$ stagg-solutions ~ $ cd capabilities/ai-hpc practice · for-government / 04
GPUA100 · H100 · H200 SERVEvLLM · Triton · NIM SCHEDULESLURM · Ray · k8s GPU op

for-government / 04 / ai & hpc

Secure compute for defense AI.

GPU clusters that scale from 8 to thousands of devices. Inference servers tuned for throughput per watt. Air-gapped LLM serving with no egress. License infrastructure that survives the noisy hours. Confidential GPU compute on the bench.

SERVEvLLM · paged-attn · cont. batching
PARTITIONMIG · 7 slices per H100
NETWORKInfiniBand NDR 400 Gb · NVLink
SCHEDULESLURM · Ray · k8s GPU op
ISOLATIONConfidential CC-mode ready

$ cat cluster.txt

Reference cluster.

// shared infra · multi-tenant · multi-workload
// shared GPU cluster · serving + training + HPC · cleared scope
login / submit · jump host entry
SLURM · Ray · kubectl SSO + PIV at the human edge
serving inference
vLLM · NIM Triton Inference TGI · TensorRT-LLM
training multi-GPU
DeepSpeed ZeRO-3 PyTorch FSDP NeMo · Megatron-LM
HPC / batch classic
MPI · OpenMP · NCCL FlexLM-licensed apps MATLAB · Ansys · LAMMPS
GPU pool · H100 · H200 · A100 compute
MIG partitioning · 7 slices per H100 at QoS NVLink · NVSwitch fabric InfiniBand NDR 400 Gb · RoCE v2 Confidential CC-mode on H100 / H200
parallel filesystem storage
Lustre · GPFS / Spectrum Scale VAST · WEKA NVMe-oF · DAOS · GPUDirect Storage
MIG lets one H100 host 7 isolated workloads at predictable QoS Confidential CC-mode (H100 / H200) keeps weights encrypted in HBM · KMS-attested

$ ls capabilities/ai-hpc/

What we operate.

// six capabilities
01

GPU cluster provisioning

k8s + bare metal

Kubespray + NVIDIA GPU Operator on the k8s tier. Bright Computing or in-house Ansible for bare-metal SLURM. MIG profiles per workload, GPU Direct RDMA wired through the fabric.

  • NVIDIA GPU Operator · driver + container toolkit + DCGM exporter
  • MIG profiles · 7 slices per H100 for serving QoS
  • InfiniBand NDR 400 Gb / NVLink for training fabric
  • GPUDirect RDMA + Storage for zero-copy paths
7MIG slices · H100
400 GbIB NDR
RDMAzero-copy
02

LLM inference serving

vLLM · NIM

vLLM with continuous batching + paged attention as the default. Triton for multi-framework. NVIDIA NIM containers for the green-button path. Speculative decoding where the model supports it.

  • vLLM · paged-attn · prefix cache · continuous batching
  • Triton Inference Server · TF / PyTorch / ONNX backends
  • NVIDIA NIM microservices (where ATO permits)
  • Speculative decoding · 1.5–2× throughput on supported models
03

Distributed training

DeepSpeed · FSDP

DeepSpeed ZeRO-3 or PyTorch FSDP for memory-efficient training. NeMo for the LLM toolchain. Ray for hyperparameter sweep + serving promotion. Checkpoints to parallel FS.

  • DeepSpeed ZeRO-3 · CPU / NVMe offload tuning
  • PyTorch FSDP + activation checkpointing
  • NeMo / NeMo Guardrails for the LLM pipeline
  • Ray Train + Ray Tune for orchestration
04

HPC & SLURM

scheduler

SLURM clusters for MPI / OpenMP / MATLAB / Ansys workloads. Fairshare scheduling, GRES tracking for GPU, prolog/epilog hooks for accounting + cleanup.

  • SLURM with GRES + cgroup constraints · GPU accounting clean
  • OpenMPI + UCX + NCCL tuned for the fabric
  • Slurm REST + Open OnDemand for user-facing jobs
  • FairShare with TRES + association limits per program
05

License infrastructure

FlexLM

Highly-available FlexLM + RLM clusters. Pacemaker for failover. License usage exported to Prometheus so engineers stop guessing why their job sat queued.

  • FlexLM triad with Pacemaker / Corosync failover
  • RLM HA for the Reprise side of the house
  • License telemetry to Prometheus · per-feature heat maps
  • 200+ concurrent engineering users sustained
200+concurrent users
HApacemaker triad
Promlive telemetry
06

Air-gapped / confidential

CC-mode

LLM serving with zero egress. RAG against on-prem vector DBs. Confidential Compute mode on H100 / H200 keeps weights and KV cache encrypted in HBM — attested via the cloud KMS.

  • Air-gapped vLLM with local-only Milvus / Weaviate RAG
  • NVIDIA Confidential Compute on H100 / H200 (CC-mode)
  • Egress firewall · no model phone-home, no telemetry leak
  • Prompt + response logging at the gateway, not the model

$ cat bench/vllm-h100.yaml

Serving config we deploy.

// vLLM · H100 · MIG 1g.20gb
// vllm-serve.yaml · tuned for Llama-3.1-70B-Instruct on H100 80GB apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: llama-3-70b-vllm annotations: serving.kserve.io/autoscalerClass: "hpa" stagg.io/atomic-attest: "required" # reject if cosign verify fails spec: predictor: minReplicas: 2 maxReplicas: 8 containers: - name: vllm image: harbor.stagg-solutions.com/ai/vllm:0.6.3-h100 args: - --model - /models/meta-llama/Llama-3.1-70B-Instruct - --tensor-parallel-size - "4" - --enable-prefix-caching - --enable-chunked-prefill - --max-num-seqs - "256" - --gpu-memory-utilization - "0.92" - --quantization - awq # trim KV pressure resources: limits: nvidia.com/gpu: 4 # 4× MIG slices = predictable QoS memory: 120Gi env: - name: VLLM_USE_V1 # new scheduler · less host overhead value: "1" - name: NCCL_IB_HCA # InfiniBand fabric pinning value: mlx5_0:1

$ cat stack.json

The stack we operate.

// hardware to model
most popular · industry standard available · production-ready coming · trial / assess
Hardware5 GPU tiers
NVIDIA H100 / H200 A100 L40S Grace Hopper B200
Fabric4 · zero-copy
NVLink / NVSwitch InfiniBand NDR (400 Gb) RoCE v2 GPUDirect RDMA + Storage
Schedule5
SLURM Kubernetes GPU Operator Ray Volcano Run.ai
Serve4 + NIM
vLLM Triton Inference TGI TensorRT-LLM NVIDIA NIM
Train5
PyTorch FSDP DeepSpeed ZeRO-3 NeMo Megatron-LM Axolotl
Data & vectors4 + 1 trial
pgvector Milvus Weaviate Qdrant LanceDB
Storage4 · parallel FS
Lustre · GPFS / Spectrum Scale S3 · MinIO VAST · WEKA NVMe-oF · DAOS
Licensing & HPC apps4 · 200+ users
FlexLM · RLM MATLAB · Simulink Ansys · Abaqus OpenFOAM · LAMMPS
Security4 · confidential-ready
Nitro Enclaves (cloud) NeMo Guardrails Confidential GPU (CC-mode) Garak (red-team)

$ radar --quarter Q2

What's on the tech radar.

// where defense AI is going

Adopt

in production
  • vLLM v1 schedulerLower host overhead · cleaner batching
  • MIG partitioning7 isolated tenants on one H100
  • SLURM + GRES cgroupsPer-job GPU accounting · no leakage
  • FlexLM HA triadLicense never the bottleneck

Trial

piloting
  • NVIDIA NIMGreen-button microservices · ATO-pending
  • Speculative decoding1.5–2× throughput · supported models
  • Confidential CC-modeEncrypted weights in HBM · KMS-attested
  • FSDP + activation ckptTraining without recomputing the wheel

Assess

watching
  • B200 GPU classFP4 inference economics
  • MoE serving at scaleExpert sharding strategies
  • On-prem RAG patternsPer-tenant retrieval boundaries
  • DOD GenAI guardrailsPolicy stacks · NeMo Guardrails + custom

$ stagg --scope ai-hpc

Compute that moves.

Free consultation. Tell us the workload (training? serving? HPC?), the boundary (cloud, on-prem, air-gapped?), and the timeline. We'll come back with a cluster architecture and a quote.

// direct comms

responsewithin 24 hours
formatfree consultation