01
GPU cluster provisioning
k8s + bare metal
Kubespray + NVIDIA GPU Operator on the k8s tier. Bright Computing or in-house Ansible for bare-metal SLURM. MIG profiles per workload, GPU Direct RDMA wired through the fabric.
- NVIDIA GPU Operator · driver + container toolkit + DCGM exporter
- MIG profiles · 7 slices per H100 for serving QoS
- InfiniBand NDR 400 Gb / NVLink for training fabric
- GPUDirect RDMA + Storage for zero-copy paths
7MIG slices · H100
400 GbIB NDR
RDMAzero-copy
02
LLM inference serving
vLLM · NIM
vLLM with continuous batching + paged attention as the default. Triton for multi-framework. NVIDIA NIM containers for the green-button path. Speculative decoding where the model supports it.
- vLLM · paged-attn · prefix cache · continuous batching
- Triton Inference Server · TF / PyTorch / ONNX backends
- NVIDIA NIM microservices (where ATO permits)
- Speculative decoding · 1.5–2× throughput on supported models
03
Distributed training
DeepSpeed · FSDP
DeepSpeed ZeRO-3 or PyTorch FSDP for memory-efficient training. NeMo for the LLM toolchain. Ray for hyperparameter sweep + serving promotion. Checkpoints to parallel FS.
- DeepSpeed ZeRO-3 · CPU / NVMe offload tuning
- PyTorch FSDP + activation checkpointing
- NeMo / NeMo Guardrails for the LLM pipeline
- Ray Train + Ray Tune for orchestration
04
HPC & SLURM
scheduler
SLURM clusters for MPI / OpenMP / MATLAB / Ansys workloads. Fairshare scheduling, GRES tracking for GPU, prolog/epilog hooks for accounting + cleanup.
- SLURM with GRES + cgroup constraints · GPU accounting clean
- OpenMPI + UCX + NCCL tuned for the fabric
- Slurm REST + Open OnDemand for user-facing jobs
- FairShare with TRES + association limits per program
05
License infrastructure
FlexLM
Highly-available FlexLM + RLM clusters. Pacemaker for failover. License usage exported to Prometheus so engineers stop guessing why their job sat queued.
- FlexLM triad with Pacemaker / Corosync failover
- RLM HA for the Reprise side of the house
- License telemetry to Prometheus · per-feature heat maps
- 200+ concurrent engineering users sustained
200+concurrent users
HApacemaker triad
Promlive telemetry
06
Air-gapped / confidential
CC-mode
LLM serving with zero egress. RAG against on-prem vector DBs. Confidential Compute mode on H100 / H200 keeps weights and KV cache encrypted in HBM — attested via the cloud KMS.
- Air-gapped vLLM with local-only Milvus / Weaviate RAG
- NVIDIA Confidential Compute on H100 / H200 (CC-mode)
- Egress firewall · no model phone-home, no telemetry leak
- Prompt + response logging at the gateway, not the model