GPUA100 · H100 · H200 SERVEvLLM · Triton · NIM SCHEDULESLURM · Ray · k8s GPU op

For government / 04 · AI & HPC

Secure compute for defense AI.

GPU clusters, serving stacks, HPC pipelines — cleared or commercial. NVIDIA A100 / H100 clusters with SLURM scheduling and MIG partitioning, vLLM / Triton / NIM model serving, FlexLM license infrastructure, and HPC data pipelines that sustain 200+ concurrent engineering users. Air-gapped LLM serving with no egress; confidential GPU compute on the bench.

Scope an engagement ← All practice areas

Reference cluster

Shared infra.
Multi-tenant, multi-workload.

One login / submit jump host (SLURM · Ray · kubectl, SSO + PIV at the human edge) fans into serving (vLLM · NIM · Triton · TGI · TensorRT-LLM), training (DeepSpeed ZeRO-3 · PyTorch FSDP · NeMo · Megatron-LM), and HPC / batch (MPI · OpenMP · NCCL · FlexLM-licensed apps). All sit on a shared GPU pool of H100 / H200 / A100 with MIG partitioning, NVLink / NVSwitch fabric, InfiniBand NDR 400 Gb, and Confidential CC-mode — backed by a Lustre / GPFS / VAST / WEKA parallel filesystem over GPUDirect Storage.

What we operate

Six capabilities.

Provisioning to confidential serving.

GPU cluster provisioning

Kubespray + NVIDIA GPU Operator on the k8s tier; Ansible for bare-metal SLURM.

MIG profiles per workload, GPUDirect RDMA wired through the fabric. NVIDIA GPU Operator handles driver + container toolkit + DCGM exporter. MIG gives 7 slices per H100 for serving QoS. InfiniBand NDR 400 Gb / NVLink for the training fabric, and GPUDirect RDMA + Storage for zero-copy paths.

NVIDIA GPU OperatorKubesprayMIG · 7 / H100InfiniBand NDRGPUDirect RDMA

7 MIG slices · H100400 Gb IB NDRRDMA zero-copy

LLM inference serving

vLLM with continuous batching + paged attention as the default.

Triton for multi-framework. NVIDIA NIM containers for the green-button path. Speculative decoding where the model supports it. vLLM with paged-attn, prefix cache, and continuous batching; Triton Inference Server with TF / PyTorch / ONNX backends; NIM microservices where ATO permits; and speculative decoding for 1.5–2× throughput on supported models.

vLLMTriton InferenceNVIDIA NIMTGI · TensorRT-LLMSpeculative decoding

Distributed training

DeepSpeed ZeRO-3 or PyTorch FSDP for memory-efficient training.

NeMo for the LLM toolchain. Ray for hyperparameter sweep + serving promotion. Checkpoints to the parallel FS. DeepSpeed ZeRO-3 with CPU / NVMe offload tuning, PyTorch FSDP + activation checkpointing, NeMo / NeMo Guardrails for the LLM pipeline, and Ray Train + Ray Tune for orchestration.

DeepSpeed ZeRO-3PyTorch FSDPNeMo · Megatron-LMRay Train · Tune

HPC & SLURM

SLURM clusters for MPI / OpenMP / MATLAB / Ansys workloads.

Fairshare scheduling, GRES tracking for GPU, prolog/epilog hooks for accounting + cleanup. SLURM with GRES + cgroup constraints keeps GPU accounting clean, OpenMPI + UCX + NCCL tuned for the fabric, Slurm REST + Open OnDemand for user-facing jobs, and FairShare with TRES + association limits per program.

SLURM · GRESOpenMPI · UCX · NCCLOpen OnDemandFairShare · TRES

License infrastructure

Highly-available FlexLM + RLM clusters — license never the bottleneck.

Pacemaker for failover. License usage is exported to Prometheus so engineers stop guessing why their job sat queued. FlexLM triad with Pacemaker / Corosync failover, RLM HA for the Reprise side, license telemetry to Prometheus with per-feature heat maps, and 200+ concurrent engineering users sustained.

FlexLM · RLMPacemaker / CorosyncPrometheus telemetry

200+ concurrent usersHA pacemaker triadLive telemetry

Air-gapped / confidential

LLM serving with zero egress; RAG against on-prem vector DBs.

Confidential Compute mode on H100 / H200 keeps weights and KV cache encrypted in HBM — attested via the cloud KMS. Air-gapped vLLM with local-only Milvus / Weaviate RAG, NVIDIA Confidential Compute (CC-mode), an egress firewall with no model phone-home or telemetry leak, and prompt + response logging at the gateway, not the model.

Air-gapped vLLMConfidential CC-modeMilvus · Weaviate RAGEgress firewall

The stack we operate

Hardware to model.

GPU tiers, fabric, serve, train.

Hardware & fabric

Five GPU tiers, zero-copy fabric.

NVIDIA H100 / H200A100L40SGrace HopperB200NVLink / NVSwitchInfiniBand NDR (400 Gb)RoCE v2GPUDirect RDMA + Storage

Schedule, serve & train

SLURM + k8s, vLLM-led serving.

SLURMKubernetes GPU OperatorRayVolcano · Run.aivLLMTriton InferenceTGI · TensorRT-LLMNVIDIA NIMPyTorch FSDPDeepSpeed ZeRO-3NeMo · Megatron-LM

Data, vectors & storage

Parallel filesystems, on-prem vectors.

pgvectorMilvusWeaviate · QdrantLustre · GPFS / Spectrum ScaleS3 · MinIOVAST · WEKANVMe-oF · DAOS

Licensing & security

HA licensing, confidential-ready security.

FlexLM · RLMMATLAB · SimulinkAnsys · AbaqusOpenFOAM · LAMMPSNitro Enclaves (cloud)NeMo GuardrailsConfidential GPU (CC-mode)Garak (red-team)

Tech radar · Q2

Where defense AI is going.

Adopt · trial · assess.

Adopt

In production.

vLLM v1 scheduler — lower host overhead, cleaner batching. MIG partitioning — 7 isolated tenants on one H100. SLURM + GRES cgroups — per-job GPU accounting, no leakage. FlexLM HA triad — license never the bottleneck.

Trial

Piloting.

NVIDIA NIM — green-button microservices, ATO-pending. Speculative decoding — 1.5–2× throughput on supported models. Confidential CC-mode — encrypted weights in HBM, KMS-attested. FSDP + activation ckpt — training without recomputing the wheel.

Assess

Watching.

B200 GPU class · FP4MoE serving at scaleOn-prem RAG patternsDoD GenAI guardrails

New engagement

Compute that moves.

A free consultation. Tell us the workload (training? serving? HPC?), the boundary (cloud, on-prem, air-gapped?), and the timeline — we'll come back with a cluster architecture and a quote.

Get a quote ← All practice areas

Email[email protected]

Phone(801) 917-4617

LocationUtah · Mountain Time

ResponseWithin 24 hours