keda-gpu-scaler¶
Scale Kubernetes GPU workloads from real hardware metrics. No Prometheus. No DCGM. No PromQL.
A KEDA External Scaler that reads NVIDIA GPU metrics directly from NVML C-bindings and autoscales your vLLM, Triton, and custom inference deployments — including scale-to-zero.
GPU Node KEDA Operator
┌─────────────────────┐ ┌──────────────────┐
│ keda-gpu-scaler │──gRPC───> │ External Scaler │
│ (DaemonSet) │ │ trigger │
│ │ └────────┬─────────┘
│ NVML: 92% GPU util │ │
│ NVML: 14.2GB VRAM │ Scale vllm-deployment
└─────────────────────┘ from 3 → 8 replicas
Why This Exists¶
Scaling AI inference on Kubernetes using CPU/Memory HPA is broken. Your GPU nodes sit at 10% CPU while the GPUs are 100% saturated with 200+ pending requests in the vLLM queue.
BEFORE: GPU Pod → dcgm-exporter → Prometheus → PromQL → KEDA → HPA
(5 components, 15-30s scrape delay, PromQL queries break on upgrades)
AFTER: GPU Pod → keda-gpu-scaler (NVML) → KEDA → HPA
(2 components, sub-second metrics, zero configuration)
Docs¶
GPU Metrics¶
| Metric | Description | Unit |
|---|---|---|
gpu_utilization |
GPU compute (SM) utilization | % (0-100) |
memory_utilization |
GPU memory controller utilization | % (0-100) |
memory_used_mib |
GPU VRAM used | MiB |
memory_used_percent |
GPU VRAM used as percentage of total | % (0-100) |
temperature |
GPU die temperature | Celsius |
power_draw |
GPU power consumption | Watts |
Featured In¶
- GPU Autoscaling on Kubernetes with KEDA — Building an External Scaler — CNCF Blog
- Abstracting AI Infrastructure: Native GPU Scaling for Internal Developer Platforms — Platform Engineering
- The Financial Trap of Autonomous Networks: Scaling Agentic AI in the Telecom Core — IEEE ComSoc Technology Blog