Skip to content

keda-gpu-scaler

Scale Kubernetes GPU workloads from real hardware metrics. No Prometheus. No DCGM. No PromQL.

A KEDA External Scaler that reads NVIDIA GPU metrics directly from NVML C-bindings and autoscales your vLLM, Triton, and custom inference deployments — including scale-to-zero.

GPU Node                          KEDA Operator
┌─────────────────────┐           ┌──────────────────┐
│ keda-gpu-scaler     │──gRPC───> │ External Scaler  │
│ (DaemonSet)         │           │ trigger          │
│                     │           └────────┬─────────┘
│ NVML: 92% GPU util  │                    │
│ NVML: 14.2GB VRAM   │           Scale vllm-deployment
└─────────────────────┘           from 3 → 8 replicas

Why This Exists

Scaling AI inference on Kubernetes using CPU/Memory HPA is broken. Your GPU nodes sit at 10% CPU while the GPUs are 100% saturated with 200+ pending requests in the vLLM queue.

BEFORE: GPU Pod → dcgm-exporter → Prometheus → PromQL → KEDA → HPA
        (5 components, 15-30s scrape delay, PromQL queries break on upgrades)

AFTER:  GPU Pod → keda-gpu-scaler (NVML) → KEDA → HPA
        (2 components, sub-second metrics, zero configuration)

Docs

GPU Metrics

Metric Description Unit
gpu_utilization GPU compute (SM) utilization % (0-100)
memory_utilization GPU memory controller utilization % (0-100)
memory_used_mib GPU VRAM used MiB
memory_used_percent GPU VRAM used as percentage of total % (0-100)
temperature GPU die temperature Celsius
power_draw GPU power consumption Watts