Design Document: keda-gpu-scaler¶

Problem Statement¶

GPU inference workloads on Kubernetes cannot be autoscaled using standard HPA. CPU and memory metrics are irrelevant — a vLLM pod serving 200 concurrent requests shows 8% CPU while the GPU is 100% saturated. The existing approach chains dcgm-exporter → Prometheus → PromQL → KEDA, adding 5 components and 15-30 seconds of metric latency.

The goal: scale GPU workloads from hardware metrics with sub-second latency, no metrics pipeline, and no PromQL.

Why an External Scaler (Not a Native KEDA Scaler)¶

Three hard constraints make embedding GPU support inside KEDA core impossible:

1. CGO Constraint¶

NVIDIA's Go bindings (go-nvml) call into libnvidia-ml.so via cgo. KEDA builds its operator with CGO_ENABLED=0 for portability — every binary is a static Linux ELF. Adding a cgo dependency would break KEDA's entire build and release pipeline.

This isn't a temporary limitation. It's a fundamental incompatibility between how KEDA ships binaries and how NVIDIA's library works.

2. Node-Level Hardware Access¶

NVML reads GPU state through /dev/nvidiactl and /dev/nvidia0..N. These device files are only available on the physical GPU node. The KEDA operator runs as a single centralized Deployment — it has no access to GPU devices on worker nodes.

The only correct Kubernetes pattern for node-level hardware polling is a DaemonSet. Each instance runs on a GPU node, mounts the NVIDIA device files, and serves metrics locally.

3. Independent Release Cycle¶

GPU infrastructure moves fast. Tying GPU scaling features to KEDA's release cadence (which needs to coordinate across 50+ scalers) would slow iteration. As a standalone component, we can ship fixes and new GPU metrics in hours, not months.

This design was discussed and documented in KEDA issue #7538.

Architecture¶

GPU Node                                    KEDA Operator
┌──────────────────────────────┐           ┌──────────────────┐
│  DaemonSet: keda-gpu-scaler  │           │                  │
│                              │           │  ExternalScaler  │
│  ┌────────────┐              │  gRPC     │  trigger config  │
│  │ NVML poller│──metrics──►  │──:6000──► │                  │
│  │ (2s loop)  │              │           │  → HPA decision  │
│  └────────────┘              │           │  → scale up/down │
│       ↕                      │           └──────────────────┘
│  libnvidia-ml.so             │
│  /dev/nvidia0..N             │
└──────────────────────────────┘

Data Flow¶

The DaemonSet starts an NVML polling loop (default 2 seconds)
Each cycle reads: SM utilization, memory controller utilization, VRAM used/total, temperature, power draw
Metrics are cached in memory (no disk, no external store)
KEDA calls GetMetrics() over gRPC on the externalscaler.ExternalScalerServer interface
The scaler returns the requested metric with the aggregation method specified in the ScaledObject
KEDA feeds the metric value into HPA for a scale up/down/to-zero decision
(Optional) An HTTP /metrics endpoint on port 9090 exposes Prometheus gauges for GPU fleet monitoring — independent of the KEDA scaling path

gRPC Interface¶

The scaler implements four methods from KEDA's ExternalScaler protobuf contract:

Method	Purpose
`IsActive`	Returns true if any GPU metric exceeds the activation threshold (enables scale-from-zero)
`StreamIsActive`	Streaming version of IsActive for push-based activation
`GetMetricSpec`	Returns the metric name and target value for HPA
`GetMetrics`	Returns the current GPU metric value

Why gRPC (Not HTTP Metrics)¶

KEDA's external scaler protocol is gRPC by design — type-safe via protobuf (no PromQL string parsing), supports streaming for push-based activation, and lower latency than HTTP scrape-and-parse.

Scaling Profiles¶

Raw metric thresholds are error-prone if you don't know what "80% GPU utilization" means for your workload. Profiles encode reasonable defaults:

Profile	What it optimizes for
`vllm-inference`	LLM serving. Scales on VRAM pressure (80%) because vLLM pre-allocates KV cache. Activation threshold at 5% for scale-to-zero.
`vllm-queue-depth`	LLM serving. Scales on pending requests (target 5) read from the vLLM engine API — faster reaction than waiting for VRAM/utilization to move. Activation at 1 request.
`triton-inference`	Multi-model serving. Scales on SM utilization (75%) because Triton shares GPU across models. Higher activation (10%) to avoid flapping.
`triton-queue-wait`	Multi-model serving. Scales on average inference queue wait time (target 50ms) read from Triton's engine API — a more direct overload signal than GPU utilization. Activation at 5ms.
`triton-request-rate`	Multi-model serving. Scales on inference throughput (target 50 req/s) read from Triton's engine API. Activation at 1 req/s.
`training`	Batch training. Scales on SM utilization (90%) with no scale-to-zero (activation 0) to avoid killing checkpoints.
`batch`	Offline batch inference. Aggressive scale-down with 70% memory threshold and low activation (1%).

Users can override any profile parameter in the ScaledObject metadata.

Multi-GPU Aggregation¶

Nodes with 4-8 GPUs need an aggregation strategy. The aggregation parameter controls how per-GPU metrics are combined into a single scalar for KEDA:

max (default): Scale when any GPU hits the threshold. Best for inference where hot GPUs indicate overload.
avg: Scale on average utilization. Best for training where GPUs should be evenly loaded.
min: Scale when the least-loaded GPU hits the threshold. Conservative.
sum: Total utilization across all GPUs. Useful for capacity-based scaling.

Testing Strategy¶

Unit Tests (no GPU required)¶

All metric parsing, aggregation, and profile resolution logic is unit-tested with a mock NVML implementation (pkg/gpu/mock.go). The mock returns configurable metric values for any number of simulated GPUs.

E2E Tests (no GPU required)¶

The gRPC server is tested end-to-end using the mock collector. Tests verify the full path: ScaledObject metadata → metric extraction → gRPC response → activation check.

Manual GPU Testing¶

For real hardware validation, deploy to a GPU cluster and verify:

# Check scaler logs
kubectl logs -n keda -l app=keda-gpu-scaler

# Verify KEDA sees the external scaler
kubectl get scaledobject -A -o yaml | grep -A5 external

Cross-Environment GPU Metrics (`pkg/env`)¶

The gpu-metrics CLI targets HPC and cloud environments equally. The pkg/env package centralises all orchestrator detection and metadata so the rest of the codebase never branches on scheduler type.

Environment detection¶

env.Detect() inspects process environment variables in priority order:

SLURM_JOB_ID → SLURM
FLUX_JOB_ID → Flux
KUBERNETES_SERVICE_HOST → Kubernetes (injected by kubelet into every pod)
Otherwise → Standalone

env.Parse(flagValue) converts the --env flag string to an env.Type. The value "auto" triggers Detect().

Unified Context¶

env.FromType(t) returns an env.Context struct with orchestrator-agnostic fields (Orchestrator, NodeName, JobID, TaskRank) plus environment-specific extras (Partition for SLURM, FluxURI for Flux, PodName/Namespace for Kubernetes). All fields are JSON-serialised into the top-level environment block of every output document.

The unexported visibleDevices []int field carries the scheduler-assigned GPU indices so gpu-metrics can restrict NVML collection to the right devices without any scheduler-specific code in the CLI itself.

Unified JSON schema¶

Before this package, the JSON output had a slurm or flux top-level key that changed depending on the runtime. Any downstream parser had to branch on which key was present. The new schema is always:

{ "environment": { ... }, "collected_at": "...", "devices": [...] }

This makes cross-environment comparison trivially composable with jq, pandas, or any streaming JSON processor.

Kubernetes Downward API¶

For Kubernetes, NODE_NAME, POD_NAME, and POD_NAMESPACE must be exposed via the Downward API (not auto-set by the runtime). The deployment manifests and Helm chart include these mappings so the metadata is populated out of the box.

Security Considerations¶

The DaemonSet needs read-only access to NVIDIA device files — no cluster-wide RBAC
The gRPC port (6000) is exposed only as a ClusterIP Service — not reachable outside the cluster
The metrics port (9090) is optional and can be disabled entirely with --metrics-port=0
No secrets or credentials are required
NVML calls are read-only (metrics collection, no device configuration)

MIG (Multi-Instance GPU) Support (`pkg/gpu`)¶

A100/H100 GPUs can be split into 2–7 MIG instances. Each gets its own NVML handle with independent utilization and memory counters.

K8s / Standalone: CollectAll() checks each GPU for MIG mode. When active, it loops GetMigDeviceHandleByIndex and returns one Metrics per instance instead of one per physical GPU. Chip-level metrics (temp, power, PCIe, NVLink) are read once from the physical handle and copied into each instance.

HPC (SLURM / Flux): MIG UUIDs in CUDA_VISIBLE_DEVICES (e.g. MIG-GPU-aaaa.../3/0) are detected by MIGUUIDs() and resolved individually via CollectByUUID().

Extra Metrics fields: IsMIGInstance (bool), ParentIndex (int, -1 for UUID lookups), MigProfile (string, e.g. "3g.40gb").

If MIG is enabled but no instances exist (GPU not yet partitioned), collectMIGInstances logs a warning and returns nothing. CollectDevice() on MIG GPUs returns physical-level metrics only and logs a warning.

vLLM Engine Metrics (`pkg/vllm`)¶

GPU utilization and VRAM are proxies for load on a vLLM server — they say the GPU is busy, not how many requests are actually queued. vLLM exposes the real signal on its own Prometheus /metrics endpoint, so pkg/vllm.Client scrapes it directly instead of routing through NVML.

Client.Scrape() fetches the vLLM engine's metrics endpoint and parses the exposition-format text into an EngineMetrics struct, pulling vllm:num_requests_waiting (queue depth), vllm:num_requests_running, vllm:gpu_cache_usage_perc (KV cache usage), and vllm:num_requests_swapped; everything else is ignored. getVLLMMetricValue in pkg/scaler maps vllm_queue_depth and vllm_kv_cache_usage (normalized 0–100) to these fields, and parseMetadata requires vllmEndpoint whenever the requested metricType is one of them. The scaler caches one vllm.Client per distinct endpoint rather than reconnecting on every poll.

Because this bypasses NVML entirely, it needs no GPU driver access — only HTTP reachability from the scaler DaemonSet pods to the vLLM Service. See docs/configuration.md#vllm-engine-metrics for usage.

Triton Engine Metrics (`pkg/triton`)¶

The same proxy problem applies to Triton: triton-inference's GPU utilization tells you the GPU is busy, not whether inference requests are backing up. Triton exposes that directly via its own Prometheus /metrics endpoint (default port 8002), so pkg/triton.Client scrapes it instead of routing through NVML.

Unlike vLLM's vllm:num_requests_waiting (an instantaneous gauge), the Triton metrics we need — nv_inference_queue_duration_us and nv_inference_count — are cumulative counters that only increase over the life of the server. A single scrape can't turn a running total into "requests/sec" or "average wait per request"; that requires two samples. pkg/triton.Client is therefore stateful: each Scrape() call parses the current counter values, diffs them against the previous scrape of that same Client (protected by a mutex), and derives AvgQueueWaitUs (Δqueue_duration_us / Δinference_count) and RequestRatePerSec (Δinference_count / Δtime). Both derived fields are 0 on a client's first scrape of an endpoint, and also fall back to 0 if the delta is negative (e.g. Triton restarted and its counters reset), rather than report a meaningless negative rate. parseMetrics sums each counter across all label combinations first, since Triton emits one series per loaded model.

getTritonMetricValue in pkg/scaler maps triton_queue_wait_ms (normalized from microseconds) and triton_request_rate to these derived fields, and parseMetadata requires tritonEndpoint whenever the requested metricType is one of them. As with vLLM, the scaler caches one pkg/triton.Client per distinct endpoint — this caching is what makes the derived metrics work at all, since a fresh client on every poll would never have a previous sample to diff against. See docs/configuration.md#triton-engine-metrics for usage.

Future Work¶

AMD ROCm support: Same DaemonSet pattern, different hardware library (rocm-smi)
NVLink topology: Prefer scaling on nodes with direct GPU-to-GPU interconnect
~~vLLM queue depth: Read pending request count directly from vLLM's engine API for more precise scaling~~ — implemented as pkg/vllm (see above); see docs/configuration.md#vllm-engine-metrics
~~Triton queue wait / request rate: Read scheduling queue time and throughput directly from Triton's engine API for more precise scaling~~ — implemented as pkg/triton (see above); see docs/configuration.md#triton-engine-metrics