Getting Started¶
Prerequisites¶
- Kubernetes cluster with NVIDIA GPU worker nodes (OKE, GKE, EKS, AKS, or bare metal)
- KEDA v2.10+ installed
- NVIDIA GPU drivers and Device Plugin installed
Deploy the Scaler¶
Make sure KEDA is already running, then deploy:
Or use Helm:
helm install keda-gpu-scaler deploy/helm/keda-gpu-scaler \
--namespace keda \
--set nodeSelector."nvidia\.com/gpu\.present"=true
This puts a pod on every GPU node, polling NVML and serving metrics over gRPC on port 6000.
Attach to Your Workload¶
Point a ScaledObject at the GPU scaler:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-inference-scaler
namespace: ai-workloads
spec:
scaleTargetRef:
name: vllm-deepseek-deployment
minReplicaCount: 1
maxReplicaCount: 50
triggers:
- type: external
metadata:
scalerAddress: "keda-gpu-scaler.keda.svc.cluster.local:6000"
profile: "vllm-inference"
KEDA will now scale your deployment based on GPU utilization and VRAM pressure.
Verify It Works¶
# Check scaler pods are running on GPU nodes
kubectl get pods -n keda -l app=keda-gpu-scaler -o wide
# Check KEDA sees the ScaledObject
kubectl get scaledobject -A
# Watch HPA in action
kubectl get hpa -w
What to read next¶
- Configuration Reference for profiles, aggregation, and all parameters
- Architecture if you want to understand the design
- Migration Guide if you're replacing dcgm-exporter + Prometheus