Skip to content

Getting Started

Prerequisites

  • Kubernetes cluster with NVIDIA GPU worker nodes (OKE, GKE, EKS, AKS, or bare metal)
  • KEDA v2.10+ installed
  • NVIDIA GPU drivers and Device Plugin installed

Deploy the Scaler

Make sure KEDA is already running, then deploy:

kubectl apply -f deploy/manifests.yaml

Or use Helm:

helm install keda-gpu-scaler deploy/helm/keda-gpu-scaler \
  --namespace keda \
  --set nodeSelector."nvidia\.com/gpu\.present"=true

This puts a pod on every GPU node, polling NVML and serving metrics over gRPC on port 6000.

Attach to Your Workload

Point a ScaledObject at the GPU scaler:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-inference-scaler
  namespace: ai-workloads
spec:
  scaleTargetRef:
    name: vllm-deepseek-deployment
  minReplicaCount: 1
  maxReplicaCount: 50
  triggers:
    - type: external
      metadata:
        scalerAddress: "keda-gpu-scaler.keda.svc.cluster.local:6000"
        profile: "vllm-inference"

KEDA will now scale your deployment based on GPU utilization and VRAM pressure.

Verify It Works

# Check scaler pods are running on GPU nodes
kubectl get pods -n keda -l app=keda-gpu-scaler -o wide

# Check KEDA sees the ScaledObject
kubectl get scaledobject -A

# Watch HPA in action
kubectl get hpa -w