Learn to scale ML models with Kubernetes deployments, manage GPU resource requests, and configure Horizontal Pod Autoscalers for production-ready inference.
Previously in this course, we mastered optimized inference runtimes and TensorRT-LLM to squeeze maximum throughput out of single nodes. While these optimizations are essential, they are only half the battle. To run these models in production, you need a robust orchestration layer. This lesson adds Kubernetes (K8s) to your stack, providing the declarative framework necessary to manage, scale, and schedule your model containers.
A Kubernetes Deployment is the standard controller for managing stateless applications. For ML inference, we treat our model-serving container (e.g., a vLLM or Triton Inference Server instance) as a stateless unit. By defining a Deployment, we ensure that a specific number of replicas are always running, self-healing if a pod crashes, and allowing for rolling updates when we deploy new model versions.
To deploy our model, we define a YAML manifest. Unlike standard web services, an ML deployment must explicitly request hardware accelerators.
YAMLapiVersion: apps/v1 kind: Deployment metadata: name: llm-inference-service spec: replicas: 2 selector: matchLabels: app: llm-model template: metadata: labels: app: llm-model spec: containers: - name: model-server image: my-registry/llm-inference:latest resources: limits: nvidia.com/gpu: 1 # Requesting a single GPU requests: memory: "16Gi" cpu: "4" ports: - containerPort: 8000
Kubernetes does not treat GPUs as first-class citizens in the same way it treats CPU or memory; they are handled via Extended Resources provided by the device plugin (usually the NVIDIA device plugin).
When you set nvidia.com/gpu: 1 in your limits, the K8s scheduler ensures the pod is placed on a node that has at least one free GPU. Note that you cannot "fractionalize" a full physical GPU without specific configurations like MIG (Multi-Instance GPU), which we will cover in our next lesson.
Crucial rule: In most standard K8s clusters, you must specify the same value for limits and requests for GPUs. If you don't, the scheduler may fail to place your pod correctly, or worse, you may face contention if multiple containers attempt to access the same GPU memory space.
Static replicas are rarely enough for production. Inference workloads are inherently bursty. We use the HorizontalPodAutoscaler to dynamically adjust the number of replicas based on real-time telemetry.
While standard HPA scales on CPU or Memory, these metrics are poor indicators of model load. A GPU-bound model might have low CPU utilization but be completely saturated at the inference engine level. To scale effectively, you should use custom metrics—like request latency or queue depth—often exported via Prometheus.
YAMLapiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: llm-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: llm-inference-service minReplicas: 2 maxReplicas: 10 metrics: - type: Pods pods: metric: name: vllm_request_queue_size target: type: AverageValue averageValue: 5
For our running project, your goal is to transition your optimized inference container into a resilient K8s deployment.
deployment.yaml that includes the nvidia.com/gpu resource limit.cpu utilization as a placeholder, then transition to a custom metric once you have Prometheus scraping your inference server.kubectl get pods -w and trigger a load test using a tool like locust to watch the HPA spin up new pods.readinessProbes to prevent the load balancer from sending traffic to a pod that is still loading weights.nodeSelector or nodeAffinity to ensure your inference pods are never scheduled on CPU-only machines.Scaling inference requires moving beyond manual management into declarative orchestration. By using Kubernetes Deployments, you guarantee consistency; by using resource requests, you ensure hardware affinity; and by using HPA, you ensure your system adapts to fluctuating demand. As we move forward, we will optimize this further by looking at GPU resource allocation and how to slice GPUs for multi-tenancy.
Up next: GPU Resource Allocation and Scheduling — we will dive into MIG (Multi-Instance GPU) and pod topology constraints to optimize your cluster utilization.
Learn to execute a production deployment on Kubernetes, integrate telemetry, and build automated feedback loops to ensure your ML system remains performant.
Read moreMaster Mixture-of-Experts (MoE) layers to build scalable, compute-efficient LLMs. Learn to design expert routers, implement sparse layers, and balance load.
Scaling Deployments with Kubernetes