Scaling Deployments with Kubernetes: Orchestrating ML Inference

Learn to scale ML models with Kubernetes deployments, manage GPU resource requests, and configure Horizontal Pod Autoscalers for production-ready inference.

KubernetesDeploymentScalingOrchestrationMLOpsGPUsaimachine-learningpython

Previously in this course, we mastered optimized inference runtimes and TensorRT-LLM to squeeze maximum throughput out of single nodes. While these optimizations are essential, they are only half the battle. To run these models in production, you need a robust orchestration layer. This lesson adds Kubernetes (K8s) to your stack, providing the declarative framework necessary to manage, scale, and schedule your model containers.

Creating Kubernetes Deployments for Inference

A Kubernetes Deployment is the standard controller for managing stateless applications. For ML inference, we treat our model-serving container (e.g., a vLLM or Triton Inference Server instance) as a stateless unit. By defining a Deployment, we ensure that a specific number of replicas are always running, self-healing if a pod crashes, and allowing for rolling updates when we deploy new model versions.

To deploy our model, we define a YAML manifest. Unlike standard web services, an ML deployment must explicitly request hardware accelerators.


YAML
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference-service
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-model
  template:
    metadata:
      labels:
        app: llm-model
    spec:
      containers:
      - name: model-server
        image: my-registry/llm-inference:latest
        resources:
          limits:
            nvidia.com/gpu: 1  # Requesting a single GPU
          requests:
            memory: "16Gi"
            cpu: "4"
        ports:
        - containerPort: 8000

Managing GPU Resource Requests

Kubernetes does not treat GPUs as first-class citizens in the same way it treats CPU or memory; they are handled via Extended Resources provided by the device plugin (usually the NVIDIA device plugin).

When you set nvidia.com/gpu: 1 in your limits, the K8s scheduler ensures the pod is placed on a node that has at least one free GPU. Note that you cannot "fractionalize" a full physical GPU without specific configurations like MIG (Multi-Instance GPU), which we will cover in our next lesson.

Crucial rule: In most standard K8s clusters, you must specify the same value for limits and requests for GPUs. If you don't, the scheduler may fail to place your pod correctly, or worse, you may face contention if multiple containers attempt to access the same GPU memory space.

Configuring Horizontal Pod Autoscaler (HPA)

Static replicas are rarely enough for production. Inference workloads are inherently bursty. We use the HorizontalPodAutoscaler to dynamically adjust the number of replicas based on real-time telemetry.

While standard HPA scales on CPU or Memory, these metrics are poor indicators of model load. A GPU-bound model might have low CPU utilization but be completely saturated at the inference engine level. To scale effectively, you should use custom metrics—like request latency or queue depth—often exported via Prometheus.


YAML
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference-service
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: vllm_request_queue_size
      target:
        type: AverageValue
        averageValue: 5

Hands-on Exercise: Scaling Your Inference Service

For our running project, your goal is to transition your optimized inference container into a resilient K8s deployment.

Containerize: Ensure your optimized model server is containerized and pushed to your private registry.
Define: Create a deployment.yaml that includes the nvidia.com/gpu resource limit.
Scale: Apply an HPA manifest. If you don't have custom metrics installed yet, start by scaling based on cpu utilization as a placeholder, then transition to a custom metric once you have Prometheus scraping your inference server.
Verify: Run kubectl get pods -w and trigger a load test using a tool like locust to watch the HPA spin up new pods.

Common Pitfalls

Cold Starts: LLM containers are massive (often 10GB+). If your HPA triggers a scale-out event, the new pod will take minutes to pull the image and load the weights into VRAM. Use readinessProbes to prevent the load balancer from sending traffic to a pod that is still loading weights.
GPU Fragmentation: If you request 1 GPU but your container only uses 4GB of a 24GB card, you are wasting expensive hardware. Plan your resource requests based on your model's actual VRAM footprint.
Node Selector/Affinity: If your cluster has a mix of GPU and CPU-only nodes, use nodeSelector or nodeAffinity to ensure your inference pods are never scheduled on CPU-only machines.

Recap

Scaling inference requires moving beyond manual management into declarative orchestration. By using Kubernetes Deployments, you guarantee consistency; by using resource requests, you ensure hardware affinity; and by using HPA, you ensure your system adapts to fluctuating demand. As we move forward, we will optimize this further by looking at GPU resource allocation and how to slice GPUs for multi-tenancy.

Up next: GPU Resource Allocation and Scheduling — we will dive into MIG (Multi-Instance GPU) and pod topology constraints to optimize your cluster utilization.

Back to Blog

Scaling Deployments with Kubernetes: Orchestrating ML Inference

Creating Kubernetes Deployments for Inference

Managing GPU Resource Requests

Configuring Horizontal Pod Autoscaler (HPA)

Hands-on Exercise: Scaling Your Inference Service

Common Pitfalls

Recap

Similar Posts

Project Milestone: Production Deployment of ML Systems

Mixture-of-Experts (MoE) Layers: Scaling Efficiently with Sparsity

GPU Resource Allocation and Scheduling: Mastering MIG and K8s