Mahamudul Hasan Rubel
HomeBlogCoursesAboutProjectsSkillsExperiencePhotosContact
Mahamudul Hasan Rubel

Senior Software Engineer crafting high-performance web applications and SaaS platforms.

Navigation

  • Home
  • Blog
  • Courses
  • About
  • Projects
  • Skills
  • Experience
  • Photos
  • Contact

Get in Touch

Available for senior/lead roles and consulting.

bd.mhrubel@gmail.comHire Me

Subscribe to the newsletter

Get new articles and course lessons delivered to your inbox. No spam, unsubscribe anytime.

© 2026 Mahamudul Hasan Rubel. All rights reserved.

Built with using Next.js 16 & Tailwind v4

Back to Blog
Lesson 40 of the Advanced AI/ML: Deep Learning, LLMs & Production Systems course
AI/MLJune 28, 20264 min read

Scaling Deployments with Kubernetes: Orchestrating ML Inference

Learn to scale ML models with Kubernetes deployments, manage GPU resource requests, and configure Horizontal Pod Autoscalers for production-ready inference.

KubernetesDeploymentScalingOrchestrationMLOpsGPUsaimachine-learningpython

Previously in this course, we mastered optimized inference runtimes and TensorRT-LLM to squeeze maximum throughput out of single nodes. While these optimizations are essential, they are only half the battle. To run these models in production, you need a robust orchestration layer. This lesson adds Kubernetes (K8s) to your stack, providing the declarative framework necessary to manage, scale, and schedule your model containers.

Creating Kubernetes Deployments for Inference

A Kubernetes Deployment is the standard controller for managing stateless applications. For ML inference, we treat our model-serving container (e.g., a vLLM or Triton Inference Server instance) as a stateless unit. By defining a Deployment, we ensure that a specific number of replicas are always running, self-healing if a pod crashes, and allowing for rolling updates when we deploy new model versions.

To deploy our model, we define a YAML manifest. Unlike standard web services, an ML deployment must explicitly request hardware accelerators.

YAML
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference-service
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-model
  template:
    metadata:
      labels:
        app: llm-model
    spec:
      containers:
      - name: model-server
        image: my-registry/llm-inference:latest
        resources:
          limits:
            nvidia.com/gpu: 1  # Requesting a single GPU
          requests:
            memory: "16Gi"
            cpu: "4"
        ports:
        - containerPort: 8000

Managing GPU Resource Requests

Kubernetes does not treat GPUs as first-class citizens in the same way it treats CPU or memory; they are handled via Extended Resources provided by the device plugin (usually the NVIDIA device plugin).

When you set nvidia.com/gpu: 1 in your limits, the K8s scheduler ensures the pod is placed on a node that has at least one free GPU. Note that you cannot "fractionalize" a full physical GPU without specific configurations like MIG (Multi-Instance GPU), which we will cover in our next lesson.

Crucial rule: In most standard K8s clusters, you must specify the same value for limits and requests for GPUs. If you don't, the scheduler may fail to place your pod correctly, or worse, you may face contention if multiple containers attempt to access the same GPU memory space.

Configuring Horizontal Pod Autoscaler (HPA)

Static replicas are rarely enough for production. Inference workloads are inherently bursty. We use the HorizontalPodAutoscaler to dynamically adjust the number of replicas based on real-time telemetry.

While standard HPA scales on CPU or Memory, these metrics are poor indicators of model load. A GPU-bound model might have low CPU utilization but be completely saturated at the inference engine level. To scale effectively, you should use custom metrics—like request latency or queue depth—often exported via Prometheus.

YAML
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference-service
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: vllm_request_queue_size
      target:
        type: AverageValue
        averageValue: 5

Hands-on Exercise: Scaling Your Inference Service

For our running project, your goal is to transition your optimized inference container into a resilient K8s deployment.

  1. Containerize: Ensure your optimized model server is containerized and pushed to your private registry.
  2. Define: Create a deployment.yaml that includes the nvidia.com/gpu resource limit.
  3. Scale: Apply an HPA manifest. If you don't have custom metrics installed yet, start by scaling based on cpu utilization as a placeholder, then transition to a custom metric once you have Prometheus scraping your inference server.
  4. Verify: Run kubectl get pods -w and trigger a load test using a tool like locust to watch the HPA spin up new pods.

Common Pitfalls

  • Cold Starts: LLM containers are massive (often 10GB+). If your HPA triggers a scale-out event, the new pod will take minutes to pull the image and load the weights into VRAM. Use readinessProbes to prevent the load balancer from sending traffic to a pod that is still loading weights.
  • GPU Fragmentation: If you request 1 GPU but your container only uses 4GB of a 24GB card, you are wasting expensive hardware. Plan your resource requests based on your model's actual VRAM footprint.
  • Node Selector/Affinity: If your cluster has a mix of GPU and CPU-only nodes, use nodeSelector or nodeAffinity to ensure your inference pods are never scheduled on CPU-only machines.

Recap

Scaling inference requires moving beyond manual management into declarative orchestration. By using Kubernetes Deployments, you guarantee consistency; by using resource requests, you ensure hardware affinity; and by using HPA, you ensure your system adapts to fluctuating demand. As we move forward, we will optimize this further by looking at GPU resource allocation and how to slice GPUs for multi-tenancy.

Up next: GPU Resource Allocation and Scheduling — we will dive into MIG (Multi-Instance GPU) and pod topology constraints to optimize your cluster utilization.

Previous lessonLLM-as-a-Judge for EvaluationNext lesson GPU Resource Allocation and Scheduling
Back to Blog

Similar Posts

AI/MLJune 28, 20263 min read

Project Milestone: Production Deployment of ML Systems

Learn to execute a production deployment on Kubernetes, integrate telemetry, and build automated feedback loops to ensure your ML system remains performant.

Read more
AI/MLJune 28, 20264 min read

Mixture-of-Experts (MoE) Layers: Scaling Efficiently with Sparsity

Master Mixture-of-Experts (MoE) layers to build scalable, compute-efficient LLMs. Learn to design expert routers, implement sparse layers, and balance load.

Part of the course

Advanced AI/ML: Deep Learning, LLMs & Production Systems

advanced · Lesson 40 of 48

  1. 1

    Advanced Weight Initialization Strategies

    4 min
  2. 2

    Normalization Techniques at Scale

    3 min
  3. 3

    High-Dimensional Optimization Landscapes

    4 min
Read more
AI/MLJune 28, 20263 min read

GPU Resource Allocation and Scheduling: Mastering MIG and K8s

Learn to partition hardware with Multi-Instance GPU (MIG) and optimize Kubernetes scheduling to maximize GPU utilization across your production AI fleet.

Read more
  • 4

    Residual Connections and Gradient Stability

    4 min
  • 5

    Gating Units and Activation Functions

    4 min
  • 6

    Implementing Multi-Head Attention

    4 min
  • 7

    Positional Encoding Architectures

    4 min
  • 8

    Transformer Encoder-Decoder Design

    3 min
  • 9

    Project Milestone: Custom Architecture Setup

    3 min
  • 10

    Tokenization Strategies for LLMs

    3 min
  • 11

    Scaling Laws and Compute Budgets

    4 min
  • 12

    Data Parallelism Strategies

    3 min
  • 13

    Tensor and Pipeline Parallelism

    4 min
  • 14

    Efficient Dataset Loading and Prefetching

    4 min
  • 15

    Fine-tuning Methodologies Overview

    4 min
  • 16

    Parameter-Efficient Fine-Tuning (LoRA)

    4 min
  • 17

    Quantized LoRA (QLoRA)

    4 min
  • 18

    Alignment with RLHF

    4 min
  • 19

    Direct Preference Optimization (DPO)

    4 min
  • 20

    Project Milestone: Domain-Specific Fine-Tuning

    3 min
  • 21

    Vector Databases and Similarity Search

    4 min
  • 22

    Retrieval Strategies for RAG

    3 min
  • 23

    Context Management and Windowing

    4 min
  • 24

    Agentic Tool Use and Function Calling

    4 min
  • 25

    Chain-of-Thought and Multi-Step Reasoning

    4 min
  • 26

    Self-Correction and Iterative Refinement

    4 min
  • 27

    Project Milestone: RAG and Agent Integration

    3 min
  • 28

    Post-Training Quantization (PTQ)

    4 min
  • 29

    Model Pruning Techniques

    4 min
  • 30

    Knowledge Distillation

    4 min
  • 31

    Optimized Inference Runtimes (vLLM)

    4 min
  • 32

    TensorRT-LLM for High-Performance Serving

    3 min
  • 33

    ONNX Runtime for Cross-Platform Inference

    3 min
  • 34

    Project Milestone: Inference Optimization

    3 min
  • 35

    CI/CD for ML (MLOps)

    4 min
  • 36

    Continuous Training (CT) Pipelines

    4 min
  • 37

    Observability and Logging

    4 min
  • 38

    Drift Detection and Data Monitoring

    4 min
  • 39

    LLM-as-a-Judge for Evaluation

    4 min
  • 40

    Scaling Deployments with Kubernetes

    4 min
  • 41

    GPU Resource Allocation and Scheduling

    3 min
  • 42

    Project Milestone: Production Deployment

    3 min
  • 43

    Advanced Activation Checkpointing

    4 min
  • 44

    Mixed Precision Training (FP8/BF16)

    4 min
  • 45

    Distributed Optimizer States

    4 min
  • 46

    Gradient Accumulation and Batch Sizing

    4 min
  • 47

    Multi-Modal Model Architectures

    4 min
  • 48

    Mixture-of-Experts (MoE) Layers

    4 min
  • View full course