Mahamudul Hasan Rubel
HomeBlogCoursesAboutProjectsSkillsExperiencePhotosContact
Mahamudul Hasan Rubel

Senior Software Engineer crafting high-performance web applications and SaaS platforms.

Navigation

  • Home
  • Blog
  • Courses
  • About
  • Projects
  • Skills
  • Experience
  • Photos
  • Contact

Get in Touch

Available for senior/lead roles and consulting.

bd.mhrubel@gmail.comHire Me

Subscribe to the newsletter

Get new articles and course lessons delivered to your inbox. No spam, unsubscribe anytime.

© 2026 Mahamudul Hasan Rubel. All rights reserved.

Built with using Next.js 16 & Tailwind v4

Back to Blog
Lesson 41 of the Advanced AI/ML: Deep Learning, LLMs & Production Systems course
AI/MLJune 28, 20263 min read

GPU Resource Allocation and Scheduling: Mastering MIG and K8s

Learn to partition hardware with Multi-Instance GPU (MIG) and optimize Kubernetes scheduling to maximize GPU utilization across your production AI fleet.

KubernetesGPUInfrastructureMLOpsNVIDIAaimachine-learningpython

Previously in this course, we covered the basics of scaling deployments with Kubernetes. While that lesson established the foundation for deploying inference services, it treated the GPU as an indivisible unit. In production, however, a single A100 or H100 is often overkill for smaller models, leading to significant "GPU dark matter"—wasted compute cycles. This lesson adds the capability to slice hardware into smaller, isolated instances using Multi-Instance GPU (MIG) and optimize how your cluster schedules these workloads.

Multi-Instance GPU (MIG) First Principles

MIG allows you to partition a single physical NVIDIA GPU into several isolated "instances." Each instance has its own dedicated hardware resources for compute, memory, and bandwidth. Unlike software-based time-slicing, MIG provides hardware-level fault isolation: if one instance crashes, it doesn't impact others on the same chip.

From a scheduler perspective, MIG treats a physical GPU as a collection of smaller virtual GPUs. To use this, you must configure the NVIDIA device plugin to expose these partitions to the Kubernetes scheduler.

Configuring MIG for Kubernetes

To enable MIG, you generally interact with the nvidia-device-plugin daemonset. You must first ensure your nodes are partitioned via nvidia-smi.

  1. Partitioning the GPU: Use nvidia-smi mig -cgi to create the desired profile. For example, creating two 3g.20gb instances on an A100:

    Bash
    sudo nvidia-smi -i 0 -mig 1
    sudo nvidia-smi mig -cgi 3g.20gb,3g.20gb -C
  2. Kubernetes Exposure: Once partitioned, the NVIDIA device plugin automatically detects these as individual resources. Your pods will request them using the nvidia.com/mig-3g.20gb resource key instead of the generic nvidia.com/gpu.

Managing Resource Quotas and Pod Placement

When running multiple models on a shared node, you face the "noisy neighbor" problem. Even with hardware isolation, memory bandwidth contention can occur if workloads are poorly placed.

To manage this, we use Kubernetes Node Affinity and Taints/Tolerations to ensure that high-throughput training jobs don't starve latency-sensitive inference services.

Worked Example: Scheduling with Resource Constraints

Here is a manifest for an inference pod specifically requesting a 3g.20gb MIG instance, including a PriorityClass (as discussed in our guide to critical workloads) to ensure it stays scheduled:

YAML
apiVersion: v1
kind: Pod
metadata:
  name: inference-service-mig
spec:
  priorityClassName: high-priority-inference
  containers:
  - name: model-server
    image: my-model-repo:v1
    resources:
      limits:
        nvidia.com/mig-3g.20gb: 1
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/mig.capable
            operator: In
            values:
            - "true"

Comparison: MIG vs. Time-Slicing

FeatureMIG (Hardware)Time-Slicing (Software)
IsolationHard (Compute & Memory)Soft (Process level)
PerformanceDeterministicJitter-prone
Fault ToleranceHigh (Crash isolation)Low
ComplexityHigh (Requires partitioning)Low (Config flag)

Hands-on Exercise

  1. Check Partitioning: Run kubectl describe node <node-name> and look for allocatable resources. Does it list specific MIG profiles?
  2. Deploy: Modify your current project inference deployment to use a specific MIG profile instead of the generic GPU request.
  3. Verify: Use kubectl top pod and nvidia-smi inside the container to confirm that the memory limit matches your MIG profile constraints.

Common Pitfalls

  • Fragmented Resources: If you partition a GPU into two 3g.20gb instances, you cannot easily re-merge them without restarting the driver and wiping all running pods. Plan your cluster's static partitioning carefully.
  • Driver Version Mismatch: MIG requires specific NVIDIA driver versions. Ensure your node image matches the requirements of the nvidia-container-toolkit (see our GPU Passthrough guide for details on keeping drivers in sync).
  • Ignoring Memory Limits: Even if you use MIG, your application might attempt to allocate more memory than the partition allows. Always set your container limits to match the partition size to prevent OOM errors.

Recap

Mastering GPU resource allocation requires moving away from treating GPUs as monolithic assets. By leveraging MIG for hard-isolated inference and K8s node affinity for placement, you can significantly increase the density of your model serving without sacrificing performance or stability.

Up next: We will integrate these scheduling strategies into our final project by deploying our full LLM pipeline to a production-ready Kubernetes cluster.

Previous lessonScaling Deployments with KubernetesNext lesson Project Milestone: Production Deployment
Back to Blog

Similar Posts

AI/MLJune 28, 20263 min read

Project Milestone: Production Deployment of ML Systems

Learn to execute a production deployment on Kubernetes, integrate telemetry, and build automated feedback loops to ensure your ML system remains performant.

Read more
AI/MLJune 28, 20264 min read

Scaling Deployments with Kubernetes: Orchestrating ML Inference

Learn to scale ML models with Kubernetes deployments, manage GPU resource requests, and configure Horizontal Pod Autoscalers for production-ready inference.

Part of the course

Advanced AI/ML: Deep Learning, LLMs & Production Systems

advanced · Lesson 41 of 48

  1. 1

    Advanced Weight Initialization Strategies

    4 min
  2. 2

    Normalization Techniques at Scale

    3 min
  3. 3

    High-Dimensional Optimization Landscapes

    4 min
Read more
AI/MLJune 28, 20263 min read

TensorRT-LLM for High-Performance Serving: Engine Optimization

Master TensorRT-LLM to achieve peak NVIDIA GPU utilization. Learn to build optimized execution engines, perform kernel fusion, and scale LLM inference.

Read more
  • 4

    Residual Connections and Gradient Stability

    4 min
  • 5

    Gating Units and Activation Functions

    4 min
  • 6

    Implementing Multi-Head Attention

    4 min
  • 7

    Positional Encoding Architectures

    4 min
  • 8

    Transformer Encoder-Decoder Design

    3 min
  • 9

    Project Milestone: Custom Architecture Setup

    3 min
  • 10

    Tokenization Strategies for LLMs

    3 min
  • 11

    Scaling Laws and Compute Budgets

    4 min
  • 12

    Data Parallelism Strategies

    3 min
  • 13

    Tensor and Pipeline Parallelism

    4 min
  • 14

    Efficient Dataset Loading and Prefetching

    4 min
  • 15

    Fine-tuning Methodologies Overview

    4 min
  • 16

    Parameter-Efficient Fine-Tuning (LoRA)

    4 min
  • 17

    Quantized LoRA (QLoRA)

    4 min
  • 18

    Alignment with RLHF

    4 min
  • 19

    Direct Preference Optimization (DPO)

    4 min
  • 20

    Project Milestone: Domain-Specific Fine-Tuning

    3 min
  • 21

    Vector Databases and Similarity Search

    4 min
  • 22

    Retrieval Strategies for RAG

    3 min
  • 23

    Context Management and Windowing

    4 min
  • 24

    Agentic Tool Use and Function Calling

    4 min
  • 25

    Chain-of-Thought and Multi-Step Reasoning

    4 min
  • 26

    Self-Correction and Iterative Refinement

    4 min
  • 27

    Project Milestone: RAG and Agent Integration

    3 min
  • 28

    Post-Training Quantization (PTQ)

    4 min
  • 29

    Model Pruning Techniques

    4 min
  • 30

    Knowledge Distillation

    4 min
  • 31

    Optimized Inference Runtimes (vLLM)

    4 min
  • 32

    TensorRT-LLM for High-Performance Serving

    3 min
  • 33

    ONNX Runtime for Cross-Platform Inference

    3 min
  • 34

    Project Milestone: Inference Optimization

    3 min
  • 35

    CI/CD for ML (MLOps)

    4 min
  • 36

    Continuous Training (CT) Pipelines

    4 min
  • 37

    Observability and Logging

    4 min
  • 38

    Drift Detection and Data Monitoring

    4 min
  • 39

    LLM-as-a-Judge for Evaluation

    4 min
  • 40

    Scaling Deployments with Kubernetes

    4 min
  • 41

    GPU Resource Allocation and Scheduling

    3 min
  • 42

    Project Milestone: Production Deployment

    3 min
  • 43

    Advanced Activation Checkpointing

    4 min
  • 44

    Mixed Precision Training (FP8/BF16)

    4 min
  • 45

    Distributed Optimizer States

    4 min
  • 46

    Gradient Accumulation and Batch Sizing

    4 min
  • 47

    Multi-Modal Model Architectures

    4 min
  • 48

    Mixture-of-Experts (MoE) Layers

    4 min
  • View full course