GPU Resource Allocation and Scheduling: Mastering MIG and K8s

Learn to partition hardware with Multi-Instance GPU (MIG) and optimize Kubernetes scheduling to maximize GPU utilization across your production AI fleet.

KubernetesGPUInfrastructureMLOpsNVIDIAaimachine-learningpython

Previously in this course, we covered the basics of scaling deployments with Kubernetes. While that lesson established the foundation for deploying inference services, it treated the GPU as an indivisible unit. In production, however, a single A100 or H100 is often overkill for smaller models, leading to significant "GPU dark matter"—wasted compute cycles. This lesson adds the capability to slice hardware into smaller, isolated instances using Multi-Instance GPU (MIG) and optimize how your cluster schedules these workloads.

Multi-Instance GPU (MIG) First Principles

MIG allows you to partition a single physical NVIDIA GPU into several isolated "instances." Each instance has its own dedicated hardware resources for compute, memory, and bandwidth. Unlike software-based time-slicing, MIG provides hardware-level fault isolation: if one instance crashes, it doesn't impact others on the same chip.

From a scheduler perspective, MIG treats a physical GPU as a collection of smaller virtual GPUs. To use this, you must configure the NVIDIA device plugin to expose these partitions to the Kubernetes scheduler.

Configuring MIG for Kubernetes

To enable MIG, you generally interact with the nvidia-device-plugin daemonset. You must first ensure your nodes are partitioned via nvidia-smi.

Partitioning the GPU: Use nvidia-smi mig -cgi to create the desired profile. For example, creating two 3g.20gb instances on an A100:


Bash
sudo nvidia-smi -i 0 -mig 1
sudo nvidia-smi mig -cgi 3g.20gb,3g.20gb -C

Kubernetes Exposure: Once partitioned, the NVIDIA device plugin automatically detects these as individual resources. Your pods will request them using the nvidia.com/mig-3g.20gb resource key instead of the generic nvidia.com/gpu.

Managing Resource Quotas and Pod Placement

When running multiple models on a shared node, you face the "noisy neighbor" problem. Even with hardware isolation, memory bandwidth contention can occur if workloads are poorly placed.

To manage this, we use Kubernetes Node Affinity and Taints/Tolerations to ensure that high-throughput training jobs don't starve latency-sensitive inference services.

Worked Example: Scheduling with Resource Constraints

Here is a manifest for an inference pod specifically requesting a 3g.20gb MIG instance, including a PriorityClass (as discussed in our guide to critical workloads) to ensure it stays scheduled:


YAML
apiVersion: v1
kind: Pod
metadata:
  name: inference-service-mig
spec:
  priorityClassName: high-priority-inference
  containers:
  - name: model-server
    image: my-model-repo:v1
    resources:
      limits:
        nvidia.com/mig-3g.20gb: 1
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/mig.capable
            operator: In
            values:
            - "true"

Comparison: MIG vs. Time-Slicing

Feature	MIG (Hardware)	Time-Slicing (Software)
Isolation	Hard (Compute & Memory)	Soft (Process level)
Performance	Deterministic	Jitter-prone
Fault Tolerance	High (Crash isolation)	Low
Complexity	High (Requires partitioning)	Low (Config flag)

Hands-on Exercise

Check Partitioning: Run kubectl describe node <node-name> and look for allocatable resources. Does it list specific MIG profiles?
Deploy: Modify your current project inference deployment to use a specific MIG profile instead of the generic GPU request.
Verify: Use kubectl top pod and nvidia-smi inside the container to confirm that the memory limit matches your MIG profile constraints.

Common Pitfalls

Fragmented Resources: If you partition a GPU into two 3g.20gb instances, you cannot easily re-merge them without restarting the driver and wiping all running pods. Plan your cluster's static partitioning carefully.
Driver Version Mismatch: MIG requires specific NVIDIA driver versions. Ensure your node image matches the requirements of the nvidia-container-toolkit (see our GPU Passthrough guide for details on keeping drivers in sync).
Ignoring Memory Limits: Even if you use MIG, your application might attempt to allocate more memory than the partition allows. Always set your container limits to match the partition size to prevent OOM errors.

Recap

Mastering GPU resource allocation requires moving away from treating GPUs as monolithic assets. By leveraging MIG for hard-isolated inference and K8s node affinity for placement, you can significantly increase the density of your model serving without sacrificing performance or stability.

Up next: We will integrate these scheduling strategies into our final project by deploying our full LLM pipeline to a production-ready Kubernetes cluster.

Back to Blog