Learn to partition hardware with Multi-Instance GPU (MIG) and optimize Kubernetes scheduling to maximize GPU utilization across your production AI fleet.
Previously in this course, we covered the basics of scaling deployments with Kubernetes. While that lesson established the foundation for deploying inference services, it treated the GPU as an indivisible unit. In production, however, a single A100 or H100 is often overkill for smaller models, leading to significant "GPU dark matter"—wasted compute cycles. This lesson adds the capability to slice hardware into smaller, isolated instances using Multi-Instance GPU (MIG) and optimize how your cluster schedules these workloads.
MIG allows you to partition a single physical NVIDIA GPU into several isolated "instances." Each instance has its own dedicated hardware resources for compute, memory, and bandwidth. Unlike software-based time-slicing, MIG provides hardware-level fault isolation: if one instance crashes, it doesn't impact others on the same chip.
From a scheduler perspective, MIG treats a physical GPU as a collection of smaller virtual GPUs. To use this, you must configure the NVIDIA device plugin to expose these partitions to the Kubernetes scheduler.
To enable MIG, you generally interact with the nvidia-device-plugin daemonset. You must first ensure your nodes are partitioned via nvidia-smi.
Partitioning the GPU: Use nvidia-smi mig -cgi to create the desired profile. For example, creating two 3g.20gb instances on an A100:
Bashsudo nvidia-smi -i 0 -mig 1 sudo nvidia-smi mig -cgi 3g.20gb,3g.20gb -C
Kubernetes Exposure: Once partitioned, the NVIDIA device plugin automatically detects these as individual resources. Your pods will request them using the nvidia.com/mig-3g.20gb resource key instead of the generic nvidia.com/gpu.
When running multiple models on a shared node, you face the "noisy neighbor" problem. Even with hardware isolation, memory bandwidth contention can occur if workloads are poorly placed.
To manage this, we use Kubernetes Node Affinity and Taints/Tolerations to ensure that high-throughput training jobs don't starve latency-sensitive inference services.
Here is a manifest for an inference pod specifically requesting a 3g.20gb MIG instance, including a PriorityClass (as discussed in our guide to critical workloads) to ensure it stays scheduled:
YAMLapiVersion: v1 kind: Pod metadata: name: inference-service-mig spec: priorityClassName: high-priority-inference containers: - name: model-server image: my-model-repo:v1 resources: limits: nvidia.com/mig-3g.20gb: 1 affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: nvidia.com/mig.capable operator: In values: - "true"
| Feature | MIG (Hardware) | Time-Slicing (Software) |
|---|---|---|
| Isolation | Hard (Compute & Memory) | Soft (Process level) |
| Performance | Deterministic | Jitter-prone |
| Fault Tolerance | High (Crash isolation) | Low |
| Complexity | High (Requires partitioning) | Low (Config flag) |
kubectl describe node <node-name> and look for allocatable resources. Does it list specific MIG profiles?kubectl top pod and nvidia-smi inside the container to confirm that the memory limit matches your MIG profile constraints.nvidia-container-toolkit (see our GPU Passthrough guide for details on keeping drivers in sync).limits to match the partition size to prevent OOM errors.Mastering GPU resource allocation requires moving away from treating GPUs as monolithic assets. By leveraging MIG for hard-isolated inference and K8s node affinity for placement, you can significantly increase the density of your model serving without sacrificing performance or stability.
Up next: We will integrate these scheduling strategies into our final project by deploying our full LLM pipeline to a production-ready Kubernetes cluster.
Learn to execute a production deployment on Kubernetes, integrate telemetry, and build automated feedback loops to ensure your ML system remains performant.
Read moreLearn to scale ML models with Kubernetes deployments, manage GPU resource requests, and configure Horizontal Pod Autoscalers for production-ready inference.
GPU Resource Allocation and Scheduling