KubernetesDevOpsJune 19, 20264 min read

Kubernetes PriorityClass: Managing Critical Workloads with Preemption

Master Kubernetes PriorityClass to manage critical workloads. Learn how pod preemption works to ensure high-priority services survive node resource contention.

KubernetesDevOpsSRESchedulingInfrastructure

During a routine deployment of our core API last Thursday, we ran into a classic "noisy neighbor" problem that brought down our production logging stack. Even though we had Kubernetes Resource Management: Using VPA Recommendation Mode in place to right-size our pods, a sudden spike in batch processing jobs consumed all available CPU cycles on our worker nodes, causing the API’s liveness probes to fail. We needed a way to guarantee that our user-facing traffic took precedence over background tasks, which is where Kubernetes PriorityClass and pod preemption saved the day.

Understanding Kubernetes PriorityClass and Pod Preemption

At its core, a Kubernetes PriorityClass is a non-namespaced object that defines the relative importance of a pod. When the scheduler encounters a pod that won't fit on any node, it checks if that pod has a higher priority than existing pods. If it does, the scheduler kills (preempts) the lower-priority pods to make room.

Before we implemented this, we were manually scaling our node groups using Implementing Kubernetes Node Auto-Provisioning: Karpenter and Bottlerocket, but that wasn't fast enough for instantaneous traffic spikes. Relying on auto-scaling is great for capacity, but pod preemption is your last line of defense when the cluster is physically out of room.

Implementing Your First PriorityClass

To get started, you define a PriorityClass manifest. The value field is an integer; the higher the number, the higher the priority.


YAML
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-service
value: 1000000
globalDefault: false
description: "Used for high-priority user-facing APIs."

Once applied, you assign this to your deployment spec:


YAML
spec:
  priorityClassName: critical-service
  containers:
    - name: api-pod
      image: my-app:v1.2.3

The Messy Reality of Resource Scheduling

Top view of a workspace with books, a laptop, and hands organizing notes using sticky notes.

We first tried setting a global default priority for all pods, thinking it would make our Kubernetes resource management predictable. That was a mistake. We ended up with a cluster where every pod thought it was important, causing the scheduler to churn constantly as pods evicted each other in a death spiral. We learned that globalDefault should almost always be false.

We also initially underestimated the impact of preemptionPolicy. By default, it's set to PreemptLowerPriority. We once set it to Never on a test workload to see if we could avoid restarts, but that just meant our critical pods stayed in a Pending state for roughly 45 minutes until a node became free. That’s an eternity in production.

Managing high-priority workloads requires a tiered approach. We now use three tiers:

System-Critical (1,000,000+): Core networking, DNS, and ingress.
User-Facing (100,000): Our main API and frontend services.
Background (10,000): Batch jobs, cron jobs, and log shippers.

Lessons Learned from the Trenches

Close-up of a notebook with handwritten notes and drawings on a wooden desk.

The biggest challenge isn't the configuration—it's the ripple effect. When a high-priority pod preempts a lower one, the lower-priority pod is evicted and must be rescheduled elsewhere. If your cluster is already saturated, those evicted pods might end up in a pending state, creating a backlog.

We’ve found that combining this with Kubernetes VPA and Goldilocks: Master Resource Right-Sizing is the only way to keep the cluster healthy. If you don't know your actual resource usage, you're just guessing where the pressure points are.

FAQ

Q: Will my pods be killed immediately if a higher-priority pod arrives? A: Yes, if the scheduler determines the only way to satisfy the higher-priority pod's requirements is to evict yours. The pod will receive a SIGTERM and have a grace period to shut down.

Q: Can I prevent a specific pod from being preempted? A: You can set preemptionPolicy: Never on the pod spec, but be warned: if the cluster is full, that pod will simply wait indefinitely for nodes to open up.

Q: Is there a limit to how many PriorityClasses I should have? A: Keep it simple. We use three tiers, as mentioned above. Adding more granularity usually leads to "priority creep," where every developer argues their service deserves a +100 increase over the next.

I’m still not entirely convinced our current tiering strategy is optimal. We’re currently investigating if we should move our database workloads, which we run using CloudNativePG for Reliable Kubernetes Database Management, into their own separate node pools to avoid preemption entirely. For now, it works, but I suspect we'll need to revisit our eviction budgets as we continue to scale.

Back to Blog

Kubernetes PriorityClass: Managing Critical Workloads with Preemption

Understanding Kubernetes PriorityClass and Pod Preemption

Implementing Your First PriorityClass

The Messy Reality of Resource Scheduling

Lessons Learned from the Trenches

FAQ

Similar Posts

Implementing Kubernetes Admission Controllers with Kubebuilder

Kubernetes ResourceQuotas: Automating Governance with Kyverno

Kubernetes Audit Logs and Falco: A Guide to API Server Security