TechnologySoftware EngineeringJune 19, 20263 min read

Kubernetes Cluster API: Automating Node Upgrades with CAPI

Master Kubernetes Cluster API for automated node upgrades. Learn how to leverage MachineHealthCheck for reliable, hands-off node lifecycle management today.

KubernetesCAPIDevOpsInfrastructure as CodeAutomationCloud NativeLinuxServer

Managing node lifecycles manually is a recipe for burnout. If you're still SSHing into nodes to drain, cordon, and upgrade them, you're doing it the hard way. In this post, I'll show you how to shift from manual intervention to declarative automation using Kubernetes Cluster API (CAPI).

Why Cluster API for Node Lifecycle Management?

The beauty of CAPI lies in treating infrastructure like application code. Instead of managing individual virtual machines, you define the desired state of your cluster in YAML. When it's time to upgrade, you update the MachineDeployment object, and the controller handles the rollout.

This approach minimizes human error. It’s consistent, repeatable, and—most importantly—it’s version-controlled.

Setting Up MachineHealthCheck

Before you automate upgrades, you need to ensure your existing nodes are healthy. The MachineHealthCheck resource is your best friend here. It automatically remediates nodes that fail health checks, ensuring your cluster maintains its desired capacity.

Here’s a sample MachineHealthCheck manifest for a CAPI-managed cluster:


YAML
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: worker-node-healthcheck
  namespace: capi-system
spec:
  clusterName: production-cluster
  selector:
    matchLabels:
      cluster.x-k8s.io/deployment-name: worker-deployment
  unhealthyConditions:
    - type: Ready
      status: Unknown
      timeout: 300s
    - type: Ready
      status: "False"
      timeout: 300s
  maxUnhealthy: 40%

In this setup, if a node reports "Unknown" or "False" for the Ready condition for more than 300 seconds, CAPI deletes the machine and provisions a fresh one. It’s self-healing infrastructure in action.

Automated Node Upgrades with CAPI

To perform Automated Node Upgrades, you don’t run a script. You update the version field in your KubeadmConfigTemplate or your MachineDeployment.

Let’s say you’re moving from Kubernetes v1.28.2 to v1.29.1. You’ll update your manifest like this:


YAML
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  name: worker-deployment
spec:
  template:
    spec:
      version: v1.29.1
      # ... other config
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

Once you apply this change, the CAPI controllers detect the delta. The controller performs a rolling update, spinning up a new node with v1.29.1 before terminating the old one. By setting maxUnavailable: 0, you ensure your application capacity never drops during the upgrade process.

Hard-Won Lessons from Production

I’ve seen plenty of upgrades go sideways. Here are three things I’ve learned the hard way:

Pre-flight Checks are Non-negotiable: Before triggering an upgrade, ensure your CNI and CSI drivers support the target Kubernetes version. Nothing breaks a cluster faster than an incompatible storage plugin.
Watch the Pod Disruption Budgets (PDBs): If your PDBs are too strict, the rolling update will hang indefinitely. CAPI respects PDBs, so if the controller can't evict a pod, it won't move to the next node.
Use maxSurge Wisely: If your cloud provider has strict quota limits, don't set your maxSurge too high. You don't want the upgrade to fail because you hit an API limit while trying to provision 10 new nodes at once.

Why This Matters

Automated node upgrades allow your team to focus on shipping features instead of fighting infrastructure fires. By leveraging Kubernetes Cluster API, you turn a high-risk manual task into a standard, CI/CD-friendly operation.

If you aren't using CAPI yet, start small. Move one non-critical node pool to CAPI management. Once you see the reliability gains from MachineHealthCheck and the simplicity of declarative upgrades, you’ll never go back to manual node management.

The goal isn't just to make things faster; it's to make them predictable. Automation is the only way to scale your operations without scaling your headcount.

Back to Blog

Kubernetes Cluster API: Automating Node Upgrades with CAPI

Why Cluster API for Node Lifecycle Management?

Setting Up MachineHealthCheck

Automated Node Upgrades with CAPI

Hard-Won Lessons from Production

Why This Matters

Similar Posts

Kubernetes Disaster Recovery: Velero and Restic Implementation Guide

Kubernetes Secret Management: Using External Secrets and HashiCorp Vault

Kubernetes Cost Monitoring: A Guide to Kubecost and FinOps