Master Kubernetes Cluster API for automated node upgrades. Learn how to leverage MachineHealthCheck for reliable, hands-off node lifecycle management today.
Managing node lifecycles manually is a recipe for burnout. If you're still SSHing into nodes to drain, cordon, and upgrade them, you're doing it the hard way. In this post, I'll show you how to shift from manual intervention to declarative automation using Kubernetes Cluster API (CAPI).
The beauty of CAPI lies in treating infrastructure like application code. Instead of managing individual virtual machines, you define the desired state of your cluster in YAML. When it's time to upgrade, you update the MachineDeployment object, and the controller handles the rollout.
This approach minimizes human error. It’s consistent, repeatable, and—most importantly—it’s version-controlled.
Before you automate upgrades, you need to ensure your existing nodes are healthy. The MachineHealthCheck resource is your best friend here. It automatically remediates nodes that fail health checks, ensuring your cluster maintains its desired capacity.
Here’s a sample MachineHealthCheck manifest for a CAPI-managed cluster:
YAMLapiVersion: cluster.x-k8s.io/v1beta1 kind: MachineHealthCheck metadata: name: worker-node-healthcheck namespace: capi-system spec: clusterName: production-cluster selector: matchLabels: cluster.x-k8s.io/deployment-name: worker-deployment unhealthyConditions: - type: Ready status: Unknown timeout: 300s - type: Ready status: "False" timeout: 300s maxUnhealthy: 40%
In this setup, if a node reports "Unknown" or "False" for the Ready condition for more than 300 seconds, CAPI deletes the machine and provisions a fresh one. It’s self-healing infrastructure in action.
To perform Automated Node Upgrades, you don’t run a script. You update the version field in your KubeadmConfigTemplate or your MachineDeployment.
Let’s say you’re moving from Kubernetes v1.28.2 to v1.29.1. You’ll update your manifest like this:
YAMLapiVersion: cluster.x-k8s.io/v1beta1 kind: MachineDeployment metadata: name: worker-deployment spec: template: spec: version: v1.29.1 # ... other config strategy: rollingUpdate: maxSurge: 1 maxUnavailable: 0
Once you apply this change, the CAPI controllers detect the delta. The controller performs a rolling update, spinning up a new node with v1.29.1 before terminating the old one. By setting maxUnavailable: 0, you ensure your application capacity never drops during the upgrade process.
I’ve seen plenty of upgrades go sideways. Here are three things I’ve learned the hard way:
maxSurge Wisely: If your cloud provider has strict quota limits, don't set your maxSurge too high. You don't want the upgrade to fail because you hit an API limit while trying to provision 10 new nodes at once.Automated node upgrades allow your team to focus on shipping features instead of fighting infrastructure fires. By leveraging Kubernetes Cluster API, you turn a high-risk manual task into a standard, CI/CD-friendly operation.
If you aren't using CAPI yet, start small. Move one non-critical node pool to CAPI management. Once you see the reliability gains from MachineHealthCheck and the simplicity of declarative upgrades, you’ll never go back to manual node management.
The goal isn't just to make things faster; it's to make them predictable. Automation is the only way to scale your operations without scaling your headcount.
Master Kubernetes disaster recovery by using Velero and Restic. Learn how to back up persistent volumes and perform cross-cluster restoration like a pro.
Read moreMaster Kubernetes Secret Management by syncing HashiCorp Vault with External Secrets Operator. Learn how to automate secure, GitOps-friendly secret injection.