Master Kubernetes disaster recovery with Velero and Restic. Learn to implement multi-region backups and seamless cluster migration for your production workloads.
I’ve spent too many nights fixing broken clusters to trust that "high availability" is enough. If your cloud provider has a regional outage or a rogue kubectl command wipes a namespace, you’re in trouble. Relying on simple snapshots isn't a strategy; it's a gamble.
To implement a real Kubernetes disaster recovery plan, I use Velero paired with Restic. Velero handles the API objects (deployments, configmaps, secrets), while Restic manages the file-level backups for persistent volumes that don't support native cloud snapshots.
Velero acts as the control plane for your backups. When you trigger a backup, it hits the Kubernetes API, grabs the YAML definitions, and pushes them to object storage (like AWS S3 or Google Cloud Storage).
For persistent data, Restic enters the picture. It performs block-level deduplication and sends the data to the same bucket. By targeting an S3 bucket in a different region, you ensure that even a total regional collapse won't lose your data.
First, install the Velero CLI (I’m using v1.12.0). Assuming you’re on AWS, grab your credentials and create a bucket in a secondary region.
Bashvelero install \ --provider aws \ --plugins velero/velero-plugin-for-aws:v1.8.0 \ --bucket my-backup-bucket-us-west-2 \ --backup-location-config region=us-west-2 \ --snapshot-location-config region=us-west-2 \ --use-restic
The --use-restic flag is critical here. It enables the Restic integration, allowing Velero to back up volumes that aren't natively supported by your cloud provider’s snapshot API.
Velero doesn't automatically back up every Persistent Volume Claim (PVC). You need to tell it which volumes to handle via Restic. Use this annotation on your pods:
YAMLmetadata: annotations: backup.velero.io/backup-volumes: data-volume-name
Replace data-volume-name with the name field from your pod's volumes section. Once applied, Velero will trigger a Restic backup for that specific volume during your next scheduled job.
Don't do this manually. Create a Schedule object to automate your K8s backup routine. I prefer a 4-hour interval for production databases.
YAMLapiVersion: velero.io/v1 kind: Schedule metadata: name: nightly-backup namespace: velero spec: schedule: "0 0 * * *" template: includedNamespaces: - production-app snapshotVolumes: true
Testing your recovery is just as important as the backup itself. If you're planning a cluster migration to a new region or a fresh cluster, follow these steps:
Bashvelero restore create --from-backup nightly-backup-20231027
Velero will re-create the namespaces, deployments, and services. For the persistent data, Restic will download the blocks from your S3 bucket and reconstruct the volumes in the new cluster.
Implementing a multi-region Kubernetes disaster recovery strategy isn't about the tools; it’s about the discipline. Velero and Restic are powerful, but they require you to define your RPO (Recovery Point Objective) and RTO (Recovery Time Objective) clearly. Start by backing up a single namespace, verify the restoration process, and then roll it out to your entire infrastructure. Your future self will thank you when the production cluster goes dark.
Master Kubernetes disaster recovery by using Velero and Restic. Learn how to back up persistent volumes and perform cross-cluster restoration like a pro.
Read moreMaster Kubernetes cost monitoring with Kubecost. Learn how to implement granular resource allocation and drive FinOps practices to optimize your cloud spend.