TechnologySoftware EngineeringJune 19, 20263 min read

Kubernetes Disaster Recovery: Velero and Restic Implementation Guide

Master Kubernetes disaster recovery with Velero and Restic. Learn to implement multi-region backups and seamless cluster migration for your production workloads.

KubernetesVeleroResticDisaster RecoveryDevOpsCloud-NativeInfrastructure as CodeLinuxServer

Why Your Cluster Needs a Disaster Recovery Plan

I’ve spent too many nights fixing broken clusters to trust that "high availability" is enough. If your cloud provider has a regional outage or a rogue kubectl command wipes a namespace, you’re in trouble. Relying on simple snapshots isn't a strategy; it's a gamble.

To implement a real Kubernetes disaster recovery plan, I use Velero paired with Restic. Velero handles the API objects (deployments, configmaps, secrets), while Restic manages the file-level backups for persistent volumes that don't support native cloud snapshots.

The Architecture: Velero and Restic

Velero acts as the control plane for your backups. When you trigger a backup, it hits the Kubernetes API, grabs the YAML definitions, and pushes them to object storage (like AWS S3 or Google Cloud Storage).

For persistent data, Restic enters the picture. It performs block-level deduplication and sends the data to the same bucket. By targeting an S3 bucket in a different region, you ensure that even a total regional collapse won't lose your data.

Step 1: Installing Velero with Restic

First, install the Velero CLI (I’m using v1.12.0). Assuming you’re on AWS, grab your credentials and create a bucket in a secondary region.


Bash
velero install \
    --provider aws \
    --plugins velero/velero-plugin-for-aws:v1.8.0 \
    --bucket my-backup-bucket-us-west-2 \
    --backup-location-config region=us-west-2 \
    --snapshot-location-config region=us-west-2 \
    --use-restic

The --use-restic flag is critical here. It enables the Restic integration, allowing Velero to back up volumes that aren't natively supported by your cloud provider’s snapshot API.

Step 2: Annotating Pods for Restic

Velero doesn't automatically back up every Persistent Volume Claim (PVC). You need to tell it which volumes to handle via Restic. Use this annotation on your pods:


YAML
metadata:
  annotations:
    backup.velero.io/backup-volumes: data-volume-name

Replace data-volume-name with the name field from your pod's volumes section. Once applied, Velero will trigger a Restic backup for that specific volume during your next scheduled job.

Step 3: Automating Backups

Don't do this manually. Create a Schedule object to automate your K8s backup routine. I prefer a 4-hour interval for production databases.


YAML
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: nightly-backup
  namespace: velero
spec:
  schedule: "0 0 * * *"
  template:
    includedNamespaces:
      - production-app
    snapshotVolumes: true

Step 4: Execution: Cluster Migration and Restoration

Testing your recovery is just as important as the backup itself. If you're planning a cluster migration to a new region or a fresh cluster, follow these steps:

Point to the Backup Location: In your new cluster, install Velero and point it to the same S3 bucket used in the original cluster.
Read-Only Mode: Ensure the new Velero instance is set to read-only mode if you don't want it overwriting existing backups.
Restore Command: Run the restore process:


Bash
velero restore create --from-backup nightly-backup-20231027

Velero will re-create the namespaces, deployments, and services. For the persistent data, Restic will download the blocks from your S3 bucket and reconstruct the volumes in the new cluster.

Lessons from the Field

Test the Restore: I cannot emphasize this enough. A backup that hasn't been restored is just a file you hope works. Set up a CI/CD pipeline that restores a backup to a staging cluster weekly.
Cost Management: Restic is efficient, but if you're backing up terabytes, S3 costs add up. Set an S3 Lifecycle Policy to move older backups to "Glacier" or "Deep Archive" after 30 days.
Secret Management: Velero backs up your secrets in an encrypted format. Ensure your encryption keys are backed up in a separate secure vault (like HashiCorp Vault or AWS KMS).

Final Thoughts

Implementing a multi-region Kubernetes disaster recovery strategy isn't about the tools; it’s about the discipline. Velero and Restic are powerful, but they require you to define your RPO (Recovery Point Objective) and RTO (Recovery Time Objective) clearly. Start by backing up a single namespace, verify the restoration process, and then roll it out to your entire infrastructure. Your future self will thank you when the production cluster goes dark.

Back to Blog

Kubernetes Disaster Recovery: Velero and Restic Implementation Guide

Why Your Cluster Needs a Disaster Recovery Plan

The Architecture: Velero and Restic

Step 1: Installing Velero with Restic

Step 2: Annotating Pods for Restic

Step 3: Automating Backups

Step 4: Execution: Cluster Migration and Restoration

Lessons from the Field

Final Thoughts

Similar Posts

Kubernetes Disaster Recovery: Velero and Restic Implementation Guide

Kubernetes Cost Monitoring: A Guide to Kubecost and FinOps

Kubernetes Resource Management: Using VPA Recommendation Mode