Chaos engineering for Kubernetes resilience: learn LitmusChaos setup, fault injection strategies, and proven DevOps testing practices to harden your clusters.
Chaos engineering isn’t a buzzword when you’ve seen a pod disappear in production and the whole service goes dark. It’s a disciplined way to prove your Kubernetes cluster can survive the unexpected. In this post I’ll walk you through LitmusChaos (the de‑facto open‑source engine), show how to inject faults with real‑world examples, and share the testing rituals that keep my teams confident.
TL;DR – Install Litmus v2.13.0, define a
ChaosEngineCRD, run apod-deleteexperiment, verify recovery with Prometheus alerts, and embed the run in a CI pipeline.
Kubernetes abstracts away servers, but it can’t hide the fact that nodes fail, network partitions happen, and config drifts creep in. Without intentional failure injection you’ll only ever test the happy path.
If you’re already using Helm, Argo CD, or GitOps, chaos fits right in as another declarative resource.
| Tool | Version used | Role |
|---|---|---|
| LitmusChaos | v2.13.0 | Chaos experiment engine, CRDs, UI |
| kubectl | v1.28.2 | Apply CRs, inspect resources |
| Prometheus | v2.48.0 | Metrics & alerting |
| Grafana | v10.2.2 | Dashboards for chaos run visibility |
| Argo Workflows | v3.5.5 | CI/CD integration (optional) |
| kube‑state‑metrics | v2.10.0 | Export cluster state for alerts |
All tools are installed via Helm charts unless noted otherwise.
Bash# 1. Add the Litmus repo helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/ helm repo update # 2. Install the operator in the litmus namespace helm install litmus litmuschaos/litmus \ --namespace litmus --create-namespace \ --set analytics=false \ --set gateway.enabled=true \ --set portal.enabled=true \ --version 2.13.0
analytics=false disables the optional telemetry (privacy‑first).gateway.enabled exposes the UI via a LoadBalancer; you can also use an Ingress.Verify the pods:
Bashkubectl -n litmus get pods -l app.kubernetes.io/name=litmus
You should see litmus-operator, litmus-gateway, and litmus-portal all Running.
Litmus ships with 30+ built‑in experiments. Let’s start with the classic pod-delete that kills a pod at random.
YAMLapiVersion: litmuschaos.io/v1alpha1 kind: ChaosEngine metadata: name: pod-delete-demo namespace: demo spec: # Run the experiment once jobCleanUpPolicy: "retain" annotationCheck: "true" engineState: "active" chaosServiceAccount: litmus-admin experiments: - name: pod-delete spec: # Target the nginx deployment in demo namespace components: env: - name: NAMESPACE value: "demo" - name: APP_LABEL value: "app=nginx" - name: TOTAL_CHAOS_DURATION value: "30" # seconds
Save as pod-delete.yaml and apply:
Bashkubectl apply -f pod-delete.yaml
Litmus creates a ChaosRunner job that runs the experiment, then a ChaosResult CR that stores the outcome. Check the result:
Bashkubectl -n demo get chaosexperiment pod-delete -o yaml | yq '.status.experimentStatus'
You should see Passed if the pod was deleted and the deployment recovered within the TOTAL_CHAOS_DURATION.
In production we once suffered a 5‑second spike on the payment-api service. To reproduce it we used Litmus’s network-chaos experiment.
YAMLapiVersion: litmuschaos.io/v1alpha1 kind: ChaosEngine metadata: name: net-latency-demo namespace: finance spec: engineState: "active" chaosServiceAccount: litmus-admin experiments: - name: network-latency spec: components: env: - name: NAMESPACE value: "finance" - name: APP_LABEL value: "app=payment-api" - name: NETWORK_INTERFACE value: "eth0" - name: LATENCY value: "5000" # ms - name: JITTER value: "200" - name: TOTAL_CHAOS_DURATION value: "60"
The experiment injects a tc qdisc rule on each pod’s eth0. While the chaos runs, Prometheus alerts fire:
YAML# prometheus alert rule - alert: HighLatencyPaymentAPI expr: avg_over_time(http_request_duration_seconds{job="payment-api"}[1m]) > 2 for: 30s labels: severity: critical annotations: summary: "Payment API latency > 2s"
When the alert clears, we know the auto‑scaler and circuit‑breaker logic behaved correctly. The whole run took 2 minutes, but the insight saved us weeks of debugging later.
Running a single experiment manually is nice, but the real power comes from automated runs on every PR. Here’s a minimal GitHub Actions workflow that triggers Litmus via kubectl:
YAMLname: Chaos Test on: pull_request: branches: [ main ] jobs: chaos: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup kubectl uses: azure/setup-kubectl@v3 with: version: v1.28.2 - name: Apply Kubeconfig env: KUBECONFIG_DATA: ${{ secrets.KUBECONFIG }} run: | echo "$KUBECONFIG_DATA" | base64 -d > $HOME/.kube/config - name: Deploy test app run: | kubectl apply -f k8s/nginx-deploy.yaml - name: Run pod-delete chaos run: | kubectl apply -f chaos/pod-delete.yaml - name: Verify result run: | RESULT=$(kubectl -n demo get chaosresult pod-delete-demo -o jsonpath='{.status.experimentStatus}') if [[ "$RESULT" != "Passed" ]]; then echo "Chaos failed: $RESULT" exit 1 fi
If the chaos fails, the workflow aborts and the PR can’t be merged. This “fail‑fast” approach catches regressions before they hit staging.
| ✅ | Practice |
|---|---|
| 1️⃣ | Scope experiments – target a single microservice per run. Broad chaos masks the root cause. |
| 2️⃣ | Define SLOs – know the latency or error‑rate thresholds you expect during a fault. |
| 3️⃣ | Automate observability – couple chaos with Prometheus alerts and Grafana dashboards. |
| 4️⃣ | Run in production‑like environments – staging clusters with the same node pool size and autoscaling settings. |
| 5️⃣ | Keep a chaos run log – store ChaosResult CRs in a long‑term bucket (e.g., GCS) for post‑mortem analysis. |
| 6️⃣ | Rollback safety – set jobCleanUpPolicy: "delete" for experiments that modify system state. |
| 7️⃣ | Team ownership – assign a “chaos champion” who owns the experiment catalog and reviews runs. |
Don’t start with “kill‑all” – I once ran a node-drain experiment on every node during a canary. The cluster went into a crash‑loop because the DaemonSet that collected logs was also drained. Lesson: whitelist critical system pods.
Network chaos needs CNI awareness – Calico respects tc rules, but Cilium drops them unless you enable privileged mode. Adjust the network-chaos experiment’s privileged flag accordingly.
Chaos can hide bugs in readiness probes – When a pod restarts, the readiness probe must return 200 quickly, or the service will appear down even if the app is healthy. Updating the probe timeout from 1s to 5s fixed a flaky rollout.
Metrics lag matters – Prometheus scrapes every 15 seconds by default. For short‑lived chaos (e.g., 10 s pod delete) increase scrape_interval to 5s so alerts fire in time.
Version drift – Litmus v2.13 introduced pod-delete support for StatefulSet pods. If you’re on v2.10 you’ll see “unsupported workload” errors. Always pin the Helm chart version.
Bash# Install Litmus helm install litmus litmuschaos/litmus \ --namespace litmus --create-namespace \ --set gateway.enabled=true \ --version 2.13.0 # Simple pod delete experiment cat > pod-delete.yaml <<EOF apiVersion: litmuschaos.io/v1alpha1 kind: ChaosEngine metadata: name: pod-delete-demo namespace: demo spec: engineState: active chaosServiceAccount: litmus-admin experiments: - name: pod-delete spec: components: env: - name: NAMESPACE value: demo - name: APP_LABEL value: app=nginx - name: TOTAL_CHAOS_DURATION value: "30" EOF kubectl apply -f pod-delete.yaml kubectl -n demo get chaosresult pod-delete-demo -o jsonpath='{.status.ex