Software EngineeringTechnologyJune 18, 20266 min read

Chaos Engineering in Kubernetes: Tools, Practices & Real‑World Tips

Chaos engineering for Kubernetes resilience: learn LitmusChaos setup, fault injection strategies, and proven DevOps testing practices to harden your clusters.

DevOpsLinuxServer

Chaos Engineering in Kubernetes: Tools, Practices & Real‑World Tips

Chaos engineering isn’t a buzzword when you’ve seen a pod disappear in production and the whole service goes dark. It’s a disciplined way to prove your Kubernetes cluster can survive the unexpected. In this post I’ll walk you through LitmusChaos (the de‑facto open‑source engine), show how to inject faults with real‑world examples, and share the testing rituals that keep my teams confident.

TL;DR – Install Litmus v2.13.0, define a ChaosEngine CRD, run a pod-delete experiment, verify recovery with Prometheus alerts, and embed the run in a CI pipeline.

Why chaos matters for Kubernetes

Kubernetes abstracts away servers, but it can’t hide the fact that nodes fail, network partitions happen, and config drifts creep in. Without intentional failure injection you’ll only ever test the happy path.

Mean time to detect (MTTD) drops from minutes to seconds when you have alerts wired to chaos runs.
Mean time to recover (MTTR) improves because you rehearse the exact steps you’ll need in a real outage.
You expose hidden dependencies—think side‑car containers that never restart or ConfigMaps that aren’t reloaded.

If you’re already using Helm, Argo CD, or GitOps, chaos fits right in as another declarative resource.

Core toolset

Tool	Version used	Role
LitmusChaos	v2.13.0	Chaos experiment engine, CRDs, UI
kubectl	v1.28.2	Apply CRs, inspect resources
Prometheus	v2.48.0	Metrics & alerting
Grafana	v10.2.2	Dashboards for chaos run visibility
Argo Workflows	v3.5.5	CI/CD integration (optional)
kube‑state‑metrics	v2.10.0	Export cluster state for alerts

All tools are installed via Helm charts unless noted otherwise.

Step‑by‑step: Installing LitmusChaos


Bash
# 1. Add the Litmus repo
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update

# 2. Install the operator in the litmus namespace
helm install litmus litmuschaos/litmus \
  --namespace litmus --create-namespace \
  --set analytics=false \
  --set gateway.enabled=true \
  --set portal.enabled=true \
  --version 2.13.0

analytics=false disables the optional telemetry (privacy‑first).
gateway.enabled exposes the UI via a LoadBalancer; you can also use an Ingress.

Verify the pods:


Bash
kubectl -n litmus get pods -l app.kubernetes.io/name=litmus

You should see litmus-operator, litmus-gateway, and litmus-portal all Running.

Defining a chaos experiment

Litmus ships with 30+ built‑in experiments. Let’s start with the classic pod-delete that kills a pod at random.


YAML
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-delete-demo
  namespace: demo
spec:
  # Run the experiment once
  jobCleanUpPolicy: "retain"
  annotationCheck: "true"
  engineState: "active"
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        # Target the nginx deployment in demo namespace
        components:
          env:
            - name: NAMESPACE
              value: "demo"
            - name: APP_LABEL
              value: "app=nginx"
            - name: TOTAL_CHAOS_DURATION
              value: "30"   # seconds

Save as pod-delete.yaml and apply:


Bash
kubectl apply -f pod-delete.yaml

Litmus creates a ChaosRunner job that runs the experiment, then a ChaosResult CR that stores the outcome. Check the result:


Bash
kubectl -n demo get chaosexperiment pod-delete -o yaml | yq '.status.experimentStatus'

You should see Passed if the pod was deleted and the deployment recovered within the TOTAL_CHAOS_DURATION.

Real‑world example: Simulating a network latency spike

In production we once suffered a 5‑second spike on the payment-api service. To reproduce it we used Litmus’s network-chaos experiment.


YAML
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: net-latency-demo
  namespace: finance
spec:
  engineState: "active"
  chaosServiceAccount: litmus-admin
  experiments:
    - name: network-latency
      spec:
        components:
          env:
            - name: NAMESPACE
              value: "finance"
            - name: APP_LABEL
              value: "app=payment-api"
            - name: NETWORK_INTERFACE
              value: "eth0"
            - name: LATENCY
              value: "5000"   # ms
            - name: JITTER
              value: "200"
            - name: TOTAL_CHAOS_DURATION
              value: "60"

The experiment injects a tc qdisc rule on each pod’s eth0. While the chaos runs, Prometheus alerts fire:


YAML
# prometheus alert rule
- alert: HighLatencyPaymentAPI
  expr: avg_over_time(http_request_duration_seconds{job="payment-api"}[1m]) > 2
  for: 30s
  labels:
    severity: critical
  annotations:
    summary: "Payment API latency > 2s"

When the alert clears, we know the auto‑scaler and circuit‑breaker logic behaved correctly. The whole run took 2 minutes, but the insight saved us weeks of debugging later.

Embedding chaos in CI/CD

Running a single experiment manually is nice, but the real power comes from automated runs on every PR. Here’s a minimal GitHub Actions workflow that triggers Litmus via kubectl:


YAML
name: Chaos Test
on:
  pull_request:
    branches: [ main ]

jobs:
  chaos:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Setup kubectl
        uses: azure/setup-kubectl@v3
        with:
          version: v1.28.2
      - name: Apply Kubeconfig
        env:
          KUBECONFIG_DATA: ${{ secrets.KUBECONFIG }}
        run: |
          echo "$KUBECONFIG_DATA" | base64 -d > $HOME/.kube/config
      - name: Deploy test app
        run: |
          kubectl apply -f k8s/nginx-deploy.yaml
      - name: Run pod-delete chaos
        run: |
          kubectl apply -f chaos/pod-delete.yaml
      - name: Verify result
        run: |
          RESULT=$(kubectl -n demo get chaosresult pod-delete-demo -o jsonpath='{.status.experimentStatus}')
          if [[ "$RESULT" != "Passed" ]]; then
            echo "Chaos failed: $RESULT"
            exit 1
          fi

If the chaos fails, the workflow aborts and the PR can’t be merged. This “fail‑fast” approach catches regressions before they hit staging.

Best practices checklist

✅	Practice
1️⃣	Scope experiments – target a single microservice per run. Broad chaos masks the root cause.
2️⃣	Define SLOs – know the latency or error‑rate thresholds you expect during a fault.
3️⃣	Automate observability – couple chaos with Prometheus alerts and Grafana dashboards.
4️⃣	Run in production‑like environments – staging clusters with the same node pool size and autoscaling settings.
5️⃣	Keep a chaos run log – store `ChaosResult` CRs in a long‑term bucket (e.g., GCS) for post‑mortem analysis.
6️⃣	Rollback safety – set `jobCleanUpPolicy: "delete"` for experiments that modify system state.
7️⃣	Team ownership – assign a “chaos champion” who owns the experiment catalog and reviews runs.

Lessons learned from the field

Don’t start with “kill‑all” – I once ran a node-drain experiment on every node during a canary. The cluster went into a crash‑loop because the DaemonSet that collected logs was also drained. Lesson: whitelist critical system pods.
Network chaos needs CNI awareness – Calico respects tc rules, but Cilium drops them unless you enable privileged mode. Adjust the network-chaos experiment’s privileged flag accordingly.
Chaos can hide bugs in readiness probes – When a pod restarts, the readiness probe must return 200 quickly, or the service will appear down even if the app is healthy. Updating the probe timeout from 1s to 5s fixed a flaky rollout.
Metrics lag matters – Prometheus scrapes every 15 seconds by default. For short‑lived chaos (e.g., 10 s pod delete) increase scrape_interval to 5s so alerts fire in time.
Version drift – Litmus v2.13 introduced pod-delete support for StatefulSet pods. If you’re on v2.10 you’ll see “unsupported workload” errors. Always pin the Helm chart version.

TL;DR cheat sheet


Bash
# Install Litmus
helm install litmus litmuschaos/litmus \
  --namespace litmus --create-namespace \
  --set gateway.enabled=true \
  --version 2.13.0

# Simple pod delete experiment
cat > pod-delete.yaml <<EOF
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-delete-demo
  namespace: demo
spec:
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: NAMESPACE
              value: demo
            - name: APP_LABEL
              value: app=nginx
            - name: TOTAL_CHAOS_DURATION
              value: "30"
EOF

kubectl apply -f pod-delete.yaml
kubectl -n demo get chaosresult pod-delete-demo -o jsonpath='{.status.ex

Back to Blog