Chaos engineering, Kubernetes resilience, LitmusChaos, fault injection, DevOps testing – learn practical steps, tool choices, and production‑ready examples to make your clusters fault‑tolerant.
Chaos engineering isn’t a buzzword; it’s a disciplined way to prove your Kubernetes workloads survive the unexpected. In this post I’ll show you how to embed chaos into your CI/CD pipeline, compare the most‑used open‑source tools, and give you ready‑to‑run manifests that demonstrate fault injection at scale.
TL;DR – Deploy LitmusChaos 2.13.0, run a simple pod‑kill experiment, and integrate the test into a GitHub Actions workflow.
Kubernetes abstracts away servers, but the underlying nodes, network, and storage still fail. A single pod restart can cascade into a service outage if you haven’t validated your resilience. Chaos engineering forces you to answer two questions:
Skipping this step means you’re betting on “it works in prod” – a gamble most senior engineers won’t take.
| Practice | What you do | Typical KPI |
|---|---|---|
| Define steady state | Record latency, error rate, CPU usage for a healthy release. | Baseline response time < 200 ms |
| Hypothesize failure impact | “If a node disappears, latency will stay < 300 ms.” | Success if hypothesis holds |
| Inject fault | Use a chaos tool to kill pods, throttle network, corrupt disks. | Fault injection duration 30 s – 5 min |
| Observe & verify | Pull metrics from Prometheus, Grafana, or Datadog. | No SLO breach |
| Automate & repeat | Add experiment to CI pipeline, schedule nightly runs. | 100 % coverage across services |
| Tool | Latest stable version (2024) | Language | Main fault types | Integration |
|---|---|---|---|---|
| LitmusChaos | 2.13.0 | Go, YAML | Pod kill, node drain, network latency, IO chaos | Helm, Argo CD, GitOps |
| Chaos Mesh | v2.6.3 | Go | Pod kill, network partition, stress CPU/Memory, JVM chaos | CRD, Kube‑Scheduler plugin |
| PowerfulSeal | v1.4.0 | Python | Node failure, VM termination, network blackhole | Terraform, Ansible |
| Gremlin (SaaS) | 2.4.1 | Go, CLI | All above + DNS, HTTP latency (paid) | CLI, API, Helm |
I gravitate toward LitmusChaos because it ships a rich experiment library, integrates cleanly with Helm, and the community maintains a “ChaosCenter” UI that developers love. Chaos Mesh is a solid alternative if you prefer pure CRDs without a UI.
Bash# Prereqs kubectl version --client --short # v1.28.0 helm version # v3.13.2 # 1. Add Litmus repo helm repo add litmuschaos https://litmuschaos.github.io/litmus-helmcharts helm repo update # 2. Install the operator in the litmus namespace helm install litmus litmuschaos/litmus-helm --namespace litmus --create-namespace \ --set chaosOperator.image.tag=2.13.0 \ --set portalServer.enabled=true \ --set portalServer.image.tag=2.13.0 # 3. Verify CRDs are present kubectl get crd | grep chaos # chaosengine.litmuschaos.io # chaosresult.litmuschaos.io
The portalServer component gives you a web UI at http://<node-ip>:9091. I keep it behind an internal load balancer; no public exposure needed.
Create a simple ChaosEngine that kills a pod from the nginx-demo deployment every 30 seconds, three times.
YAMLapiVersion: litmuschaos.io/v1alpha1 kind: ChaosEngine metadata: name: pod-kill-nginx namespace: demo spec: appinfo: appns: demo applabel: "app=nginx" appkind: deployment chaosServiceAccount: litmus-admin experiments: - name: pod-delete spec: components: env: - name: TOTAL_CHAOS_DURATION value: "90" # seconds - name: CHAOS_INTERVAL value: "30"
Apply it:
Bashkubectl apply -f pod-kill-nginx.yaml kubectl get chaosengine -n demo pod-kill-nginx -o yaml
Watch the logs:
Bashkubectl logs -l name=chaos-runner -n demo -f
You’ll see the operator delete a pod, wait 30 s, repeat. If your nginx service stays reachable (e.g., curl http://nginx.demo.svc.cluster.local returns 200), the hypothesis passes.
In production we once lost a 30‑node worker pool during a rolling upgrade. To avoid surprise, we scripted a node‑drain experiment that evicts all pods from a randomly chosen node, then restores it.
YAMLapiVersion: litmuschaos.io/v1alpha1 kind: ChaosEngine metadata: name: node-drain namespace: prod spec: appinfo: appns: prod applabel: "" # not needed for node-level chaos appkind: "" # same chaosServiceAccount: litmus-admin experiments: - name: node-drain spec: components: env: - name: NODE_LABEL value: "kubernetes.io/role=worker" - name: CHAOS_DURATION value: "120" - name: TOTAL_CHAOS_DURATION value: "180"
When this runs, Litmus picks a node matching NODE_LABEL, cordons it, evicts pods, waits CHAOS_DURATION, then uncordons. In our CI run we measure:
The experiment gave us confidence to shrink our node pool by 20 % without impacting latency.
YAMLname: Chaos Test on: push: branches: [ main ] jobs: chaos: runs-on: ubuntu-latest steps: - name: Checkout repo uses: actions/checkout@v4 - name: Set up kubectl uses: azure/setup-kubectl@v3 with: version: v1.28.0 - name: Install Litmus CLI run: | curl -Lo litmusctl https://github.com/litmuschaos/litmusctl/releases/download/v2.13.0/litmusctl-linux-amd64 chmod +x litmusctl sudo mv litmusctl /usr/local/bin/ - name: Run pod‑kill experiment env: KUBE_CONFIG_DATA: ${{ secrets.KUBE_CONFIG_DATA }} run: | echo "$KUBE_CONFIG_DATA" | base64 -d > $HOME/.kube/config litmusctl create engine -f chaos/pod-kill-nginx.yaml --namespace demo
The workflow spins up a temporary cluster (via kind or a short‑lived GKE node pool), runs the experiment, and fails the job if the ChaosResult reports a non‑pass. This gives you fail‑fast feedback before a release ships.
2.13.0, Litmus CLI v2.13.0, Kubernetes v1.28.x.chaosengine.spec.annotationCheck to block experiments in prod unless a Git tag chaos‑approved is present.chaosresult_status{status!="Pass"}.jobCleanUpPolicy: "Delete" in the engine spec to avoid orphaned pods.network-chaos experiments.litmus-admin only for CI, litmus-viewer for dashboards.io-stress to simulate disk pressure, catching a bug where appendonly rewrites stalled under high I/O.Chaos engineering in Kubernetes is no longer an experimental hobby; it’s a production necessity. Pick a tool (LitmusChaos 2.13.0 is my go‑to), write a few YAML experiments, and bake them into your CI pipeline. The payoff is concrete: you’ll know your SLOs hold up when a node disappears, a network partition appears, or a storage volume hiccups.
Give it a try next week. Deploy Litmus, run the pod‑kill example, and watch your dashboards stay green. If they don’t, you’ve just uncovered the next ticket on your backlog – and that’s exactly why you ran the test.
Happy breaking!