Master Kubernetes Chaos Engineering with LitmusChaos. Learn to perform fault injection and improve system resiliency with this hands-on deployment guide.
If you aren't testing how your services behave when things go sideways, you're just waiting for a production outage to do it for you. Kubernetes Chaos Engineering isn't about causing damage; it's about verifying your assumptions. Do your pods actually restart when a node goes down? Does your circuit breaker trip correctly when the database latency spikes?
I’ve spent too many late nights debugging "ghost" issues that only appeared under load. Implementing LitmusChaos changed how I approach stability. It’s a CNCF project that turns fault injection into a repeatable, observable process. In this guide, I’ll show you how to get it running and start testing your system resiliency.
kubectl configured and authenticatedWe’ll deploy Litmus using Helm. It’s the cleanest way to manage the operator and the portal.
Bash# Add the Litmus repo helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/ helm repo update # Create the namespace kubectl create namespace litmus # Install the operator helm install litmus litmuschaos/litmus --namespace litmus
Once the pods are running, verify the installation with kubectl get pods -n litmus. You should see the litmus-portal and the litmus-operator ready to go.
To perform fault injection, we need to define an "Experiment." Litmus uses Custom Resource Definitions (CRDs) to manage these. Let's simulate a pod-kill scenario to see if our replica sets recover as expected.
First, create a rbac.yaml to give the Litmus engine the necessary permissions to kill pods in your target namespace:
YAMLapiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: litmus-role namespace: default rules: - apiGroups: ["", "apps", "batch"] resources: ["pods", "pods/log", "deployments", "jobs"] verbs: ["get", "list", "delete", "patch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: litmus-rb namespace: default subjects: - kind: ServiceAccount name: litmus-admin namespace: litmus roleRef: kind: Role name: litmus-role apiGroup: rbac.authorization.k8s.io
Apply this with kubectl apply -f rbac.yaml.
Now, let’s run a pod-delete experiment. I prefer using the Litmus Chaos Experiment CRDs directly for automation, but you can also use the web UI.
Create experiment.yaml:
YAMLapiVersion: litmuschaos.io/v1alpha1 kind: ChaosEngine metadata: name: pod-delete-engine namespace: default spec: engineState: active appinfo: appns: default applabel: "app=my-web-service" appkind: deployment chaosServiceAccount: litmus-admin experiments: - name: pod-delete spec: components: env: - name: TOTAL_CHAOS_DURATION value: '30' - name: CHAOS_INTERVAL value: '5'
Apply it: kubectl apply -f experiment.yaml.
Watch the logs: kubectl logs -f pod-delete-engine-runner. You’ll see the chaos runner targeting your pods and terminating them. This is the core of SRE practices: observing how your system heals itself when a component is forcefully removed.
After the 30-second duration, check your application metrics. Did the error rate spike? Did your ingress controller route traffic to the remaining healthy pods? If you see a total outage, you’ve discovered a critical flaw in your configuration—likely a missing Readiness Probe or an insufficient replica count.
ChaosEngine to limit the impact to a subset of pods.Kubernetes Chaos Engineering isn't about being a "cowboy" who breaks production. It’s about building confidence. By using LitmusChaos to inject faults, you gain the data needed to justify infrastructure improvements to stakeholders. Go ahead, break things—just make sure you're the one who decides when it happens.
Master Kubernetes chaos engineering with LitmusChaos. Learn how to perform fault injection and build resilient systems with this step-by-step technical guide.
Read moreMaster Kubernetes Secret Management by syncing HashiCorp Vault with External Secrets Operator. Learn how to automate secure, GitOps-friendly secret injection.