Software EngineeringTechnologyJune 18, 20263 min read

Kubernetes Chaos Engineering with LitmusChaos: A Practical Guide

Master Kubernetes Chaos Engineering with LitmusChaos. Learn to perform fault injection and improve system resiliency with this hands-on deployment guide.

KubernetesChaos EngineeringLitmusChaosSREDevOpsReliabilityCloud NativeLinuxServer

Why You Should Break Your Cluster on Purpose

If you aren't testing how your services behave when things go sideways, you're just waiting for a production outage to do it for you. Kubernetes Chaos Engineering isn't about causing damage; it's about verifying your assumptions. Do your pods actually restart when a node goes down? Does your circuit breaker trip correctly when the database latency spikes?

I’ve spent too many late nights debugging "ghost" issues that only appeared under load. Implementing LitmusChaos changed how I approach stability. It’s a CNCF project that turns fault injection into a repeatable, observable process. In this guide, I’ll show you how to get it running and start testing your system resiliency.

Prerequisites

A Kubernetes cluster (v1.24+)
kubectl configured and authenticated
Helm 3 installed

Step 1: Installing LitmusChaos

We’ll deploy Litmus using Helm. It’s the cleanest way to manage the operator and the portal.


Bash
# Add the Litmus repo
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update

# Create the namespace
kubectl create namespace litmus

# Install the operator
helm install litmus litmuschaos/litmus --namespace litmus

Once the pods are running, verify the installation with kubectl get pods -n litmus. You should see the litmus-portal and the litmus-operator ready to go.

Step 2: Setting Up Your First Experiment

To perform fault injection, we need to define an "Experiment." Litmus uses Custom Resource Definitions (CRDs) to manage these. Let's simulate a pod-kill scenario to see if our replica sets recover as expected.

First, create a rbac.yaml to give the Litmus engine the necessary permissions to kill pods in your target namespace:


YAML
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: litmus-role
  namespace: default
rules:
- apiGroups: ["", "apps", "batch"]
  resources: ["pods", "pods/log", "deployments", "jobs"]
  verbs: ["get", "list", "delete", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: litmus-rb
  namespace: default
subjects:
- kind: ServiceAccount
  name: litmus-admin
  namespace: litmus
roleRef:
  kind: Role
  name: litmus-role
  apiGroup: rbac.authorization.k8s.io

Apply this with kubectl apply -f rbac.yaml.

Step 3: Executing the Chaos

Now, let’s run a pod-delete experiment. I prefer using the Litmus Chaos Experiment CRDs directly for automation, but you can also use the web UI.

Create experiment.yaml:


YAML
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-delete-engine
  namespace: default
spec:
  engineState: active
  appinfo:
    appns: default
    applabel: "app=my-web-service"
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '30'
            - name: CHAOS_INTERVAL
              value: '5'

Apply it: kubectl apply -f experiment.yaml.

Watch the logs: kubectl logs -f pod-delete-engine-runner. You’ll see the chaos runner targeting your pods and terminating them. This is the core of SRE practices: observing how your system heals itself when a component is forcefully removed.

Analyzing the Results

After the 30-second duration, check your application metrics. Did the error rate spike? Did your ingress controller route traffic to the remaining healthy pods? If you see a total outage, you’ve discovered a critical flaw in your configuration—likely a missing Readiness Probe or an insufficient replica count.

Best Practices for SREs

Start Small: Don't run chaos against your production database on day one. Start with dev, then staging.
Define Steady State: Before running an experiment, define what "healthy" looks like (e.g., latency < 200ms, 0 errors). If you can't measure it, don't break it.
Automate: Integrate these experiments into your CI/CD pipeline. Running chaos once a week is good; running it on every release is better.
Control the Blast Radius: Use specific labels in your ChaosEngine to limit the impact to a subset of pods.

Final Thoughts

Kubernetes Chaos Engineering isn't about being a "cowboy" who breaks production. It’s about building confidence. By using LitmusChaos to inject faults, you gain the data needed to justify infrastructure improvements to stakeholders. Go ahead, break things—just make sure you're the one who decides when it happens.

Back to Blog

Kubernetes Chaos Engineering with LitmusChaos: A Practical Guide

Why You Should Break Your Cluster on Purpose

Prerequisites

Step 1: Installing LitmusChaos

Step 2: Setting Up Your First Experiment

Step 3: Executing the Chaos

Analyzing the Results

Best Practices for SREs

Final Thoughts

Similar Posts

Kubernetes Chaos Engineering: A Practical Guide to LitmusChaos

Kubernetes Secret Management: Using External Secrets and HashiCorp Vault

Kubernetes Canary Deployments: A Guide to Flagger and Istio