Software EngineeringTechnologyJune 19, 20263 min read

Kubernetes Chaos Engineering: A Practical Guide to LitmusChaos

Master Kubernetes chaos engineering with LitmusChaos. Learn how to perform fault injection and build resilient systems with this step-by-step technical guide.

KubernetesDevOpsSRELitmusChaosChaos EngineeringCloud NativeReliabilityLinuxServer

Why Chaos Engineering?

I’ve spent too many nights awake because a "minor" service failure caused a cascading outage. In distributed systems, failure isn't an option; it's a guarantee. If you aren't proactively breaking your services, your users will do it for you during peak traffic.

That’s where Kubernetes chaos engineering comes in. It’s not about causing chaos for the sake of it; it’s about verifying your hypotheses regarding system resilience. By using LitmusChaos, we can systematically inject faults—like pod kills, network latency, or CPU spikes—to ensure our monitoring and self-healing mechanisms actually work.

Getting Started with LitmusChaos

I prefer LitmusChaos because it’s Kubernetes-native, meaning it uses Custom Resource Definitions (CRDs) to manage experiments. It fits right into my existing GitOps workflows.

Prerequisites

A Kubernetes cluster (v1.24+)
Helm (v3.0+)
kubectl configured with cluster-admin access

Installation

First, let’s create a namespace and install the Litmus operator via Helm:


Bash
kubectl create namespace litmus
helm repo add litmuschaos https://litmuschaos.github.io/charts
helm install litmus litmuschaos/litmus --namespace litmus

Once installed, verify the pods are running: kubectl get pods -n litmus

You should see the litmus-portal and various operator pods running. If you're on a local machine, port-forward the portal to access the UI: kubectl port-forward svc/litmusportal-frontend-service 9091:9091 -n litmus

Running Your First Experiment: Pod Chaos

Let's test if your deployment can handle a random pod kill. We’ll use the pod-delete experiment.

1. Define the ChaosEngine

The ChaosEngine connects your application to the chaos experiment. Create a file named engine.yaml:


YAML
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: engine-nginx
  namespace: default
spec:
  appinfo:
    appns: 'default'
    applabel: 'app=nginx'
    appkind: 'deployment'
  jobCleanUpPolicy: 'delete'
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '30'
            - name: CHAOS_INTERVAL
              value: '5'

2. Apply and Observe

Apply the engine: kubectl apply -f engine.yaml

Now, watch your pods: kubectl get pods -w

You’ll see your nginx pods terminating and the replica set spinning up new ones. This is the heart of fault injection. If your application handles the SIGTERM signal correctly, you shouldn't see any downtime. If you do, you've found a gap in your resilience testing strategy.

Integrating into SRE Workflows

Running experiments manually is a good start, but it isn't site reliability engineering. To make this sustainable, you need to automate these checks.

The "Steady State" Hypothesis

Before running any experiment, define your steady state. What metrics should remain stable? For a web service, this is usually:

Error rate < 0.1%
P99 latency < 200ms
Successful request count

I recommend hooking Litmus into your CI/CD pipeline. Use litmusctl to trigger experiments as part of your integration tests. If the steady-state metrics deviate beyond your thresholds during the experiment, the pipeline should fail.

Hard-Won Lessons

Start Small: Don't start by deleting your database pods. Start with redundant microservices.
Monitor Everything: Chaos experiments without observability are just sabotage. Ensure your Prometheus/Grafana stack is configured to alert before you start the experiment.
Blast Radius: Always define your ChaosEngine scope carefully. You don't want to accidentally kill production pods while testing in a staging namespace.
Rollback is Key: Ensure your application has proper readiness and liveness probes. Without them, Kubernetes won't know the service is down, and your experiment will just create a black hole.

Final Thoughts

Kubernetes chaos engineering is about building confidence. When I can prove that my services survive a node drain or a network partition, I sleep better. LitmusChaos provides the framework; the rest is up to you to define what "resilient" means for your specific architecture.

Don't wait for a real incident to find out your system is fragile. Start small, experiment often, and automate the results.

Back to Blog