MHRubel
HomeAboutProjectsSkillsExperienceBlogContact
MHRubel

Senior Software Engineer crafting high-performance web applications and SaaS platforms.

Navigation

  • Home
  • About
  • Projects
  • Skills
  • Experience
  • Blog
  • Contact

Get in Touch

Available for senior/lead roles and consulting.

bd.mhrubel@gmail.comHire Me

© 2026 Mahamudul Hasan Rubel. All rights reserved.

Built with using Next.js 16 & Tailwind v4

Back to Blog
Software EngineeringTechnologyJune 18, 20266 min read

Implement Chaos Engineering in Kubernetes: Tools, Practices & Real‑World Examples

Chaos engineering, Kubernetes resilience, LitmusChaos, fault injection, DevOps testing – learn practical steps, tool choices, and production‑ready examples to make your clusters fault‑tolerant.

chaos engineeringkuberneteslitmuschaosfault injectiondevops testingobservability===categories=== Software Engineering===DevOpsLinuxServer

Implement Chaos Engineering in Kubernetes: Tools, Practices & Real‑World Examples

Chaos engineering isn’t a buzzword; it’s a disciplined way to prove your Kubernetes workloads survive the unexpected. In this post I’ll show you how to embed chaos into your CI/CD pipeline, compare the most‑used open‑source tools, and give you ready‑to‑run manifests that demonstrate fault injection at scale.

TL;DR – Deploy LitmusChaos 2.13.0, run a simple pod‑kill experiment, and integrate the test into a GitHub Actions workflow.


Why chaos matters for Kubernetes

Kubernetes abstracts away servers, but the underlying nodes, network, and storage still fail. A single pod restart can cascade into a service outage if you haven’t validated your resilience. Chaos engineering forces you to answer two questions:

  1. What can go wrong? – Identify failure modes (node loss, network latency, config drift).
  2. Does the system survive? – Run controlled experiments, measure SLO impact, and iterate.

Skipping this step means you’re betting on “it works in prod” – a gamble most senior engineers won’t take.


Core Chaos Engineering Practices

PracticeWhat you doTypical KPI
Define steady stateRecord latency, error rate, CPU usage for a healthy release.Baseline response time < 200 ms
Hypothesize failure impact“If a node disappears, latency will stay < 300 ms.”Success if hypothesis holds
Inject faultUse a chaos tool to kill pods, throttle network, corrupt disks.Fault injection duration 30 s – 5 min
Observe & verifyPull metrics from Prometheus, Grafana, or Datadog.No SLO breach
Automate & repeatAdd experiment to CI pipeline, schedule nightly runs.100 % coverage across services

Tool Landscape

ToolLatest stable version (2024)LanguageMain fault typesIntegration
LitmusChaos2.13.0Go, YAMLPod kill, node drain, network latency, IO chaosHelm, Argo CD, GitOps
Chaos Meshv2.6.3GoPod kill, network partition, stress CPU/Memory, JVM chaosCRD, Kube‑Scheduler plugin
PowerfulSealv1.4.0PythonNode failure, VM termination, network blackholeTerraform, Ansible
Gremlin (SaaS)2.4.1Go, CLIAll above + DNS, HTTP latency (paid)CLI, API, Helm

I gravitate toward LitmusChaos because it ships a rich experiment library, integrates cleanly with Helm, and the community maintains a “ChaosCenter” UI that developers love. Chaos Mesh is a solid alternative if you prefer pure CRDs without a UI.


Quick Start: Deploy LitmusChaos on a GKE Cluster

Bash
# Prereqs
kubectl version --client --short      # v1.28.0
helm version                          # v3.13.2

# 1. Add Litmus repo
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helmcharts
helm repo update

# 2. Install the operator in the litmus namespace
helm install litmus litmuschaos/litmus-helm --namespace litmus --create-namespace \
  --set chaosOperator.image.tag=2.13.0 \
  --set portalServer.enabled=true \
  --set portalServer.image.tag=2.13.0

# 3. Verify CRDs are present
kubectl get crd | grep chaos
# chaosengine.litmuschaos.io
# chaosresult.litmuschaos.io

The portalServer component gives you a web UI at http://<node-ip>:9091. I keep it behind an internal load balancer; no public exposure needed.


First Experiment: Pod Kill

Create a simple ChaosEngine that kills a pod from the nginx-demo deployment every 30 seconds, three times.

YAML
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-kill-nginx
  namespace: demo
spec:
  appinfo:
    appns: demo
    applabel: "app=nginx"
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "90"          # seconds
            - name: CHAOS_INTERVAL
              value: "30"

Apply it:

Bash
kubectl apply -f pod-kill-nginx.yaml
kubectl get chaosengine -n demo pod-kill-nginx -o yaml

Watch the logs:

Bash
kubectl logs -l name=chaos-runner -n demo -f

You’ll see the operator delete a pod, wait 30 s, repeat. If your nginx service stays reachable (e.g., curl http://nginx.demo.svc.cluster.local returns 200), the hypothesis passes.


Real‑World Example: Simulating a Node Drain

In production we once lost a 30‑node worker pool during a rolling upgrade. To avoid surprise, we scripted a node‑drain experiment that evicts all pods from a randomly chosen node, then restores it.

YAML
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: node-drain
  namespace: prod
spec:
  appinfo:
    appns: prod
    applabel: ""               # not needed for node-level chaos
    appkind: ""                # same
  chaosServiceAccount: litmus-admin
  experiments:
    - name: node-drain
      spec:
        components:
          env:
            - name: NODE_LABEL
              value: "kubernetes.io/role=worker"
            - name: CHAOS_DURATION
              value: "120"
            - name: TOTAL_CHAOS_DURATION
              value: "180"

When this runs, Litmus picks a node matching NODE_LABEL, cordons it, evicts pods, waits CHAOS_DURATION, then uncordons. In our CI run we measure:

  • Pod restart latency – stayed under 5 s (our SLA).
  • Cluster autoscaler – didn’t spin up extra nodes (cost neutral).

The experiment gave us confidence to shrink our node pool by 20 % without impacting latency.


Automating Chaos in CI/CD (GitHub Actions Example)

YAML
name: Chaos Test
on:
  push:
    branches: [ main ]
jobs:
  chaos:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repo
        uses: actions/checkout@v4

      - name: Set up kubectl
        uses: azure/setup-kubectl@v3
        with:
          version: v1.28.0

      - name: Install Litmus CLI
        run: |
          curl -Lo litmusctl https://github.com/litmuschaos/litmusctl/releases/download/v2.13.0/litmusctl-linux-amd64
          chmod +x litmusctl
          sudo mv litmusctl /usr/local/bin/

      - name: Run pod‑kill experiment
        env:
          KUBE_CONFIG_DATA: ${{ secrets.KUBE_CONFIG_DATA }}
        run: |
          echo "$KUBE_CONFIG_DATA" | base64 -d > $HOME/.kube/config
          litmusctl create engine -f chaos/pod-kill-nginx.yaml --namespace demo

The workflow spins up a temporary cluster (via kind or a short‑lived GKE node pool), runs the experiment, and fails the job if the ChaosResult reports a non‑pass. This gives you fail‑fast feedback before a release ships.


Best‑Practice Checklist

  • Version pin everything – Helm chart 2.13.0, Litmus CLI v2.13.0, Kubernetes v1.28.x.
  • Scope experiments – Start with non‑critical services (canary) before touching payment APIs.
  • Guard rails – Use chaosengine.spec.annotationCheck to block experiments in prod unless a Git tag chaos‑approved is present.
  • Observability – Wire Prometheus alerts on chaosresult_status{status!="Pass"}.
  • Clean‑up – Set jobCleanUpPolicy: "Delete" in the engine spec to avoid orphaned pods.

Lessons Learned from the Field

  1. Network latency beats pod kill – In our microservice mesh, a 200 ms added latency caused timeouts more often than a full pod restart. Prioritize network-chaos experiments.
  2. Chaos UI is a double‑edge sword – Developers love the portal, but it also becomes a “run‑anywhere” button. Enforce RBAC: litmus-admin only for CI, litmus-viewer for dashboards.
  3. Don’t forget stateful workloads – For a Redis cluster we used io-stress to simulate disk pressure, catching a bug where appendonly rewrites stalled under high I/O.
  4. Chaos budgets – Treat chaos like a test coverage metric. Aim for at least 75 % of services having one experiment per release cycle.

Wrap‑Up

Chaos engineering in Kubernetes is no longer an experimental hobby; it’s a production necessity. Pick a tool (LitmusChaos 2.13.0 is my go‑to), write a few YAML experiments, and bake them into your CI pipeline. The payoff is concrete: you’ll know your SLOs hold up when a node disappears, a network partition appears, or a storage volume hiccups.

Give it a try next week. Deploy Litmus, run the pod‑kill example, and watch your dashboards stay green. If they don’t, you’ve just uncovered the next ticket on your backlog – and that’s exactly why you ran the test.

Happy breaking!

Back to Blog