TechnologySoftware EngineeringJune 19, 20263 min read

Kubernetes Canary Deployments: A Guide to Flagger and Istio

Master Kubernetes Canary Deployments using Flagger and Istio. Learn how to automate traffic shifting, run health checks, and achieve safer progressive delivery.

KubernetesDevOpsIstioFlaggerSRETraffic ManagementProgressive DeliveryLinuxServer

Kubernetes Canary Deployments: A Guide to Flagger and Istio

I’ve spent countless hours dealing with the "deploy and pray" method. You push a change, hold your breath, and watch the error rates climb. It’s stressful, and frankly, it’s unnecessary. If you’re running on Kubernetes, you have the tools to make this painless.

In this post, I’ll show you how to implement Kubernetes Canary Deployments using Flagger and Istio Service Mesh. We’ll move away from manual releases and toward automated progressive delivery.

Why Flagger and Istio?

Istio handles the heavy lifting of traffic routing at the network level, but managing those weight shifts manually is a recipe for disaster. That’s where Flagger comes in. It acts as an operator that watches your deployments, automates the canary analysis, and shifts traffic based on real-time metrics from Prometheus.

By combining these, you get:

Automated traffic shifting: Gradual increases in traffic (e.g., 5%, 10%, 20%).
Built-in rollbacks: If your error rates spike, Flagger automatically reverts to the previous version.
Metric-driven gates: Integration with Prometheus ensures you only promote if your latency and success rates remain within thresholds.

The Architecture

Before we touch the code, understand the flow. Flagger creates a "Canary" custom resource. When you update your deployment image, Flagger detects the change, creates a clone of the deployment, and starts shifting traffic from your primary service to the canary version via Istio’s VirtualService.

Implementing the Canary Resource

First, ensure you have Istio (1.18+) and Flagger (1.30+) installed in your cluster. I’m assuming you have Prometheus running, as it’s the source of truth for your health checks.

Define your Canary resource like this:


YAML
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: backend-service
  namespace: prod
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: backend-service
  service:
    port: 80
    gateways:
    - public-gateway.istio-system.svc.cluster.local
    hosts:
    - api.example.com
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
      interval: 1m
    - name: request-duration
      thresholdRange:
        max: 500
      interval: 1m

Breaking Down the Analysis

interval: How often Flagger checks the metrics.
threshold: How many failed checks trigger a rollback.
stepWeight: The percentage of traffic shifted each cycle.
metrics: We’re tracking success rate (must be > 99%) and latency (must be < 500ms).

The Deployment Workflow

Once you apply this, the workflow changes. You stop applying Deployment manifests directly. Instead, you update your image tag in your deployment manifest, and Flagger takes the wheel.

When you change the image, Flagger:

Creates the backend-service-primary and backend-service-canary deployments.
Updates the Istio VirtualService to route 5% of traffic to the canary.
Queries Prometheus for the last minute of traffic.
If metrics are healthy, it increases the weight to 10%, then 20%, and so on.
If it hits 50% and everything is green, it promotes the canary to the primary deployment.

Lessons from the Trenches

I’ve learned a few hard lessons implementing this in production.

Don't ignore the feedback loop. If your analysis window is too short, you’ll promote buggy code. If it’s too long, your deployments will take forever. I usually stick to a 1-minute interval for 5-10 iterations. It’s the "Goldilocks" zone for most microservices.

Watch your Prometheus queries. Flagger uses specific Prometheus queries for success rates. Ensure your application is exporting standard Istio metrics (like istio_requests_total). If those aren't firing, Flagger will hang in a "Waiting for metrics" state indefinitely.

Use Webhooks for smoke tests. You can add webhooks to the analysis section to run automated integration tests during the canary phase. It’s the best way to catch logic errors that metrics alone might miss.

Final Thoughts

Implementing Kubernetes Canary Deployments isn't just about the technology; it's about shifting your mindset. You stop fearing the release because you've automated the safety net. With Flagger and Istio, you can sleep better knowing the system is watching your error rates for you.

Start small. Apply this to a non-critical service first. Once you see the traffic shifting in the Istio dashboard, you'll never want to go back to manual updates again.

Back to Blog

Kubernetes Canary Deployments: A Guide to Flagger and Istio

Kubernetes Canary Deployments: A Guide to Flagger and Istio

Why Flagger and Istio?

The Architecture

Implementing the Canary Resource

Breaking Down the Analysis

The Deployment Workflow

Lessons from the Trenches

Final Thoughts

Similar Posts

Automating Canary Deployments with Flagger & Istio on Kubernetes

Argo Rollouts vs Flagger: GitOps Canary Deployment Guide

Automating Canary Deployments with Flagger & Istio: Step‑by‑Step Guide