Learn how to implement GitOps-driven observability using OpenSLO and Prometheus. Stop chasing noise and start managing Service Level Objectives as code today.
I’ve spent too many late nights chasing alerts that didn't actually matter. If you’ve worked in SRE or DevOps for a while, you know the drill: your PagerDuty goes off at 3:00 AM because a CPU spike hit a threshold that hasn't been relevant for six months. We call this "alert fatigue," but it’s really a failure of observability strategy.
The shift toward GitOps changed how we deploy infrastructure and applications. Now, it’s time to apply that same rigor to our Service Level Objectives (SLOs). By treating SLOs as code using OpenSLO and Prometheus, we move from reactive threshold-based alerting to proactive, error-budget-based reliability.
In a traditional setup, you define alerts in Prometheus recording rules or Grafana dashboards. If you want to change an objective, you’re clicking through UIs or manually editing YAML files buried in a repository. It’s brittle.
When you adopt GitOps for your observability stack, you store your SLO definitions in version control. This forces a code review process, keeps a history of why an objective was changed, and allows you to sync your production state with your source of truth automatically.
We’re going to use OpenSLO—a vendor-agnostic specification—to define our objectives. Then, we’ll use a tool like sloth or pyrra to compile these specifications into Prometheus recording rules.
OpenSLO uses YAML to describe what success looks like. Let’s say we want to track the latency of our authentication service.
YAMLapiVersion: openslo/v1 kind: SLO metadata: name: auth-service-latency spec: service: auth-service description: "99% of requests should be faster than 200ms" budgeting: type: RollingWindow window: 30d objectives: - displayName: High Latency indicator: spec: type: latency source: prometheus query: | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="auth"}[5m])) by (le)) target: 0.99 value: 0.2
This YAML is readable, version-controlled, and tells the entire team exactly what the performance expectations are.
You don't want to write complex PromQL recording rules by hand. They’re prone to error and hard to debug. Instead, use a generator. If you’re using sloth (v0.11.0+), you can generate the rules directly from your OpenSLO file:
Bash# Generate recording rules for Prometheus sloth generate --input openslo.yaml --output prometheus-rules.yaml
The output creates a set of recording rules that calculate your error budget burn rate. This is the secret sauce: instead of alerting on latency, you alert on the rate at which you are burning your error budget.
When you implement SLO-based alerting, you stop alerting on "CPU > 80%." Instead, you alert when the error budget is being consumed too quickly.
Here is a typical Prometheus alert rule generated from the workflow above:
YAMLgroups: - name: slos rules: - alert: SLOErrorBudgetBurnRateHigh expr: | (job_error_budget_burn_rate_fast > 14.4) and (job_error_budget_burn_rate_slow > 5) for: 1h labels: severity: critical annotations: summary: "High error budget burn rate" description: "The auth-service is consuming its 30-day budget in less than 2 days."
By using the "Fast Burn" (1-hour window) and "Slow Burn" (6-hour window) logic, you eliminate false positives. If a blip happens, the fast burn triggers, but the slow burn doesn't, so you don't wake up. If the service is actually failing, both will trigger, and you’ll know it’s time to act.
Now that your SLOs are YAML files and your rules are generated, put them in your infrastructure repository. Use a tool like ArgoCD or Flux to sync these to your Kubernetes cluster.
openslo.yaml to your observability repo.PrometheusRule Custom Resource in your K8s cluster.This workflow ensures that your Service Level Objectives are always in sync with your production environment. No more "drift" between what you think you're monitoring and what's actually configured.
Transitioning to GitOps-driven observability isn't just about the tools; it's about shifting the culture. It forces you to define what "healthy" actually means for your users. Start small. Pick one service, define one SLO, and automate the deployment of that single rule. Once you see the noise drop and the signal rise, you won't want to go back to static thresholds.
If you’re running Prometheus v2.45+ and using a GitOps controller, there’s no reason your reliability standards shouldn't be as robust as your application code.
Master GitOps-driven canary deployments using Argo Rollouts and Flagger. Learn how to automate Kubernetes progressive delivery for safer, faster production releases.
Read moreMaster GitOps with Argo CD and Crossplane to manage infrastructure as code. Learn how to unify your Kubernetes deployment strategy for apps and cloud resources.