Software EngineeringTechnologyJune 19, 20264 min read

GitOps-Driven Observability: Implementing SLO-Based Alerting with OpenSLO

Learn how to implement GitOps-driven observability using OpenSLO and Prometheus. Stop chasing noise and start managing Service Level Objectives as code today.

GitOpsObservabilitySLOPrometheusSREDevOpsInfrastructure as CodeLinuxServer

GitOps-Driven Observability: Implementing SLO-Based Alerting with OpenSLO and Prometheus

I’ve spent too many late nights chasing alerts that didn't actually matter. If you’ve worked in SRE or DevOps for a while, you know the drill: your PagerDuty goes off at 3:00 AM because a CPU spike hit a threshold that hasn't been relevant for six months. We call this "alert fatigue," but it’s really a failure of observability strategy.

The shift toward GitOps changed how we deploy infrastructure and applications. Now, it’s time to apply that same rigor to our Service Level Objectives (SLOs). By treating SLOs as code using OpenSLO and Prometheus, we move from reactive threshold-based alerting to proactive, error-budget-based reliability.

Why SLOs as Code?

In a traditional setup, you define alerts in Prometheus recording rules or Grafana dashboards. If you want to change an objective, you’re clicking through UIs or manually editing YAML files buried in a repository. It’s brittle.

When you adopt GitOps for your observability stack, you store your SLO definitions in version control. This forces a code review process, keeps a history of why an objective was changed, and allows you to sync your production state with your source of truth automatically.

The Stack: OpenSLO and Prometheus

We’re going to use OpenSLO—a vendor-agnostic specification—to define our objectives. Then, we’ll use a tool like sloth or pyrra to compile these specifications into Prometheus recording rules.

1. Defining the SLO in OpenSLO

OpenSLO uses YAML to describe what success looks like. Let’s say we want to track the latency of our authentication service.


YAML
apiVersion: openslo/v1
kind: SLO
metadata:
  name: auth-service-latency
spec:
  service: auth-service
  description: "99% of requests should be faster than 200ms"
  budgeting:
    type: RollingWindow
    window: 30d
  objectives:
    - displayName: High Latency
      indicator:
        spec:
          type: latency
          source: prometheus
          query: |
            histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="auth"}[5m])) by (le))
      target: 0.99
      value: 0.2

This YAML is readable, version-controlled, and tells the entire team exactly what the performance expectations are.

2. Compiling to Prometheus Rules

You don't want to write complex PromQL recording rules by hand. They’re prone to error and hard to debug. Instead, use a generator. If you’re using sloth (v0.11.0+), you can generate the rules directly from your OpenSLO file:


Bash
# Generate recording rules for Prometheus
sloth generate --input openslo.yaml --output prometheus-rules.yaml

The output creates a set of recording rules that calculate your error budget burn rate. This is the secret sauce: instead of alerting on latency, you alert on the rate at which you are burning your error budget.

Alerting on Burn Rates

When you implement SLO-based alerting, you stop alerting on "CPU > 80%." Instead, you alert when the error budget is being consumed too quickly.

Here is a typical Prometheus alert rule generated from the workflow above:


YAML
groups:
- name: slos
  rules:
  - alert: SLOErrorBudgetBurnRateHigh
    expr: |
      (job_error_budget_burn_rate_fast > 14.4) 
      and (job_error_budget_burn_rate_slow > 5)
    for: 1h
    labels:
      severity: critical
    annotations:
      summary: "High error budget burn rate"
      description: "The auth-service is consuming its 30-day budget in less than 2 days."

By using the "Fast Burn" (1-hour window) and "Slow Burn" (6-hour window) logic, you eliminate false positives. If a blip happens, the fast burn triggers, but the slow burn doesn't, so you don't wake up. If the service is actually failing, both will trigger, and you’ll know it’s time to act.

Integrating with GitOps

Now that your SLOs are YAML files and your rules are generated, put them in your infrastructure repository. Use a tool like ArgoCD or Flux to sync these to your Kubernetes cluster.

Commit: Push your openslo.yaml to your observability repo.
CI: Run a pipeline to lint and compile the YAML to Prometheus rules.
CD: ArgoCD detects the change and updates the PrometheusRule Custom Resource in your K8s cluster.

This workflow ensures that your Service Level Objectives are always in sync with your production environment. No more "drift" between what you think you're monitoring and what's actually configured.

Final Thoughts

Transitioning to GitOps-driven observability isn't just about the tools; it's about shifting the culture. It forces you to define what "healthy" actually means for your users. Start small. Pick one service, define one SLO, and automate the deployment of that single rule. Once you see the noise drop and the signal rise, you won't want to go back to static thresholds.

If you’re running Prometheus v2.45+ and using a GitOps controller, there’s no reason your reliability standards shouldn't be as robust as your application code.

Back to Blog

GitOps-Driven Observability: Implementing SLO-Based Alerting with OpenSLO

GitOps-Driven Observability: Implementing SLO-Based Alerting with OpenSLO and Prometheus

Why SLOs as Code?

The Stack: OpenSLO and Prometheus

1. Defining the SLO in OpenSLO

2. Compiling to Prometheus Rules

Alerting on Burn Rates

Integrating with GitOps

Final Thoughts

Similar Posts

Argo Rollouts vs Flagger: GitOps Canary Deployment Guide

Building a GitOps Pipeline with Argo CD and Crossplane

Kubernetes Incident Response: Automating Self-Healing with KubeVela