Software EngineeringTechnologyJune 18, 20267 min read

OpenTelemetry in Kubernetes: End‑to‑End Tracing, Metrics & Logging Guide

OpenTelemetry, Kubernetes observability, and distributed tracing made simple. Learn to collect traces, metrics, and logs in one pipeline with ready‑to‑use configs and code samples.

OpenTelemetryKubernetestracingmetricsloggingobservabilityDevOpsLinuxServer

OpenTelemetry in Kubernetes: End‑to‑End Tracing, Metrics & Logging Guide

Keywords: OpenTelemetry, Kubernetes observability, distributed tracing, metrics collection, logging integration

Why I Switched to OpenTelemetry

I spent two years juggling Jaeger, Prometheus, and Fluent Bit separately. The context switch cost me time and introduced gaps—some requests never showed up in the trace graph, metrics lagged, and logs were hard to correlate. When the OpenTelemetry project hit v1.26.0 (Oct 2023) with stable collector binaries, I finally had a single, vendor‑agnostic data plane.

In this post I’ll walk you through a production‑ready setup:

Instrument a Go microservice with the OpenTelemetry SDK.
Deploy the OpenTelemetry Collector as a DaemonSet for node‑level telemetry.
Export traces to Jaeger, metrics to Prometheus, and logs to Loki—all via the collector.

You can copy the snippets, drop them into a cluster running Kubernetes 1.27, and be up and running in under an hour.

1. Instrumenting Your Application

1.1 Choose the right SDK

For Go services I use go.opentelemetry.io/otel v1.20.0. It supports automatic instrumentation for HTTP, gRPC, and database drivers. If you’re on Java, the equivalent is io.opentelemetry:opentelemetry-sdk 1.32.0.

1.2 Minimal tracing code


Go
package main

import (
	"context"
	"log"
	"net/http"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
	"go.opentelemetry.io/otel/sdk/resource"
	sdktrace "go.opentelemetry.io/otel/sdk/trace"
	semconv "go.opentelemetry.io/otel/semconv/v1.22.0"
)

func initTracer() func(context.Context) error {
	// Collector endpoint is injected via env var OTEL_EXPORTER_OTLP_ENDPOINT
	exporter, err := otlptracegrpc.New(context.Background())
	if err != nil {
		log.Fatalf("failed to create exporter: %v", err)
	}
	r := resource.NewWithAttributes(
		semconv.SchemaURL,
		semconv.ServiceNameKey.String("orders-api"),
		attribute.String("environment", "prod"),
	)
	bsp := sdktrace.NewBatchSpanProcessor(exporter)
	tp := sdktrace.NewTracerProvider(
		sdktrace.WithResource(r),
		sdktrace.WithSpanProcessor(bsp),
	)
	otel.SetTracerProvider(tp)
	return tp.Shutdown
}

func main() {
	shutdown := initTracer()
	defer func() {
		if err := shutdown(context.Background()); err != nil {
			log.Fatalf("tracer shutdown error: %v", err)
		}
	}()

	mux := http.NewServeMux()
	mux.HandleFunc("/order", func(w http.ResponseWriter, r *http.Request) {
		ctx, span := otel.Tracer("orders-handler").Start(r.Context(), "CreateOrder")
		defer span.End()
		// Business logic here…
		_ = ctx
		w.WriteHeader(http.StatusCreated)
	})
	http.ListenAndServe(":8080", mux)
}

Key points

Exporter points at OTEL_EXPORTER_OTLP_ENDPOINT, which the collector will expose on localhost:4317.
Use resource to tag every span with service name and environment—critical for filtering in Jaeger.

1.3 Adding metrics


Go
import (
	"go.opentelemetry.io/otel/metric"
	"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc"
	sdkmetric "go.opentelemetry.io/otel/sdk/metric"
)

func initMeter() func(context.Context) error {
	exp, _ := otlpmetricgrpc.New(context.Background())
	provider := sdkmetric.NewMeterProvider(
		sdkmetric.WithReader(sdkmetric.NewPeriodicReader(exp, sdkmetric.WithInterval(15*time.Second))),
		sdkmetric.WithResource(r),
	)
	otel.SetMeterProvider(provider)
	return provider.Shutdown
}

Define a counter:


Go
var ordersCreated = otel.Meter("orders-api").Int64Counter("orders_created_total")
ordersCreated.Add(ctx, 1, attribute.String("status", "success"))

1.4 Structured logging with OpenTelemetry

OpenTelemetry doesn’t ship a logger yet, but you can inject trace IDs into logs. Using logrus v1.9.0:


Go
import (
	"github.com/sirupsen/logrus"
	"go.opentelemetry.io/otel/trace"
)

func logWithTrace(ctx context.Context, msg string) {
	span := trace.SpanFromContext(ctx)
	fields := logrus.Fields{
		"trace_id": span.SpanContext().TraceID().String(),
		"span_id":  span.SpanContext().SpanID().String(),
	}
	logrus.WithFields(fields).Info(msg)
}

Now every log line carries the trace context, making correlation in Loki trivial.

2. Deploying the OpenTelemetry Collector

2.1 Collector image

I run otel/opentelemetry-collector-contrib:0.103.0. The “contrib” build includes receivers for Prometheus, Loki, and Jaeger.

2.2 DaemonSet manifest


YAML
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-collector
  namespace: observability
spec:
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      serviceAccountName: otel-collector-sa
      containers:
        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:0.103.0
          args: ["--config=/etc/otel-collector-config.yaml"]
          ports:
            - containerPort: 4317   # OTLP gRPC
            - containerPort: 55679  # Prometheus metrics (collector health)
          volumeMounts:
            - name: config
              mountPath: /etc/otel-collector-config.yaml
              subPath: otel-collector-config.yaml
      volumes:
        - name: config
          configMap:
            name: otel-collector-config

2.3 ConfigMap – pipeline definition


YAML
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: observability
data:
  otel-collector-config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
      prometheus:
        config:
          scrape_configs:
            - job_name: 'kubernetes-nodes'
              kubernetes_sd_configs:
                - role: node
              relabel_configs:
                - source_labels: [__address__]
                  regex: '(.*):10250'
                  target_label: __address__
                  replacement: '${1}:10250'
      filelog:
        include:
          - /var/log/pods/**/*.log
        operators:
          - type: json_parser
            timestamp:
              parse_from: attributes.time
            severity:
              parse_from: attributes.severity
    processors:
      batch:
        timeout: 5s
        send_batch_max_size: 1024
      memory_limiter:
        check_interval: 1s
        limit_mib: 400
        spike_limit_mib: 100
    exporters:
      jaeger:
        endpoint: jaeger-collector.observability.svc:14250
        tls:
          insecure: true
      prometheus:
        endpoint: "0.0.0.0:8888"
      loki:
        endpoint: http://loki.observability.svc:3100/api/prom/push
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch, memory_limiter]
          exporters: [jaeger]
        metrics:
          receivers: [otlp, prometheus]
          processors: [batch, memory_limiter]
          exporters: [prometheus]
        logs:
          receivers: [filelog]
          processors: [batch, memory_limiter]
          exporters: [loki]

Why this shape?

OTLP receives both traces and metrics from the app containers.
Prometheus receiver scrapes node‑exporter metrics automatically.
Filelog receiver reads container logs mounted from the host (/var/log/pods/**).
Batch reduces network chatter; memory_limiter prevents OOM in busy nodes.

2.4 RBAC


YAML
apiVersion: v1
kind: ServiceAccount
metadata:
  name: otel-collector-sa
  namespace: observability
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otel-collector-role
rules:
  - apiGroups: [""]
    resources: ["pods", "nodes"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: otel-collector-binding
subjects:
  - kind: ServiceAccount
    name: otel-collector-sa
    namespace: observability
roleRef:
  kind: ClusterRole
  name: otel-collector-role
  apiGroup: rbac.authorization.k8s.io

Apply everything:


Bash
kubectl apply -f rbac.yaml -f collector-config.yaml -f collector-daemonset.yaml

You’ll see the collector pods in observability namespace, each exposing 4317/tcp.

3. Wiring the Back‑ends

3.1 Jaeger


Bash
helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm install jaeger jaegertracing/jaeger \
  --namespace observability \
  --set collector.service.enabled=true \
  --set query.service.type=LoadBalancer \
  --set collector.image.tag=1.53.0

The collector endpoint we referenced earlier (jaeger-collector.observability.svc:14250) resolves automatically.

3.2 Prometheus & Grafana


Bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace observability \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false

Add a Grafana dashboard that queries the /metrics endpoint on the collector (otel-collector:8888). A ready‑made dashboard ID 1860 (OpenTelemetry Collector) visualizes spans per minute, CPU usage, and queue depth.

3.3 Loki


Bash
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
  --namespace observability \
  --set loki.image.tag=2.9.2 \
  --set promtail.enabled=false

In Grafana, create a Loki data source pointing at http://loki.observability.svc:3100. You can now search logs by trace_id thanks to the fields we injected.

4. Verifying End‑to‑End Flow

Generate traffic: curl -X POST http://orders-api.prod.svc.cluster.local:8080/order.
Check Jaeger: Open http://<lb-ip>:16686, look for service orders-api. You should see a span tree with HTTP, DB, and custom CreateOrder spans.
Metrics: In Grafana, open the “OpenTelemetry Collector” dashboard. Verify otelcol_exporter_sent_spans increments.
Logs: In Grafana → Explore → Loki, query {trace_id="*"} |~ "CreateOrder" to see logs tied to the same request.

If any step fails, start with kubectl logs -n observability ds/otel-collector and look for pipeline errors. Common gotchas:

Missing OTEL_EXPORTER_OTLP_ENDPOINT – set it via a ConfigMap or DownwardAPI env var in the pod spec.
Port conflicts – the collector’s gRPC port (4317) must be free on each node.
Loki rate limits – adjust max_batch_size in the Loki exporter if you see “request body too large”.

5. Production‑grade Tweaks

Concern	Recommended Setting
CPU limits	`resources: limits: cpu: "500m" requests: cpu: "200m"` per collector container
Memory	Keep `memory_limiter` limit at 400 MiB (see config) and monitor `process_resident_memory_bytes`
TLS	Switch OTLP exporter to mTLS (`tls: {cert_file: /certs/client.crt, key_file: /certs/client.key}`) and enable `tls` on Jaeger/Loki
Sampling	Add `sampler` processor in the trace pipeline: `type: parentbased_traceidratio` with `ratio: 0.2` to cut 80 % of traffic in high‑load clusters
Autoscaling	If you prefer a Deployment over DaemonSet, enable HPA based on `otelcol_exporter_sent_spans` metric

6. Lessons Learned

One collector per node beats a single central collector – network hops drop, and node‑level metrics stay accurate.
Never rely on auto‑instrumentation alone – custom spans (business logic) give you the real insight.
Inject trace IDs into logs early – retro‑fitting later is a nightmare.
Version lock – OpenTelemetry components evolve fast; pin the collector image, SDK, and exporter versions to avoid breaking changes.

Implementing OpenTelemetry in Kubernetes isn’t a “set‑and‑forget” task, but once the pipeline runs, you gain a unified view that cuts mean‑time‑to‑detect (MTTD) by half in my experience. Give it a try, tweak the processors to match your traffic, and watch your observability maturity climb.

Next steps

Add a Redis exporter to push metrics to a time‑series database of your choice.
Experiment with OpenTelemetry Semantic Conventions for Kubernetes (k8s.pod.name, k8s.namespace.name).
Explore OTel Collector’s “tail sampling” for cost‑effective long‑term storage.

Happy tracing!

If you liked this guide, subscribe for more hands‑on DevOps deep dives.

Back to Blog