Software EngineeringTechnologyJune 18, 20267 min read

OpenTelemetry in Kubernetes: End‑to‑End Tracing, Metrics & Logging

OpenTelemetry + Kubernetes observability explained: a step‑by‑step guide to collect distributed traces, metrics, and logs from Java, Go, and Python services using Otel Collector, Prometheus, and Loki.

OpenTelemetryKubernetestracingmetricsloggingobservabilityPrometheusLokiGrafanaDevOpsLinuxServer

OpenTelemetry in Kubernetes: End‑to‑End Tracing, Metrics & Logging

Keywords: OpenTelemetry, Kubernetes observability, distributed tracing, metrics collection, logging integration

Why I’m writing this

A few months back my team ran into a classic production nightmare: a latency spike in a microservice, but we had no visibility into the request flow, no metric thresholds, and our logs were scattered across three different namespaces. We patched things together with ad‑hoc scripts, but it was a maintenance nightmare.

Instead of patching, I went all‑in on OpenTelemetry (OTel). The result? One unified pipeline that captures traces, metrics, and logs from every pod, stores them in Prometheus and Loki, and visualizes everything in Grafana. This post is the exact recipe I used, with version numbers, Helm values, and code snippets you can copy‑paste.

What you’ll get

A working OpenTelemetry Collector deployed via Helm (v0.83.0).
Instrumentation libraries for Java (Spring Boot 3.2), Go (1.22), and Python (3.11).
Metrics scraped by Prometheus (v2.48) and logs shipped to Loki (v2.9).
Grafana dashboards for distributed tracing, latency heatmaps, and error rates.

Everything is reproducible with a single make deploy command.

Prerequisites

Item	Minimum version
Kubernetes cluster	1.27 (any CNCF‑certified distro)
Helm	3.12
kubectl	1.27
Docker	24.0
Grafana	10.2 (optional – for visualization)

Make sure kubectl points to a cluster with cluster‑wide admin rights; the Collector needs to create ClusterRoles and ServiceAccounts.

1. Deploy the OpenTelemetry Collector

I use the official Helm chart from the OpenTelemetry community. It gives us a gateway mode collector that receives data from all pods and forwards it to the back‑ends.


Bash
helm repo add otel https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

helm upgrade --install otel-collector otel/opentelemetry-collector \
  --namespace observability \
  --create-namespace \
  --version 0.83.0 \
  -f - <<EOF
mode: "deployment"
replicaCount: 2
config:
  receivers:
    otlp:
      protocols:
        grpc:
        http:
  processors:
    batch:
    memory_limiter:
      limit_mib: 400
      check_interval: 5s
  exporters:
    prometheus:
      endpoint: "0.0.0.0:8889"
    loki:
      endpoint: "http://loki:3100/loki/api/v1/push"
    zipkin:
      endpoint: "http://tempo:9411/api/v2/spans"
  service:
    pipelines:
      traces:
        receivers: [otlp]
        processors: [memory_limiter, batch]
        exporters: [zipkin]
      metrics:
        receivers: [otlp]
        processors: [memory_limiter, batch]
        exporters: [prometheus]
      logs:
        receivers: [otlp]
        processors: [memory_limiter, batch]
        exporters: [loki]
EOF

What’s happening?

The collector runs as a Deployment with two replicas for HA.
It listens on the OTLP gRPC/HTTP ports (default 4317/4318).
Traces go to Tempo (Zipkin protocol) – you can swap it for Jaeger if you prefer.
Metrics are exposed on :8889 for Prometheus to scrape.
Logs are shipped to Loki.

Tip: Keep the memory_limiter processor; without it the collector can OOM on busy clusters.

2. Install the back‑ends (Prometheus, Loki, Tempo)

All three are available as Helm charts. Below is a minimal but production‑ready set‑up.


Bash
# Prometheus
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm upgrade --install prometheus prometheus-community/kube-prometheus-stack \
  --namespace observability \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set grafana.enabled=true \
  --set grafana.defaultDashboardEnabled=true \
  --version 55.7.0

# Loki
helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade --install loki grafana/loki-stack \
  --namespace observability \
  --set loki.image.tag=2.9.1 \
  --set promtail.enabled=true \
  --set promtail.config.client.url=http://loki:3100/loki/api/v1/push \
  --version 2.9.0

# Tempo
helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade --install tempo grafana/tempo \
  --namespace observability \
  --set tempo.image.tag=2.4.0 \
  --set tempo.storage.trace.blocksize=64Mi \
  --set tempo.storage.trace.retention=48h \
  --version 1.6.0

Now you have a full observability stack in the observability namespace. Grafana automatically picks up the data sources because the kube-prometheus-stack chart creates them.

3. Instrument a Java service (Spring Boot)

Add the OTel starter to your pom.xml:


XML
style="color:#808080"><style="color:#4EC9B0">dependency>
  style="color:#808080"><style="color:#4EC9B0">groupId>io.opentelemetry.instrumentationstyle="color:#808080"></style="color:#4EC9B0">groupId>
  style="color:#808080"><style="color:#4EC9B0">artifactId>opentelemetry-spring-boot-starterstyle="color:#808080"></style="color:#4EC9B0">artifactId>
  style="color:#808080"><style="color:#4EC9B0">version>1.34.1style="color:#808080"></style="color:#4EC9B0">version>
style="color:#808080"></style="color:#4EC9B0">dependency>

Create an application.yaml (or application.properties) that points to the collector:


YAML
otel:
  exporter:
    otlp:
      endpoint: http://otel-collector.observability:4317
      compression: gzip
  metrics:
    export:
      interval: 30s
  resource:
    attributes:
      service.name: orders-service
      deployment.environment: prod

Dockerfile (multi‑stage, Java 21):


DOCKERFILE
FROM eclipse-temurin:21-jdk-alpine AS build
WORKDIR /app
COPY . .
RUN ./mvnw -B package -DskipTests

FROM eclipse-temurin:21-jre-alpine
COPY --from=build /app/target/orders-service.jar /app.jar
ENTRYPOINT ["java","-javaagent:/app.jar","-jar","/app.jar"]

Deploy with a simple manifest:


YAML
apiVersion: apps/v1
kind: Deployment
metadata:
  name: orders-service
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: orders
  template:
    metadata:
      labels:
        app: orders
    spec:
      containers:
        - name: java
          image: ghcr.io/yourorg/orders-service:1.0.0
          env:
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: "http://otel-collector.observability:4317"
            - name: OTEL_RESOURCE_ATTRIBUTES
              value: "service.name=orders-service,deployment.environment=prod"
          ports:
            - containerPort: 8080

When the pod starts, you’ll see trace IDs in the logs (trace_id=...) and metrics like http_server_requests_seconds_count in Prometheus.

4. Instrument a Go service

Add the OTel Go SDK (v1.28.0) and the HTTP instrumentation:


Bash
go get go.opentelemetry.io/otel@v1.28.0
go get go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.45.0

Initialize the tracer in main.go:


Go
package main

import (
	"context"
	"log"
	"net/http"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
	"go.opentelemetry.io/otel/sdk/resource"
	sdktrace "go.opentelemetry.io/otel/sdk/trace"
	semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
	"go.opentelemetry.io/otel/trace"
)

func initTracer() func(context.Context) error {
	ctx := context.Background()
	exp, err := otlptracegrpc.New(ctx,
		otlptracegrpc.WithEndpoint("otel-collector.observability:4317"),
		otlptracegrpc.WithInsecure(),
	)
	if err != nil {
		log.Fatalf("failed to create exporter: %v", err)
	}
	bsp := sdktrace.NewBatchSpanProcessor(exp)
	res, _ := resource.New(ctx,
		resource.WithAttributes(
			semconv.ServiceNameKey.String("payment-service"),
			semconv.DeploymentEnvironmentKey.String("prod"),
		),
	)
	tp := sdktrace.NewTracerProvider(
		sdktrace.WithSpanProcessor(bsp),
		sdktrace.WithResource(res),
	)
	otel.SetTracerProvider(tp)
	return tp.Shutdown
}

func main() {
	shutdown := initTracer()
	defer func() {
		if err := shutdown(context.Background()); err != nil {
			log.Printf("shutdown error: %v", err)
		}
	}()

	mux := http.NewServeMux()
	mux.Handle("/", otelhttp.NewHandler(http.HandlerFunc(hello), "Hello"))
	log.Println("Listening on :8080")
	http.ListenAndServe(":8080", mux)
}

Dockerfile:


DOCKERFILE
FROM golang:1.22-alpine AS build
WORKDIR /src
COPY . .
RUN go build -ldflags="-s -w" -o /app/payment-service

FROM alpine:3.19
COPY --from=build /app/payment-service /payment-service
ENTRYPOINT ["/payment-service"]

Deploy the same way as the Java service, just swap the image name and service.name env var.

5. Instrument a Python FastAPI service

Install the OTel packages:


Bash
pip install opentelemetry-sdk==1.22.0 \
            opentelemetry-instrumentation-fastapi==0.45b0 \
            opentelemetry-exporter-otlp==1.22.0 \
            uvicorn

Add a tiny wrapper (instrument.py):


PYTHON
from fastapi import FastAPI
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
import os

resource = Resource.create({
    "service.name": "inventory-service",
    "deployment.environment": "prod",
})

trace.set_tracer_provider(TracerProvider(resource=resource))
otlp_exporter = OTLPSpanExporter(endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT"))
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)

@app.get("/items/{item_id}")
async def read_item(item_id: str):
    return {"item_id": item_id}

Dockerfile:


DOCKERFILE
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
ENV OTEL_EXPORTER_OTLP_ENDPOINT="http://otel-collector.observability:4317"
CMD ["uvicorn", "instrument:app", "--host", "0.0.0.0", "--port", "8080"]

Again, a plain Deployment manifest does the trick.

6. Wire Prometheus to scrape the Collector

Add a ServiceMonitor (Prometheus Operator will pick it up):


YAML
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: otel-collector
  namespace: observability
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: otel-collector
  endpoints:
    - port: metrics
      path: /metrics
      interval: 30s

Now you can query otelcol_exporter_sent_spans_total or process_cpu_seconds_total in Grafana.

7. Verify end‑to‑end flow

Generate traffic – hey -c 10 -n 1000 http://orders-service.production:8080/
Check traces – Open Grafana → Explore → Tempo datasource → search by service.name="orders-service"
Metrics – In Grafana, use the Prometheus query rate(http_server_requests_seconds_count[1m]) to see QPS per service.
Logs – In Grafana Loki, query {job="orders-service"} |~ "trace_id" – you’ll see the same trace IDs as in Tempo.

If any component is missing, the collector logs (kubectl logs -l app=otel-collector) will tell you which exporter failed.

8. Production‑grade tweaks

Concern	Recommended setting
Collector memory	`memory_limiter.limit_mib: 800` for >5 k RPS
Span retention	Set `tempo.storage.trace.retention: 72h`
Scrape interval	15 s for high‑frequency metrics
TLS	Enable `otelcol.receiver.otlp.protocols.grpc.tls` and configure certs
RBAC	Use a dedicated `otel-collector` ServiceAccount with `cluster-admin` limited to `metrics.k8s.io`

9. Wrap‑up

OpenTelemetry gave us a single source of truth for everything that happens inside Kubernetes. No more juggling separate tracing libraries, Prometheus exporters, or log shippers. The collector acts as a gatekeeper; you only need to change the back‑ends once.

Try it on a dev cluster first, then scale the collector replicas and increase the memory limits as traffic grows. The code snippets above are fully versioned, so you can lock them down in your CI/CD pipelines.

Happy observability!

Next steps

Add auto‑instrumentation for Node.js (opentelemetry-auto-instrumentations-node@0.38.0).
Experiment with sampling (trace.id_ratio_based) to reduce data volume.
Deploy Alertmanager rules on latency percentiles (histogram_quantile(0.95, sum(rate(http_server_request_duration_seconds_bucket[5m])) by (le, service_name))).

Feel free to drop a comment if you hit a snag. I’ll update the repo with a make deploy target that bundles everything shown here.

Back to Blog