OpenTelemetry + Kubernetes observability explained: a step‑by‑step guide to collect distributed traces, metrics, and logs from Java, Go, and Python services using Otel Collector, Prometheus, and Loki.
Keywords: OpenTelemetry, Kubernetes observability, distributed tracing, metrics collection, logging integration
A few months back my team ran into a classic production nightmare: a latency spike in a microservice, but we had no visibility into the request flow, no metric thresholds, and our logs were scattered across three different namespaces. We patched things together with ad‑hoc scripts, but it was a maintenance nightmare.
Instead of patching, I went all‑in on OpenTelemetry (OTel). The result? One unified pipeline that captures traces, metrics, and logs from every pod, stores them in Prometheus and Loki, and visualizes everything in Grafana. This post is the exact recipe I used, with version numbers, Helm values, and code snippets you can copy‑paste.
Everything is reproducible with a single make deploy command.
| Item | Minimum version |
|---|---|
| Kubernetes cluster | 1.27 (any CNCF‑certified distro) |
| Helm | 3.12 |
| kubectl | 1.27 |
| Docker | 24.0 |
| Grafana | 10.2 (optional – for visualization) |
Make sure kubectl points to a cluster with cluster‑wide admin rights; the Collector needs to create ClusterRoles and ServiceAccounts.
I use the official Helm chart from the OpenTelemetry community. It gives us a gateway mode collector that receives data from all pods and forwards it to the back‑ends.
Bashhelm repo add otel https://open-telemetry.github.io/opentelemetry-helm-charts helm repo update helm upgrade --install otel-collector otel/opentelemetry-collector \ --namespace observability \ --create-namespace \ --version 0.83.0 \ -f - <<EOF mode: "deployment" replicaCount: 2 config: receivers: otlp: protocols: grpc: http: processors: batch: memory_limiter: limit_mib: 400 check_interval: 5s exporters: prometheus: endpoint: "0.0.0.0:8889" loki: endpoint: "http://loki:3100/loki/api/v1/push" zipkin: endpoint: "http://tempo:9411/api/v2/spans" service: pipelines: traces: receivers: [otlp] processors: [memory_limiter, batch] exporters: [zipkin] metrics: receivers: [otlp] processors: [memory_limiter, batch] exporters: [prometheus] logs: receivers: [otlp] processors: [memory_limiter, batch] exporters: [loki] EOF
What’s happening?
:8889 for Prometheus to scrape.Tip: Keep the
memory_limiterprocessor; without it the collector can OOM on busy clusters.
All three are available as Helm charts. Below is a minimal but production‑ready set‑up.
Bash# Prometheus helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm upgrade --install prometheus prometheus-community/kube-prometheus-stack \ --namespace observability \ --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \ --set grafana.enabled=true \ --set grafana.defaultDashboardEnabled=true \ --version 55.7.0 # Loki helm repo add grafana https://grafana.github.io/helm-charts helm upgrade --install loki grafana/loki-stack \ --namespace observability \ --set loki.image.tag=2.9.1 \ --set promtail.enabled=true \ --set promtail.config.client.url=http://loki:3100/loki/api/v1/push \ --version 2.9.0 # Tempo helm repo add grafana https://grafana.github.io/helm-charts helm upgrade --install tempo grafana/tempo \ --namespace observability \ --set tempo.image.tag=2.4.0 \ --set tempo.storage.trace.blocksize=64Mi \ --set tempo.storage.trace.retention=48h \ --version 1.6.0
Now you have a full observability stack in the observability namespace. Grafana automatically picks up the data sources because the kube-prometheus-stack chart creates them.
Add the OTel starter to your pom.xml:
XMLstyle="color:#808080"><style="color:#4EC9B0">dependency> style="color:#808080"><style="color:#4EC9B0">groupId>io.opentelemetry.instrumentationstyle="color:#808080"></style="color:#4EC9B0">groupId> style="color:#808080"><style="color:#4EC9B0">artifactId>opentelemetry-spring-boot-starterstyle="color:#808080"></style="color:#4EC9B0">artifactId> style="color:#808080"><style="color:#4EC9B0">version>1.34.1style="color:#808080"></style="color:#4EC9B0">version> style="color:#808080"></style="color:#4EC9B0">dependency>
Create an application.yaml (or application.properties) that points to the collector:
YAMLotel: exporter: otlp: endpoint: http://otel-collector.observability:4317 compression: gzip metrics: export: interval: 30s resource: attributes: service.name: orders-service deployment.environment: prod
Dockerfile (multi‑stage, Java 21):
DOCKERFILEFROM eclipse-temurin:21-jdk-alpine AS build WORKDIR /app COPY . . RUN ./mvnw -B package -DskipTests FROM eclipse-temurin:21-jre-alpine COPY --from=build /app/target/orders-service.jar /app.jar ENTRYPOINT ["java","-javaagent:/app.jar","-jar","/app.jar"]
Deploy with a simple manifest:
YAMLapiVersion: apps/v1 kind: Deployment metadata: name: orders-service namespace: production spec: replicas: 3 selector: matchLabels: app: orders template: metadata: labels: app: orders spec: containers: - name: java image: ghcr.io/yourorg/orders-service:1.0.0 env: - name: OTEL_EXPORTER_OTLP_ENDPOINT value: "http://otel-collector.observability:4317" - name: OTEL_RESOURCE_ATTRIBUTES value: "service.name=orders-service,deployment.environment=prod" ports: - containerPort: 8080
When the pod starts, you’ll see trace IDs in the logs (trace_id=...) and metrics like http_server_requests_seconds_count in Prometheus.
Add the OTel Go SDK (v1.28.0) and the HTTP instrumentation:
Bashgo get go.opentelemetry.io/otel@v1.28.0 go get go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.45.0
Initialize the tracer in main.go:
Gopackage main import ( "context" "log" "net/http" "go.opentelemetry.io/otel" "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc" "go.opentelemetry.io/otel/sdk/resource" sdktrace "go.opentelemetry.io/otel/sdk/trace" semconv "go.opentelemetry.io/otel/semconv/v1.21.0" "go.opentelemetry.io/otel/trace" ) func initTracer() func(context.Context) error { ctx := context.Background() exp, err := otlptracegrpc.New(ctx, otlptracegrpc.WithEndpoint("otel-collector.observability:4317"), otlptracegrpc.WithInsecure(), ) if err != nil { log.Fatalf("failed to create exporter: %v", err) } bsp := sdktrace.NewBatchSpanProcessor(exp) res, _ := resource.New(ctx, resource.WithAttributes( semconv.ServiceNameKey.String("payment-service"), semconv.DeploymentEnvironmentKey.String("prod"), ), ) tp := sdktrace.NewTracerProvider( sdktrace.WithSpanProcessor(bsp), sdktrace.WithResource(res), ) otel.SetTracerProvider(tp) return tp.Shutdown } func main() { shutdown := initTracer() defer func() { if err := shutdown(context.Background()); err != nil { log.Printf("shutdown error: %v", err) } }() mux := http.NewServeMux() mux.Handle("/", otelhttp.NewHandler(http.HandlerFunc(hello), "Hello")) log.Println("Listening on :8080") http.ListenAndServe(":8080", mux) }
Dockerfile:
DOCKERFILEFROM golang:1.22-alpine AS build WORKDIR /src COPY . . RUN go build -ldflags="-s -w" -o /app/payment-service FROM alpine:3.19 COPY --from=build /app/payment-service /payment-service ENTRYPOINT ["/payment-service"]
Deploy the same way as the Java service, just swap the image name and service.name env var.
Install the OTel packages:
Bashpip install opentelemetry-sdk==1.22.0 \ opentelemetry-instrumentation-fastapi==0.45b0 \ opentelemetry-exporter-otlp==1.22.0 \ uvicorn
Add a tiny wrapper (instrument.py):
PYTHONfrom fastapi import FastAPI from opentelemetry import trace from opentelemetry.sdk.resources import Resource from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor import os resource = Resource.create({ "service.name": "inventory-service", "deployment.environment": "prod", }) trace.set_tracer_provider(TracerProvider(resource=resource)) otlp_exporter = OTLPSpanExporter(endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT")) span_processor = BatchSpanProcessor(otlp_exporter) trace.get_tracer_provider().add_span_processor(span_processor) app = FastAPI() FastAPIInstrumentor.instrument_app(app) @app.get("/items/{item_id}") async def read_item(item_id: str): return {"item_id": item_id}
Dockerfile:
DOCKERFILEFROM python:3.11-slim WORKDIR /app COPY requirements.txt . RUN pip install -r requirements.txt COPY . . ENV OTEL_EXPORTER_OTLP_ENDPOINT="http://otel-collector.observability:4317" CMD ["uvicorn", "instrument:app", "--host", "0.0.0.0", "--port", "8080"]
Again, a plain Deployment manifest does the trick.
Add a ServiceMonitor (Prometheus Operator will pick it up):
YAMLapiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: otel-collector namespace: observability labels: release: prometheus spec: selector: matchLabels: app.kubernetes.io/name: otel-collector endpoints: - port: metrics path: /metrics interval: 30s
Now you can query otelcol_exporter_sent_spans_total or process_cpu_seconds_total in Grafana.
hey -c 10 -n 1000 http://orders-service.production:8080/service.name="orders-service"rate(http_server_requests_seconds_count[1m]) to see QPS per service.{job="orders-service"} |~ "trace_id" – you’ll see the same trace IDs as in Tempo.If any component is missing, the collector logs (kubectl logs -l app=otel-collector) will tell you which exporter failed.
| Concern | Recommended setting |
|---|---|
| Collector memory | memory_limiter.limit_mib: 800 for >5 k RPS |
| Span retention | Set tempo.storage.trace.retention: 72h |
| Scrape interval | 15 s for high‑frequency metrics |
| TLS | Enable otelcol.receiver.otlp.protocols.grpc.tls and configure certs |
| RBAC | Use a dedicated otel-collector ServiceAccount with cluster-admin limited to metrics.k8s.io |
OpenTelemetry gave us a single source of truth for everything that happens inside Kubernetes. No more juggling separate tracing libraries, Prometheus exporters, or log shippers. The collector acts as a gatekeeper; you only need to change the back‑ends once.
Try it on a dev cluster first, then scale the collector replicas and increase the memory limits as traffic grows. The code snippets above are fully versioned, so you can lock them down in your CI/CD pipelines.
Happy observability!
Next steps
opentelemetry-auto-instrumentations-node@0.38.0).trace.id_ratio_based) to reduce data volume.histogram_quantile(0.95, sum(rate(http_server_request_duration_seconds_bucket[5m])) by (le, service_name))).Feel free to drop a comment if you hit a snag. I’ll update the repo with a make deploy target that bundles everything shown here.