Observability and Logging: Mastering MLOps Production Telemetry

Learn to implement structured logging and track request latency for production ML systems. Master MLOps observability to catch failures before they scale.

MLOpsObservabilityLoggingMonitoringProduction AIaimachine-learningpython

Previously in this course, we covered CI/CD for ML pipelines, establishing the foundation for automated deployments. However, a deployment is not the end of the lifecycle. In production, "it works on my machine" is a dangerous fallacy. You need Observability—the ability to infer the internal state of your system based on its external outputs—to ensure your model is performing as expected.

While Logging and Observability for Production ML Pipelines covers the high-level strategy, this lesson focuses on the implementation: how to instrument your code to track request latency and error rates, transforming raw logs into actionable intelligence.

The Shift to Structured Logging

Standard print statements or unstructured log strings are useless in production. They require expensive parsing, making it nearly impossible to filter for specific request IDs or model versions. Structured logging (JSON format) turns logs into queryable data.

Instead of print(f"Request {id} took {time}s"), emit a JSON object:


PYTHON
import structlog  # Standard for production structured logging

logger = structlog.get_logger()

def log_inference_event(request_id, model_version, latency, status_code):
    logger.info("inference_completed",
        request_id=request_id,
        model_version=model_version,
        latency_seconds=latency,
        status=status_code
    )

By standardizing your schema, you can immediately run queries in tools like Datadog, ELK, or CloudWatch to calculate the p99 latency of specific model versions across your entire fleet.

Tracking Latency and Error Rates

In MLOps, latency is often a proxy for hardware saturation. If your latency spikes, it’s usually due to GPU memory pressure or inefficient batching. To track this, wrap your inference calls in a context manager that records the duration and captures exceptions.


PYTHON
import time
from functools import wraps

def track_performance(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        try:
            result = func(*args, **kwargs)
            status = "success"
            return result
        except Exception as e:
            status = "error"
            logger.error("inference_failed", error=str(e))
            raise e
        finally:
            latency = time.perf_counter() - start
            logger.info("request_metrics", latency=latency, status=status)
    return wrapper

@track_performance
def run_model_inference(input_data):
    # Your model call here
    pass

This decorator provides a non-intrusive way to gather telemetry. It ensures that every request is accounted for, providing the data necessary to monitor production performance effectively.

Visualizing Performance Dashboards

Raw logs are for debugging; dashboards are for observability. A production-ready dashboard for an LLM application should prioritize four key metrics:

Throughput: Requests per second (RPS).
Latency Distribution: p50, p95, and p99 latency.
Error Rate: Percentage of requests returning 4xx or 5xx.
Token Usage: If using LLMs, track input/output tokens to monitor cost and model performance.

Metric	Business Impact	Infrastructure Signal
p99 Latency	User churn	GPU/Memory bottleneck
Error Rate	Service downtime	Model/Dependency failure
RPS	Capacity planning	Traffic spikes/DDoS
Token Count	Operational cost	Prompt complexity

Hands-on Exercise: Instrumenting the Project

For our running project, add a custom logging middleware to your serving layer (e.g., FastAPI or Flask).

Create a logger.py that configures structlog to output JSON.
Implement the track_performance decorator shown above.
Apply it to your inference endpoint.
Run 100 requests against your service and verify that the logs are being generated in valid JSON format.

Common Pitfalls

Logging Too Much: Logging every input token for a large LLM will blow up your storage costs and degrade I/O performance. Log metadata (request length, user ID) instead of full payloads.
Missing Correlation IDs: If a request traverses a load balancer, an API gateway, and a model server, ensure a correlation_id is passed through all of them. Without it, you cannot trace a single request's lifecycle.
Synchronous Logging: Never perform blocking I/O (like writing to a remote log server) in the main inference thread. Use asynchronous log handlers or a local buffer (e.g., Fluentd or Logstash) to ship logs out-of-band.

Recap

We have moved beyond simple print statements to structured JSON logging, implemented decorators for automated latency tracking, and defined the essential metrics for our production dashboard. These practices are the prerequisite to detecting the silent model failures that plague unmonitored systems.

Up next: We will dive into Drift Detection and Data Monitoring, where we’ll learn to identify when your production input data no longer matches the distribution your model was trained on.

Back to Blog

Observability and Logging: Mastering MLOps Production Telemetry

The Shift to Structured Logging

Tracking Latency and Error Rates

Visualizing Performance Dashboards

Hands-on Exercise: Instrumenting the Project

Common Pitfalls

Recap

Similar Posts

Logging and Observability for Production ML Pipelines

Model Monitoring in Practice: Keeping AI Healthy

Mixture-of-Experts (MoE) Layers: Scaling Efficiently with Sparsity