Project Milestone: Production Deployment of ML Systems

Learn to execute a production deployment on Kubernetes, integrate telemetry, and build automated feedback loops to ensure your ML system remains performant.

MLOpsKubernetesDeploymentProductionFeedback LoopsProject Milestoneaimachine-learningpython

Previously in this course, we explored Scaling Deployments with Kubernetes: Orchestrating ML Inference and established the fundamental observability patterns in Observability and Logging: Mastering MLOps Production Telemetry. This lesson serves as our final project milestone, where we move from local validation to a live, production-grade deployment.

We are taking our optimized inference engine—built upon the lessons learned in Project Milestone: Inference Optimization for Production—and wrapping it in a resilient, observable Kubernetes ecosystem.

The Architecture of a Production Deployment

A production deployment is more than just a kubectl apply. It requires a closed-loop system where the model, the infrastructure, and the data pipeline communicate. To reach this, we must ensure our Kubernetes manifests reflect production-grade resource requests, liveness/readiness probes, and proper sidecar patterns for logging.

1. Hardening the Deployment Manifest

We start by defining our Deployment resource. Unlike development pods, production pods must explicitly request GPU resources and define memory limits to prevent OOM (Out of Memory) kills.


YAML
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
    spec:
      containers:
      - name: vllm-server
        image: my-registry/llm-service:v1.2.0
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            memory: "16Gi"
            cpu: "4"
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30

2. Monitoring and Logging Integration

To achieve production visibility, we don't just rely on standard output. We implement a sidecar pattern—typically a FluentBit or Promtail agent—to ship logs to an ELK stack or Grafana Loki.

For telemetry, ensure you are exposing an /metrics endpoint (typically via Prometheus client) that tracks:

Request Latency (p99): The time taken for token generation.
GPU Utilization: Percentage of time the GPU is active.
Token Throughput: Tokens generated per second per user request.

3. Configuring Automated Feedback Loops

The "Project Milestone" for this deployment is the creation of a feedback loop. We need to capture user interactions (thumbs up/down or corrections) and push them back into a storage layer (like S3 or a SQL database) to serve as the ground truth for future retraining.


PYTHON
# Simple feedback ingestion logic for your API
from fastapi import FastAPI, BackgroundTasks

app = FastAPI()

def persist_feedback(data: dict):
    # Logic to write to your feature store or S3 bucket
    # This data will trigger the CT pipeline in future iterations
    pass

@app.post("/feedback")
async def collect_feedback(feedback: dict, background_tasks: BackgroundTasks):
    background_tasks.add_task(persist_feedback, feedback)
    return {"status": "accepted"}

Hands-on Exercise: The Production Checklist

Your goal is to deploy your current project to a namespace and verify its health.

Apply the manifest: Deploy your model to a Kubernetes namespace using the template above.
Verify Probes: Observe the pod status; if the readinessProbe fails, investigate the vLLM logs via kubectl logs <pod-name>.
Connect Feedback: Ensure your API has a POST endpoint that logs JSON payloads to a persistent volume (PV) for later evaluation.

Common Pitfalls in Production

Cold Start Latency: LLMs are heavy. If your readiness probe is too aggressive, your pod will be killed before it finishes loading the weights into VRAM. Increase initialDelaySeconds.
Resource Fragmentation: If you request 1 GPU but your node has 2, ensure your scheduler is configured to avoid "bin-packing" too many models onto a single node if they are memory-bound.
Logging Noise: Do not log full prompt/response pairs at INFO level. This will explode your logging costs. Use DEBUG or sample the logs.

Recap

We have moved from raw model weights to a scalable, observable, and feedback-aware production service. By combining Kubernetes orchestration with structured telemetry and a feedback ingestion API, we have closed the loop on our MLOps lifecycle, ensuring our system is not just deployed, but continuously improving.

Up next: We will dive into Advanced Activation Checkpointing, focusing on how to push the boundaries of memory efficiency for even larger models.

Back to Blog

Project Milestone: Production Deployment of ML Systems

The Architecture of a Production Deployment

1. Hardening the Deployment Manifest

2. Monitoring and Logging Integration

3. Configuring Automated Feedback Loops

Hands-on Exercise: The Production Checklist

Common Pitfalls in Production

Recap

Similar Posts

Scaling Deployments with Kubernetes: Orchestrating ML Inference

Project Milestone: Deployment Readiness for ML Pipelines

GPU Resource Allocation and Scheduling: Mastering MIG and K8s