Learn to execute a production deployment on Kubernetes, integrate telemetry, and build automated feedback loops to ensure your ML system remains performant.
Previously in this course, we explored Scaling Deployments with Kubernetes: Orchestrating ML Inference and established the fundamental observability patterns in Observability and Logging: Mastering MLOps Production Telemetry. This lesson serves as our final project milestone, where we move from local validation to a live, production-grade deployment.
We are taking our optimized inference engine—built upon the lessons learned in Project Milestone: Inference Optimization for Production—and wrapping it in a resilient, observable Kubernetes ecosystem.
A production deployment is more than just a kubectl apply. It requires a closed-loop system where the model, the infrastructure, and the data pipeline communicate. To reach this, we must ensure our Kubernetes manifests reflect production-grade resource requests, liveness/readiness probes, and proper sidecar patterns for logging.
We start by defining our Deployment resource. Unlike development pods, production pods must explicitly request GPU resources and define memory limits to prevent OOM (Out of Memory) kills.
YAMLapiVersion: apps/v1 kind: Deployment metadata: name: llm-inference-service spec: replicas: 3 selector: matchLabels: app: llm-inference template: metadata: labels: app: llm-inference spec: containers: - name: vllm-server image: my-registry/llm-service:v1.2.0 resources: limits: nvidia.com/gpu: 1 requests: memory: "16Gi" cpu: "4" readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 30
To achieve production visibility, we don't just rely on standard output. We implement a sidecar pattern—typically a FluentBit or Promtail agent—to ship logs to an ELK stack or Grafana Loki.
For telemetry, ensure you are exposing an /metrics endpoint (typically via Prometheus client) that tracks:
The "Project Milestone" for this deployment is the creation of a feedback loop. We need to capture user interactions (thumbs up/down or corrections) and push them back into a storage layer (like S3 or a SQL database) to serve as the ground truth for future retraining.
PYTHON# Simple feedback ingestion logic for your API from fastapi import FastAPI, BackgroundTasks app = FastAPI() def persist_feedback(data: dict): # Logic to write to your feature store or S3 bucket # This data will trigger the CT pipeline in future iterations pass @app.post("/feedback") async def collect_feedback(feedback: dict, background_tasks: BackgroundTasks): background_tasks.add_task(persist_feedback, feedback) return {"status": "accepted"}
Your goal is to deploy your current project to a namespace and verify its health.
readinessProbe fails, investigate the vLLM logs via kubectl logs <pod-name>.initialDelaySeconds.INFO level. This will explode your logging costs. Use DEBUG or sample the logs.We have moved from raw model weights to a scalable, observable, and feedback-aware production service. By combining Kubernetes orchestration with structured telemetry and a feedback ingestion API, we have closed the loop on our MLOps lifecycle, ensuring our system is not just deployed, but continuously improving.
Up next: We will dive into Advanced Activation Checkpointing, focusing on how to push the boundaries of memory efficiency for even larger models.
Learn to scale ML models with Kubernetes deployments, manage GPU resource requests, and configure Horizontal Pod Autoscalers for production-ready inference.
Read moreLearn how to finalize your ML pipeline for production. We cover final validation, dependency locking, and operational readiness for a seamless deployment.
Project Milestone: Production Deployment