Mahamudul Hasan Rubel
HomeBlogCoursesAboutProjectsSkillsExperiencePhotosContact
Mahamudul Hasan Rubel

Senior Software Engineer crafting high-performance web applications and SaaS platforms.

Navigation

  • Home
  • Blog
  • Courses
  • About
  • Projects
  • Skills
  • Experience
  • Photos
  • Contact

Get in Touch

Available for senior/lead roles and consulting.

bd.mhrubel@gmail.comHire Me

Subscribe to the newsletter

Get new articles and course lessons delivered to your inbox. No spam, unsubscribe anytime.

© 2026 Mahamudul Hasan Rubel. All rights reserved.

Built with using Next.js 16 & Tailwind v4

Back to Blog
Lesson 42 of the Advanced AI/ML: Deep Learning, LLMs & Production Systems course
AI/MLJune 28, 20263 min read

Project Milestone: Production Deployment of ML Systems

Learn to execute a production deployment on Kubernetes, integrate telemetry, and build automated feedback loops to ensure your ML system remains performant.

MLOpsKubernetesDeploymentProductionFeedback LoopsProject Milestoneaimachine-learningpython

Previously in this course, we explored Scaling Deployments with Kubernetes: Orchestrating ML Inference and established the fundamental observability patterns in Observability and Logging: Mastering MLOps Production Telemetry. This lesson serves as our final project milestone, where we move from local validation to a live, production-grade deployment.

We are taking our optimized inference engine—built upon the lessons learned in Project Milestone: Inference Optimization for Production—and wrapping it in a resilient, observable Kubernetes ecosystem.

The Architecture of a Production Deployment

A production deployment is more than just a kubectl apply. It requires a closed-loop system where the model, the infrastructure, and the data pipeline communicate. To reach this, we must ensure our Kubernetes manifests reflect production-grade resource requests, liveness/readiness probes, and proper sidecar patterns for logging.

1. Hardening the Deployment Manifest

We start by defining our Deployment resource. Unlike development pods, production pods must explicitly request GPU resources and define memory limits to prevent OOM (Out of Memory) kills.

YAML
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
    spec:
      containers:
      - name: vllm-server
        image: my-registry/llm-service:v1.2.0
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            memory: "16Gi"
            cpu: "4"
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30

2. Monitoring and Logging Integration

To achieve production visibility, we don't just rely on standard output. We implement a sidecar pattern—typically a FluentBit or Promtail agent—to ship logs to an ELK stack or Grafana Loki.

For telemetry, ensure you are exposing an /metrics endpoint (typically via Prometheus client) that tracks:

  • Request Latency (p99): The time taken for token generation.
  • GPU Utilization: Percentage of time the GPU is active.
  • Token Throughput: Tokens generated per second per user request.

3. Configuring Automated Feedback Loops

The "Project Milestone" for this deployment is the creation of a feedback loop. We need to capture user interactions (thumbs up/down or corrections) and push them back into a storage layer (like S3 or a SQL database) to serve as the ground truth for future retraining.

PYTHON
# Simple feedback ingestion logic for your API
from fastapi import FastAPI, BackgroundTasks

app = FastAPI()

def persist_feedback(data: dict):
    # Logic to write to your feature store or S3 bucket
    # This data will trigger the CT pipeline in future iterations
    pass

@app.post("/feedback")
async def collect_feedback(feedback: dict, background_tasks: BackgroundTasks):
    background_tasks.add_task(persist_feedback, feedback)
    return {"status": "accepted"}

Hands-on Exercise: The Production Checklist

Your goal is to deploy your current project to a namespace and verify its health.

  1. Apply the manifest: Deploy your model to a Kubernetes namespace using the template above.
  2. Verify Probes: Observe the pod status; if the readinessProbe fails, investigate the vLLM logs via kubectl logs <pod-name>.
  3. Connect Feedback: Ensure your API has a POST endpoint that logs JSON payloads to a persistent volume (PV) for later evaluation.

Common Pitfalls in Production

  • Cold Start Latency: LLMs are heavy. If your readiness probe is too aggressive, your pod will be killed before it finishes loading the weights into VRAM. Increase initialDelaySeconds.
  • Resource Fragmentation: If you request 1 GPU but your node has 2, ensure your scheduler is configured to avoid "bin-packing" too many models onto a single node if they are memory-bound.
  • Logging Noise: Do not log full prompt/response pairs at INFO level. This will explode your logging costs. Use DEBUG or sample the logs.

Recap

We have moved from raw model weights to a scalable, observable, and feedback-aware production service. By combining Kubernetes orchestration with structured telemetry and a feedback ingestion API, we have closed the loop on our MLOps lifecycle, ensuring our system is not just deployed, but continuously improving.

Up next: We will dive into Advanced Activation Checkpointing, focusing on how to push the boundaries of memory efficiency for even larger models.

Previous lessonGPU Resource Allocation and SchedulingNext lesson Advanced Activation Checkpointing
Back to Blog

Similar Posts

AI/MLJune 28, 20264 min read

Scaling Deployments with Kubernetes: Orchestrating ML Inference

Learn to scale ML models with Kubernetes deployments, manage GPU resource requests, and configure Horizontal Pod Autoscalers for production-ready inference.

Read more
AI/MLJune 26, 20263 min read

Project Milestone: Deployment Readiness for ML Pipelines

Learn how to finalize your ML pipeline for production. We cover final validation, dependency locking, and operational readiness for a seamless deployment.

Part of the course

Advanced AI/ML: Deep Learning, LLMs & Production Systems

advanced · Lesson 42 of 48

  1. 1

    Advanced Weight Initialization Strategies

    4 min
  2. 2

    Normalization Techniques at Scale

    3 min
  3. 3

    High-Dimensional Optimization Landscapes

    4 min
Read more
AI/MLJune 28, 20263 min read

GPU Resource Allocation and Scheduling: Mastering MIG and K8s

Learn to partition hardware with Multi-Instance GPU (MIG) and optimize Kubernetes scheduling to maximize GPU utilization across your production AI fleet.

Read more
  • 4

    Residual Connections and Gradient Stability

    4 min
  • 5

    Gating Units and Activation Functions

    4 min
  • 6

    Implementing Multi-Head Attention

    4 min
  • 7

    Positional Encoding Architectures

    4 min
  • 8

    Transformer Encoder-Decoder Design

    3 min
  • 9

    Project Milestone: Custom Architecture Setup

    3 min
  • 10

    Tokenization Strategies for LLMs

    3 min
  • 11

    Scaling Laws and Compute Budgets

    4 min
  • 12

    Data Parallelism Strategies

    3 min
  • 13

    Tensor and Pipeline Parallelism

    4 min
  • 14

    Efficient Dataset Loading and Prefetching

    4 min
  • 15

    Fine-tuning Methodologies Overview

    4 min
  • 16

    Parameter-Efficient Fine-Tuning (LoRA)

    4 min
  • 17

    Quantized LoRA (QLoRA)

    4 min
  • 18

    Alignment with RLHF

    4 min
  • 19

    Direct Preference Optimization (DPO)

    4 min
  • 20

    Project Milestone: Domain-Specific Fine-Tuning

    3 min
  • 21

    Vector Databases and Similarity Search

    4 min
  • 22

    Retrieval Strategies for RAG

    3 min
  • 23

    Context Management and Windowing

    4 min
  • 24

    Agentic Tool Use and Function Calling

    4 min
  • 25

    Chain-of-Thought and Multi-Step Reasoning

    4 min
  • 26

    Self-Correction and Iterative Refinement

    4 min
  • 27

    Project Milestone: RAG and Agent Integration

    3 min
  • 28

    Post-Training Quantization (PTQ)

    4 min
  • 29

    Model Pruning Techniques

    4 min
  • 30

    Knowledge Distillation

    4 min
  • 31

    Optimized Inference Runtimes (vLLM)

    4 min
  • 32

    TensorRT-LLM for High-Performance Serving

    3 min
  • 33

    ONNX Runtime for Cross-Platform Inference

    3 min
  • 34

    Project Milestone: Inference Optimization

    3 min
  • 35

    CI/CD for ML (MLOps)

    4 min
  • 36

    Continuous Training (CT) Pipelines

    4 min
  • 37

    Observability and Logging

    4 min
  • 38

    Drift Detection and Data Monitoring

    4 min
  • 39

    LLM-as-a-Judge for Evaluation

    4 min
  • 40

    Scaling Deployments with Kubernetes

    4 min
  • 41

    GPU Resource Allocation and Scheduling

    3 min
  • 42

    Project Milestone: Production Deployment

    3 min
  • 43

    Advanced Activation Checkpointing

    4 min
  • 44

    Mixed Precision Training (FP8/BF16)

    4 min
  • 45

    Distributed Optimizer States

    4 min
  • 46

    Gradient Accumulation and Batch Sizing

    4 min
  • 47

    Multi-Modal Model Architectures

    4 min
  • 48

    Mixture-of-Experts (MoE) Layers

    4 min
  • View full course