Stop losing track of which data trained which model. Learn how to implement version control for data and models to ensure your ML pipelines are reproducible.
Previously in this course, we covered serializing pipelines with Joblib. While serialization saves your model to disk, it doesn't tell you how that model was created, what data it ingested, or which parameters were used. This lesson adds the final piece to the puzzle: lineage.
In production, a model file is useless without the context of its birth. If you can't trace a model back to its specific training data snapshot and code version, you aren't doing MLOps—you're just guessing.
In software engineering, Git tracks code changes. In machine learning, your "program" is a combination of code, hyperparameters, and the training data itself. If your data changes, the model output changes, even if the code remains identical.
To achieve true reproducibility, you must treat your data as a versioned artifact. This means moving away from "data.csv" and toward "data_v1.2.3.parquet."
The core pattern for production systems is to store a manifest file alongside your serialized pipeline. This manifest should contain:
requirements.txt or conda.yaml) ensuring dependency parity.We’ll build a simple ModelRegistry structure. Instead of just saving a .pkl file, we save a folder containing the model and its metadata.
PYTHONimport joblib import json import hashlib import os from datetime import datetime def save_model_with_lineage(model, data_path, metrics, output_dir): # 1. Create a hash of the data file to detect changes with open(data_path, "rb") as f: file_hash = hashlib.md5(f.read()).hexdigest() # 2. Gather metadata metadata = { "timestamp": datetime.now().isoformat(), "data_source": data_path, "data_hash": file_hash, "metrics": metrics, "model_class": model.__class__.__name__ } # 3. Save everything as a package os.makedirs(output_dir, exist_ok=True) joblib.dump(model, os.path.join(output_dir, "model.pkl")) with open(os.path.join(output_dir, "metadata.json"), "w") as f: json.dump(metadata, f, indent=4) print(f"Model and lineage saved to {output_dir}") # Usage: # save_model_with_lineage(my_pipeline, "data/train_v1.parquet", {"f1": 0.88}, "models/v001")
By keeping the metadata.json physically alongside the model.pkl, you ensure that anyone (or any automated system) can audit exactly what went into the model. This is the foundation for version control for ML experiments.
Your task is to extend the save_model_with_lineage function above. Modify the code to include:
subprocess.check_output(['git', 'rev-parse', 'HEAD'])).model.named_steps.keys())..pkl and .json files.Pro-tip: If you are already using tools like MLflow or DVC, these steps are automated, but you must still understand why they are doing it. As we discussed in hyperparameter stability analysis, understanding your model's provenance is the only way to debug performance drops in production.
model.pkl. Always use versioned naming (e.g., model_v1.pkl, model_v2.pkl). If you overwrite, you lose the ability to roll back.scikit-learn or numpy can produce different numerical outputs. Always track your environment state.Versioning is the backbone of production-ready machine learning. By linking your models to their source data via metadata manifests, you transform your pipeline from a "black box" into a traceable, auditable engineering process. This discipline ensures that your final project review is based on facts, not assumptions.
Up next: Designing Inference APIs — we'll wrap our versioned models in a production-ready HTTP service.
Stop losing track of your best models. Learn how to combine Git for code and MLflow for experiment tracking to ensure your ML projects are reproducible.
Read moreLearn to track performance degradation in production by logging real-time predictions and computing metrics to detect silent model failure and feedback loops.
Versioning Models and Data
Handling Environment Parity
Documentation for Production
Project Milestone: Deployment Readiness