Versioning Models and Data: Establishing Lineage for ML Pipelines

Stop losing track of which data trained which model. Learn how to implement version control for data and models to ensure your ML pipelines are reproducible.

MLOpsreproducibilityversion controllineagepipelinemachine learningaimachine-learningpython

Previously in this course, we covered serializing pipelines with Joblib. While serialization saves your model to disk, it doesn't tell you how that model was created, what data it ingested, or which parameters were used. This lesson adds the final piece to the puzzle: lineage.

In production, a model file is useless without the context of its birth. If you can't trace a model back to its specific training data snapshot and code version, you aren't doing MLOps—you're just guessing.

Why Version Control for Data Matters

In software engineering, Git tracks code changes. In machine learning, your "program" is a combination of code, hyperparameters, and the training data itself. If your data changes, the model output changes, even if the code remains identical.

To achieve true reproducibility, you must treat your data as a versioned artifact. This means moving away from "data.csv" and toward "data_v1.2.3.parquet."

The Pattern: Immutable Artifact Linking

The core pattern for production systems is to store a manifest file alongside your serialized pipeline. This manifest should contain:

Git Commit Hash: The exact version of your pipeline code.
Data URI/Hash: A unique identifier for the training data (e.g., an S3 path or a DVC hash).
Environment Signature: A lock file (like requirements.txt or conda.yaml) ensuring dependency parity.
Model Metrics: The performance scores achieved during validation.

Worked Example: Linking Models to Data

We’ll build a simple ModelRegistry structure. Instead of just saving a .pkl file, we save a folder containing the model and its metadata.


PYTHON
import joblib
import json
import hashlib
import os
from datetime import datetime

def save_model_with_lineage(model, data_path, metrics, output_dir):
    # 1. Create a hash of the data file to detect changes
    with open(data_path, "rb") as f:
        file_hash = hashlib.md5(f.read()).hexdigest()
    
    # 2. Gather metadata
    metadata = {
        "timestamp": datetime.now().isoformat(),
        "data_source": data_path,
        "data_hash": file_hash,
        "metrics": metrics,
        "model_class": model.__class__.__name__
    }
    
    # 3. Save everything as a package
    os.makedirs(output_dir, exist_ok=True)
    joblib.dump(model, os.path.join(output_dir, "model.pkl"))
    with open(os.path.join(output_dir, "metadata.json"), "w") as f:
        json.dump(metadata, f, indent=4)
        
    print(f"Model and lineage saved to {output_dir}")

# Usage:
# save_model_with_lineage(my_pipeline, "data/train_v1.parquet", {"f1": 0.88}, "models/v001")

By keeping the metadata.json physically alongside the model.pkl, you ensure that anyone (or any automated system) can audit exactly what went into the model. This is the foundation for version control for ML experiments.

Hands-on Exercise: Create a Metadata Manifest

Your task is to extend the save_model_with_lineage function above. Modify the code to include:

The Git commit hash of the current repository (use subprocess.check_output(['git', 'rev-parse', 'HEAD'])).
A list of the pipeline steps (e.g., model.named_steps.keys()).
Verify that the output directory contains both the .pkl and .json files.

Pro-tip: If you are already using tools like MLflow or DVC, these steps are automated, but you must still understand why they are doing it. As we discussed in hyperparameter stability analysis, understanding your model's provenance is the only way to debug performance drops in production.

Common Pitfalls

The "Latest" Trap: Never overwrite a file named model.pkl. Always use versioned naming (e.g., model_v1.pkl, model_v2.pkl). If you overwrite, you lose the ability to roll back.
Ignoring Environment Drift: Even if your data and code are the same, a different version of scikit-learn or numpy can produce different numerical outputs. Always track your environment state.
Assuming Data is Static: Data in production often evolves. If you don't track the version of the data, you won't be able to distinguish between a model failure and a data quality issue when you eventually encounter data drift.

Recap

Versioning is the backbone of production-ready machine learning. By linking your models to their source data via metadata manifests, you transform your pipeline from a "black box" into a traceable, auditable engineering process. This discipline ensures that your final project review is based on facts, not assumptions.

Up next: Designing Inference APIs — we'll wrap our versioned models in a production-ready HTTP service.

Back to Blog

Versioning Models and Data: Establishing Lineage for ML Pipelines

Why Version Control for Data Matters

The Pattern: Immutable Artifact Linking

Worked Example: Linking Models to Data

Hands-on Exercise: Create a Metadata Manifest

Common Pitfalls

Recap

Similar Posts

Version Control for ML Experiments: Git and MLflow

Tracking Performance Degradation in Production ML Pipelines

Blending Techniques: A Manual Approach to Model Ensembling