Data drift occurs when production data shifts away from your training baseline. Learn to calculate the Population Stability Index and set up alerts to catch it.
Previously in this course, we covered Input Validation and Schema Enforcement for ML Pipelines, which ensures your model receives the correct data types. Now, we move beyond schema structure to address the content of that data: data drift.
Even if your API receives perfectly formatted JSON, your model will fail if the underlying distribution of the features changes. This phenomenon, often called "silent failure," occurs when the world changes—user behavior shifts, seasonal trends emerge, or upstream upstream systems change their data generation process. In this lesson, we will implement methods to detect these shifts quantitatively.
Data drift (or covariate shift) is the change in the distribution of input features ($P(X)$) between the training set and the production set.
If your model was trained on data where the average customer age was 35, but your production traffic shifts to an average of 55, the model is now operating in a region of the feature space it hasn't "seen" during training. Because the model's decision boundaries were optimized for the 35-year-old distribution, performance will likely degrade.
The Population Stability Index (PSI) is the industry standard for measuring how much a variable's distribution has changed over time. It quantifies the difference between two distributions by binning the data and comparing the percentage of samples in each bin.
The formula for PSI is: $$PSI = \sum (Actual% - Expected%) \times \ln\left(\frac{Actual%}{Expected%}\right)$$
To implement this, we need to compare a "reference" dataset (usually your training set) against a "current" dataset (the latest batch of production data). We'll use numpy to bin the data and compute the PSI.
PYTHONimport numpy as np import pandas as pd def calculate_psi(expected, actual, buckets=10): def get_bin_percentages(data, bins): counts, _ = np.histogram(data, bins=bins) return counts / len(data) # Define bins based on the reference(expected) distribution _, bins = np.histogram(expected, bins=buckets) expected_pct = get_bin_percentages(expected, bins) actual_pct = get_bin_percentages(actual, bins) # Add small epsilon to avoid division by zero or log(0) expected_pct = np.clip(expected_pct, 0.0001, None) actual_pct = np.clip(actual_pct, 0.0001, None) psi_values = (actual_pct - expected_pct) * np.log(actual_pct / expected_pct) return np.sum(psi_values) # Example usage train_data = np.random.normal(0, 1, 1000) prod_data = np.random.normal(0.2, 1.1, 1000) # Slightly shifted psi = calculate_psi(train_data, prod_data) print(f"PSI: {psi:.4f}")
In a production environment, you shouldn't manually run this script. You need an automated monitoring loop.
Using the calculate_psi function provided above, perform the following:
baseline (normal distribution) and drifted (normal distribution with a different mean).buckets values; how does changing the number of bins affect the sensitivity of the PSI?Monitoring data drift is not just about tracking numbers—it's about maintaining the contract between your model and the reality it predicts. By using the Population Stability Index, you can provide a rigorous, statistical basis for when it's time to trigger a model refresh.
Up next: Tracking Performance Degradation — we will bridge the gap between input drift and actual model accuracy decay.
Master production logging and observability to track execution times and build robust audit trails for your ML pipelines. Ensure your models remain debuggable.
Read moreMaster production monitoring for ML. Learn to design effective health checks, track performance metrics, and build alerts to catch silent model failures.
Monitoring Data Drift