Learn how to evaluate model calibration using calibration curves and the Brier score. Ensure your predicted probabilities are accurate representations of reality.
Previously in this course, we covered Managing Model Complexity: Pruning and Regularization Strategies to prevent overfitting. Now that your model is stable, we need to answer a critical question: when your model says there is an 80% chance of an event occurring, does it happen 80% of the time?
Many beginners treat classification models as simple "Yes/No" machines. In production, however, we often care about the confidence of that prediction. If your model is poorly calibrated, a high probability might not mean high certainty, which can lead to disastrous business decisions.
A model is calibrated if its predicted probabilities align with the actual observed frequency of the positive class. If you take all samples where the model predicted 0.7 probability, approximately 70% of those samples should be positive.
To measure this, we use two primary tools:
Let's use scikit-learn to evaluate the calibration of a classifier.
PYTHONimport numpy as np from sklearn.calibration import calibration_curve, CalibrationDisplay from sklearn.metrics import brier_score_loss import matplotlib.pyplot as plt # Assume y_test are true labels and y_prob are model probabilities # y_prob = model.predict_proba(X_test)[:, 1] # 1. Calculate Brier Score brier = brier_score_loss(y_test, y_prob) print(f"Brier Score: {brier:.4f}") # 2. Generate Calibration Curve prob_true, prob_pred = calibration_curve(y_test, y_prob, n_bins=10) # 3. Visualize disp = CalibrationDisplay(prob_true, prob_pred, y_prob) disp.plot() plt.title("Calibration Curve") plt.show()
If your curve bows below the diagonal, your model is "overconfident"—the predicted probabilities are higher than the actual frequency. If it bows above, the model is "underconfident."
You shouldn't always use 0.5 as your decision threshold. If your model is well-calibrated, you can pick a threshold based on the cost of errors.
For instance, if you are predicting fraud, missing a fraudulent transaction might be expensive. You might choose a lower threshold (e.g., 0.3) to flag more potential fraud, accepting more false positives to ensure you catch more actual fraud. Because the model is calibrated, you know that a 0.3 probability actually corresponds to a 30% risk, allowing you to make a mathematically sound trade-off.
sklearn.calibration.CalibratedClassifierCV to wrap your model. This uses techniques like Isotonic Regression or Platt Scaling to "fix" the probabilities.Calibration is about the integrity of your probability estimates. By using the Brier score for a quantitative metric and calibration curves for visual debugging, you ensure that your model’s output can be trusted for real-world decision-making. When your model is calibrated, you can move beyond simple binary predictions and start optimizing for the actual costs and benefits of your business logic.
Up next: We will explore how to use more sophisticated methods to find the optimal settings for your models with Advanced Hyperparameter Search.
Learn to measure model accuracy with essential regression metrics. We break down RMSE, MAE, and R-squared so you can evaluate your predictions like a pro.
Read moreLearn to diagnose overfitting and underfitting by mastering the concepts of bias and variance. Build models that generalize, not just memorize.
Evaluating Model Calibration