Learn to evaluate models on imbalanced data using the Matthews Correlation Coefficient and Cohen’s Kappa to avoid the traps of misleading accuracy.
Previously in this course, we explored Confusion Matrices and Beyond and Mastering Precision-Recall Curves for Production ML Pipelines. While those tools provide excellent visibility into error types, they often leave you with a trade-off decision: which single number summarizes the "goodness" of a model when your classes are highly skewed?
When you face severe class imbalance, standard accuracy is almost useless. If 99% of your transactions are legitimate, a model that predicts "legitimate" for every single case hits 99% accuracy while missing every single fraud attempt. In this lesson, we move beyond simple ratios to metrics that account for the base rate of your classes.
Most standard metrics are asymmetric. Accuracy, precision, and recall focus heavily on the positive class. When the positive class is rare (the "needle in the haystack" problem), these metrics can hide the fact that your model has no predictive power over the minority class.
To evaluate models rigorously, we need metrics that treat both classes as equally important or account for the likelihood of "guessing" correctly by chance.
The Matthews Correlation Coefficient is arguably the best single-number metric for binary classification. It produces a value between -1 and +1, where:
Unlike F1-score, which ignores True Negatives, MCC uses all four quadrants of the confusion matrix (TP, TN, FP, FN). It is calculated as:
$$MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$$
Because it incorporates True Negatives, it is mathematically robust to imbalance. If your model predicts the majority class for everything, the numerator becomes zero, resulting in an MCC of 0—correctly signaling that the model has learned nothing.
Cohen’s Kappa measures the agreement between predicted and observed labels, corrected for the agreement that would occur by chance.
If your dataset is 90% Class A, a model that randomly guesses "Class A" 90% of the time will achieve 81% accuracy just by chance. Kappa subtracts this "chance agreement" from your actual accuracy.
Let’s look at a scenario where we have 1,000 samples, 950 are negative, and 50 are positive.
PYTHONfrom sklearn.metrics import matthews_corrcoef, cohen_kappa_score from sklearn.metrics import confusion_matrix # Hypothetical results: # Model predicts 40 of 50 positives correctly(TP) # Model predicts 940 of 950 negatives correctly(TN) # Model has 10 False Positives (FP) and 10 False Negatives (FN) y_true = [0]*950 + [1]*50 y_pred = [0]*940 + [1]*10 + [0]*10 + [1]*40 # Simplified representation mcc = matthews_corrcoef(y_true, y_pred) kappa = cohen_kappa_score(y_true, y_pred) print(f"MCC: {mcc:.4f}") print(f"Kappa: {kappa:.4f}")
In this case, the accuracy is 98%, but the MCC and Kappa provide a more nuanced view of the model's actual performance relative to the data's skew.
Take the code block above and modify the y_pred list to represent a "lazy" model that predicts the majority class (0) for every single input. Calculate the MCC and Kappa scores for this model. You will observe that the scores drop to 0, proving that these metrics are immune to the "accuracy trap" that occurs with simple metrics.
When working with imbalanced data:
Up next: Project Milestone: Building the Baseline Pipeline.
Learn to perform hyperparameter stability analysis to ensure your models generalize. Avoid overfitting to specific data splits with robust tuning techniques.
Read moreLearn how to align your ML models with business objectives by moving beyond accuracy to cost-sensitive learning. Define custom cost matrices and maximize profit.
Advanced Metrics for Imbalanced Datasets
Statistical Significance in Model Comparison
Model Ensembling: Voting and Averaging
Stacking Architectures
Blending Techniques
Interpreting Complex Ensembles
Managing Model Complexity
Bias-Variance Tradeoff in Ensembles
Project Milestone: The Ensemble Strategy
Serializing Pipelines with Joblib
Versioning Models and Data
Designing Inference APIs
Input Validation and Schema Enforcement
Monitoring Data Drift
Tracking Performance Degradation
Logging and Observability
Automated Retraining Triggers
Containerization Basics
Handling Environment Parity
Documentation for Production
Project Milestone: Deployment Readiness