Advanced Metrics for Imbalanced Datasets: MCC and Kappa

Learn to evaluate models on imbalanced data using the Matthews Correlation Coefficient and Cohen’s Kappa to avoid the traps of misleading accuracy.

imbalanced datametricsMCCmodel evaluationmachine learningaimachine-learningpython

Previously in this course, we explored Confusion Matrices and Beyond and Mastering Precision-Recall Curves for Production ML Pipelines. While those tools provide excellent visibility into error types, they often leave you with a trade-off decision: which single number summarizes the "goodness" of a model when your classes are highly skewed?

When you face severe class imbalance, standard accuracy is almost useless. If 99% of your transactions are legitimate, a model that predicts "legitimate" for every single case hits 99% accuracy while missing every single fraud attempt. In this lesson, we move beyond simple ratios to metrics that account for the base rate of your classes.

The Problem with Traditional Metrics

Most standard metrics are asymmetric. Accuracy, precision, and recall focus heavily on the positive class. When the positive class is rare (the "needle in the haystack" problem), these metrics can hide the fact that your model has no predictive power over the minority class.

To evaluate models rigorously, we need metrics that treat both classes as equally important or account for the likelihood of "guessing" correctly by chance.

Matthews Correlation Coefficient (MCC)

The Matthews Correlation Coefficient is arguably the best single-number metric for binary classification. It produces a value between -1 and +1, where:

+1: Perfect prediction.
0: No better than random guessing.
-1: Total disagreement between prediction and observation.

Unlike F1-score, which ignores True Negatives, MCC uses all four quadrants of the confusion matrix (TP, TN, FP, FN). It is calculated as:

$$MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$$

Why use MCC?

Because it incorporates True Negatives, it is mathematically robust to imbalance. If your model predicts the majority class for everything, the numerator becomes zero, resulting in an MCC of 0—correctly signaling that the model has learned nothing.

Cohen’s Kappa

Cohen’s Kappa measures the agreement between predicted and observed labels, corrected for the agreement that would occur by chance.

If your dataset is 90% Class A, a model that randomly guesses "Class A" 90% of the time will achieve 81% accuracy just by chance. Kappa subtracts this "chance agreement" from your actual accuracy.

Kappa = 1: Perfect agreement.
Kappa = 0: Agreement is exactly what you’d expect by random chance.
Negative values: The model performs worse than random.

Worked Example: Evaluating an Imbalanced Pipeline

Let’s look at a scenario where we have 1,000 samples, 950 are negative, and 50 are positive.


PYTHON
from sklearn.metrics import matthews_corrcoef, cohen_kappa_score
from sklearn.metrics import confusion_matrix

# Hypothetical results: 
# Model predicts 40 of 50 positives correctly(TP)
# Model predicts 940 of 950 negatives correctly(TN)
# Model has 10 False Positives (FP) and 10 False Negatives (FN)

y_true = [0]*950 + [1]*50
y_pred = [0]*940 + [1]*10 + [0]*10 + [1]*40 # Simplified representation

mcc = matthews_corrcoef(y_true, y_pred)
kappa = cohen_kappa_score(y_true, y_pred)

print(f"MCC: {mcc:.4f}")
print(f"Kappa: {kappa:.4f}")

In this case, the accuracy is 98%, but the MCC and Kappa provide a more nuanced view of the model's actual performance relative to the data's skew.

Hands-on Exercise

Take the code block above and modify the y_pred list to represent a "lazy" model that predicts the majority class (0) for every single input. Calculate the MCC and Kappa scores for this model. You will observe that the scores drop to 0, proving that these metrics are immune to the "accuracy trap" that occurs with simple metrics.

Common Pitfalls

Ignoring the Business Cost: While MCC is statistically superior, it doesn't account for the cost of a False Positive vs. a False Negative. Always pair these metrics with a cost-sensitive analysis if the business impact of errors differs.
Over-interpreting Kappa: Kappa is sensitive to the prevalence of the classes. If the class distribution in your test set is significantly different from your production environment, the "chance agreement" calculation may be misleading.
Threshold Dependency: Remember that both MCC and Kappa are usually calculated on hard labels (0 or 1). If you are tuning thresholds, ensure you are evaluating your metrics across the probability spectrum, as discussed in Mastering Precision-Recall Curves for Production ML Pipelines.

Recap

When working with imbalanced data:

Accuracy is misleading because it ignores the majority class bias.
MCC is your best bet for a balanced, all-quadrant view of model performance.
Cohen’s Kappa is essential when you need to account for random chance agreement.
Use these metrics to compare models during your Introduction to Cross-Validation cycles to ensure your chosen model is truly learning the patterns, not just the base rates.

Up next: Project Milestone: Building the Baseline Pipeline.

Back to Blog