Mahamudul Hasan Rubel
HomeBlogCoursesAboutProjectsSkillsExperiencePhotosContact
Mahamudul Hasan Rubel

Senior Software Engineer crafting high-performance web applications and SaaS platforms.

Navigation

  • Home
  • Blog
  • Courses
  • About
  • Projects
  • Skills
  • Experience
  • Photos
  • Contact

Get in Touch

Available for senior/lead roles and consulting.

bd.mhrubel@gmail.comHire Me

© 2026 Mahamudul Hasan Rubel. All rights reserved.

Built with using Next.js 16 & Tailwind v4

Back to Blog
Lesson 19 of the Intermediate Machine Learning: Real-World Pipelines course
AI/MLJune 25, 20264 min read

Advanced Metrics for Imbalanced Datasets: MCC and Kappa

Learn to evaluate models on imbalanced data using the Matthews Correlation Coefficient and Cohen’s Kappa to avoid the traps of misleading accuracy.

imbalanced datametricsMCCmodel evaluationmachine learningaimachine-learningpython

Previously in this course, we explored Confusion Matrices and Beyond and Mastering Precision-Recall Curves for Production ML Pipelines. While those tools provide excellent visibility into error types, they often leave you with a trade-off decision: which single number summarizes the "goodness" of a model when your classes are highly skewed?

When you face severe class imbalance, standard accuracy is almost useless. If 99% of your transactions are legitimate, a model that predicts "legitimate" for every single case hits 99% accuracy while missing every single fraud attempt. In this lesson, we move beyond simple ratios to metrics that account for the base rate of your classes.

The Problem with Traditional Metrics

Most standard metrics are asymmetric. Accuracy, precision, and recall focus heavily on the positive class. When the positive class is rare (the "needle in the haystack" problem), these metrics can hide the fact that your model has no predictive power over the minority class.

To evaluate models rigorously, we need metrics that treat both classes as equally important or account for the likelihood of "guessing" correctly by chance.

Matthews Correlation Coefficient (MCC)

The Matthews Correlation Coefficient is arguably the best single-number metric for binary classification. It produces a value between -1 and +1, where:

  • +1: Perfect prediction.
  • 0: No better than random guessing.
  • -1: Total disagreement between prediction and observation.

Unlike F1-score, which ignores True Negatives, MCC uses all four quadrants of the confusion matrix (TP, TN, FP, FN). It is calculated as:

$$MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$$

Why use MCC?

Because it incorporates True Negatives, it is mathematically robust to imbalance. If your model predicts the majority class for everything, the numerator becomes zero, resulting in an MCC of 0—correctly signaling that the model has learned nothing.

Cohen’s Kappa

Cohen’s Kappa measures the agreement between predicted and observed labels, corrected for the agreement that would occur by chance.

If your dataset is 90% Class A, a model that randomly guesses "Class A" 90% of the time will achieve 81% accuracy just by chance. Kappa subtracts this "chance agreement" from your actual accuracy.

  • Kappa = 1: Perfect agreement.
  • Kappa = 0: Agreement is exactly what you’d expect by random chance.
  • Negative values: The model performs worse than random.

Worked Example: Evaluating an Imbalanced Pipeline

Let’s look at a scenario where we have 1,000 samples, 950 are negative, and 50 are positive.

PYTHON
from sklearn.metrics import matthews_corrcoef, cohen_kappa_score
from sklearn.metrics import confusion_matrix

# Hypothetical results: 
# Model predicts 40 of 50 positives correctly(TP)
# Model predicts 940 of 950 negatives correctly(TN)
# Model has 10 False Positives (FP) and 10 False Negatives (FN)

y_true = [0]*950 + [1]*50
y_pred = [0]*940 + [1]*10 + [0]*10 + [1]*40 # Simplified representation

mcc = matthews_corrcoef(y_true, y_pred)
kappa = cohen_kappa_score(y_true, y_pred)

print(f"MCC: {mcc:.4f}")
print(f"Kappa: {kappa:.4f}")

In this case, the accuracy is 98%, but the MCC and Kappa provide a more nuanced view of the model's actual performance relative to the data's skew.

Hands-on Exercise

Take the code block above and modify the y_pred list to represent a "lazy" model that predicts the majority class (0) for every single input. Calculate the MCC and Kappa scores for this model. You will observe that the scores drop to 0, proving that these metrics are immune to the "accuracy trap" that occurs with simple metrics.

Common Pitfalls

  1. Ignoring the Business Cost: While MCC is statistically superior, it doesn't account for the cost of a False Positive vs. a False Negative. Always pair these metrics with a cost-sensitive analysis if the business impact of errors differs.
  2. Over-interpreting Kappa: Kappa is sensitive to the prevalence of the classes. If the class distribution in your test set is significantly different from your production environment, the "chance agreement" calculation may be misleading.
  3. Threshold Dependency: Remember that both MCC and Kappa are usually calculated on hard labels (0 or 1). If you are tuning thresholds, ensure you are evaluating your metrics across the probability spectrum, as discussed in Mastering Precision-Recall Curves for Production ML Pipelines.

Recap

When working with imbalanced data:

  • Accuracy is misleading because it ignores the majority class bias.
  • MCC is your best bet for a balanced, all-quadrant view of model performance.
  • Cohen’s Kappa is essential when you need to account for random chance agreement.
  • Use these metrics to compare models during your Introduction to Cross-Validation cycles to ensure your chosen model is truly learning the patterns, not just the base rates.

Up next: Project Milestone: Building the Baseline Pipeline.

Previous lessonHandling Class Imbalance with ResamplingNext lesson Project Milestone: Building the Baseline Pipeline
Back to Blog

Similar Posts

AI/MLJune 25, 20264 min read

Hyperparameter Stability Analysis: Building Robust ML Models

Learn to perform hyperparameter stability analysis to ensure your models generalize. Avoid overfitting to specific data splits with robust tuning techniques.

Read more
AI/MLJune 25, 20264 min read

Cost-Sensitive Learning: Optimize for Profit, Not Just Accuracy

Learn how to align your ML models with business objectives by moving beyond accuracy to cost-sensitive learning. Define custom cost matrices and maximize profit.

Part of the course

Intermediate Machine Learning: Real-World Pipelines

intermediate · Lesson 19 of 49

  1. 1

    Pipeline Architecture Essentials

    4 min
  2. 2

    ColumnTransformer for Heterogeneous Data

    3 min
  3. 3

    Custom Transformers for Feature Engineering

    3 min
Read more
AI/MLJune 25, 20264 min read

Confusion Matrices and Beyond: A Guide to Model Diagnostics

Stop relying on accuracy alone. Learn to build confusion matrices and calculate precision, recall, and F1-score to master model diagnostics and error analysis.

Read more
  • 4

    Handling Missing Values Strategically

    4 min
  • 5

    Scaling and Normalization Pipelines

    3 min
  • 6

    Encoding Categorical Variables

    3 min
  • 7

    Feature Selection in Pipelines

    3 min
  • 8

    Data Leakage Prevention Strategies

    4 min
  • 9

    Designing Reproducible Pipelines

    3 min
  • 10

    Project Initialization: Defining the Prediction Problem

    3 min
  • 11

    Introduction to Cross-Validation

    3 min
  • 12

    Stratification for Imbalanced Data

    4 min
  • 13

    Time-Series Validation Strategies

    4 min
  • 14

    Confusion Matrices and Beyond

    4 min
  • 15

    Precision-Recall Curves

    4 min
  • 16

    ROC-AUC Analysis

    3 min
  • 17

    Cost-Sensitive Learning

    4 min
  • 18

    Handling Class Imbalance with Resampling

    3 min
  • 19

    Advanced Metrics for Imbalanced Datasets

    4 min
  • 20

    Project Milestone: Building the Baseline Pipeline

    3 min
  • 21

    Introduction to GridSearchCV

    3 min
  • 22

    RandomizedSearchCV for Efficiency

    3 min
  • 23

    Bayesian Optimization Principles

    3 min
  • 24

    Early Stopping in Iterative Models

    4 min
  • 25

    Managing Computational Resources

    3 min
  • 26

    Hyperparameter Stability Analysis

    4 min
  • 27

    Pipeline Parameter Nesting

    3 min
  • 28

    Project Milestone: Tuning the Champion Model

    3 min
  • 29

    Baseline-to-Champion Framework

    3 min
  • 30

    Statistical Significance in Model Comparison

    Coming soon
  • 31

    Model Ensembling: Voting and Averaging

    Coming soon
  • 32

    Stacking Architectures

    Coming soon
  • 33

    Blending Techniques

    Coming soon
  • 34

    Interpreting Complex Ensembles

    Coming soon
  • 35

    Managing Model Complexity

    Coming soon
  • 36

    Bias-Variance Tradeoff in Ensembles

    Coming soon
  • 37

    Project Milestone: The Ensemble Strategy

    Coming soon
  • 38

    Serializing Pipelines with Joblib

    Coming soon
  • 39

    Versioning Models and Data

    Coming soon
  • 40

    Designing Inference APIs

    Coming soon
  • 41

    Input Validation and Schema Enforcement

    Coming soon
  • 42

    Monitoring Data Drift

    Coming soon
  • 43

    Tracking Performance Degradation

    Coming soon
  • 44

    Logging and Observability

    Coming soon
  • 45

    Automated Retraining Triggers

    Coming soon
  • 46

    Containerization Basics

    Coming soon
  • 47

    Handling Environment Parity

    Coming soon
  • 48

    Documentation for Production

    Coming soon
  • 49

    Project Milestone: Deployment Readiness

    Coming soon
  • View full course