Overfitting and Underfitting: Bias, Variance, and Model Health

Learn to diagnose overfitting and underfitting by mastering the concepts of bias and variance. Build models that generalize, not just memorize.

machine learningbiasvarianceoverfittingunderfittingdata sciencemodel evaluationaimachine-learningpython

Previously in this course, we explored training error vs generalization error. While that lesson taught you how to compare performance metrics, it didn't explain why those gaps appear. In this lesson, we dive into the root causes: the relationship between model complexity, bias, and variance.

Understanding Bias and Variance from First Principles

Every machine learning model you build is an attempt to map input features to a target outcome. However, no model is perfect. The error a model makes on unseen data can be decomposed into three parts: bias, variance, and irreducible noise.

1. Bias: The Error of Simplicity

Bias measures how much a model's average prediction differs from the true value. A model with high bias makes strong, simplifying assumptions about the data.

The symptom: The model ignores relevant patterns.
The result: It performs poorly on both the training set and the test set. We call this underfitting. It’s like trying to model a complex, winding mountain road with a single straight line.

2. Variance: The Error of Complexity

Variance measures how much the model's predictions change if you train it on a different subset of the same data. A model with high variance is overly sensitive to the specific noise or "quirks" in your training set.

The symptom: The model "memorizes" the training data rather than learning the underlying pattern.
The result: It performs exceptionally well on the training data but fails miserably on the test set. We call this overfitting. It’s like trying to model that same mountain road by drawing a jagged line that hits every single pebble on the path.

Identifying the Signs: A Diagnostic Framework

To diagnose your model, you must compare performance metrics across your training and testing sets.

Condition	Training Error	Test Error	Complexity
Underfitting	High	High	Too Low
Overfitting	Low	High	Too High
Ideal Model	Low	Low	Balanced

Worked Example: Spotting the Gap

Imagine we are predicting house prices. We use a polynomial regression model.


PYTHON
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Low complexity: Degree 1 (Underfitting)
# High complexity: Degree 15 (Overfitting)
poly = PolynomialFeatures(degree=15)
X_poly = poly.fit_transform(X_train)

model = LinearRegression()
model.fit(X_poly, y_train)

# Assessing the gap
train_preds = model.predict(X_poly)
test_preds = model.predict(poly.transform(X_test))

print(f"Train MSE: {mean_squared_error(y_train, train_preds)}")
print(f"Test MSE: {mean_squared_error(y_test, test_preds)}")

If your Train MSE is 100 but your Test MSE is 50,000, you are looking at a classic case of overfitting. The model has chased the noise in the training data, losing its ability to generalize. Conversely, if both are 20,000, your model hasn't captured enough signal—that is underfitting.

Hands-on Exercise

Take the baseline model you created in our previous session on training the baseline linear model.

Calculate the MSE for both your training set and your test set.
If the training MSE is significantly lower than the test MSE, your model is likely overfitting. What is one feature you could remove to simplify the model?
If both MSE values are high, your model is likely underfitting. What is one feature you could add or transform (e.g., squaring a value) to help the model capture more signal?

Common Pitfalls

Confusing high bias with high variance: Remember, high bias = "I don't know enough," high variance = "I'm obsessed with these specific examples."
Ignoring the noise: Sometimes, the gap between training and test error is just irreducible noise (inherent randomness in the data). You cannot eliminate this; don't spend weeks trying to "fix" a model that has hit the theoretical performance ceiling of the dataset.
Over-tuning early: Don't jump to complex models (like deep neural networks) if a simple linear model hasn't been properly tuned first. Start simple, then add complexity only when the data demands it.

Recap

Overfitting and underfitting represent the two ends of the model complexity spectrum. By monitoring the performance gap between training and testing data, you can diagnose whether your model suffers from high bias (underfitting) or high variance (overfitting). Your goal is to find the "sweet spot" where the model is complex enough to capture the signal but simple enough to ignore the noise.

Up next: We will learn how to quantify these errors using specific Regression Evaluation Metrics like RMSE and R-squared to make your diagnostic process more precise.

Back to Blog