Handling Outliers: A Guide to Robust Data Cleaning for ML

Outliers can derail your model’s performance. Learn to identify them using the IQR method and decide when to cap or remove them for better model accuracy.

AI/MLdata cleaningIQRrobust statisticsoutliersmachine learningpandasaimachine-learningpython

Previously in this course, we covered Feature Engineering Strategies to create more meaningful inputs for our models. While better features help, they can't save a model if your dataset is polluted with extreme values. This lesson adds the critical step of managing outliers to ensure your model learns general patterns rather than chasing noise.

The Problem with Outliers

In machine learning, outliers are data points that deviate significantly from the rest of your observations. If your feature is "House Price" and most homes cost between $200k and $800k, a $50M mansion is an outlier.

If you don't address these, your model—especially linear models we discussed in The Mechanics of Linear Regression—will try to minimize the error for that single extreme point. This pulls the "line of best fit" away from the bulk of your data, leading to poor generalization.

Detecting Outliers with IQR

Standard deviation is sensitive to the very outliers you are trying to find. Instead, we use robust statistics that rely on percentiles, specifically the Interquartile Range (IQR).

The IQR is the distance between the 25th percentile (Q1) and the 75th percentile (Q3). Any point that falls 1.5 times the IQR below Q1 or above Q3 is considered a potential outlier.

Worked Example: IQR Filtering

Let’s use pandas to identify and handle these values in our project dataset.


PYTHON
import pandas as pd
import numpy as np

# Load your project data
df = pd.read_csv("project_data.csv")

# Calculate IQR for a target feature, e.g., CE9178">'income'
Q1 = df[CE9178">'income'].quantile(0.25)
Q3 = df[CE9178">'income'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
outliers = df[(df[CE9178">'income'] < lower_bound) | (df[CE9178">'income'] > upper_bound)]
print(f"Found {len(outliers)} outliers.")

Deciding: Cap or Remove?

Once detected, you have two primary strategies:

Removal: Best if the outlier is a data entry error (e.g., a negative age or a price of $0). It’s "trash" data that provides no signal.
Capping (Winsorization): Best if the outlier is a valid but extreme observation. By capping, you set all values above the upper_bound to the upper_bound value itself. This keeps the data point in your set but prevents it from exerting undue influence on the model.


PYTHON
# Capping example
df[CE9178">'income'] = np.where(df[CE9178">'income'] > upper_bound, upper_bound, df[CE9178">'income'])
df[CE9178">'income'] = np.where(df[CE9178">'income'] < lower_bound, lower_bound, df[CE9178">'income'])

Visualizing the Impact

Before you decide to drop or cap, always visualize. A boxplot is the industry standard for this. If you’ve followed Exploratory Data Analysis Fundamentals, you know that a boxplot clearly shows the "whiskers" marking the bounds of normal data, with outliers appearing as individual dots beyond them.

If the dots are sparse and far away, you have a clear case for removal. If they are clustered near the whiskers, consider capping.

Hands-on Exercise

Select one numerical feature in your project dataset that you suspect has outliers.
Generate a boxplot using matplotlib or seaborn to confirm the presence of outliers.
Calculate the IQR and define your upper and lower bounds.
Apply the capping method (Winsorization) to the feature.
Create a new boxplot to verify that the extreme dots have been pulled into the range.

Common Pitfalls

Assuming all outliers are errors: Sometimes the outlier is the most important signal (e.g., fraud detection). Never drop data blindly without understanding its context.
Applying global removal: Don't remove rows based on one feature outlier if that row has valid, important data in other columns. Capping is often safer than dropping rows.
Ignoring scaling: If you plan to use StandardScaler later, remember that it is highly sensitive to outliers. Always handle your outliers before scaling your features.

Recap

We've moved from raw data inspection to active cleaning. By using the IQR, we establish a robust, objective way to identify outliers. Whether you choose to cap or remove them depends on the nature of your data, but the goal remains the same: preventing extreme values from skewing your model's performance.

Up next: We will explore the Bias-Variance Tradeoff, where we'll learn why balancing model complexity is the secret to building high-performing, reliable predictors.

Back to Blog