Outliers can derail your model’s performance. Learn to identify them using the IQR method and decide when to cap or remove them for better model accuracy.
Previously in this course, we covered Feature Engineering Strategies to create more meaningful inputs for our models. While better features help, they can't save a model if your dataset is polluted with extreme values. This lesson adds the critical step of managing outliers to ensure your model learns general patterns rather than chasing noise.
In machine learning, outliers are data points that deviate significantly from the rest of your observations. If your feature is "House Price" and most homes cost between $200k and $800k, a $50M mansion is an outlier.
If you don't address these, your model—especially linear models we discussed in The Mechanics of Linear Regression—will try to minimize the error for that single extreme point. This pulls the "line of best fit" away from the bulk of your data, leading to poor generalization.
Standard deviation is sensitive to the very outliers you are trying to find. Instead, we use robust statistics that rely on percentiles, specifically the Interquartile Range (IQR).
The IQR is the distance between the 25th percentile (Q1) and the 75th percentile (Q3). Any point that falls 1.5 times the IQR below Q1 or above Q3 is considered a potential outlier.
Let’s use pandas to identify and handle these values in our project dataset.
PYTHONimport pandas as pd import numpy as np # Load your project data df = pd.read_csv("project_data.csv") # Calculate IQR for a target feature, e.g., CE9178">'income' Q1 = df[CE9178">'income'].quantile(0.25) Q3 = df[CE9178">'income'].quantile(0.75) IQR = Q3 - Q1 # Define bounds lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR # Identify outliers outliers = df[(df[CE9178">'income'] < lower_bound) | (df[CE9178">'income'] > upper_bound)] print(f"Found {len(outliers)} outliers.")
Once detected, you have two primary strategies:
upper_bound to the upper_bound value itself. This keeps the data point in your set but prevents it from exerting undue influence on the model.PYTHON# Capping example df[CE9178">'income'] = np.where(df[CE9178">'income'] > upper_bound, upper_bound, df[CE9178">'income']) df[CE9178">'income'] = np.where(df[CE9178">'income'] < lower_bound, lower_bound, df[CE9178">'income'])
Before you decide to drop or cap, always visualize. A boxplot is the industry standard for this. If you’ve followed Exploratory Data Analysis Fundamentals, you know that a boxplot clearly shows the "whiskers" marking the bounds of normal data, with outliers appearing as individual dots beyond them.
If the dots are sparse and far away, you have a clear case for removal. If they are clustered near the whiskers, consider capping.
matplotlib or seaborn to confirm the presence of outliers.StandardScaler later, remember that it is highly sensitive to outliers. Always handle your outliers before scaling your features.We've moved from raw data inspection to active cleaning. By using the IQR, we establish a robust, objective way to identify outliers. Whether you choose to cap or remove them depends on the nature of your data, but the goal remains the same: preventing extreme values from skewing your model's performance.
Up next: We will explore the Bias-Variance Tradeoff, where we'll learn why balancing model complexity is the secret to building high-performing, reliable predictors.
Learn to initialize your ML project dataset with a rigorous data audit and cleaning workflow, ensuring your data is ready for predictive modeling.
Read moreMaster feature selection and data filtering to reduce dimensionality and improve model performance. Learn to prune irrelevant columns and handle correlation.
Handling Outliers