Mahamudul Hasan Rubel
HomeAboutProjectsSkillsExperienceBlogCoursesPhotosContact
Mahamudul Hasan Rubel

Senior Software Engineer crafting high-performance web applications and SaaS platforms.

Navigation

  • Home
  • About
  • Projects
  • Skills
  • Experience
  • Blog
  • Courses
  • Photos
  • Contact

Get in Touch

Available for senior/lead roles and consulting.

bd.mhrubel@gmail.comHire Me

© 2026 Mahamudul Hasan Rubel. All rights reserved.

Built with using Next.js 16 & Tailwind v4

Back to Blog
Lesson 25 of the AI/ML Foundations: Core Concepts & First Models course
AI/MLJune 25, 20263 min read

Handling Outliers: A Guide to Robust Data Cleaning for ML

Outliers can derail your model’s performance. Learn to identify them using the IQR method and decide when to cap or remove them for better model accuracy.

AI/MLdata cleaningIQRrobust statisticsoutliersmachine learningpandasaimachine-learningpython

Previously in this course, we covered Feature Engineering Strategies to create more meaningful inputs for our models. While better features help, they can't save a model if your dataset is polluted with extreme values. This lesson adds the critical step of managing outliers to ensure your model learns general patterns rather than chasing noise.

The Problem with Outliers

In machine learning, outliers are data points that deviate significantly from the rest of your observations. If your feature is "House Price" and most homes cost between $200k and $800k, a $50M mansion is an outlier.

If you don't address these, your model—especially linear models we discussed in The Mechanics of Linear Regression—will try to minimize the error for that single extreme point. This pulls the "line of best fit" away from the bulk of your data, leading to poor generalization.

Detecting Outliers with IQR

Standard deviation is sensitive to the very outliers you are trying to find. Instead, we use robust statistics that rely on percentiles, specifically the Interquartile Range (IQR).

The IQR is the distance between the 25th percentile (Q1) and the 75th percentile (Q3). Any point that falls 1.5 times the IQR below Q1 or above Q3 is considered a potential outlier.

Worked Example: IQR Filtering

Let’s use pandas to identify and handle these values in our project dataset.

PYTHON
import pandas as pd
import numpy as np

# Load your project data
df = pd.read_csv("project_data.csv")

# Calculate IQR for a target feature, e.g., CE9178">'income'
Q1 = df[CE9178">'income'].quantile(0.25)
Q3 = df[CE9178">'income'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
outliers = df[(df[CE9178">'income'] < lower_bound) | (df[CE9178">'income'] > upper_bound)]
print(f"Found {len(outliers)} outliers.")

Deciding: Cap or Remove?

Once detected, you have two primary strategies:

  1. Removal: Best if the outlier is a data entry error (e.g., a negative age or a price of $0). It’s "trash" data that provides no signal.
  2. Capping (Winsorization): Best if the outlier is a valid but extreme observation. By capping, you set all values above the upper_bound to the upper_bound value itself. This keeps the data point in your set but prevents it from exerting undue influence on the model.
PYTHON
# Capping example
df[CE9178">'income'] = np.where(df[CE9178">'income'] > upper_bound, upper_bound, df[CE9178">'income'])
df[CE9178">'income'] = np.where(df[CE9178">'income'] < lower_bound, lower_bound, df[CE9178">'income'])

Visualizing the Impact

Before you decide to drop or cap, always visualize. A boxplot is the industry standard for this. If you’ve followed Exploratory Data Analysis Fundamentals, you know that a boxplot clearly shows the "whiskers" marking the bounds of normal data, with outliers appearing as individual dots beyond them.

If the dots are sparse and far away, you have a clear case for removal. If they are clustered near the whiskers, consider capping.

Hands-on Exercise

  1. Select one numerical feature in your project dataset that you suspect has outliers.
  2. Generate a boxplot using matplotlib or seaborn to confirm the presence of outliers.
  3. Calculate the IQR and define your upper and lower bounds.
  4. Apply the capping method (Winsorization) to the feature.
  5. Create a new boxplot to verify that the extreme dots have been pulled into the range.

Common Pitfalls

  • Assuming all outliers are errors: Sometimes the outlier is the most important signal (e.g., fraud detection). Never drop data blindly without understanding its context.
  • Applying global removal: Don't remove rows based on one feature outlier if that row has valid, important data in other columns. Capping is often safer than dropping rows.
  • Ignoring scaling: If you plan to use StandardScaler later, remember that it is highly sensitive to outliers. Always handle your outliers before scaling your features.

Recap

We've moved from raw data inspection to active cleaning. By using the IQR, we establish a robust, objective way to identify outliers. Whether you choose to cap or remove them depends on the nature of your data, but the goal remains the same: preventing extreme values from skewing your model's performance.

Up next: We will explore the Bias-Variance Tradeoff, where we'll learn why balancing model complexity is the secret to building high-performing, reliable predictors.

Previous lessonFeature Engineering StrategiesNext lesson The Bias-Variance Tradeoff
Back to Blog

Similar Posts

AI/MLJune 25, 20263 min read

Project Dataset Initialization: Audit and Clean Your Data

Learn to initialize your ML project dataset with a rigorous data audit and cleaning workflow, ensuring your data is ready for predictive modeling.

Read more
AI/MLJune 25, 20263 min read

Feature Selection and Basic Filtering for Cleaner ML Models

Master feature selection and data filtering to reduce dimensionality and improve model performance. Learn to prune irrelevant columns and handle correlation.

Part of the course

AI/ML Foundations: Core Concepts & First Models

beginner · Lesson 25 of 50

  1. 1

    The Machine Learning Workflow

    4 min
  2. 2

    Setting Up the Python ML Environment

    4 min
  3. 3

    Introduction to NumPy for Data Handling

    4 min
Read more
AI/MLJune 25, 20263 min read

Feature Selection via Recursive Elimination: An RFECV Guide

Master feature selection with RFECV. Learn how to automate the removal of noisy, irrelevant features to build simpler, more robust machine learning models.

Read more
4

Loading and Inspecting Datasets with Pandas

3 min
  • 5

    Exploratory Data Analysis Fundamentals

    3 min
  • 6

    Handling Missing and Inconsistent Data

    3 min
  • 7

    Feature Selection and Basic Filtering

    3 min
  • 8

    Project Dataset Initialization

    3 min
  • 9

    Mechanics of Linear Regression

    4 min
  • 10

    Mechanics of Classification

    4 min
  • 11

    Loss Functions and Model Objectives

    4 min
  • 12

    Training and Testing Data Splits

    3 min
  • 13

    Data Scaling Techniques

    4 min
  • 14

    Encoding Categorical Variables

    3 min
  • 15

    Building Scikit-Learn Pipelines

    4 min
  • 16

    Training the Baseline Linear Model

    3 min
  • 17

    Training Error vs Generalization Error

    4 min
  • 18

    Overfitting and Underfitting

    4 min
  • 19

    Regression Evaluation Metrics

    4 min
  • 20

    The Confusion Matrix

    3 min
  • 21

    Error Analysis Plots

    4 min
  • 22

    Introduction to Cross-Validation

    4 min
  • 23

    Diagnosing Model Weaknesses

    3 min
  • 24

    Feature Engineering Strategies

    4 min
  • 25

    Handling Outliers

    3 min
  • 26

    The Bias-Variance Tradeoff

    3 min
  • 27

    Hyperparameter Tuning Basics

    4 min
  • 28

    Implementing Grid Search

    3 min
  • 29

    Refining the Project Model

    3 min
  • 30

    Evaluating Feature Importance

    3 min
  • 31

    Advanced Feature Transformation

    3 min
  • 32

    Regularization Techniques

    3 min
  • 33

    Comparing Different Algorithms

    3 min
  • 34

    Managing Model Complexity

    4 min
  • 35

    Understanding Data Drift

    4 min
  • 36

    Version Control for ML Experiments

    3 min
  • 37

    Exporting Trained Models

    3 min
  • 38

    Creating an Inference Script

    3 min
  • 39

    Building a Simple Web Interface

    3 min
  • 40

    Documenting ML Projects

    4 min
  • 41

    Final Project Review

    4 min
  • 42

    Ensemble Methods Overview

    4 min
  • 43

    Feature Selection via Recursive Elimination

    3 min
  • 44

    Model Interpretability Basics

    4 min
  • 45

    Dealing with High Cardinality

    3 min
  • 46

    Handling Multi-Collinearity

    4 min
  • 47

    Introduction to Pipelines with Custom Transformers

    3 min
  • 48

    Evaluating Model Calibration

    4 min
  • 49

    Advanced Hyperparameter Search

    3 min
  • 50

    Model Monitoring in Practice

    4 min
  • View full course