Handling Missing and Inconsistent Data: A Practical Guide

Learn how to master data cleaning with Pandas. We cover detecting nulls, imputing missing values, and fixing inconsistent labels for better model performance.

data cleaningimputationpreprocessingPandasmachine learningaimachine-learningpython

Previously in this course, we covered Loading and Inspecting Datasets with Pandas: A Practical Guide. Now that you can load your data, this lesson adds the critical skill of data cleaning. Raw data is rarely "model-ready"; it arrives with missing gaps and inconsistent labels that will break your algorithms or bias your results.

Detecting Null Values

Before you can fix data, you must find where it's broken. In Pandas, missing data is typically represented as NaN (Not a Number) or None.

The most efficient way to detect these is using the .isnull() method combined with .sum(). This tells you exactly how many gaps exist in each column:


PYTHON
import pandas as pd
import numpy as np

# Sample DataFrame with missing values
df = pd.DataFrame({
    CE9178">'age': [25, np.nan, 30, 22, np.nan],
    CE9178">'city': [CE9178">'NY', CE9178">'LA', None, CE9178">'NY', CE9178">'SF']
})

# Detect nulls
print(df.isnull().sum())

This output gives you a clear audit of your dataset's health. If a column has 90% missing data, you might consider dropping it entirely. If it has 1%, you can likely fix it.

Strategies for Missing Data: Drop or Impute

Once you've identified the gaps, you have two primary strategies: dropping (removing the rows/columns) or imputation (filling the gaps with estimated values).

1. Dropping Data

If you have a massive dataset and the missing rows are negligible, dropping them is the safest approach to avoid introducing bias.

df.dropna(): Removes rows with any missing values.
df.dropna(subset=['age']): Removes rows where only the 'age' column is missing.

2. Imputation

When data is expensive to collect, dropping rows is wasteful. Instead, we use imputation.

Mean/Median Imputation: Replace NaN with the average or median of the column. Use this for numerical data.
Mode Imputation: Replace NaN with the most frequent value. Use this for categorical data.


PYTHON
# Mean imputation for numerical data
df[CE9178">'age'] = df[CE9178">'age'].fillna(df[CE9178">'age'].mean())

# Mode imputation for categorical data
df[CE9178">'city'] = df[CE9178">'city'].fillna(df[CE9178">'city'].mode()[0])

Handling Inconsistent Categorical Labels

Data often contains human-entered errors. For example, a column might contain both "New York" and "NY". To a machine, these are distinct categories, which splits your data's statistical power.

You can standardize these labels using the .replace() method or by mapping values:


PYTHON
# Standardize inconsistent labels
df[CE9178">'city'] = df[CE9178">'city'].replace({CE9178">'NY': CE9178">'New York', CE9178">'N.Y.': CE9178">'New York'})

Always inspect your unique categories first using df['column_name'].unique() to see exactly what variations exist before you start replacing them.

Hands-on Exercise

Load a dataset where you know some values are missing (or create a small one using pd.DataFrame).
Calculate the total percentage of nulls in each column.
Choose one numerical column to impute with the median and one categorical column to clean using a label mapping.
Verify your work by running df.isnull().sum() again—it should return 0 for those columns.

Common Pitfalls

Imputing before splitting: Never calculate the mean or median on your entire dataset before splitting it into training and testing sets. This causes "data leakage," where information from the test set influences your training process. Always calculate statistics on the training set only.
Blindly dropping: If you drop rows where a specific feature is missing, you might accidentally remove an entire segment of your population, introducing significant selection bias.
Over-cleaning: Sometimes, a missing value is a signal itself (e.g., a missing "credit_score" might imply the user has no credit history). Consider creating a new binary feature like is_age_missing before imputing.

Recap

Data cleaning is the process of turning raw, messy input into a reliable foundation for your model. By identifying nulls with .isnull(), choosing between deletion and imputation, and standardizing categorical labels, you ensure that your model learns patterns rather than noise.

Up next: Feature Selection and Basic Filtering

Back to Blog