Learn how to master data cleaning with Pandas. We cover detecting nulls, imputing missing values, and fixing inconsistent labels for better model performance.
Previously in this course, we covered Loading and Inspecting Datasets with Pandas: A Practical Guide. Now that you can load your data, this lesson adds the critical skill of data cleaning. Raw data is rarely "model-ready"; it arrives with missing gaps and inconsistent labels that will break your algorithms or bias your results.
Before you can fix data, you must find where it's broken. In Pandas, missing data is typically represented as NaN (Not a Number) or None.
The most efficient way to detect these is using the .isnull() method combined with .sum(). This tells you exactly how many gaps exist in each column:
PYTHONimport pandas as pd import numpy as np # Sample DataFrame with missing values df = pd.DataFrame({ CE9178">'age': [25, np.nan, 30, 22, np.nan], CE9178">'city': [CE9178">'NY', CE9178">'LA', None, CE9178">'NY', CE9178">'SF'] }) # Detect nulls print(df.isnull().sum())
This output gives you a clear audit of your dataset's health. If a column has 90% missing data, you might consider dropping it entirely. If it has 1%, you can likely fix it.
Once you've identified the gaps, you have two primary strategies: dropping (removing the rows/columns) or imputation (filling the gaps with estimated values).
If you have a massive dataset and the missing rows are negligible, dropping them is the safest approach to avoid introducing bias.
df.dropna(): Removes rows with any missing values.df.dropna(subset=['age']): Removes rows where only the 'age' column is missing.When data is expensive to collect, dropping rows is wasteful. Instead, we use imputation.
NaN with the average or median of the column. Use this for numerical data.NaN with the most frequent value. Use this for categorical data.PYTHON# Mean imputation for numerical data df[CE9178">'age'] = df[CE9178">'age'].fillna(df[CE9178">'age'].mean()) # Mode imputation for categorical data df[CE9178">'city'] = df[CE9178">'city'].fillna(df[CE9178">'city'].mode()[0])
Data often contains human-entered errors. For example, a column might contain both "New York" and "NY". To a machine, these are distinct categories, which splits your data's statistical power.
You can standardize these labels using the .replace() method or by mapping values:
PYTHON# Standardize inconsistent labels df[CE9178">'city'] = df[CE9178">'city'].replace({CE9178">'NY': CE9178">'New York', CE9178">'N.Y.': CE9178">'New York'})
Always inspect your unique categories first using df['column_name'].unique() to see exactly what variations exist before you start replacing them.
pd.DataFrame).df.isnull().sum() again—it should return 0 for those columns.is_age_missing before imputing.Data cleaning is the process of turning raw, messy input into a reliable foundation for your model. By identifying nulls with .isnull(), choosing between deletion and imputation, and standardizing categorical labels, you ensure that your model learns patterns rather than noise.
Up next: Feature Selection and Basic Filtering
Feature scaling is essential for model stability. Learn how to apply StandardScaler and MinMaxScaler to ensure your machine learning models converge efficiently.
Read moreOutliers can derail your model’s performance. Learn to identify them using the IQR method and decide when to cap or remove them for better model accuracy.
Handling Missing and Inconsistent Data