High cardinality can cripple your ML models. Learn to handle categorical features with many unique values using target encoding, hashing, and grouping.
Previously in this course, we covered Encoding Categorical Variables: A Practical Guide for ML. While one-hot encoding works for features with a handful of categories, it fails when you face high cardinality—features like "zip code," "user ID," or "product SKU" that contain hundreds or thousands of unique values.
When you one-hot encode a feature with 1,000 unique values, you add 1,000 new columns to your dataset. This leads to the "curse of dimensionality," where your model becomes bloated, slow to train, and prone to overfitting because most of those columns will be sparse (full of zeros). Here is how to handle these features like a production-grade engineer.
We generally use three approaches to tame features that overwhelm our models:
The simplest way to reduce cardinality is to collapse sparse categories into a single "Other" bucket. If a category appears in less than 1% of your data, its impact on the model is likely noise rather than signal.
Feature hashing uses a hash function to map categories to a fixed number of columns. It’s memory-efficient and doesn't require storing a dictionary of category mappings, making it perfect for streaming data or massive feature sets.
Target encoding replaces each category with the average value of the target variable for that category. It turns a categorical feature into a single, high-information numerical feature.
Note: You must use cross-validation or smoothing to prevent "data leakage," as replacing a category with its target mean can cause the model to memorize the target directly.
Let’s apply target encoding to our project dataset. We’ll simulate a "Region" column with high cardinality and use category_encoders to transform it.
PYTHONimport pandas as pd from category_encoders import TargetEncoder # Setup sample data data = { CE9178">'region': [CE9178">'North', CE9178">'North', CE9178">'South', CE9178">'East', CE9178">'West', CE9178">'North', CE9178">'South'] * 100, CE9178">'target': [1, 0, 1, 0, 1, 1, 0] * 100 } df = pd.DataFrame(data) # Initialize the encoder # smoothing=10 is a common setting to prevent overfitting encoder = TargetEncoder(cols=[CE9178">'region'], smoothing=10) # Fit and transform df[CE9178">'region_encoded'] = encoder.fit_transform(df[CE9178">'region'], df[CE9178">'target']) print(df[[CE9178">'region', CE9178">'region_encoded']].drop_duplicates())
In this example, the region string is replaced by the probability of the target being 1 for that specific region. This maintains the relationship between the feature and the target while reducing the dimensionality to a single column.
TargetEncoder on the resulting column and observe how the number of features in your model changes compared to a one-hot encoded version.High cardinality is a common bottleneck in production machine learning. We avoid the bloat of one-hot encoding by:
By choosing the right technique, you maintain model performance without sacrificing interpretability or speed.
Up next: We will address Handling Multi-Collinearity to ensure our features don't fight each other during training.
Master advanced feature transformations to fix skewed data distributions. Learn to apply log and power transforms to improve your model's predictive accuracy.
Read moreMaster feature engineering strategies to boost model performance. Learn to create polynomial features, perform interactions, and derive new domain-driven variables.
Dealing with High Cardinality