Dealing with High Cardinality: Advanced Categorical Encoding

High cardinality can cripple your ML models. Learn to handle categorical features with many unique values using target encoding, hashing, and grouping.

machine learningcategoricalfeature engineeringhigh cardinalitypythonaimachine-learning

Previously in this course, we covered Encoding Categorical Variables: A Practical Guide for ML. While one-hot encoding works for features with a handful of categories, it fails when you face high cardinality—features like "zip code," "user ID," or "product SKU" that contain hundreds or thousands of unique values.

When you one-hot encode a feature with 1,000 unique values, you add 1,000 new columns to your dataset. This leads to the "curse of dimensionality," where your model becomes bloated, slow to train, and prone to overfitting because most of those columns will be sparse (full of zeros). Here is how to handle these features like a production-grade engineer.

Strategies for High Cardinality

We generally use three approaches to tame features that overwhelm our models:

1. Grouping and Rare Category Aggregation

The simplest way to reduce cardinality is to collapse sparse categories into a single "Other" bucket. If a category appears in less than 1% of your data, its impact on the model is likely noise rather than signal.

2. Feature Hashing (The Hashing Trick)

Feature hashing uses a hash function to map categories to a fixed number of columns. It’s memory-efficient and doesn't require storing a dictionary of category mappings, making it perfect for streaming data or massive feature sets.

3. Target Encoding

Target encoding replaces each category with the average value of the target variable for that category. It turns a categorical feature into a single, high-information numerical feature.

Note: You must use cross-validation or smoothing to prevent "data leakage," as replacing a category with its target mean can cause the model to memorize the target directly.

Worked Example: Implementing Target Encoding

Let’s apply target encoding to our project dataset. We’ll simulate a "Region" column with high cardinality and use category_encoders to transform it.


PYTHON
import pandas as pd
from category_encoders import TargetEncoder

# Setup sample data
data = {
    CE9178">'region': [CE9178">'North', CE9178">'North', CE9178">'South', CE9178">'East', CE9178">'West', CE9178">'North', CE9178">'South'] * 100,
    CE9178">'target': [1, 0, 1, 0, 1, 1, 0] * 100
}
df = pd.DataFrame(data)

# Initialize the encoder
# smoothing=10 is a common setting to prevent overfitting
encoder = TargetEncoder(cols=[CE9178">'region'], smoothing=10)

# Fit and transform
df[CE9178">'region_encoded'] = encoder.fit_transform(df[CE9178">'region'], df[CE9178">'target'])

print(df[[CE9178">'region', CE9178">'region_encoded']].drop_duplicates())

In this example, the region string is replaced by the probability of the target being 1 for that specific region. This maintains the relationship between the feature and the target while reducing the dimensionality to a single column.

Hands-on Exercise

Load your project dataset from your local environment.
Identify a column with more than 20 unique values.
Apply a frequency-based filter: replace any category appearing fewer than 10 times with the label "Other."
Use TargetEncoder on the resulting column and observe how the number of features in your model changes compared to a one-hot encoded version.

Common Pitfalls

Data Leakage with Target Encoding: Never use the global mean of the entire test set to encode training data. Always fit your encoder only on the training fold.
Hash Collisions: When using feature hashing, choosing too small a number of buckets will cause different categories to map to the same index. This is fine for some models (like linear regression) but can degrade performance on others.
Ignoring "New" Categories: In production, you will inevitably encounter categories in live data that didn't exist in your training set. Ensure your encoding pipeline has a strategy (like assigning to "Other" or a neutral mean) for unseen data.

Recap

High cardinality is a common bottleneck in production machine learning. We avoid the bloat of one-hot encoding by:

Grouping rare items to simplify the feature space.
Hashing to map vast categories into a fixed-size vector.
Target Encoding to inject predictive power into a single numeric column.

By choosing the right technique, you maintain model performance without sacrificing interpretability or speed.

Up next: We will address Handling Multi-Collinearity to ensure our features don't fight each other during training.

Back to Blog