Mahamudul Hasan Rubel
HomeBlogCoursesAboutProjectsSkillsExperiencePhotosContact
Mahamudul Hasan Rubel

Senior Software Engineer crafting high-performance web applications and SaaS platforms.

Navigation

  • Home
  • Blog
  • Courses
  • About
  • Projects
  • Skills
  • Experience
  • Photos
  • Contact

Get in Touch

Available for senior/lead roles and consulting.

bd.mhrubel@gmail.comHire Me

© 2026 Mahamudul Hasan Rubel. All rights reserved.

Built with using Next.js 16 & Tailwind v4

Back to Blog
Lesson 6 of the Intermediate Machine Learning: Real-World Pipelines course
AI/MLJune 25, 20263 min read

Encoding Categorical Variables: Production Pipelines

Master categorical encoding in your ML pipelines. Learn when to use OneHot vs. Ordinal encoding and how to implement target encoding without data leakage.

categorical encodingmachine learningscikit-learnfeature engineeringdata scienceaimachine-learningpython

Previously in this course, we discussed ColumnTransformer for Heterogeneous Data to isolate feature processing. In this lesson, we move from structural organization to the numerical transformation of categorical data.

Most machine learning algorithms are essentially massive mathematical functions; they cannot ingest strings or categories directly. Converting these into numbers—categorical encoding—is a foundational step in your feature engineering workflow.

OneHotEncoder vs. OrdinalEncoder: The Core Trade-off

The two most common strategies in Scikit-Learn are OneHotEncoder and OrdinalEncoder. Choosing the wrong one can lead to poor model convergence or spurious relationships.

OneHotEncoder creates a binary column for every unique category. It assumes no inherent order.

  • Best for: Nominal data (e.g., "Color": Red, Blue, Green).
  • The Risk: High cardinality. If you have a "Zip Code" feature with 5,000 unique values, OneHot encoding generates 5,000 new columns. This leads to sparse matrices, increased memory usage, and the "curse of dimensionality."

OrdinalEncoder maps each category to an integer (0, 1, 2...).

  • Best for: Ordinal data (e.g., "Size": Small, Medium, Large) or tree-based models like Random Forests or XGBoost, which can handle integer-encoded features effectively.
  • The Risk: Linear models (like Logistic Regression) will interpret the integers as magnitudes, assuming "Large" (2) is twice as "valuable" as "Medium" (1). Only use this for linear models if the categories are truly ordered.

Handling High Cardinality with Target Encoding

When you face high cardinality—features like "User ID" or "City" that have thousands of levels—OneHot encoding is often prohibitive. In these cases, we use Target Encoding (or Mean Encoding).

Target encoding replaces each category with the average target value for that category. If you are predicting churn, the category "New York" becomes the average churn rate of all users in New York.

The Golden Rule: You must compute these averages only on the training set to prevent data leakage. If you calculate the mean target for a category using the entire dataset, the model "sees" the answer for the validation/test rows during training.

Implementation: Target Encoding in a Pipeline

To implement this safely, we use the category_encoders library, which integrates seamlessly with Scikit-Learn pipelines.

PYTHON
from category_encoders import TargetEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier

# Define our high-cardinality columns
cat_features = [CE9178">'city', CE9178">'zip_code']

# Create the encoder
# min_samples_leaf and smoothing help prevent overfitting
te = TargetEncoder(min_samples_leaf=10, smoothing=10)

preprocessor = ColumnTransformer(
    transformers=[
        (CE9178">'cat', te, cat_features)
    ],
    remainder=CE9178">'passthrough'
)

# Pipeline ensures fit/transform happens only on training folds
pipeline = Pipeline([
    (CE9178">'prep', preprocessor),
    (CE9178">'model', RandomForestClassifier())
])

Hands-on Exercise

  1. Create a synthetic dataset with a high-cardinality feature (e.g., 50 unique cities).
  2. Compare the performance of a LogisticRegression model using OneHotEncoder vs. TargetEncoder.
  3. Observe the memory usage of the resulting feature matrix using .shape. Note how OneHotEncoder expands the feature space significantly.

Common Pitfalls

  • The "Unknown Category" Trap: If your production data contains a category never seen during training, OneHotEncoder will throw an error by default. Set handle_unknown='ignore' in your encoder configuration to prevent deployment crashes.
  • Leakage in Target Encoding: Never perform target encoding before splitting your data. If you calculate the mean target across the whole set, your model will score suspiciously high on local validation but fail in production. Always rely on the Pipeline structure to ensure the mapping is learned solely from the training fold.
  • Overfitting with Target Encoding: If a category has very few samples (e.g., only 1 person in "Small Town"), the target mean will be highly unreliable. Always use smoothing parameters to shrink the estimate toward the global mean.

Recap

Categorical encoding is not one-size-fits-all. Use OneHotEncoder for low-cardinality nominal data, OrdinalEncoder for tree-based models or ordinal data, and TargetEncoder for high-cardinality features. Always encapsulate these transformations within your Pipeline architecture to ensure that your preprocessing logic is applied consistently and that you avoid the silent, deadly trap of data leakage.

Up next: We will discuss how to prune the feature space we’ve just created using Feature Selection in Pipelines.

Previous lessonScaling and Normalization PipelinesNext lesson Feature Selection in Pipelines
Back to Blog

Similar Posts

AI/MLJune 25, 20263 min read

Custom Transformers for Feature Engineering in Scikit-Learn

Learn how to build custom transformers for feature engineering in scikit-learn. Master the BaseEstimator and TransformerMixin pattern for production pipelines.

Read more
AI/MLJune 25, 20263 min read

Advanced Feature Transformation: Handling Skewed Data Distributions

Master advanced feature transformations to fix skewed data distributions. Learn to apply log and power transforms to improve your model's predictive accuracy.

Part of the course

Intermediate Machine Learning: Real-World Pipelines

intermediate · Lesson 6 of 49

  1. 1

    Pipeline Architecture Essentials

    4 min
  2. 2

    ColumnTransformer for Heterogeneous Data

    3 min
  3. 3

    Custom Transformers for Feature Engineering

    3 min
Read more
AI/MLJune 25, 20264 min read

Feature Engineering Strategies: Boosting Model Predictive Power

Master feature engineering strategies to boost model performance. Learn to create polynomial features, perform interactions, and derive new domain-driven variables.

Read more
  • 4

    Handling Missing Values Strategically

    4 min
  • 5

    Scaling and Normalization Pipelines

    3 min
  • 6

    Encoding Categorical Variables

    3 min
  • 7

    Feature Selection in Pipelines

    3 min
  • 8

    Data Leakage Prevention Strategies

    4 min
  • 9

    Designing Reproducible Pipelines

    3 min
  • 10

    Project Initialization: Defining the Prediction Problem

    3 min
  • 11

    Introduction to Cross-Validation

    3 min
  • 12

    Stratification for Imbalanced Data

    4 min
  • 13

    Time-Series Validation Strategies

    4 min
  • 14

    Confusion Matrices and Beyond

    4 min
  • 15

    Precision-Recall Curves

    4 min
  • 16

    ROC-AUC Analysis

    3 min
  • 17

    Cost-Sensitive Learning

    4 min
  • 18

    Handling Class Imbalance with Resampling

    Coming soon
  • 19

    Advanced Metrics for Imbalanced Datasets

    Coming soon
  • 20

    Project Milestone: Building the Baseline Pipeline

    Coming soon
  • 21

    Introduction to GridSearchCV

    Coming soon
  • 22

    RandomizedSearchCV for Efficiency

    Coming soon
  • 23

    Bayesian Optimization Principles

    Coming soon
  • 24

    Early Stopping in Iterative Models

    Coming soon
  • 25

    Managing Computational Resources

    Coming soon
  • 26

    Hyperparameter Stability Analysis

    Coming soon
  • 27

    Pipeline Parameter Nesting

    Coming soon
  • 28

    Project Milestone: Tuning the Champion Model

    Coming soon
  • 29

    Baseline-to-Champion Framework

    Coming soon
  • 30

    Statistical Significance in Model Comparison

    Coming soon
  • 31

    Model Ensembling: Voting and Averaging

    Coming soon
  • 32

    Stacking Architectures

    Coming soon
  • 33

    Blending Techniques

    Coming soon
  • 34

    Interpreting Complex Ensembles

    Coming soon
  • 35

    Managing Model Complexity

    Coming soon
  • 36

    Bias-Variance Tradeoff in Ensembles

    Coming soon
  • 37

    Project Milestone: The Ensemble Strategy

    Coming soon
  • 38

    Serializing Pipelines with Joblib

    Coming soon
  • 39

    Versioning Models and Data

    Coming soon
  • 40

    Designing Inference APIs

    Coming soon
  • 41

    Input Validation and Schema Enforcement

    Coming soon
  • 42

    Monitoring Data Drift

    Coming soon
  • 43

    Tracking Performance Degradation

    Coming soon
  • 44

    Logging and Observability

    Coming soon
  • 45

    Automated Retraining Triggers

    Coming soon
  • 46

    Containerization Basics

    Coming soon
  • 47

    Handling Environment Parity

    Coming soon
  • 48

    Documentation for Production

    Coming soon
  • 49

    Project Milestone: Deployment Readiness

    Coming soon
  • View full course