Master categorical encoding in your ML pipelines. Learn when to use OneHot vs. Ordinal encoding and how to implement target encoding without data leakage.
Previously in this course, we discussed ColumnTransformer for Heterogeneous Data to isolate feature processing. In this lesson, we move from structural organization to the numerical transformation of categorical data.
Most machine learning algorithms are essentially massive mathematical functions; they cannot ingest strings or categories directly. Converting these into numbers—categorical encoding—is a foundational step in your feature engineering workflow.
The two most common strategies in Scikit-Learn are OneHotEncoder and OrdinalEncoder. Choosing the wrong one can lead to poor model convergence or spurious relationships.
OneHotEncoder creates a binary column for every unique category. It assumes no inherent order.
OrdinalEncoder maps each category to an integer (0, 1, 2...).
When you face high cardinality—features like "User ID" or "City" that have thousands of levels—OneHot encoding is often prohibitive. In these cases, we use Target Encoding (or Mean Encoding).
Target encoding replaces each category with the average target value for that category. If you are predicting churn, the category "New York" becomes the average churn rate of all users in New York.
The Golden Rule: You must compute these averages only on the training set to prevent data leakage. If you calculate the mean target for a category using the entire dataset, the model "sees" the answer for the validation/test rows during training.
To implement this safely, we use the category_encoders library, which integrates seamlessly with Scikit-Learn pipelines.
PYTHONfrom category_encoders import TargetEncoder from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.ensemble import RandomForestClassifier # Define our high-cardinality columns cat_features = [CE9178">'city', CE9178">'zip_code'] # Create the encoder # min_samples_leaf and smoothing help prevent overfitting te = TargetEncoder(min_samples_leaf=10, smoothing=10) preprocessor = ColumnTransformer( transformers=[ (CE9178">'cat', te, cat_features) ], remainder=CE9178">'passthrough' ) # Pipeline ensures fit/transform happens only on training folds pipeline = Pipeline([ (CE9178">'prep', preprocessor), (CE9178">'model', RandomForestClassifier()) ])
LogisticRegression model using OneHotEncoder vs. TargetEncoder..shape. Note how OneHotEncoder expands the feature space significantly.OneHotEncoder will throw an error by default. Set handle_unknown='ignore' in your encoder configuration to prevent deployment crashes.Pipeline structure to ensure the mapping is learned solely from the training fold.Categorical encoding is not one-size-fits-all. Use OneHotEncoder for low-cardinality nominal data, OrdinalEncoder for tree-based models or ordinal data, and TargetEncoder for high-cardinality features. Always encapsulate these transformations within your Pipeline architecture to ensure that your preprocessing logic is applied consistently and that you avoid the silent, deadly trap of data leakage.
Up next: We will discuss how to prune the feature space we’ve just created using Feature Selection in Pipelines.
Learn how to build custom transformers for feature engineering in scikit-learn. Master the BaseEstimator and TransformerMixin pattern for production pipelines.
Read moreMaster advanced feature transformations to fix skewed data distributions. Learn to apply log and power transforms to improve your model's predictive accuracy.
Encoding Categorical Variables
Handling Class Imbalance with Resampling
Advanced Metrics for Imbalanced Datasets
Project Milestone: Building the Baseline Pipeline
Introduction to GridSearchCV
RandomizedSearchCV for Efficiency
Bayesian Optimization Principles
Early Stopping in Iterative Models
Managing Computational Resources
Hyperparameter Stability Analysis
Pipeline Parameter Nesting
Project Milestone: Tuning the Champion Model
Baseline-to-Champion Framework
Statistical Significance in Model Comparison
Model Ensembling: Voting and Averaging
Stacking Architectures
Blending Techniques
Interpreting Complex Ensembles
Managing Model Complexity
Bias-Variance Tradeoff in Ensembles
Project Milestone: The Ensemble Strategy
Serializing Pipelines with Joblib
Versioning Models and Data
Designing Inference APIs
Input Validation and Schema Enforcement
Monitoring Data Drift
Tracking Performance Degradation
Logging and Observability
Automated Retraining Triggers
Containerization Basics
Handling Environment Parity
Documentation for Production
Project Milestone: Deployment Readiness