Mahamudul Hasan Rubel
HomeAboutProjectsSkillsExperienceBlogCoursesPhotosContact
Mahamudul Hasan Rubel

Senior Software Engineer crafting high-performance web applications and SaaS platforms.

Navigation

  • Home
  • About
  • Projects
  • Skills
  • Experience
  • Blog
  • Courses
  • Photos
  • Contact

Get in Touch

Available for senior/lead roles and consulting.

bd.mhrubel@gmail.comHire Me

© 2026 Mahamudul Hasan Rubel. All rights reserved.

Built with using Next.js 16 & Tailwind v4

Back to Blog
Lesson 14 of the AI/ML Foundations: Core Concepts & First Models course
AI/MLJune 25, 20263 min read

Encoding Categorical Variables: A Practical Guide for ML

Learn how to prepare non-numeric data for machine learning. Master one-hot and label encoding to turn categorical features into model-ready inputs.

encodingmachine learningpandasscikit-learnpreprocessingdata scienceaimachine-learningpython

Previously in this course, we covered Data Scaling Techniques to normalize continuous variables. While scaling handles numbers, real-world datasets are often filled with text-based categories—like "Red," "Green," and "Blue"—that mathematical models cannot interpret directly. This lesson adds the essential skill of encoding to your preprocessing toolkit, allowing you to bridge the gap between human-readable categories and the numerical input required by Scikit-Learn.

Understanding Categorical Data Types

Before applying any transformation, you must identify the nature of your categorical data. Misinterpreting these types is the most common cause of poor model performance.

  • Nominal Data: Categories with no inherent order (e.g., "City," "Color," "Device Type"). There is no mathematical "greater than" between "New York" and "London."
  • Ordinal Data: Categories with a clear, meaningful rank (e.g., "Low," "Medium," "High" or "Education Level"). The order matters, and the distance between them often carries information.

Label Encoding (Ordinal Data)

Label encoding assigns a unique integer to each category (e.g., Low=0, Medium=1, High=2). This preserves the rank, which is exactly what tree-based models like Random Forests need to identify the hierarchy.

One-Hot Encoding (Nominal Data)

One-Hot Encoding creates a new binary column for every unique category. If you have a "Color" column with three values, it creates three columns: is_red, is_green, and is_blue. This prevents the model from assuming that "Green" (2) is somehow "greater than" "Red" (0).

Implementing Encoding with Scikit-Learn

In a production environment, you should use scikit-learn transformers to ensure your encoding logic is reproducible. We will use OrdinalEncoder for ordinal data and OneHotEncoder for nominal data.

PYTHON
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

# Sample dataset
df = pd.DataFrame({
    CE9178">'size': [CE9178">'small', CE9178">'medium', CE9178">'large', CE9178">'medium'],
    CE9178">'color': [CE9178">'red', CE9178">'blue', CE9178">'green', CE9178">'blue']
})

# 1. Label/Ordinal Encoding
# We define the order explicitly
ordinal_encoder = OrdinalEncoder(categories=[[CE9178">'small', CE9178">'medium', CE9178">'large']])
df[CE9178">'size_encoded'] = ordinal_encoder.fit_transform(df[[CE9178">'size']])

# 2. One-Hot Encoding
# sparse_output=False returns a dense array for easier viewing
ohe = OneHotEncoder(sparse_output=False)
ohe_results = ohe.fit_transform(df[[CE9178">'color']])
ohe_df = pd.DataFrame(ohe_results, columns=ohe.get_feature_names_out([CE9178">'color']))

# Combine back to the original dataframe
df_final = pd.concat([df, ohe_df], axis=1).drop(CE9178">'color', axis=1)
print(df_final)

Hands-on Exercise: Preparing the Project Dataset

For our running project, locate a column in your dataset that contains categories (e.g., "Department," "Status," or "Region").

  1. Inspect the column using df['column'].value_counts().
  2. Determine if the data is nominal or ordinal.
  3. Apply OneHotEncoder if it is nominal.
  4. Drop the original string column and join the new binary features to your main DataFrame.

Common Pitfalls

  • The Dummy Variable Trap: If you use OneHotEncoder, you might be tempted to include all columns. However, if you have two categories (A and B), you only need one column (is_A). If it's 1, it's A; if it's 0, it's B. The second column is redundant and can cause issues in linear models (multicollinearity). Use drop='first' in your OneHotEncoder settings to handle this.
  • Encoding Unseen Categories: If your test set contains a category that wasn't in your training set, the encoder will throw an error. In production, set handle_unknown='ignore' in your OneHotEncoder to safely treat unknown categories as all-zeros.
  • Ignoring Cardinality: If you have a column with thousands of unique values, One-Hot encoding will create thousands of columns, leading to a massive, sparse matrix that slows down training. In such cases, consider grouping rare categories into an "Other" category before encoding.

Recap

Encoding is the process of converting non-numeric categorical data into a format that algorithms can process. By using OrdinalEncoder for ranked data and OneHotEncoder for nominal data, you ensure that your model interprets your features correctly. Always remember to handle unknown categories and watch out for the dummy variable trap to keep your model performant and reliable.

Up next: We will learn how to wrap these preprocessing steps into a Pipeline object to ensure your data transformation logic is reusable and mistake-proof.

Previous lessonData Scaling TechniquesNext lesson Building Scikit-Learn Pipelines
Back to Blog

Similar Posts

AI/MLJune 25, 20264 min read

Data Scaling Techniques: Why Feature Scaling Matters for ML

Feature scaling is essential for model stability. Learn how to apply StandardScaler and MinMaxScaler to ensure your machine learning models converge efficiently.

Read more
AI/MLJune 25, 20264 min read

Feature Engineering Strategies: Boosting Model Predictive Power

Master feature engineering strategies to boost model performance. Learn to create polynomial features, perform interactions, and derive new domain-driven variables.

Part of the course

AI/ML Foundations: Core Concepts & First Models

beginner · Lesson 14 of 50

  1. 1

    The Machine Learning Workflow

    4 min
  2. 2

    Setting Up the Python ML Environment

    4 min
  3. 3

    Introduction to NumPy for Data Handling

    4 min
Read more
AI/MLJune 25, 20264 min read

Model Interpretability Basics: Coefficients and SHAP Explained

Learn how to demystify your models using linear coefficients and SHAP values. Understand why transparency is essential for trust and debugging in production.

Read more
4

Loading and Inspecting Datasets with Pandas

3 min
  • 5

    Exploratory Data Analysis Fundamentals

    3 min
  • 6

    Handling Missing and Inconsistent Data

    3 min
  • 7

    Feature Selection and Basic Filtering

    3 min
  • 8

    Project Dataset Initialization

    3 min
  • 9

    Mechanics of Linear Regression

    4 min
  • 10

    Mechanics of Classification

    4 min
  • 11

    Loss Functions and Model Objectives

    4 min
  • 12

    Training and Testing Data Splits

    3 min
  • 13

    Data Scaling Techniques

    4 min
  • 14

    Encoding Categorical Variables

    3 min
  • 15

    Building Scikit-Learn Pipelines

    4 min
  • 16

    Training the Baseline Linear Model

    3 min
  • 17

    Training Error vs Generalization Error

    4 min
  • 18

    Overfitting and Underfitting

    4 min
  • 19

    Regression Evaluation Metrics

    4 min
  • 20

    The Confusion Matrix

    3 min
  • 21

    Error Analysis Plots

    4 min
  • 22

    Introduction to Cross-Validation

    4 min
  • 23

    Diagnosing Model Weaknesses

    3 min
  • 24

    Feature Engineering Strategies

    4 min
  • 25

    Handling Outliers

    3 min
  • 26

    The Bias-Variance Tradeoff

    3 min
  • 27

    Hyperparameter Tuning Basics

    4 min
  • 28

    Implementing Grid Search

    3 min
  • 29

    Refining the Project Model

    3 min
  • 30

    Evaluating Feature Importance

    3 min
  • 31

    Advanced Feature Transformation

    3 min
  • 32

    Regularization Techniques

    3 min
  • 33

    Comparing Different Algorithms

    3 min
  • 34

    Managing Model Complexity

    4 min
  • 35

    Understanding Data Drift

    4 min
  • 36

    Version Control for ML Experiments

    3 min
  • 37

    Exporting Trained Models

    3 min
  • 38

    Creating an Inference Script

    3 min
  • 39

    Building a Simple Web Interface

    3 min
  • 40

    Documenting ML Projects

    4 min
  • 41

    Final Project Review

    4 min
  • 42

    Ensemble Methods Overview

    4 min
  • 43

    Feature Selection via Recursive Elimination

    3 min
  • 44

    Model Interpretability Basics

    4 min
  • 45

    Dealing with High Cardinality

    3 min
  • 46

    Handling Multi-Collinearity

    4 min
  • 47

    Introduction to Pipelines with Custom Transformers

    3 min
  • 48

    Evaluating Model Calibration

    4 min
  • 49

    Advanced Hyperparameter Search

    3 min
  • 50

    Model Monitoring in Practice

    4 min
  • View full course