Learn how to prepare non-numeric data for machine learning. Master one-hot and label encoding to turn categorical features into model-ready inputs.
Previously in this course, we covered Data Scaling Techniques to normalize continuous variables. While scaling handles numbers, real-world datasets are often filled with text-based categories—like "Red," "Green," and "Blue"—that mathematical models cannot interpret directly. This lesson adds the essential skill of encoding to your preprocessing toolkit, allowing you to bridge the gap between human-readable categories and the numerical input required by Scikit-Learn.
Before applying any transformation, you must identify the nature of your categorical data. Misinterpreting these types is the most common cause of poor model performance.
Label encoding assigns a unique integer to each category (e.g., Low=0, Medium=1, High=2). This preserves the rank, which is exactly what tree-based models like Random Forests need to identify the hierarchy.
One-Hot Encoding creates a new binary column for every unique category. If you have a "Color" column with three values, it creates three columns: is_red, is_green, and is_blue. This prevents the model from assuming that "Green" (2) is somehow "greater than" "Red" (0).
In a production environment, you should use scikit-learn transformers to ensure your encoding logic is reproducible. We will use OrdinalEncoder for ordinal data and OneHotEncoder for nominal data.
PYTHONimport pandas as pd from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder # Sample dataset df = pd.DataFrame({ CE9178">'size': [CE9178">'small', CE9178">'medium', CE9178">'large', CE9178">'medium'], CE9178">'color': [CE9178">'red', CE9178">'blue', CE9178">'green', CE9178">'blue'] }) # 1. Label/Ordinal Encoding # We define the order explicitly ordinal_encoder = OrdinalEncoder(categories=[[CE9178">'small', CE9178">'medium', CE9178">'large']]) df[CE9178">'size_encoded'] = ordinal_encoder.fit_transform(df[[CE9178">'size']]) # 2. One-Hot Encoding # sparse_output=False returns a dense array for easier viewing ohe = OneHotEncoder(sparse_output=False) ohe_results = ohe.fit_transform(df[[CE9178">'color']]) ohe_df = pd.DataFrame(ohe_results, columns=ohe.get_feature_names_out([CE9178">'color'])) # Combine back to the original dataframe df_final = pd.concat([df, ohe_df], axis=1).drop(CE9178">'color', axis=1) print(df_final)
For our running project, locate a column in your dataset that contains categories (e.g., "Department," "Status," or "Region").
df['column'].value_counts().OneHotEncoder if it is nominal.OneHotEncoder, you might be tempted to include all columns. However, if you have two categories (A and B), you only need one column (is_A). If it's 1, it's A; if it's 0, it's B. The second column is redundant and can cause issues in linear models (multicollinearity). Use drop='first' in your OneHotEncoder settings to handle this.handle_unknown='ignore' in your OneHotEncoder to safely treat unknown categories as all-zeros.Encoding is the process of converting non-numeric categorical data into a format that algorithms can process. By using OrdinalEncoder for ranked data and OneHotEncoder for nominal data, you ensure that your model interprets your features correctly. Always remember to handle unknown categories and watch out for the dummy variable trap to keep your model performant and reliable.
Up next: We will learn how to wrap these preprocessing steps into a Pipeline object to ensure your data transformation logic is reusable and mistake-proof.
Feature scaling is essential for model stability. Learn how to apply StandardScaler and MinMaxScaler to ensure your machine learning models converge efficiently.
Read moreMaster feature engineering strategies to boost model performance. Learn to create polynomial features, perform interactions, and derive new domain-driven variables.
Encoding Categorical Variables