Mahamudul Hasan Rubel
HomeAboutProjectsSkillsExperienceBlogCoursesPhotosContact
Mahamudul Hasan Rubel

Senior Software Engineer crafting high-performance web applications and SaaS platforms.

Navigation

  • Home
  • About
  • Projects
  • Skills
  • Experience
  • Blog
  • Courses
  • Photos
  • Contact

Get in Touch

Available for senior/lead roles and consulting.

bd.mhrubel@gmail.comHire Me

© 2026 Mahamudul Hasan Rubel. All rights reserved.

Built with using Next.js 16 & Tailwind v4

Back to Blog
Lesson 31 of the AI/ML Foundations: Core Concepts & First Models course
AI/MLJune 25, 20263 min read

Advanced Feature Transformation: Handling Skewed Data Distributions

Master advanced feature transformations to fix skewed data distributions. Learn to apply log and power transforms to improve your model's predictive accuracy.

machine learningfeature engineeringdata sciencescikit-learnpythonaimachine-learning

Previously in this course, we covered Feature Engineering Strategies: Boosting Model Predictive Power, where we discussed creating interaction terms and polynomial features. While adding new features is powerful, it is equally important to refine the existing ones. This lesson focuses on data distribution—specifically, how to handle features that are heavily skewed and prevent them from biasing your model.

Understanding Skewed Data and Why It Matters

Most linear models, such as Linear Regression, assume that the input variables follow a normal (Gaussian) distribution. When your data is highly skewed—meaning it has a long tail on one side—the model struggles to find a representative "line of best fit."

Imagine you are predicting house prices. A feature like "square footage" might be normally distributed, but "distance to the city center" or "property taxes" often show a long right-tail (many low values, few extremely high values). This asymmetry forces the model to over-index on those few extreme outliers, leading to poor generalization.

Applying the Log Transform

The log transform is the most common tool for squashing long right-tails. By taking the logarithm of each value in a feature, you compress the high end of the range while expanding the low end.

Mathematically, it turns multiplicative relationships into additive ones. If your data contains zeros, you must use log1p (log(1+x)) to avoid a math error.

PYTHON
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Simulate right-skewed data
data = np.random.exponential(scale=2, size=1000)

# Apply log transform
transformed_data = np.log1p(data)

# Visualize the effect
fig, ax = plt.subplots(1, 2)
ax[0].hist(data, bins=30)
ax[0].set_title("Original Skewed Data")
ax[1].hist(transformed_data, bins=30)
ax[1].set_title("Log Transformed Data")
plt.show()

Using Power Transforms (Box-Cox and Yeo-Johnson)

Sometimes the log transform isn't enough. If your data is skewed but the relationship isn't perfectly logarithmic, you need a more flexible approach. Power transforms systematically search for the best exponent to make your data as "normal" as possible.

  1. Box-Cox Transform: Works only on strictly positive data. It finds an optimal lambda ($\lambda$) value to stabilize variance and minimize skewness.
  2. Yeo-Johnson Transform: The modern standard. It handles both positive and negative values, making it safer for general-purpose pipelines.

Scikit-learn provides a PowerTransformer that automates this:

PYTHON
from sklearn.preprocessing import PowerTransformer

# Initialize the transformer
pt = PowerTransformer(method=CE9178">'yeo-johnson')

# Reshape data for sklearn(needs 2D array)
data_reshaped = data.reshape(-1, 1)

# Fit and transform
data_normalized = pt.fit_transform(data_reshaped)

Hands-on Exercise: Normalize Your Project Data

In your current project dataset, look for a numerical feature that shows a long tail in a histogram (refer back to Exploratory Data Analysis Fundamentals if you need a refresher).

  1. Identify a skewed feature.
  2. Apply np.log1p to that column in your DataFrame.
  3. Plot the histogram again to verify the distribution has become more symmetric.
  4. If the skew remains, try PowerTransformer and compare the results.

Common Pitfalls

  • Transforming the Target: If you transform your target variable (the label), remember that your model's predictions will be in the transformed space. You will need to apply the inverse transformation (e.g., np.expm1 if you used log1p) to interpret the results in the original units.
  • Data Leakage: Always fit your transformer on the training set and transform the test set using that same fit. If you fit on the entire dataset, you are leaking information about the distribution of your test data into your training process.
  • Negative Values: Don't use log on data with negative values or zeros. Always check your data range before applying these techniques; stick to Yeo-Johnson if you aren't sure.

Recap

We’ve learned that non-linear transformations are essential for cleaning up data distribution issues that hinder model performance. By applying a log transform for simple right-skewed features or a PowerTransformer for more complex cases, you ensure your features are better aligned with the assumptions of your algorithms. These small adjustments often lead to significant gains in model stability.

Up next: We will discuss how to implement regularization techniques to prevent your models from over-relying on specific features.

Previous lessonEvaluating Feature ImportanceNext lesson Regularization Techniques
Back to Blog

Similar Posts

AI/MLJune 25, 20264 min read

Feature Engineering Strategies: Boosting Model Predictive Power

Master feature engineering strategies to boost model performance. Learn to create polynomial features, perform interactions, and derive new domain-driven variables.

Read more
AI/MLJune 25, 20264 min read

Model Interpretability Basics: Coefficients and SHAP Explained

Learn how to demystify your models using linear coefficients and SHAP values. Understand why transparency is essential for trust and debugging in production.

Part of the course

AI/ML Foundations: Core Concepts & First Models

beginner · Lesson 31 of 50

  1. 1

    The Machine Learning Workflow

    4 min
  2. 2

    Setting Up the Python ML Environment

    4 min
  3. 3

    Introduction to NumPy for Data Handling

    4 min
Read more
AI/MLJune 25, 20263 min read

Benchmarking Algorithms: Choosing the Right Model for Your Project

Stop guessing which model works best. Learn the principles of benchmarking algorithms to compare linear and tree-based models for your machine learning project.

Read more
4

Loading and Inspecting Datasets with Pandas

3 min
  • 5

    Exploratory Data Analysis Fundamentals

    3 min
  • 6

    Handling Missing and Inconsistent Data

    3 min
  • 7

    Feature Selection and Basic Filtering

    3 min
  • 8

    Project Dataset Initialization

    3 min
  • 9

    Mechanics of Linear Regression

    4 min
  • 10

    Mechanics of Classification

    4 min
  • 11

    Loss Functions and Model Objectives

    4 min
  • 12

    Training and Testing Data Splits

    3 min
  • 13

    Data Scaling Techniques

    4 min
  • 14

    Encoding Categorical Variables

    3 min
  • 15

    Building Scikit-Learn Pipelines

    4 min
  • 16

    Training the Baseline Linear Model

    3 min
  • 17

    Training Error vs Generalization Error

    4 min
  • 18

    Overfitting and Underfitting

    4 min
  • 19

    Regression Evaluation Metrics

    4 min
  • 20

    The Confusion Matrix

    3 min
  • 21

    Error Analysis Plots

    4 min
  • 22

    Introduction to Cross-Validation

    4 min
  • 23

    Diagnosing Model Weaknesses

    3 min
  • 24

    Feature Engineering Strategies

    4 min
  • 25

    Handling Outliers

    3 min
  • 26

    The Bias-Variance Tradeoff

    3 min
  • 27

    Hyperparameter Tuning Basics

    4 min
  • 28

    Implementing Grid Search

    3 min
  • 29

    Refining the Project Model

    3 min
  • 30

    Evaluating Feature Importance

    3 min
  • 31

    Advanced Feature Transformation

    3 min
  • 32

    Regularization Techniques

    3 min
  • 33

    Comparing Different Algorithms

    3 min
  • 34

    Managing Model Complexity

    4 min
  • 35

    Understanding Data Drift

    4 min
  • 36

    Version Control for ML Experiments

    3 min
  • 37

    Exporting Trained Models

    3 min
  • 38

    Creating an Inference Script

    3 min
  • 39

    Building a Simple Web Interface

    3 min
  • 40

    Documenting ML Projects

    4 min
  • 41

    Final Project Review

    4 min
  • 42

    Ensemble Methods Overview

    4 min
  • 43

    Feature Selection via Recursive Elimination

    3 min
  • 44

    Model Interpretability Basics

    4 min
  • 45

    Dealing with High Cardinality

    3 min
  • 46

    Handling Multi-Collinearity

    4 min
  • 47

    Introduction to Pipelines with Custom Transformers

    3 min
  • 48

    Evaluating Model Calibration

    4 min
  • 49

    Advanced Hyperparameter Search

    3 min
  • 50

    Model Monitoring in Practice

    4 min
  • View full course