Master advanced feature transformations to fix skewed data distributions. Learn to apply log and power transforms to improve your model's predictive accuracy.
Previously in this course, we covered Feature Engineering Strategies: Boosting Model Predictive Power, where we discussed creating interaction terms and polynomial features. While adding new features is powerful, it is equally important to refine the existing ones. This lesson focuses on data distribution—specifically, how to handle features that are heavily skewed and prevent them from biasing your model.
Most linear models, such as Linear Regression, assume that the input variables follow a normal (Gaussian) distribution. When your data is highly skewed—meaning it has a long tail on one side—the model struggles to find a representative "line of best fit."
Imagine you are predicting house prices. A feature like "square footage" might be normally distributed, but "distance to the city center" or "property taxes" often show a long right-tail (many low values, few extremely high values). This asymmetry forces the model to over-index on those few extreme outliers, leading to poor generalization.
The log transform is the most common tool for squashing long right-tails. By taking the logarithm of each value in a feature, you compress the high end of the range while expanding the low end.
Mathematically, it turns multiplicative relationships into additive ones. If your data contains zeros, you must use log1p (log(1+x)) to avoid a math error.
PYTHONimport numpy as np import pandas as pd import matplotlib.pyplot as plt # Simulate right-skewed data data = np.random.exponential(scale=2, size=1000) # Apply log transform transformed_data = np.log1p(data) # Visualize the effect fig, ax = plt.subplots(1, 2) ax[0].hist(data, bins=30) ax[0].set_title("Original Skewed Data") ax[1].hist(transformed_data, bins=30) ax[1].set_title("Log Transformed Data") plt.show()
Sometimes the log transform isn't enough. If your data is skewed but the relationship isn't perfectly logarithmic, you need a more flexible approach. Power transforms systematically search for the best exponent to make your data as "normal" as possible.
Scikit-learn provides a PowerTransformer that automates this:
PYTHONfrom sklearn.preprocessing import PowerTransformer # Initialize the transformer pt = PowerTransformer(method=CE9178">'yeo-johnson') # Reshape data for sklearn(needs 2D array) data_reshaped = data.reshape(-1, 1) # Fit and transform data_normalized = pt.fit_transform(data_reshaped)
In your current project dataset, look for a numerical feature that shows a long tail in a histogram (refer back to Exploratory Data Analysis Fundamentals if you need a refresher).
np.log1p to that column in your DataFrame.PowerTransformer and compare the results.np.expm1 if you used log1p) to interpret the results in the original units.log on data with negative values or zeros. Always check your data range before applying these techniques; stick to Yeo-Johnson if you aren't sure.We’ve learned that non-linear transformations are essential for cleaning up data distribution issues that hinder model performance. By applying a log transform for simple right-skewed features or a PowerTransformer for more complex cases, you ensure your features are better aligned with the assumptions of your algorithms. These small adjustments often lead to significant gains in model stability.
Up next: We will discuss how to implement regularization techniques to prevent your models from over-relying on specific features.
Master feature engineering strategies to boost model performance. Learn to create polynomial features, perform interactions, and derive new domain-driven variables.
Read moreLearn how to demystify your models using linear coefficients and SHAP values. Understand why transparency is essential for trust and debugging in production.
Advanced Feature Transformation