Feature Selection in Pipelines: Improving Model Efficiency

Learn to integrate SelectKBest and RFE into your scikit-learn pipelines to automate feature selection, reduce overfitting, and improve model efficiency.

scikit-learnmachine learningfeature selectionpipelinedata scienceaimachine-learningpython

Previously in this course, we covered how to handle encoding categorical variables and manage data transformations within a ColumnTransformer. While those steps prepare your data, raw datasets often contain "noise" features that can inflate model complexity without contributing predictive power.

This lesson adds automated feature selection to your toolkit. By integrating selection stages directly into your Pipeline, you ensure that the same logic applied during training is consistently applied to new, unseen data, preventing the common trap of manual feature pruning.

Why Automate Feature Selection?

In production, you rarely want to curate your feature list by hand. Manual selection is prone to human error and difficult to reproduce. Automated feature selection offers two primary benefits:

Improved Model Efficiency: Fewer features mean lower memory usage and faster inference times.
Reduced Overfitting: By removing irrelevant or redundant features, you lower the variance of your model, allowing it to generalize better to new data.

We categorize these methods into filters (statistical tests) and wrappers (iterative model-based methods).

Using SelectKBest for Filter-Based Selection

Filter methods evaluate the intrinsic properties of features independently of the final model. SelectKBest is the standard choice here, as it ranks features based on a statistical test (like ANOVA F-value for regression or Chi-squared for classification).

Because it is a transformer, it slots perfectly into a Pipeline.


PYTHON
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Constructing the pipeline
pipeline = Pipeline([
    (CE9178">'scaler', StandardScaler()),
    (CE9178">'selector', SelectKBest(score_func=f_classif, k=10)),
    (CE9178">'classifier', LogisticRegression())
])

In this setup, the SelectKBest step calculates the F-value for every feature during fit. It then drops everything except the 10 highest-scoring features. When you call predict on new data, the pipeline automatically applies that same 10-feature mask.

Using RFE for Wrapper-Based Selection

Wrappers are more computationally expensive but often more effective. RFE (Recursive Feature Elimination) fits the model repeatedly, removing the weakest feature at each iteration until the desired number of features is reached.

If you are looking for a more robust version that cross-validates the number of features to keep, consider feature selection via recursive elimination. However, for a fixed-size pipeline, standard RFE is highly efficient:


PYTHON
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

# Using RFE to select 5 features based on Random Forest importance
rfe_pipeline = Pipeline([
    (CE9178">'scaler', StandardScaler()),
    (CE9178">'selector', RFE(estimator=RandomForestClassifier(), n_features_to_select=5)),
    (CE9178">'classifier', LogisticRegression())
])

Practice Exercise: Implementing a Selection Pipeline

Your goal is to optimize the "Project Pipeline" we have been building.

Create a Pipeline that includes your existing preprocessing steps (imputation, scaling).
Insert a SelectKBest step between scaling and modeling.
Use a RandomForestClassifier as your estimator.
Experiment with different values of k. How does the training time change as you decrease k? Does the validation score drop significantly?

Common Pitfalls

Fitting on the whole dataset: Never perform feature selection on your entire dataset before splitting. If your feature selection process "sees" the target variable (which happens in supervised selection), it will leak information. Always include the selector inside the pipeline and fit that pipeline only on your training fold.
Ignoring Multi-collinearity: SelectKBest treats features independently. If you have two highly correlated features, it might keep both. For a deeper look at managing redundant features, see our guide on handling multi-collinearity.
Computational Cost: RFE is expensive if you have thousands of features. If your pipeline is hanging, start with a filter method like SelectKBest to prune the search space before using a wrapper.

Recap

Feature selection is a critical step in building lean, performant ML systems. By using SelectKBest for fast filtering and RFE for targeted reduction, you can significantly improve model efficiency. Remember to keep these steps encapsulated within your Pipeline to ensure that your feature selection logic remains strictly tied to your training process, preventing data leakage and ensuring reproducibility.

Up next: We will discuss how to identify and mitigate data leakage more broadly across your entire pipeline architecture.

Back to Blog

Feature Selection in Pipelines: Improving Model Efficiency

Why Automate Feature Selection?

Using SelectKBest for Filter-Based Selection

Using RFE for Wrapper-Based Selection

Practice Exercise: Implementing a Selection Pipeline

Common Pitfalls

Recap

Similar Posts

Custom Transformers for Feature Engineering in Scikit-Learn

Pipeline Architecture Essentials: Building Robust ML Systems

Building Scikit-Learn Pipelines: A Reproducible ML Workflow