Learn to integrate SelectKBest and RFE into your scikit-learn pipelines to automate feature selection, reduce overfitting, and improve model efficiency.
Previously in this course, we covered how to handle encoding categorical variables and manage data transformations within a ColumnTransformer. While those steps prepare your data, raw datasets often contain "noise" features that can inflate model complexity without contributing predictive power.
This lesson adds automated feature selection to your toolkit. By integrating selection stages directly into your Pipeline, you ensure that the same logic applied during training is consistently applied to new, unseen data, preventing the common trap of manual feature pruning.
In production, you rarely want to curate your feature list by hand. Manual selection is prone to human error and difficult to reproduce. Automated feature selection offers two primary benefits:
We categorize these methods into filters (statistical tests) and wrappers (iterative model-based methods).
Filter methods evaluate the intrinsic properties of features independently of the final model. SelectKBest is the standard choice here, as it ranks features based on a statistical test (like ANOVA F-value for regression or Chi-squared for classification).
Because it is a transformer, it slots perfectly into a Pipeline.
PYTHONfrom sklearn.pipeline import Pipeline from sklearn.feature_selection import SelectKBest, f_classif from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression # Constructing the pipeline pipeline = Pipeline([ (CE9178">'scaler', StandardScaler()), (CE9178">'selector', SelectKBest(score_func=f_classif, k=10)), (CE9178">'classifier', LogisticRegression()) ])
In this setup, the SelectKBest step calculates the F-value for every feature during fit. It then drops everything except the 10 highest-scoring features. When you call predict on new data, the pipeline automatically applies that same 10-feature mask.
Wrappers are more computationally expensive but often more effective. RFE (Recursive Feature Elimination) fits the model repeatedly, removing the weakest feature at each iteration until the desired number of features is reached.
If you are looking for a more robust version that cross-validates the number of features to keep, consider feature selection via recursive elimination. However, for a fixed-size pipeline, standard RFE is highly efficient:
PYTHONfrom sklearn.feature_selection import RFE from sklearn.ensemble import RandomForestClassifier # Using RFE to select 5 features based on Random Forest importance rfe_pipeline = Pipeline([ (CE9178">'scaler', StandardScaler()), (CE9178">'selector', RFE(estimator=RandomForestClassifier(), n_features_to_select=5)), (CE9178">'classifier', LogisticRegression()) ])
Your goal is to optimize the "Project Pipeline" we have been building.
Pipeline that includes your existing preprocessing steps (imputation, scaling).SelectKBest step between scaling and modeling.RandomForestClassifier as your estimator.k. How does the training time change as you decrease k? Does the validation score drop significantly?SelectKBest treats features independently. If you have two highly correlated features, it might keep both. For a deeper look at managing redundant features, see our guide on handling multi-collinearity.RFE is expensive if you have thousands of features. If your pipeline is hanging, start with a filter method like SelectKBest to prune the search space before using a wrapper.Feature selection is a critical step in building lean, performant ML systems. By using SelectKBest for fast filtering and RFE for targeted reduction, you can significantly improve model efficiency. Remember to keep these steps encapsulated within your Pipeline to ensure that your feature selection logic remains strictly tied to your training process, preventing data leakage and ensuring reproducibility.
Up next: We will discuss how to identify and mitigate data leakage more broadly across your entire pipeline architecture.
Learn how to build custom transformers for feature engineering in scikit-learn. Master the BaseEstimator and TransformerMixin pattern for production pipelines.
Read moreLearn to build a scikit-learn Pipeline to automate your machine learning workflow and prevent data leakage by isolating preprocessing from model training.
Feature Selection in Pipelines
Handling Class Imbalance with Resampling
Advanced Metrics for Imbalanced Datasets
Project Milestone: Building the Baseline Pipeline
Introduction to GridSearchCV
RandomizedSearchCV for Efficiency
Bayesian Optimization Principles
Early Stopping in Iterative Models
Managing Computational Resources
Hyperparameter Stability Analysis
Pipeline Parameter Nesting
Project Milestone: Tuning the Champion Model
Baseline-to-Champion Framework
Statistical Significance in Model Comparison
Model Ensembling: Voting and Averaging
Stacking Architectures
Blending Techniques
Interpreting Complex Ensembles
Managing Model Complexity
Bias-Variance Tradeoff in Ensembles
Project Milestone: The Ensemble Strategy
Serializing Pipelines with Joblib
Versioning Models and Data
Designing Inference APIs
Input Validation and Schema Enforcement
Monitoring Data Drift
Tracking Performance Degradation
Logging and Observability
Automated Retraining Triggers
Containerization Basics
Handling Environment Parity
Documentation for Production
Project Milestone: Deployment Readiness