Learn to integrate feature engineering into your Scikit-Learn pipeline and run structured grid searches to improve your model's performance over the baseline.
Previously in this course, we explored feature engineering strategies to boost predictive power and implemented grid search to find optimal model settings. In this lesson, we consolidate these steps by embedding your engineering logic directly into a Pipeline object, ensuring your model is robust, reproducible, and ready for deployment.
In production ML, the most common source of "silent" bugs is training-serving skew—where the data transformation applied during training differs from the data seen in production. By bundling your scalers, encoders, and model into a single Pipeline, you treat the entire sequence as a single object. When you call predict() on new data, the pipeline automatically applies the exact same transformations used during training.
To refine our current project model, we need to move away from manual data preparation. Instead of transforming your DataFrame and then passing it to a model, you define a sequence of "steps."
PYTHONfrom sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestRegressor from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder # Define features numeric_features = [CE9178">'age', CE9178">'income', CE9178">'sq_ft'] categorical_features = [CE9178">'location_type'] # Create preprocessor for different data types preprocessor = ColumnTransformer( transformers=[ (CE9178">'num', StandardScaler(), numeric_features), (CE9178">'cat', OneHotEncoder(), categorical_features) ]) # Build the pipeline pipeline = Pipeline(steps=[ (CE9178">'preprocessor', preprocessor), (CE9178">'regressor', RandomForestRegressor(random_state=42)) ])
Once your pipeline is structured, you can perform hyperparameter tuning across the entire stack. Because the pipeline names its components, you access them using the step_name__parameter_name syntax.
PYTHONfrom sklearn.model_selection import GridSearchCV # Define the grid param_grid = { CE9178">'regressor__n_estimators': [50, 100, 200], CE9178">'regressor__max_depth': [None, 10, 20] } # Run the search grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring=CE9178">'neg_mean_squared_error') grid_search.fit(X_train, y_train) print(f"Best parameters: {grid_search.best_params_}")
The final step is to compare your refined model against the baseline you created when training the baseline linear model. A common mistake is comparing training error; always compare the performance on your hold-out test set using metrics like RMSE or R-squared.
Pipeline object from the previous lesson.PolynomialFeatures step to the pipeline before the regressor.GridSearchCV to tune both the polynomial degree and the model's max_depth.StandardScaler or other transformers are inside the pipeline. If you scale your data before splitting, you leak information from the test set into your training process.Pipeline, remember to update your param_grid keys, or the grid search will fail with a "parameter not found" error.Refining your project model is about moving from manual scripts to automated, reproducible workflows. By encapsulating transformations and models into pipelines, you ensure consistent data processing and simplify the process of tuning hyperparameters. Always validate against your baseline to ensure that your "improvements" are actually delivering real-world value.
Up next: We will evaluate feature importance to prune irrelevant variables and simplify our model.
Master feature selection with RFECV. Learn how to automate the removal of noisy, irrelevant features to build simpler, more robust machine learning models.
Read moreLearn how to build a clean, professional inference script to generate predictions. Master model loading, data processing, and standardized output formats.
Refining the Project Model