Learn how to instantiate, fit, and generate predictions with your first baseline linear model using Scikit-Learn to establish a performance benchmark.
Previously in this course, we covered the mechanics of linear regression and the importance of training and testing data splits. Now that your data is cleaned and partitioned, it's time to build your first baseline model.
Establishing a baseline is the most critical step in any machine learning project. It provides a "floor" for performance—a simple, interpretable model against which you can measure the effectiveness of more complex techniques.
In Scikit-Learn, the workflow follows a consistent API: you instantiate an estimator object, call .fit() to train it on your data, and call .predict() to generate outputs. When you build Scikit-Learn pipelines, this process becomes even more robust because the pipeline handles the transformation steps automatically.
For our running project, we will use a LinearRegression model. Since we have already preprocessed our features—handling missing data and feature selection—we can feed our training set directly into the pipeline.
PYTHONfrom sklearn.linear_model import LinearRegression from sklearn.pipeline import Pipeline # 1. Instantiate the model model = LinearRegression() # 2. Create the pipeline(assuming you have a preprocessor defined) # If you haven't defined a preprocessor yet, use a simple identity or # just the model itself for the absolute baseline. baseline_pipeline = Pipeline([ (CE9178">'regressor', model) ]) # 3. Fit the model # X_train and y_train come from your previous train-test split step baseline_pipeline.fit(X_train, y_train) print("Model training complete.")
Once the model is fitted, it has "learned" the coefficients (weights) that minimize the error on your training data. To see how it performs on unseen data, we pass the test set to the .predict() method.
PYTHON# 4. Generate predictions on the test set y_pred = baseline_pipeline.predict(X_test) # Compare the first 5 predictions to actual values import pandas as pd comparison = pd.DataFrame({CE9178">'Actual': y_test, CE9178">'Predicted': y_pred}) print(comparison.head())
These initial predictions are your first real look at how well your features capture the underlying patterns in the target variable.
Using the dataset you cleaned in the project dataset initialization lesson:
LinearRegression from sklearn.linear_model.Pipeline.X_train and y_train variables.X_test and store them in a variable called y_pred.y_test and y_pred.X to be a 2D array (samples, features) and y to be a 1D array (samples).In this lesson, we transitioned from theory to application by instantiating a LinearRegression model, fitting it via a pipeline, and generating predictions on the test set. This baseline acts as your primary performance metric. By establishing this foundation, you now have a clear target to beat as you experiment with feature engineering and more advanced algorithms in the coming lessons.
Up next: We will examine the gap between your training results and test results to discuss training error vs generalization error.
Multi-collinearity can destabilize your ML model's coefficients. Learn to calculate VIF, identify redundant features, and improve your model's reliability today.
Read moreLearn how to build a clean, professional inference script to generate predictions. Master model loading, data processing, and standardized output formats.
Training the Baseline Linear Model