Master the mechanics of linear regression, from the line of best fit to variable relationships, and learn how to build your first predictive model.
Previously in this course, we covered Project Dataset Initialization to ensure our data was ready for modeling. Now that your data is clean, we move to the core of supervised learning: understanding how machines learn to make a prediction.
Linear regression is the "Hello World" of predictive modeling. It is a foundational statistical method used to model the relationship between an independent variable and a dependent variable by fitting a linear equation to observed data.
At its core, a linear model assumes that the relationship between your input (the independent variable, $x$) and your output (the dependent variable, $y$) can be approximated by a straight line.
You likely remember the high school algebra formula for a line: $y = mx + b$
In machine learning statistics, we adapt this notation slightly to represent the model we are training: $\hat{y} = w_0 + w_1x$
The "line of best fit" is simply the line that minimizes the distance between the actual data points and the line itself. We call this distance the "error" or "residual." By adjusting $w_0$ and $w_1$, the algorithm finds the specific line that results in the smallest total error across your entire dataset.
In any regression task, you must distinguish between your variables:
The goal of the model is to explain how changes in the independent variable influence the dependent variable. If you find a strong linear relationship, your model will be highly accurate. If the data is scattered randomly, a linear model will struggle to find a meaningful trend.
Before running a complex algorithm, always visualize your data to see if a linear relationship actually exists. As discussed in our Exploratory Data Analysis Fundamentals lesson, a scatter plot is your best tool here.
If your scatter plot shows a cloud of points that seem to drift upward or downward in a consistent path, linear regression is a great starting point. If the relationship looks like a curve, a straight line will likely underperform.
Let's look at how this works in practice using scikit-learn. We will simulate a simple relationship where $y = 2x + 5$.
PYTHONimport numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression # 1. Prepare data x = np.array([1, 2, 3, 4, 5]).reshape(-1, 1) # Features must be 2D y = np.array([7, 9, 11, 13, 15]) # Target # 2. Instantiate and train the model model = LinearRegression() model.fit(x, y) # 3. Visualize the result plt.scatter(x, y, color=CE9178">'blue') plt.plot(x, model.predict(x), color=CE9178">'red') plt.title("Line of Best Fit") plt.show() print(f"Learned Slope: {model.coef_[0]}") print(f"Learned Intercept: {model.intercept_}")
In this code, model.fit performs the mathematical heavy lifting to find the optimal $w_1$ (slope) and $w_0$ (intercept).
Using the dataset you cleaned in Project Dataset Initialization, perform the following steps:
matplotlib.We've covered the mechanics of the linear model: finding the line of best fit by adjusting weights and intercepts to minimize error. By identifying the relationship between your independent and dependent variables, you can predict continuous values effectively. Remember, visualization is your first line of defense against poor model performance.
Up next: We will dive into the mechanics of classification, where we shift from predicting continuous numbers to predicting discrete categories.
Learn how to evaluate model calibration using calibration curves and the Brier score. Ensure your predicted probabilities are accurate representations of reality.
Read moreLearn how to demystify your models using linear coefficients and SHAP values. Understand why transparency is essential for trust and debugging in production.
Mechanics of Linear Regression