In this complete tutorial, we’ll introduce the linear regression algorithm in machine learning, and its step-by-step implementation in Python with examples.
Linear regression is one of the most applied and fundamental algorithms in machine learning. Python is one of the most in-demand skills for data scientists. These make learning linear regression in Python critical.
Following this linear regression tutorial, you’ll learn:
- What is linear regression in machine learning.
- What are the linear regression equation and the best fit estimation.
- How to fit simple and multiple linear regression (including polynomial regression) in Python (Scikit-Learn).
- Other useful tips when applying linear regression.
Ready to learn machine learning algorithms? Linear regression is the best start.
Let’s jump in!
If you are not familiar with machine learning algorithms, please take a look at Machine Learning for Beginners: Overview of Algorithm Types.
If you are new to Python, please take our FREE Python crash course for data science to get a good foundation.
What is Linear Regression in machine learning?
In regression tasks, we have a labeled training dataset of input variables (X) and a numerical output variable (y).
When applying linear regression, we want to find the best fit linear relationship between X and y. Then when we have new observations, we can use its input variables and the linear function f to predict its output value.
y = f(X)
When we have only one input variable, it is a simple linear regression. When there is more than one input variable, it is multiple linear regression.
Linear Regression is the best algorithm to start learning for machine learning. Because it is:
- relatively easy to learn.
- widely used in many different industries as well as academia.
- easy to interpret to gain insights, especially for prediction tasks.
- also popular models to be tested in interview questions.
Next, let’s look at the details of the linear function.
Linear Regression Equation
Let’s start from the straight forward simple linear equation.
Simple Linear Regression
Assume we have the input variable x1 and output variable y. A linear equation describing the relationship between x1 and y is below:
y = w0 + w1*x1
w0 and w1 are the two coefficients, where w0 is the intercept (of the y-axis), and w1 is the slope of the line. w1 shows the impact of the independent variable x1 on y. For example, when w1 = 0, there’s no impact of x1 on y since (0*x1 = 0).
In simple terms, linear regression is an algorithm that finds the best values of w0 and w1 to fit the training dataset.
The graph below visualizes an example. The yellow dots represent the x1 and y (observations from the training dataset), while the line is the best fitted linear relationship based on this training dataset.
So what are the criteria of the best fit line?
How to get the Line of Best Fit: cost function?
Prediction is the common application of linear regression. We want this line to produce minimum errors when making predictions.
So, the goal is to minimize the difference between the predicted output and the actual output values.
The most used estimation method is (Ordinary) Least Squares (OLS), which targets minimizing the sum of squared residuals/errors. To achieve this, we use the cost function called Mean Squared Error (MSE), which is the average of the sum of squared residuals.
Assume we have n observations Y1, Y2, … Yn, the MSE formula is below:
As you can see, the smaller the MSE, the better the predictor fits the data. For linear equations, it’s possible to find w0 and w1 that result in the lowest possible MSE.
The optimal coefficients can also be calculated with a closed-form solution, which involves linear algebra, but we don’t have to worry about it when practicing ML.
There are also other cost functions, but MSE is the most applicable which you’ll be using a lot. In fact, for different machine learning models, as long as the predicting targets are real-numbers (-inf to inf), MSE is the go-to method.
Another popularly mentioned technique is Gradient Descent. Starting from some random values for each coefficient, the method changes these values by iteratively reducing the cost function. The procedure continues until the minimum possible cost function is reached.
But, Gradient Descent is only useful when the dataset is too large or when the training is done in batches. So it’s not commonly used with linear regression.
Multiple Linear Regression
When there are more than one input variables, say x1, x2, …, xm, we have a multiple linear regression formula below:
y = w0 + w1*x1 + w2*x2 + … + wm*xm
The concept we discussed for simple linear regression also applies to multiple linear regression. Instead of fitting 2-dimensional lines, we fit m-dimensional hyperplanes.
It is more difficult to visualize the equation in graphs since it’s higher dimensions.
Linear regression is a model first developed in the field of statistics. And it has been well studied for years. So there are many different names used to describe it.
Note that the input variables are also called:
- explanatory variables
- independent variables
while the output variable can also be called:
- dependent variable.
In this linear regression tutorial, we use these names interchangeably.
Now with the basic concepts, let’s see how linear regression works in Python.
Simple Linear Regression in Python
Again, if you are new to Python, please take our FREE Python crash course before this linear regression tutorial in Python.
First, we’ll show detailed steps of fitting a simple linear regression model. Then we’ll move onto multiple linear regression.
Step #1: Import Python packages
First of all, we need to import some packages that are necessary for linear regression:
- Numpy – fundamental package for scientific computing to create the example dataset.
- Pandas – a powerful tool for data analysis and manipulation.
- Scikit Learn (sklearn) – tools for predictive data analysis, including linear regression.
- Matplotlib: plotting library for visualization.
Step #2: Generate Random Training Dataset
Since we want to present linear regression simply, we’ll generate a random sample dataset as the training set.
We first generate a random dataset of size 50 with x1 as the single input variable. And we set y (the output) as a roughly linear relationship with x1. The random variable noise is added to create noise to the dataset.
Now we have a simple linear regression problem.
We can see that if we visualize x1 and y using a scatterplot, it is a roughly linear relationship.
Step #3: Create and Fit Linear Regression Models
Now let’s use the linear regression algorithm within the scikit learn package to create a model. The Ordinary Least Squares method is used by default.
- x1 is reshaped from a numpy array to a matrix, which is required by the sklearn package.
reshape(-1,1): -1 is telling NumPy to get the number of rows from the original x1, while 1 is representing 1 column.
- We start the process by creating an instance of the class LinearRegression, lr.
- There are different input parameters of the fit method, but we’ll leave them as the default.
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Step #4: Check the Result Model: coefficients and plot
After fitting the model, we can call its attributes to look at the results.
We can print out the coefficients.
As we can see, .intercept_ returns a scalar, while .coef_ returns an array. The values of these coefficients are close to the true values (y = 10 + 2*x1).
We can also visualize the regression line together with the training dataset.
The line learned by linear regression turns out to be a pretty good fit.
Step #5: Make Predictions with Linear Regression!
This is the most exciting part. Let’s predict using our new model.
Assume we have 4 new input values of x1 (0, 1, 2, 3), we can either:
- predict by plugging in the values into the equation manually
- or use the predict method in scikit-learn.
Both methods return the same predicted values.
array([10.06295511, 11.85833473, 13.65371434, 15.44909395])
This is great! It’s time to move onto multiple input variables.
Multiple Linear Regression in Python
Now let’s try an example with multiple features x1, x2, x3.
The steps are not outlined here, but it is the same procedures as the simple linear regression section. Please check the previous section for the detailed explanation of the Python code.
Let’s also use a pandas DataFrame so we can see the features together. Below is the DataFrame with the output y, and input variables x1, x2, x3.
Then we can use the same procedure to fit the multiple linear regression.
10.611687306000336 [ 2.12810817 3.04993324 -5.0536238 ]
Again, the linear regression algorithm finds good estimates of the coefficients that are close to the original equation (y = 10 + 2*x1 + 3*x2 – 5*x3).
We are not showing any plots for this example since we can’t visualize a 4D chart.
Polynomial Regression (Overfit/Underfit) in Python
Finally, let’s also try to fit polynomial regression, a special case of multiple linear regression. It models the relationship between y and an nth degree polynomial in x1.
We’ll also look at overfitting and underfitting, the two common problems when fitting models.
Again, the detailed steps are like the simple linear regression section. We are not laying out the explanation here.
Before we start, let’s also create a sample dataset with two input variables x1, x1^2, and output variable y.
First, let’s try the same method to fit a linear regression model.
We can see that this linear regression doesn’t fit the polynomial well. Our linear equation doesn’t have the x1^2, which is part of the data’s fundamental structure. The regression line won’t accurately predict any new observations. This is known as underfitting.
Let’s improve it by adding another feature x1^2.
We combine x1 with x1^2 as the matrix features.
Then we can fit the linear model with these features.
The coefficients are quite close to the original equation (y = 10 + 2*x1 + x1**2) now.
10.580989730588694 [1.86539339 0.97235238]
The line also fits much better.
To make it more complicated, let’s look at another example with an x1^3 term.
As before, we generate a random dataset with the following linear relationship.
We combine the terms x1, x1^2, x1^3 together.
Then we fit the linear model again.
Again, the coefficient values are similar to the original equation (y = 10 + 2*x1 + x1**2 – 0.1*x1**3).
10.250585811682729 [ 1.620373 1.00304899 -0.09791949]
Linear regression models can fit complicated curves!
In our examples, we know the terms to include in the models since we generated the samples. In reality, it won’t be easy to decide on the features.
Let’s see another example.
We generated a random sample of size 10 of a simple linear equation relationship between x1 and y.
But what if we fit more variables to it?
Let’s put all the fancy terms x1^2, x1^3, … , x1^8.
13.482967579012808 [ 8.60273387e+00 7.19785753e-01 -9.59145942e-01 -1.54303821e-01 3.02473147e-02 4.24998891e-03 -2.31913879e-04 -2.96290114e-05]
It looks like a perfect fit!
The line goes through every single data point.
This is called overfitting (or overtraining) when too many features are included in the model.
The model becomes overly fit for a particular dataset. The dataset in real life naturally contains noises that can’t be fit perfectly. Overfitted models will fail to fit any new data to make good predictions.
In the next section, you’ll see a method to avoid overfitting/underfitting.
More for Linear Regression Machine Learning
The examples we’ve shown are simple since we generated them. In practice, there’re a few other things to consider:
There are many different possibilities of input variables. We must be careful to avoid overfitting/underfitting the training data.
The good idea is to use model validation techniques such as cross-validation to test the model. The procedure involves dividing the dataset into training, validation or testing sets.
Training Dataset Range
It’s also risky to extrapolate outside the range of the input variables.
For example, if the input variable x1 in the training dataset ranges from 10 to 100, it’s not good to predict for x1 = -50 or 1000. Since these values are outside the range of the training dataset, there’s no evidence that they behave the same way as the training data.
Linear Regression Assumptions
The linear regression models do need certain assumptions to be considered. Yet, a lot of the assumptions can be relaxed by modifications.
Please take a look at the Wikipedia section for more explanation.
That’s it! In this tutorial, you’ve learned a lot about linear regression, a critical machine learning algorithm.
Hope you are ready to fit your linear regression models in Python!
Leave a comment for any questions you may have or anything else.
A FREE Python online course, beginner-friendly tutorial. Start your successful data science career journey: learn Python for data science, machine learning.
Read this pandas tutorial to learn Group by in pandas. It is an essential operation on datasets (DataFrame) when doing data manipulation or analysis.
Check out this for a detailed review of resources online, including courses, books, free tutorials, portfolios building, and more.