In this tutorial, you’ll learn what **gradient boosting** is in machine learning and its regularization.

Gradient boosting is a popular machine learning predictive modeling technique and has shown success in many practical applications. Its main idea is to ensemble weak predictive models by “boosting” them into a stronger model. We can apply this algorithm to both supervised regression and classification problems.

By following this tutorial, you’ll learn the following:

- What is gradient boosting: intro to basic concepts
- How to improve gradient boosting with regularization

Before building gradient boosting models, this tutorial will help you understand its fundamentals to apply them more effectively.

Let’s get started!

## What is gradient boosting in machine learning?

Here is the basic definition.

Gradient boosting is a machine learning algorithm that sequentially ensembles weak predictive models into a single predictive model. Usually, the combined model is much stronger than each model in the ensemble.

So how do we ensemble the models?

Before diving into gradient boosting, let’s look at ensembling in machine learning.

### What is ensemble learning?

Ensembling is a technique to build a predictive model by combining the strengths of multiple base models. The goal of the method is to produce better predictions than any of the individual models alone.

So the ensembling process involves developing a group of base models, and putting them together in specific ways. Usually, we can either:

- build the base models independently and then
**average**their results to get the final prediction.

In this case, the ensembled predictor often has better predictions because of reduced variance. So the base models often have low bias and high variance (e.g., an overgrown decision tree).

or, - build base models
**sequentially**to improve the previous ones’ results.

In this case, the ensembled predictor usually produces better predictions because of less bias. So the base models are often high bias and low variance (e.g., a one-split decision tree).

Gradient boosting uses the *sequential *ensembling method mentioned above or, more specifically, the **boosting **method.

**Further learning**: what is the main difference between gradient boosting (base model as decision trees) vs. random forest?

Another popular machine learning algorithm, random forest, uses an average ensemble method called bagging. Please see above for the differences between bagging and boosting. You can learn more about the random forest algorithm here.

Next, let’s dig into gradient *boosting*.

### What is gradient boosting?

As mentioned above, gradient boosting uses the boosting ensemble method to combine weak learners into a stronger model. More specifically, the purpose of gradient boosting is to minimize a loss function by sequentially building base models (weak learners) using a gradient descent type of algorithm.

So the three critical components of gradient boosting in machine learning are:

- weak learners
- loss function
- algorithm of additive training plus gradient descent minimization

We’ll take a closer look at each of them.

#### Weak learners

We can use different types of weak models as weak learners for gradient boosting. The most popular ones are decision trees, mainly due to the following reasons:

- trees can be very fast to construct
- trees can learn non-linear relationships
- trees are resilient to outliers and irrelevant data

Within gradient boosting, we usually limit the tree features to be simple so that it remains weak individually. They may only perform slightly better than random guessing. For example, we could use only decision stumps, i.e., decision trees with a single split.

#### Loss function

The objective of training most supervised learning models is to minimize a loss function. For example, in regression problems, we often try to minimize the mean squared error; for binary classification problems, we often use cross-entropy.

Within the gradient boosting technique, we can use any loss function as long as they are differentiable. This flexibility makes gradient boosting highly customizable.

#### Algorithm

During the training process, gradient boosting uses additive (sequential) expansions and gradient descent minimization. So starting with one weak learner, the algorithm adds new weak learners one by one, trained based on the previous learners and according to the gradient descent direction of the loss function.

As you may recall, gradient descent is a generic algorithm to find a local minimum of a differentiable function. It takes repeated steps according to the direction of the gradient of the function. So it is a perfect technique for finding the minimum of a differentiable loss function for the gradient boosting algorithm.

Now that you’ve learned the basic concepts of gradient boosting, we’ll look at more details.

### How does gradient boosting work?

Let’s see how gradient boosting works with an example.

Assuming we have a supervised learning problem, we’ll define it in mathematical terms:

- \(y_i \): the observed target for instance i
- \(\hat{y}_{i} \): the corresponding gradient boosting prediction for instance i

The loss function of the entire dataset can be written as the sum of each instance’s loss (\(i=1, …, n\)):

\(Loss = \sum_{i=1}^{n}L(y_i, \hat{y}_{i}) \)

And our goal is to make predictions that minimize this loss function.

How do we do that?

Recall that gradient boosting is a sequential ensemble model. During the process, we add weak learners on top of each other to boost performance. The detailed procedure is shown below.

We start from the first weak learner, \(f_{1}(\textbf{x}_i)\), which tries to minimize the loss function. Then during each step \(t = 1, …, K \) of gradient boosting, we find the next weak learner \(f_{t}(\textbf{x}_i)\) that tries to “fix” the residuals left between \(y_i \) and the previous learners.

\(\hat{y}_{i}^{(1)} = f_{1}(\textbf{x}_i)\)

\(\hat{y}_{i}^{(2)} = \hat{y}_{i}^{(1)} + f_{2}(\textbf{x}_i)\)

\(… \)

\(\hat{y}_{i}^{(t)} = \hat{y}_{i}^{(t-1)} + f_{t}(\textbf{x}_i)\)

\(… \)

\(\hat{y}_{i}^{(K)} = \hat{y}_{i}^{(K-1)} + f_{K}(\textbf{x}_i)\)

In the end, \(\hat{y}_{i} \) is the addition of \(K\) weak learners:

\(\hat{y}_{i} = \sum_{k=1}^{K}f_{k}(\textbf{x}_i)\)

This might be hard to understand. You can also think of it this way: each additional weak learner attempts to correct the biggest mistakes of the previous learners in the ensemble. Hopefully, after adding enough weak learners, the errors of the model for the training set will be adequately small.

Now you have the main idea of the gradient-boosting algorithm! The detailed theory is more complicated. But feel free to explore more as you need.

Next, let’s look at common regularization techniques for gradient boosting. This is critical to improve the performance of complicated gradient-boosting models.

## How to regularize gradient boosting?

Compared to the single predictive model (such as linear regression, or a single decision tree), gradient boosting ensembles many weak learners sequentially. This makes gradient boosting more prone to overfitting the training set. To solve this problem, it is critical to simplify the gradient boosting model with some common regularization techniques.

**Further learning**: currently, the most popular gradient boosting implementation is XGBoost (Extreme Gradient Boosting). One of the critical optimizations of XGBoost is its additional regularization, which largely improves its performance over traditional gradient boosting. We’ve covered XGBoost in a separate tutorial with an example in Python here.

If you just began learning machine learning, the idea of regularization might also be new. So let’s look at its basic definition first.

### What is regularization?

Regularization is a general process to make the resulting answer “simpler”. When it comes to machine learning, regularization helps prevent overfitting. When we find that the error is much higher for a stand-alone validation set than the training set, the model may have been fit too close to the training set. This problem is more common when we fit a sophisticated model such as gradient boosting. When this happens, the model has low generalization capability. So regularization techniques are commonly incorporated in gradient boosting to constrain the fitting procedure.

There are different regularization tools. Each has different parameters that we can tune to optimize models. Let’s look at three common ones.

### Regularization #1: loss function regularization

One primary regularization method is adding a penalty term to the loss function. This imposes an additional condition on the model. In mathematical terms, our goal becomes minimizing the below function:

\(Loss + penalty = \sum_{i=1}^{n}L(y_i, \hat{y}_{i}) + \sum_{k=1}^{K}\Omega(f_k)\)

The penalty term \(\Omega(f) \) can be various functions of \(f\). Usually, it is designed so that the more complicated the model \(f\), the larger such term, so more penalization is added to the loss function. This helps us to get a model that’s simple and predictive.

### Regularization #2: learning rate shrinkage

Within each step of the gradient boosting process, we optimize the loss function by adding a new weak learner. Instead of taking a full step, we can slow down the learning process by applying a learning rate to the new learner.

So, for example, mathematically, the additive model at step \(t\) could become:

\(\hat{y}_{i}^{(t)} = \hat{y}_{i}^{(t-1)} + \nu f_{t}(\textbf{x}_i)\)

where \(\nu \) is the learning rate, a constant between 0 and 1. So we are shrinking the new weak learner’s impact proportionally for regularization.

### Regularization #3: stochastic gradient boosting

Introducing randomness to the gradient boosting procedure can also help simplify and improve the process. Within this approach, at each learning iteration, we take a random subsample (usually without replacement) of the training set and use it to fit the next weak learner. It is crucial to choose a good subsample size as a fraction of the training set. This process is called stochastic gradient boosting.

Other more straightforward regularization approaches for gradient boosting include limiting the number of weak learners, limiting the weak learners’ features, etc.

And that’s it!

In this tutorial, you’ve learned what gradient boosting is in machine learning.

Please leave a comment with any questions you may have or anything else.