In this tutorial, you’ll learn XGBoost and how to implement it in Python, with an example.
XGBoost is a popular framework due to its proven success in machine learning competitions. It is based on the gradient boosting algorithm but with many practical improvements. We can easily apply XGBoost for supervised learning problems to make predictions.
By following this tutorial, you’ll learn:
- What is XGBoost (vs. gradient boosting)
- How to build an XGBoost model (Classifier) in Python, step-by-step
- And more!
If you are looking to apply XGBoost for your prediction task, this tutorial will get you started.
Let’s dive in!
What is XGBoost in Python?
XGBoost (eXtreme Gradient Boosting) is an optimized implementation of the gradient boosting algorithm. It became well known because of its outstanding accuracy and efficiency compared to other machine learning algorithms in competitions.
Let’s start from its basics.
What is gradient boosting?
Gradient boosting is a machine learning algorithm that sequentially ensembles weak predictive models into a single stronger predictive model. We can apply it to both supervised regression and classification problems.
Further learning: we’ve covered gradient boosting’s fundamentals in another tutorial. Please check out What is gradient boosting in machine learning: fundamentals explained.
The most common choice of weak models to the ensemble in gradient boosting are decision trees. The tree-based gradient boosting methods are performant and easy to use. But despite the advantages of tree-based gradient boosting methods, the model tends to overfit as well as could demand a lot of computing power.
XGBoost, also a tree-based gradient boosting implementation, overcomes some disadvantages with its optimizations. Let’s look at some key improvements.
XGBoost vs. traditional gradient boosting
Now that you’ve understood the relationship between XGBoost and gradient boosting. We’ll compare XGBoost’s major optimizations over traditional tree-based gradient boosting.
XGBoost uses a more regularized model formalization to control overfitting, which gives it better performance. You can consider XGBoost as a more regularized version of gradient tree boosting. For example, the objective function of XGBoost has a regularization term added to the loss function. We’ll use some of the regularization parameters in our XGBoost Python example.
XGBoost was designed to be scalable. It has implemented practices such as memory optimization, cache optimization, and distributed computing that can handle large datasets.
So overall, XGBoost is a faster framework that can build better models.
XGBoost Python package
The XGBoost framework has an open-source Python package. This package was built with easy integration with the popular machine-learning library
sklearn). If you are familiar with
sklearn, you’ll find it easy to use
All right, now we are ready to build an XGBoost model in Python!
Step #1: Explore and prep data
We’ll use some bank marketing data as an example. This dataset records the telemarketing campaigns of a Portuguese bank. Based on the client information and campaign activities, we’ll predict if a client will subscribe to a term deposit (yes or no). So we are dealing with a supervised classification task.
First, let’s load the dataset.
Further learning: to follow this XGBoost in Python tutorial, you need to know Python, including basic knowledge of
pandas. If you need help, please check out our course Python for Data Analysis: step-by-step with projects. This course teaches
pandas, which is necessary to transform your dataset before modeling, and much more!
In reality, we’d explore the dataset before transforming it. But for simplicity, I’ll skip the process and change the dataset with the following code.
Now let’s look at our cleaned dataset.
We have 14 features and the target
result, with no missing data. Note that
xgboost can handle missing values internally, even if there are some.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 41188 entries, 0 to 41187 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 41188 non-null int64 1 job_type 41188 non-null object 2 marital 41188 non-null object 3 education 41188 non-null object 4 default_status 41188 non-null object 5 housing_loan_status 41188 non-null object 6 personal_loan_status 41188 non-null object 7 contact_type 41188 non-null object 8 contact_month 41188 non-null object 9 contact_day_of_week 41188 non-null object 10 num_contacts 41188 non-null int64 11 days_last_contact 41188 non-null int64 12 previous_contacts 41188 non-null int64 13 previous_outcome 41188 non-null object 14 result 41188 non-null int64 dtypes: int64(5), object(10) memory usage: 4.7+ MB
result stores whether the customer accepted or rejected the offer from the bank (
0 36548 1 4640 Name: result, dtype: int64
Next, let’s split the dataset into training and test sets as in the usual machine learning processes. We set the sampling to be stratified based on the target’s values and also a random number seed so that we get reproducible output.
Now we have the training set (80% of the original set) as
y_train, and the test set as
Step #2: Build a pipeline of training
In this step, we’ll set up a pipeline of training using the
sklearn package. The sklearn pipeline can sequentially apply a list of transforms and a final estimator. It conveniently assembles several steps/changes that can be cross-validated together when training. Building a pipeline is much easier and ensures consistency than setting up the process manually. Hence, it’s a good practice to follow.
In our example, we set up the pipeline
pipe holding the parameter
estimators includes a list of tuples in sequential order:
- an encoder of
TargetEncoder: encoding like this is a standard preprocessing procedure in classification prediction problems. It will transform the categorical features into numeric ones. You can read more about
- an estimator of
XGBClassifier: this is the
sklearnwrapper implementation of the XGBoost classification. For regression problems, please use
If we print out
pipe, we can see that we’ve assigned its
steps as ‘
encoder‘ followed by ‘
clf‘. In the following steps, we’ll train the dataset by calling this pipeline to ensure the dataset is always encoded before fitting the XGBoost classifier.
Pipeline(steps=[('encoder', TargetEncoder()), ('clf', XGBClassifier(base_score=None, booster=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, enable_categorical=False, gamma=None, gpu_id=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_delta_step=None, max_depth=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=8, reg_alpha=None, reg_lambda=None, scale_pos_weight=None, subsample=None, tree_method=None, validate_parameters=None, verbosity=None))])
Step #3: Set up hyperparameter tuning
One more step before training our XGBoost model in Python. The XGBoost model contains many hyperparameters. We should tune them to get a better estimate of the model.
This tutorial will use a package called
skopt) for hyperparameter tuning. It is easy to use and integrates easily with
sklearn. Within the package, we’ll use the object called
BayesSearchCV. In short, it utilizes Bayesian optimization, where a predictive model is used to model a search space of hyperparameter values to arrive at good values combination based on cross-validation performance. So it is an efficient yet effective approach to hyperparameter tuning.
So to use this
BayesSearchCV method, we need to define a search space of hyperparameter values. In the Python code below, within the variable
search_space, we set up the ranges of the hyperparameter values that will be searched as a dictionary:
- the keys are parameter names: in our case of a pipeline, the name of the estimator ‘clf’, followed by two underscores ‘__’, and the hyperparameter name
- the values are the type and range of the hyperparameter, defined by the
As a result, only these hyperparameter values will be considered for tuning. This list of hyperparameters is not exhaustive. You can remove or include more hyperparameters by reading their definition within the
XGBClassifier documentation. And you can change their search values as you like.
After that, we set up a variable
BayesSearchCV and feed it with:
pipe: the pipeline we set up earlier
search_space: the search space of the hyperparameters
cv: the number of folds of cross-validation
n_iter: the number of hyperparameter settings that are sampled
scoring: the metric for evaluation
Note that it is necessary to use the
scikit-learn pipeline when using
BayesSearchCV. This ensures the
TargetEncoder is being applied to the correct dataset during cross-validation.
Finally, we’ve got everything set up for training!
Step #4: Train the XGBoost model
opt includes both the pipeline and the hyperparameter tuning settings. We call its
fit method on the training set.
And after waiting, we have our XGBoost model trained!
Step #5: Evaluate the model and make predictions
Let’s look at the chosen pipeline/model.
Pipeline(steps=[('encoder', TargetEncoder(cols=['job_type', 'marital', 'education', 'default_status', 'housing_loan_status', 'personal_loan_status', 'contact_type', 'contact_month', 'contact_day_of_week', 'previous_outcome'])), ('clf', XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=0.7750018497221565, colsample_bynode=0.5614437441596264, colsamp... learning_rate=0.4299244814327041, max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=8, num_parallel_tree=1, predictor='auto', random_state=8, reg_alpha=2.784887532399771, reg_lambda=1.67027558902639, scale_pos_weight=1, subsample=0.5966102807384807, tree_method='exact', validate_parameters=1, verbosity=None))])
We’ve set the
roc_auc. We can call the
best_score_ to see the AUC score for the training set. The closer this score is to 1, the better predictions the model can make. And by calling the
score method on the test dataset, we have the score for the test set.
We can see that the scores on the training and test sets are close.
To make predictions, we use the
predict_proba methods on the test set.
array([0, 0, 0, ..., 0, 0, 0], dtype=int64)
array([[0.9388285 , 0.06117148], [0.95764744, 0.04235254], [0.9209205 , 0.07907953], ..., [0.52436817, 0.4756318 ], [0.9597406 , 0.04025942], [0.8426394 , 0.15736063]], dtype=float32)
Step #6: Measure feature importance (optional)
We can look at the feature importance if you want to interpret the model better. The
xgboost package offers a plotting function
plot_importance based on the fitted model.
So first, we need to extract the fitted XGBoost model from
As you can see, the
XGBClassifier is printed with the above code. Now we need to use basic Python indexing techniques to grab it.
[('encoder', TargetEncoder(cols=['job_type', 'marital', 'education', 'default_status', 'housing_loan_status', 'personal_loan_status', 'contact_type', 'contact_month', 'contact_day_of_week', 'previous_outcome'])), ('clf', XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=0.7750018497221565, colsample_bynode=0.5614437441596264, colsample_bytree=0.9126202065825759, enable_categorical=False, gamma=8.289497472648083, gpu_id=-1, importance_type=None, interaction_constraints='', learning_rate=0.4299244814327041, max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=8, num_parallel_tree=1, predictor='auto', random_state=8, reg_alpha=2.784887532399771, reg_lambda=1.67027558902639, scale_pos_weight=1, subsample=0.5966102807384807, tree_method='exact', validate_parameters=1, verbosity=None))]
Below, we extract the model and store it within
xgboost_model. Then apply the
plot_importance function to plot feature importance based on such a model. The default calculates importance as the number of times each feature appears in a tree.
We can see the feature importance plot below. Please investigate more if you are interested.
And that’s it!
As you can see, building XGBoost models in Python is easy.
In this tutorial, you’ve successfully built an XGBoost model and made predictions in Python!
Please leave a comment with any questions you may have or anything else.