In this tutorial, you’ll learn XGBoost and how to implement it in Python, with an example.
XGBoost is a popular framework due to its proven success in machine learning competitions. It is based on the gradient boosting algorithm but with many practical improvements. We can easily apply XGBoost for supervised learning problems to make predictions.
By following this tutorial, you’ll learn:
- What is XGBoost (vs. gradient boosting)
- How to build an XGBoost model (Classifier) in Python, step-by-step
- And more!
If you are looking to apply XGBoost for your prediction task, this tutorial will get you started.
Let’s dive in!
What is XGBoost in Python?
XGBoost (eXtreme Gradient Boosting) is an optimized implementation of the gradient boosting algorithm. It became well known because of its outstanding accuracy and efficiency compared to other machine learning algorithms in competitions.
Let’s start from its basics.
What is gradient boosting?
Gradient boosting is a machine learning algorithm that sequentially ensembles weak predictive models into a single stronger predictive model. We can apply it to both supervised regression and classification problems.
Further learning: we’ve covered gradient boosting’s fundamentals in another tutorial. Please check out What is gradient boosting in machine learning: fundamentals explained.
The most common choice of weak models to the ensemble in gradient boosting are decision trees. The tree-based gradient boosting methods are performant and easy to use. But despite the advantages of tree-based gradient boosting methods, the model tends to overfit as well as could demand a lot of computing power.
XGBoost, also a tree-based gradient boosting implementation, overcomes some disadvantages with its optimizations. Let’s look at some key improvements.
XGBoost vs. traditional gradient boosting
Now that you’ve understood the relationship between XGBoost and gradient boosting. We’ll compare XGBoost’s major optimizations over traditional tree-based gradient boosting.
More regularized
XGBoost uses a more regularized model formalization to control overfitting, which gives it better performance. You can consider XGBoost as a more regularized version of gradient tree boosting. For example, the objective function of XGBoost has a regularization term added to the loss function. We’ll use some of the regularization parameters in our XGBoost Python example.
More scalable
XGBoost was designed to be scalable. It has implemented practices such as memory optimization, cache optimization, and distributed computing that can handle large datasets.
So overall, XGBoost is a faster framework that can build better models.
XGBoost Python package
The XGBoost framework has an open-source Python package. This package was built with easy integration with the popular machine-learning library scikit-learn
(sklearn
). If you are familiar with sklearn
, you’ll find it easy to use xgboost
.
All right, now we are ready to build an XGBoost model in Python!
Step #1: Explore and prep data
We’ll use some bank marketing data as an example. This dataset records the telemarketing campaigns of a Portuguese bank. Based on the client information and campaign activities, we’ll predict if a client will subscribe to a term deposit (yes or no). So we are dealing with a supervised classification task.
First, let’s load the dataset.
Further learning: to follow this XGBoost in Python tutorial, you need to know Python, including basic knowledge of pandas
. If you need help, please check out our course Python for Data Analysis: step-by-step with projects. This course teaches pandas
, which is necessary to transform your dataset before modeling, and much more!
In reality, we’d explore the dataset before transforming it. But for simplicity, I’ll skip the process and change the dataset with the following code.
Now let’s look at our cleaned dataset.
We have 14 features and the target result
, with no missing data. Note that xgboost
can handle missing values internally, even if there are some.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 41188 entries, 0 to 41187 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 41188 non-null int64 1 job_type 41188 non-null object 2 marital 41188 non-null object 3 education 41188 non-null object 4 default_status 41188 non-null object 5 housing_loan_status 41188 non-null object 6 personal_loan_status 41188 non-null object 7 contact_type 41188 non-null object 8 contact_month 41188 non-null object 9 contact_day_of_week 41188 non-null object 10 num_contacts 41188 non-null int64 11 days_last_contact 41188 non-null int64 12 previous_contacts 41188 non-null int64 13 previous_outcome 41188 non-null object 14 result 41188 non-null int64 dtypes: int64(5), object(10) memory usage: 4.7+ MB
result
stores whether the customer accepted or rejected the offer from the bank (1
or 0
).
0 36548 1 4640 Name: result, dtype: int64
Next, let’s split the dataset into training and test sets as in the usual machine learning processes. We set the sampling to be stratified based on the target’s values and also a random number seed so that we get reproducible output.
Now we have the training set (80% of the original set) as X_train
, y_train
, and the test set as X_test
, y_test
.
Step #2: Build a pipeline of training
In this step, we’ll set up a pipeline of training using the sklearn
package. The sklearn pipeline can sequentially apply a list of transforms and a final estimator. It conveniently assembles several steps/changes that can be cross-validated together when training. Building a pipeline is much easier and ensures consistency than setting up the process manually. Hence, it’s a good practice to follow.
In our example, we set up the pipeline pipe
holding the parameter steps
as estimators
. The estimators
includes a list of tuples in sequential order:
- an encoder of
TargetEncoder
: encoding like this is a standard preprocessing procedure in classification prediction problems. It will transform the categorical features into numeric ones. You can read more aboutcategory_encoders
here. - an estimator of
XGBClassifier
: this is thesklearn
wrapper implementation of the XGBoost classification. For regression problems, please useXGBRegressor
.
If we print out pipe
, we can see that we’ve assigned its steps
as ‘encoder
‘ followed by ‘clf
‘. In the following steps, we’ll train the dataset by calling this pipeline to ensure the dataset is always encoded before fitting the XGBoost classifier.
Pipeline(steps=[('encoder', TargetEncoder()), ('clf', XGBClassifier(base_score=None, booster=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, enable_categorical=False, gamma=None, gpu_id=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_delta_step=None, max_depth=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=8, reg_alpha=None, reg_lambda=None, scale_pos_weight=None, subsample=None, tree_method=None, validate_parameters=None, verbosity=None))])
Step #3: Set up hyperparameter tuning
One more step before training our XGBoost model in Python. The XGBoost model contains many hyperparameters. We should tune them to get a better estimate of the model.
This tutorial will use a package called scikit-optimize
(skopt
) for hyperparameter tuning. It is easy to use and integrates easily with sklearn
. Within the package, we’ll use the object called BayesSearchCV
. In short, it utilizes Bayesian optimization, where a predictive model is used to model a search space of hyperparameter values to arrive at good values combination based on cross-validation performance. So it is an efficient yet effective approach to hyperparameter tuning.
So to use this BayesSearchCV
method, we need to define a search space of hyperparameter values. In the Python code below, within the variable search_space
, we set up the ranges of the hyperparameter values that will be searched as a dictionary:
- the keys are parameter names: in our case of a pipeline, the name of the estimator ‘clf’, followed by two underscores ‘__’, and the hyperparameter name
- the values are the type and range of the hyperparameter, defined by the
space
module ofscikit-optimize
As a result, only these hyperparameter values will be considered for tuning. This list of hyperparameters is not exhaustive. You can remove or include more hyperparameters by reading their definition within the XGBClassifier
documentation. And you can change their search values as you like.
After that, we set up a variable opt
as BayesSearchCV
and feed it with:
pipe
: the pipeline we set up earliersearch_space
: the search space of the hyperparameterscv
: the number of folds of cross-validationn_iter
: the number of hyperparameter settings that are sampledscoring
: the metric for evaluation
Note that it is necessary to use the scikit-learn
pipeline when using BayesSearchCV
. This ensures the TargetEncoder
is being applied to the correct dataset during cross-validation.
Finally, we’ve got everything set up for training!
Step #4: Train the XGBoost model
opt
includes both the pipeline and the hyperparameter tuning settings. We call its fit
method on the training set.
And after waiting, we have our XGBoost model trained!
Step #5: Evaluate the model and make predictions
Let’s look at the chosen pipeline/model.
Pipeline(steps=[('encoder', TargetEncoder(cols=['job_type', 'marital', 'education', 'default_status', 'housing_loan_status', 'personal_loan_status', 'contact_type', 'contact_month', 'contact_day_of_week', 'previous_outcome'])), ('clf', XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=0.7750018497221565, colsample_bynode=0.5614437441596264, colsamp... learning_rate=0.4299244814327041, max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=8, num_parallel_tree=1, predictor='auto', random_state=8, reg_alpha=2.784887532399771, reg_lambda=1.67027558902639, scale_pos_weight=1, subsample=0.5966102807384807, tree_method='exact', validate_parameters=1, verbosity=None))])
We’ve set the scoring
within opt
as roc_auc
. We can call the best_score_
to see the AUC score for the training set. The closer this score is to 1, the better predictions the model can make. And by calling the score
method on the test dataset, we have the score for the test set.
We can see that the scores on the training and test sets are close.
0.7682616152239224 0.7775439731826973
To make predictions, we use the predict
or predict_proba
methods on the test set.
array([0, 0, 0, ..., 0, 0, 0], dtype=int64)
array([[0.9388285 , 0.06117148], [0.95764744, 0.04235254], [0.9209205 , 0.07907953], ..., [0.52436817, 0.4756318 ], [0.9597406 , 0.04025942], [0.8426394 , 0.15736063]], dtype=float32)
Step #6: Measure feature importance (optional)
We can look at the feature importance if you want to interpret the model better. The xgboost
package offers a plotting function plot_importance
based on the fitted model.
So first, we need to extract the fitted XGBoost model from opt
.
As you can see, the XGBClassifier
is printed with the above code. Now we need to use basic Python indexing techniques to grab it.
[('encoder', TargetEncoder(cols=['job_type', 'marital', 'education', 'default_status', 'housing_loan_status', 'personal_loan_status', 'contact_type', 'contact_month', 'contact_day_of_week', 'previous_outcome'])), ('clf', XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=0.7750018497221565, colsample_bynode=0.5614437441596264, colsample_bytree=0.9126202065825759, enable_categorical=False, gamma=8.289497472648083, gpu_id=-1, importance_type=None, interaction_constraints='', learning_rate=0.4299244814327041, max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=8, num_parallel_tree=1, predictor='auto', random_state=8, reg_alpha=2.784887532399771, reg_lambda=1.67027558902639, scale_pos_weight=1, subsample=0.5966102807384807, tree_method='exact', validate_parameters=1, verbosity=None))]
Below, we extract the model and store it within xgboost_model
. Then apply the plot_importance
function to plot feature importance based on such a model. The default calculates importance as the number of times each feature appears in a tree.
We can see the feature importance plot below. Please investigate more if you are interested.

And that’s it!
As you can see, building XGBoost models in Python is easy.
In this tutorial, you’ve successfully built an XGBoost model and made predictions in Python!
Please leave a comment with any questions you may have or anything else.