How to build XGBoost models in Python
 With a step-by-step example

Lianne & Justin

Lianne & Justin

xgboost python
Source: Pixabay

In this tutorial, you’ll learn XGBoost and how to implement it in Python, with an example.

XGBoost is a popular framework due to its proven success in machine learning competitions. It is based on the gradient boosting algorithm but with many practical improvements. We can easily apply XGBoost for supervised learning problems to make predictions.

By following this tutorial, you’ll learn:

  • What is XGBoost (vs. gradient boosting)
  • How to build an XGBoost model (Classifier) in Python, step-by-step
  • And more!

If you are looking to apply XGBoost for your prediction task, this tutorial will get you started.

Let’s dive in!



What is XGBoost in Python?

XGBoost (eXtreme Gradient Boosting) is an optimized implementation of the gradient boosting algorithm. It became well known because of its outstanding accuracy and efficiency compared to other machine learning algorithms in competitions.

Let’s start from its basics.

What is gradient boosting?

Gradient boosting is a machine learning algorithm that sequentially ensembles weak predictive models into a single stronger predictive model. We can apply it to both supervised regression and classification problems.

Further learning: we’ve covered gradient boosting’s fundamentals in another tutorial. Please check out What is gradient boosting in machine learning: fundamentals explained.

The most common choice of weak models to the ensemble in gradient boosting are decision trees. The tree-based gradient boosting methods are performant and easy to use. But despite the advantages of tree-based gradient boosting methods, the model tends to overfit as well as could demand a lot of computing power.

XGBoost, also a tree-based gradient boosting implementation, overcomes some disadvantages with its optimizations. Let’s look at some key improvements.

XGBoost vs. traditional gradient boosting

Now that you’ve understood the relationship between XGBoost and gradient boosting. We’ll compare XGBoost’s major optimizations over traditional tree-based gradient boosting.

More regularized

XGBoost uses a more regularized model formalization to control overfitting, which gives it better performance. You can consider XGBoost as a more regularized version of gradient tree boosting. For example, the objective function of XGBoost has a regularization term added to the loss function. We’ll use some of the regularization parameters in our XGBoost Python example.

More scalable

XGBoost was designed to be scalable. It has implemented practices such as memory optimization, cache optimization, and distributed computing that can handle large datasets.

So overall, XGBoost is a faster framework that can build better models.

XGBoost Python package

The XGBoost framework has an open-source Python package. This package was built with easy integration with the popular machine-learning library scikit-learn (sklearn). If you are familiar with sklearn, you’ll find it easy to use xgboost.

All right, now we are ready to build an XGBoost model in Python!


Step #1: Explore and prep data

We’ll use some bank marketing data as an example. This dataset records the telemarketing campaigns of a Portuguese bank. Based on the client information and campaign activities, we’ll predict if a client will subscribe to a term deposit (yes or no). So we are dealing with a supervised classification task.

First, let’s load the dataset.

Further learning: to follow this XGBoost in Python tutorial, you need to know Python, including basic knowledge of pandas. If you need help, please check out our course Python for Data Analysis: step-by-step with projects. This course teaches pandas, which is necessary to transform your dataset before modeling, and much more!

In reality, we’d explore the dataset before transforming it. But for simplicity, I’ll skip the process and change the dataset with the following code.

Now let’s look at our cleaned dataset.

We have 14 features and the target result, with no missing data. Note that xgboost can handle missing values internally, even if there are some.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   age                   41188 non-null  int64 
 1   job_type              41188 non-null  object
 2   marital               41188 non-null  object
 3   education             41188 non-null  object
 4   default_status        41188 non-null  object
 5   housing_loan_status   41188 non-null  object
 6   personal_loan_status  41188 non-null  object
 7   contact_type          41188 non-null  object
 8   contact_month         41188 non-null  object
 9   contact_day_of_week   41188 non-null  object
 10  num_contacts          41188 non-null  int64 
 11  days_last_contact     41188 non-null  int64 
 12  previous_contacts     41188 non-null  int64 
 13  previous_outcome      41188 non-null  object
 14  result                41188 non-null  int64 
dtypes: int64(5), object(10)
memory usage: 4.7+ MB

result stores whether the customer accepted or rejected the offer from the bank (1 or 0).

0    36548
1     4640
Name: result, dtype: int64

Next, let’s split the dataset into training and test sets as in the usual machine learning processes. We set the sampling to be stratified based on the target’s values and also a random number seed so that we get reproducible output.

Now we have the training set (80% of the original set) as X_train, y_train, and the test set as X_test, y_test.

Step #2: Build a pipeline of training

In this step, we’ll set up a pipeline of training using the sklearn package. The sklearn pipeline can sequentially apply a list of transforms and a final estimator. It conveniently assembles several steps/changes that can be cross-validated together when training. Building a pipeline is much easier and ensures consistency than setting up the process manually. Hence, it’s a good practice to follow.

In our example, we set up the pipeline pipe holding the parameter steps as estimators. The estimators includes a list of tuples in sequential order:

  • an encoder of TargetEncoder: encoding like this is a standard preprocessing procedure in classification prediction problems. It will transform the categorical features into numeric ones. You can read more about category_encoders here.
  • an estimator of XGBClassifier: this is the sklearn wrapper implementation of the XGBoost classification. For regression problems, please use XGBRegressor.

If we print out pipe, we can see that we’ve assigned its steps as ‘encoder‘ followed by ‘clf‘. In the following steps, we’ll train the dataset by calling this pipeline to ensure the dataset is always encoded before fitting the XGBoost classifier.

Pipeline(steps=[('encoder', TargetEncoder()),
                ('clf',
                 XGBClassifier(base_score=None, booster=None,
                               colsample_bylevel=None, colsample_bynode=None,
                               colsample_bytree=None, enable_categorical=False,
                               gamma=None, gpu_id=None, importance_type=None,
                               interaction_constraints=None, learning_rate=None,
                               max_delta_step=None, max_depth=None,
                               min_child_weight=None, missing=nan,
                               monotone_constraints=None, n_estimators=100,
                               n_jobs=None, num_parallel_tree=None,
                               predictor=None, random_state=8, reg_alpha=None,
                               reg_lambda=None, scale_pos_weight=None,
                               subsample=None, tree_method=None,
                               validate_parameters=None, verbosity=None))])

Step #3: Set up hyperparameter tuning

One more step before training our XGBoost model in Python. The XGBoost model contains many hyperparameters. We should tune them to get a better estimate of the model.

This tutorial will use a package called scikit-optimize (skopt) for hyperparameter tuning. It is easy to use and integrates easily with sklearn. Within the package, we’ll use the object called BayesSearchCV. In short, it utilizes Bayesian optimization, where a predictive model is used to model a search space of hyperparameter values to arrive at good values combination based on cross-validation performance. So it is an efficient yet effective approach to hyperparameter tuning.

So to use this BayesSearchCV method, we need to define a search space of hyperparameter values. In the Python code below, within the variable search_space, we set up the ranges of the hyperparameter values that will be searched as a dictionary:

  • the keys are parameter names: in our case of a pipeline, the name of the estimator ‘clf’, followed by two underscores ‘__’, and the hyperparameter name
  • the values are the type and range of the hyperparameter, defined by the space module of scikit-optimize

As a result, only these hyperparameter values will be considered for tuning. This list of hyperparameters is not exhaustive. You can remove or include more hyperparameters by reading their definition within the XGBClassifier documentation. And you can change their search values as you like.

After that, we set up a variable opt as BayesSearchCV and feed it with:

  • pipe: the pipeline we set up earlier
  • search_space: the search space of the hyperparameters
  • cv: the number of folds of cross-validation
  • n_iter: the number of hyperparameter settings that are sampled
  • scoring: the metric for evaluation

Note that it is necessary to use the scikit-learn pipeline when using BayesSearchCV. This ensures the TargetEncoder is being applied to the correct dataset during cross-validation.

Finally, we’ve got everything set up for training!

Step #4: Train the XGBoost model

opt includes both the pipeline and the hyperparameter tuning settings. We call its fit method on the training set.

And after waiting, we have our XGBoost model trained!

Step #5: Evaluate the model and make predictions

Let’s look at the chosen pipeline/model.

Pipeline(steps=[('encoder',
                 TargetEncoder(cols=['job_type', 'marital', 'education',
                                     'default_status', 'housing_loan_status',
                                     'personal_loan_status', 'contact_type',
                                     'contact_month', 'contact_day_of_week',
                                     'previous_outcome'])),
                ('clf',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=0.7750018497221565,
                               colsample_bynode=0.5614437441596264,
                               colsamp...
                               learning_rate=0.4299244814327041,
                               max_delta_step=0, max_depth=6,
                               min_child_weight=1, missing=nan,
                               monotone_constraints='()', n_estimators=100,
                               n_jobs=8, num_parallel_tree=1, predictor='auto',
                               random_state=8, reg_alpha=2.784887532399771,
                               reg_lambda=1.67027558902639, scale_pos_weight=1,
                               subsample=0.5966102807384807,
                               tree_method='exact', validate_parameters=1,
                               verbosity=None))])

We’ve set the scoring within opt as roc_auc. We can call the best_score_ to see the AUC score for the training set. The closer this score is to 1, the better predictions the model can make. And by calling the score method on the test dataset, we have the score for the test set.

We can see that the scores on the training and test sets are close.

0.7682616152239224
0.7775439731826973

To make predictions, we use the predict or predict_proba methods on the test set.

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)
array([[0.9388285 , 0.06117148],
       [0.95764744, 0.04235254],
       [0.9209205 , 0.07907953],
       ...,
       [0.52436817, 0.4756318 ],
       [0.9597406 , 0.04025942],
       [0.8426394 , 0.15736063]], dtype=float32)

Step #6: Measure feature importance (optional)

We can look at the feature importance if you want to interpret the model better. The xgboost package offers a plotting function plot_importance based on the fitted model.

So first, we need to extract the fitted XGBoost model from opt.

As you can see, the XGBClassifier is printed with the above code. Now we need to use basic Python indexing techniques to grab it.

[('encoder',
  TargetEncoder(cols=['job_type', 'marital', 'education', 'default_status',
                      'housing_loan_status', 'personal_loan_status',
                      'contact_type', 'contact_month', 'contact_day_of_week',
                      'previous_outcome'])),
 ('clf',
  XGBClassifier(base_score=0.5, booster='gbtree',
                colsample_bylevel=0.7750018497221565,
                colsample_bynode=0.5614437441596264,
                colsample_bytree=0.9126202065825759, enable_categorical=False,
                gamma=8.289497472648083, gpu_id=-1, importance_type=None,
                interaction_constraints='', learning_rate=0.4299244814327041,
                max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
                monotone_constraints='()', n_estimators=100, n_jobs=8,
                num_parallel_tree=1, predictor='auto', random_state=8,
                reg_alpha=2.784887532399771, reg_lambda=1.67027558902639,
                scale_pos_weight=1, subsample=0.5966102807384807,
                tree_method='exact', validate_parameters=1, verbosity=None))]

Below, we extract the model and store it within xgboost_model. Then apply the plot_importance function to plot feature importance based on such a model. The default calculates importance as the number of times each feature appears in a tree.

We can see the feature importance plot below. Please investigate more if you are interested.

xgboost python feature importance plot
XGBoost model feature importance plot

And that’s it!

As you can see, building XGBoost models in Python is easy.


In this tutorial, you’ve successfully built an XGBoost model and made predictions in Python!

Please leave a comment with any questions you may have or anything else.

Twitter
LinkedIn
Facebook
Email
Lianne & Justin

Lianne & Justin

Leave a Comment

Your email address will not be published. Required fields are marked *

More recent articles

Scroll to Top

Learn Python for Data Analysis

with a practical online course

lectures + projects

based on real-world datasets

We use cookies to ensure you get the best experience on our website.  Learn more.