How to use AutoML Python tools to automate your machine learning process
 Popular and free libraries: H2O, TPOT, PyCaret, AutoGluon

Lianne & Justin

Lianne & Justin

automl python gears
Source: Pexels

In this tutorial, you’ll learn and apply popular Automated Machine Learning (AutoML) tools in Python.

Applying machine learning in real life can be complicated. The process could be time-consuming and resource-intensive, especially challenging for beginners. To simplify this task, we can automate the process largely with free Python AutoML packages.

By following this guide, you’ll learn:

  • What is AutoML, and how to use it in Python?
  • How to use popular and general Python AutoML libraries:
    • H2O
    • TPOT
    • PyCaret
    • AutoGluon

Throughout the guide, you’ll use a time series dataset as an example to try each AutoML tool to find well-performing model pipelines in Python.

If you are interested in learning AutoML to see which tool is best for your need, this practical tutorial will get you started.

Let’s jump in!



How to start using AutoML in Python

What is AutoML in Python?

First of all, why AutoML?

Applying machine learning to solve real-world problems is not easy. It involves many steps to reach a production-ready model. This process can take much effort, even for industry experts. So it is challenging, if not impossible, for machine learning beginners.

Luckily, the demand for machine learning has been increasing dramatically. It has driven the efforts to automate the process to make ML simpler and more approachable. And that’s what AutoML is used for.

Automated Machine Learning (AutoML) is the process of automating machine learning workflows. In an ideal situation, we, as the users, only need to provide a dataset. The AutoML tool should automatically produce good-performing model pipelines for us.

automl python process

So AutoML should handle tasks like:

  • data preprocessing
  • algorithm selection
  • hyperparameter tuning
  • model training

With Python being one of the most common data science languages, there are quite a few AutoML Python libraries that we can use. We’ve reviewed the popular AutoML Python packages. And we want to introduce 4 easy-to-use and relatively up-to-date ones:

  • H2O
  • TPOT
  • PyCaret
  • AutoGluon

These Python AutoML tools can help you produce high-performing machine learning models with less thinking and coding. They are not only useful for machine learning beginners but also experienced data scientists.

It could be exciting to just start throwing data into them, but please read the below tips first.

Before using AutoML tools in Python

Even though automating the entire machine learning process sounds attractive and promising, the existing AutoML tools are still limited and require human interventions. So before using these AutoML packages, please make sure you’ve learned the basics of below:

  • Python: you still need to know basic Python. It is highly recommended to conduct basic data cleaning before feeding the data into AutoML tools
  • machine learning: you still need to have the basic knowledge to run AutoML tools properly and understand the results

Also, we strongly recommend you to follow the below tips to set up AutoML tools in Python:

  • follow the installation guide: the AutoML packages often rely on other tools, which could be more complicated to set up than standard Python libraries. It is better to follow their official installation guide
  • create virtual environments: the AutoML packages could also be based on different versions of Python and tools. To avoid conflicts, it is better to set up a separate environment for each AutoML tool

All right! I also want to give another tip about expectations for the AutoML tools. You need to budget for a long time to train using these tools. This is because they often consider many choices of preprocessing steps, machine learning algorithms, methods of ensembling, and so on. So it needs to be run for a long enough time (even hours to days) to optimize the results. However, we’ll set a maximum run time for each tool within this tutorial to shorten your learning time.

I know you can’t wait to try the AutoML tools. Next, let’s quickly look at our example dataset and preprocess it.

Preprocess the Example dataset

We’ll use the individual household electric power consumption dataset. It is a time-series recording of a household’s electric power usage between 2006 and 2010.

As mentioned earlier, before using the AutoML packages in Python, I recommend you clean data by yourself. So below is the process of data cleaning and preprocessing.

Further learning: if you have trouble understanding the code below, check out our course Python for Data Analysis with projects. This course shows how to use Python for basic analysis, essential before applying AutoML.

In the end, we have the training and test sets: df_train and df_test. The target is electricity_usage, while the 9 features are below:

  • electricity_usage_1hr_lag
  • electricity_usage_2hr_lag
  • electricity_usage_3hr_lag
  • electricity_usage_4hr_lag
  • electricity_usage_5hr_lag
  • electricity_usage_6hr_lag
  • electricity_usage_7hr_lag
  • electricity_usage_8hr_lag
  • month

We’ll use the previous 8 hours of electricity usage and the month of the year to predict the household’s electricity usage.

Now we are ready to feed the datasets into AutoML tools!


H2O

h2o automl python
H2O

Intro

H2O is an open-source, in-memory, distributed, scalable and fast machine learning platform. Its core code is written in Java. But we can use it in other languages like Python, Scala, and R. The tool currently supports both supervised and unsupervised learning problems.

As mentioned earlier, please follow the installation guide since the process is not as straightforward as the standard Python libraries.

Train with AutoML

To use H2O in Python, we first initialize a connection between our Python and an H2O local server. 

If the connection is successful, we can see a summary of the cluster status like below. I’ve only printed part of the summary since it’s long.

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
; OpenJDK 64-Bit Server VM (build 17.0.1+12-39, mixed mode, sharing)
  Starting server from C:\Users\liann\anaconda3\Lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\liann\AppData\Local\Temp\tmp0t1e8bbx
  JVM stdout: C:\Users\liann\AppData\Local\Temp\tmp0t1e8bbx\h2o_liann_started_from_python.out
  JVM stderr: C:\Users\liann\AppData\Local\Temp\tmp0t1e8bbx\h2o_liann_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.

H2O uses its unique objects. So instead of using pandas DataFrames, we need to convert them to H2OFrame, a 2D array of uniformly-typed columns. It is similar to the pandas DataFrame in many ways. You can read more about it here. Then we also identify the target and features.

Next, we can use an H2OAutoML object to automate our supervised machine learning model training. This object trains several models, and by default cross-validated.

We’ve set a couple of parameters in the argument:

  • sort_metric='mse': set the MSE (mean squared error) as the metric to sort the model performance by
  • max_runtime_secs=5*60: specify 5 minutes as the maximum time the process will run for
  • seed=666: set seed for reproducibility. However, because we’ve set the max_runtime_secs, this can not guarantee the same results after each run. You can read more about it here

Within the AutoML progress note, you’ll notice it says, “XGBoost is not available; skipping it”. This is because I’m running this in a Windows environment, and XGBoost is not supported on Windows with H2O. You can read more about the limitation here.

Below the progress bar, you’ll see the Model Details about the best-performing model trained in this session.

AutoML progress: |
10:05:12.611: AutoML: XGBoost is not available; skipping it.
10:05:12.624: Step 'best_of_family_xgboost' not defined in provider 'StackedEnsemble': skipping it.
10:05:12.624: Step 'all_xgboost' not defined in provider 'StackedEnsemble': skipping it.
███████████████████████████████████████████████████████████████| (done) 100%
Model Details
=============
H2OStackedEnsembleEstimator :  Stacked Ensemble
Model Key:  StackedEnsemble_AllModels_3_AutoML_1_20220302_100512
No model summary for this model
ModelMetricsRegressionGLM: stackedensemble
** Reported on train data. **
MSE: 0.25770256452650037
RMSE: 0.5076441317758932
MAE: 0.34986002383931386
RMSLE: 0.21355570733121418
R^2: 0.6653029342713153
Mean Residual Deviance: 0.25770256452650037
Null degrees of freedom: 10047
Residual degrees of freedom: 10031
Null deviance: 7737.841923940112
Residual deviance: 2589.395368362276
AIC: 14926.41110344668
ModelMetricsRegressionGLM: stackedensemble
** Reported on cross-validation data. **
MSE: 0.35857040562993847
RMSE: 0.5988074862841466
MAE: 0.4082713452285525
RMSLE: 0.24890166892926036
R^2: 0.5493550817323876
Mean Residual Deviance: 0.35857040562993847
Null degrees of freedom: 34742
Residual degrees of freedom: 34726
Null deviance: 27644.918328643114
Residual deviance: 12457.811602800952
AIC: 62998.89118626651

Print best performing model(s)

We can print the leaderboard if we want to compare the top-performing models.

This returns an H2OFrame storing the top models and their metrics.

model_idmsemean_residual_deviancermsemaermsletraining_time_mspredict_time_per_row_msalgo
StackedEnsemble_AllModels_3_AutoML_1_20220302_1005120.358570.358570.5988070.4082710.24890212010.075261StackedEnsemble
StackedEnsemble_AllModels_4_AutoML_1_20220302_1005120.358650.358650.5988740.4083480.2489168790.07141StackedEnsemble
StackedEnsemble_AllModels_2_AutoML_1_20220302_1005120.3587740.3587740.5989770.408580.2489534290.046865StackedEnsemble
StackedEnsemble_BestOfFamily_3_AutoML_1_20220302_1005120.3588760.3588760.5990620.4089730.2490594090.029082StackedEnsemble
StackedEnsemble_AllModels_1_AutoML_1_20220302_1005120.3588810.3588810.5990670.4090690.2489492800.024984StackedEnsemble
StackedEnsemble_BestOfFamily_2_AutoML_1_20220302_1005120.3593820.3593820.5994850.4099060.2491452690.017429StackedEnsemble
StackedEnsemble_BestOfFamily_1_AutoML_1_20220302_1005120.3603170.3603170.6002640.4115280.2496215790.007317StackedEnsemble
GBM_1_AutoML_1_20220302_1005120.3609230.3609230.6007690.4127720.2500357820.007483GBM
GBM_2_AutoML_1_20220302_1005120.362990.362990.6024870.4147030.250774600.005719GBM
GBM_grid_1_AutoML_1_20220302_100512_model_170.3642550.3642550.6035350.416230.2516494100.005731GBM
H2O leaderboard

Calculate metrics

We can also calculate MSE based on the holdout test dataset df_test. Please note that this is different from the above results since the above calculation was based on the training set.

0.29635466797374066

Plot predicted and actual data comparisons

Lastly, we can also plot and compare the predicted and actual electricity usage for the test set.

You can see how closely the prediction follows the actual target.

h2o comparison of predicted and actual data time series
H2O comparison plot

TPOT

tpot automl python
TPOT

Intro

TPOT (Tree-based Pipeline Optimization Tool) is a Python Automated Machine learning tool based on the popular machine learning package scikit-learn. It automates the process, including feature engineering, model selection, parameter optimization. The tool follows a technique called genetic programming, which applies operations similar to natural genetic processes to evolve programs.

Please follow its official installation guide to set it up.

Train with AutoML

TPOT is designed to be as similar as scikit-learn, so you may find it easier to use if you are familiar with scikit-learn.

First, we’ll create an instance of the class TPOTRegressor since ours is a regression problem. Within the argument, we’ve set some parameters:

  • generations=10 and population_size=10: these two parameters are related to the genetic programming. In general, the higher these numbers, the better TPOT can work. But we’ve set them to be lower than their default values of 100 to simplify the process
  • verbosity=2: this determines how much information TPOT prints out while it’s running. 2 means it will print more information as well as provide a progress bar
  • scoring='neg_mean_squared_error': this is the function used to evaluate the quality of a pipeline. We’ll use the negative mean squared error, which is the negative MSE
  • max_time_mins=5: limit the optimization time of TPOT to 5 minutes
  • random_state = 666: set the seed of the pseudo-random number generator for reproducibility

Then, we’ll separate the features and target of the training set.

Now we are ready to feed the data into TPOT to optimize a machine learning pipeline/model. The fit function uses genetic programming with cross-validation to find the optimal pipeline.

You should see the progress bar moving, and when the maximum time is reached, it returns the current best pipeline.

tpot fit
TPOT fit results

Print best performing model(s)

We can already see the best pipeline from the result above. But we can also export the optimized pipeline as Python code.

tpot exported pipeline
TPOT exported pipeline

Calculate metrics

We can also evaluate the pipeline using the test set.

We’ve set the score as the negative MSE. The MSE would be the positive part of this number below.

-0.2989048361963268

We could verify this by using the sklearn code below.

0.2989048361963268

Plot predicted and actual data comparisons

In the end, let’s also plot the comparison of the prediction and actual data of the test set.

tpot comparison prediction and actual data time series
TPOT comparison plot

PyCaret

pycaret automl python
PyCaret

Intro

PyCaret is a Python library that automates machine learning workflows. It is built based on other Python machine learning libraries, including scikit-learn, XGBoost, LightGBM, etc. The library mainly targets citizen data scientists who prefer low code but is also for other data science users. We can use PyCaret for both supervised and unsupervised learning problems.

Please follow the installation guide to set up PyCaret.

Train with AutoML

We’ll train our dataset with the regression module from PyCaret. PyCaret also has a time series module. But it is still in beta at the time of this article being written, so we are not using it here.

First, we setup the training environment and create the transformation pipeline. Within the argument, we’ve set the parameter session_id for reproducibility.

When the setup function is executed, PyCaret automatically infers the data types in the dataset. So you’ll see the columns in the dataset together with their inferred data types printed. If they are correct, you can press enter to continue the process. If they are not, you can use the numeric_features, categorical_features, date_features parameters in the setup function to specify them.

pycaret setup
PyCaret inferred data types

Then, the process will begin and print a summary. I won’t show it since it’s long.

Next, we can use the compare_models function to train and evaluate the model performance based on cross-validation. We’ve set two parameters for its argument:

  • sort='MSE': set MSE (mean squared error) as the sorting criteria of the results
  • budget_time=5: set the time limit to be 5 minutes

This prints out a list of models and their scores. I’ve only printed the top 5 to save space.


ModelMAEMSERMSER2RMSLEMAPETT (Sec)
lightgbmLight Gradient Boosting Machine0.41590.36450.60360.53570.25230.56050.2220
catboostCatBoost Regressor0.41580.36560.60450.53430.25260.55453.8820
gbrGradient Boosting Regressor0.42020.36750.60610.53180.25400.57262.1120
rfRandom Forest Regressor0.42320.37450.61180.52290.25710.58044.8620
xgboostExtreme Gradient Boosting0.42620.38370.61930.51120.25890.56832.0300
PyCaret model comparisons

Print best performing model(s)

To look at the best-performing model, we can print it out.

LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.1, max_depth=-1,
              min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
              n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
              random_state=666, reg_alpha=0.0, reg_lambda=0.0, silent='warn',
              subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

Calculate metrics

To calculate the MSE metric on the test set, we can use the mean_squared_error function from sklearn.

0.2978445058103306

Plot predicted and actual data comparisons

And lastly, let’s also plot to compare the actual and predicted data.

pycaret comparison plot actual and predicted time series
PyCaret comparison plot

AutoGluon

autogluon automl python
AutoGluon

Intro

AutoGluon is an AutoML tool that works not only for tabular data but also for text and images. It focuses on automated stack ensembling, deep learning, etc. It seems only to cover supervised learning problems.

Please follow the installation guide to set up AutoGluon.

Train with AutoML

We’ll use the tabular module for our example. Within the TabularPredictor, we set the eval_metric to be the mean squared error. Then, we can use the fit function to use AutoGluon to train models. We also set the time_limit to be 5 minutes.

When it’s running, you should be seeing the process and its summary printed. I won’t show it since it’s long.

Print best performing model(s)

We can use the leaderboard function to see the models and their information.

Here are the top 5 models.

modelscore_valpred_time_valfit_timepred_time_val_marginalfit_time_marginalstack_levelcan_inferfit_order
0WeightedEnsemble_L2-0.3486200.303443287.4615360.0025770.2764442True12
1LightGBMXT-0.3510870.0102671.0512110.0102671.0512111True3
2LightGBM-0.3514150.0064180.3539560.0064180.3539561True4
3CatBoost-0.3536310.0054254.7934790.0054254.7934791True6
4XGBoost-0.3537160.0090781.1650400.0090781.1650401True9
AutoGluon learderboard

Calculate metrics

We can use the evaluate function to see the metrics for our test set.

You can see metrics, including the negative mean squared error being printed.

Evaluation: mean_squared_error on test data: -0.29892013448684046
	Note: Scores are always higher_is_better. This metric score can be multiplied by -1 to get the metric value.
Evaluations on test data:
{
    "mean_squared_error": -0.29892013448684046,
    "root_mean_squared_error": -0.546735890981048,
    "mean_absolute_error": -0.41346365122169965,
    "r2": 0.4071766623884153,
    "pearsonr": 0.6835849536261334,
    "median_absolute_error": -0.30548684794108083
}

Plot predicted and actual data comparisons

And we can also plot to compare the actual and predicted values.

autogluon comparison plot predicted and actual data time series
AutoGluon comparison plot

Other AutoML Python Tools

Besides these 4 libraries, there are also other Python AutoML tools. We’ve excluded them from this guide for the following reasons:

  • Auto-Sklearn: the package only explicitly supports the Linux operating system
  • HyperOpt-Sklearn: the package is less updated based on their GitHub history
  • Google or other cloud services: these often cost money. With that said, you can usually try them for free

Please test them out as you need.


In this tutorial, you’ve learned about popular AuoML tools in Python.

Hope you can try them out to automate your machine learning process.

We’d love to hear from you. Leave a comment for any questions you may have or anything else.

Twitter
LinkedIn
Facebook
Email
Lianne & Justin

Lianne & Justin

Leave a Comment

Your email address will not be published. Required fields are marked *

More recent articles

Scroll to Top

Learn Python for Data Analysis

with a practical online course

lectures + projects

based on real-world datasets

We use cookies to ensure you get the best experience on our website.  Learn more.