In this post, we dive into the coronavirus data using a machine learning algorithm: hyperparameter tuning.
You’ll see the step-by-step procedures of how to find the parameters of a model that is best fitting the COVID-19 data.
The goal is to estimate:
- the death rate, aka case fatality ratio (CFR) and
- the distribution of time from symptoms to death/recovery.
If you want:
- more insights about coronavirus
- or to see an example of hyperparameter tuning/optimization in Python
take a look!
The dataset we are using is COVID-19 Complete Dataset from Kaggle. It includes the number of confirmed, death, and recovered cases daily across the globe.
Yes, we know the numbers are not accurate due to various reasons. But let’s leave that concern out for this project.
And to simplify the analysis, we zoom into China’s Hubei province, the first heavily affected region.
Step #1: Exploring the Data: COVID-19
To begin the project, let’s read the data into Python and take a look.
The dataset has 5,822 rows and 8 features. The rows represent daily data for different provinces/states from different countries/regions. While the features are:
- Province/State: the province/state of the location
- Country/Region: the country/region of the location
- Lat: latitude of the location
- Long: longitude of the location
- Date: date of the cumulative report
- Confirmed: cumulative number of confirmed cases till this day
- Deaths: cumulative number of deaths till this day
- Recovered: cumulative number of recovered cases till this day
We also notice that the feature Date needs to be converted to DateTime format.
Step #2: Cleaning the Data
In this step, we clean the data up to fit our project by:
- converting the Date feature to the correct format.
- narrowing down the data to Hubei.
- turning the cumulative counts to single-day counts: New_Confirmed, New_Deaths, New_Recovered.
Now we have a dataset with newly confirmed, deaths, recovered cases each day from Jan 22nd to Mar 2nd, 2020.
Related article: How to Manipulate Date And Time in Python Like a Boss
Step #3: Explaining the Methodologies
As claimed at the beginning of the post, we are trying to estimate the death rate and the distribution of time from symptoms to death/recovery.
You might wonder, why do we want to estimate the death rate?
Isn’t it simply (2,803/67,103) = 4.2% on March 2nd, 2020, for Hubei, based on the data below?
It is not the best way because of two main reasons:
- This death rate is likely biased upwards.
Early in the epidemic outbreak, the confirmed cases only include people being officially diagnosed with severe symptoms. Others with minor symptoms or not diagnosed are not counted in the denominator.
- This death rate is also likely biased downwards.
There is a time gap between being infected and dying/recovering. So we can’t immediately observe the outcome (whether they live) for people infected.
This could be a big impact when the epidemic is growing exponentially.
For example, assume on day T = 1 there is 1 new case; T = 2 there are 2 new cases; T = 3 has 100 new cases.
The person infected on day T = 1 died on T = 3. The simple death rate at T = 3 would be 1 / (1 + 2 + 100), which is not accurate.
At that time point, we only know for sure the death rate of T = 1 is 1/1 = 100%.
While we can’t address the first problem without extra data, we’ll be accomodating for the second problem.
As for the time from symptoms to death/recovery, most articles provide summary information such as average time to death/recovery. We want to estimate the distribution to get a full picture.
To do these, we use hyperparameter optimization techniques from machine learning.
Hyperparameter optimization or tuning finds a tuple of hyperparameters that yields an optimal model which minimizes a predefined loss function on given independent data.Wikipedia
As you can tell, the key is to find an optimal model first.
In the next step, we will smooth the data and find a target model that we’ll be tuning its parameters to fit our data.
Step #4: Smoothing the Data
Let’s take a closer look at our Hubei data throughout the time.
As you can see, the newly confirmed, deaths, and recoveries are all volatile. The numbers go up and down like rollercoasters.
Due to the limited technologies and other issues, it is normal that the data is not smooth at the early stages of a virus outbreak.
But this is not ideal for our analysis.
Also, we would like to look at the whole time window of the virus infection. We want to know the time when people had symptoms rather than being confirmed officially.
Hence, we redistribute the number of these features (New_Confirmed, New_Deaths, New_Recovered) for one day equally among a time window before that day.
A confirmed case today might have had symptoms days before. And we are just smoothing out deaths and recoveries as well.
How should we redistribute?
This concept is easier to understand with an example. Assume our data looks like below, with T representing the day and New_Confirmed as the daily confirmed cases.
And we want to set the lookback window to be 5.
We start on day 1 of T = 1.
There is not enough history for the 5-day window to look back. But we can still do a 2-day redistribution.
The 10 cases at T = 1 can be divided into 2 parts, each of 5 cases and add onto T = 0 and T = 1.
Similarly, for day 2 of T = 2, we can redistribute the 10 days among 3 days.
And so on.
Now let’s go back to our analysis.
For the feature New_Confirmed, we set a window size of 10. For New_Deaths and New_Recovered, we set the window to be 3. Since it usually takes longer to confirm a new infection than a death/recovery.
The detailed procedure is coded in Python below.
Since we pick the lookback window of 10 days, we remove the latest days that don’t have enough future data.
Let’s see how our data looks like after this smoothing process.
New_Confirmed, New_Deaths, and New_Recovered are all looking much better now.
Before moving on, let’s also remove any nulls from the dataset.
Step #5: Choosing the Optimal Model
What is a good statistical model that fits this data?
The Gamma distribution is a popular model for the waiting times between relevant events. It is also known for its flexible shape.
Let’s assume both the time from symptoms to death (time-to-death) and time from symptoms to recovery (time-to-recovery) follow Gamma distributions.
So we need to find out the values of these four parameters:
- ttd_scale: time-to-death scale parameter
- ttd_shape: time-to-death shape parameter
- ttr_scale: time-to-recovery scale parameter
- ttr_shape: time-to-recovery shape parameter
And as you may recall, we are trying to estimate the death rate as well. Let’s include it as the last piece of the puzzle to solve.
Now to specify the hyperparameter tuning problem, we will run simulations to:
find a tuple of hyperparameters (ttd_scale, ttd_shape, ttr_scale, ttr_shape, and death_rate) that yields an optimal model (including Gamma distributions) which minimizes a predefined loss function (Mean Squared Error) on given independent data (COVID-19 dataset).
Among the hyperparameter optimization approaches, we will use the random search.
Before deciding, we also experimented with more sophisticated algorithms. But they all focused narrowly on a set of “best” parameters. We want to explore different possible scenarios and use our judgment to see if it makes sense.
So we choose the random search approach, which can provide a wide range of scenarios.
Step #6: Running Simulations for Hyperparameter Tuning
For the simulation process, we make the below assumptions based on our judgment:
- ttd_shape, ttr_shape, ttd_scale, ttr_scale are between 0.1 and 100.
- death_rate is between 1% and 30%.
The simulation and optimization function run_sim are coded as below.
Then this function run_sim is simulated for 20,000 times. The result of each simulation is saved in the test.txt file.
Within the procedure, we use the hyperopt package to apply the hyperparameter optimization techniques.
If you are interested in learning more about the package, please read Automated Machine Learning Hyperparameter Tuning in Python.
There are also other packages available for hyperparameter optimization. But we choose hyperopt for its popularity. Stay tuned for our future articles with more options!
Update: take a look at this article for a better package Hyperparameter Tuning with Python: Complete Step-by-Step Guide.
Next, leave your computer to “struggle” for ~2 hours to run the entire simulation!
Let’s take a look at the simulation results.
The below data shows the best 20 combinations of parameters with the lowest loss function.
Let’s examine both of the best two scenarios.
Both of them claim that the death rate is much higher at 22% and 9%!
Step #7: Visualizing the Results
In this last step, let’s look at the details of the top two scenarios.
Scenario One: Death rate = 22%
Under this doomsday scenario, the death rate is as high as 22%.
While the prediction seemed to match the actual at earlier days, they diverged since early February.
So good news, the death rate is not as high as this scenario suggests.
The predicted recovery was higher than the actual at the beginning while they moved closer at later days.
Scenario Two: Death rate = 8.7%
Even though this scenario gave a higher loss function, the predicted number of death match the actual death closer throughout the time.
Under this scenario, the predicted recoveries also match closer to the actual numbers throughout time.
So given the above observations, the second scenario is more likely to be real, even for the severe Hubei province.
Please note that the analysis and conclusions are all based on certain assumptions. And we were only trying to tune the death rate up, as mentioned in Step #3.
The only sure thing is that we all hope the coronavirus crisis passes on soon!
Stay strong and healthy.
Thank you for reading. Leave a comment below to let us know your thoughts.