# How to Improve Sports Betting Odds — Step by Step Guide in Python The data science strategy I used to make \$20,000 betting on sports.

#### Lianne & Justin

When I was a student learning statistics, I tried sports betting with data science techniques. And by doing so, I made a \$20,000 profit from sports betting during that year.

Sports betting could be more than using your gut feeling. You could have better odds by adding proper data analysis and predictive modeling.

This guide will show you the step by step algorithm to sports bet smarter using Python and also more tips about it.

Let’s dive in!

To begin, let’s review the traditional statistics on the sports website.

Imagine today is the final game of the season. Team A is going to face Team C. Which team do you think has better odds of winning the championship? Who would you place your bets on?

Below are the records of the playoff games involving Team A and Team C.

Sports websites such as NHL.com (National Hockey League) often provide statistics like these:

The traditional method of ranking the teams is by looking at Win-Loss %, i.e., the percentage of wins out of total games. In this case, Team A would tie with Team C since they both have the same winning percentage of 50%.

But when you look closer at the game data, Team C beat Team B (7– 0) more than Team A beat Team B (1 – 0). So Team C should have a higher chance of beating Team A this time.

It is easy to study each result when there’s a small number of games. But when there are many teams and games, we need a systematic way to analyze the data.

How do we incorporate the details like goal differences from the past games to get a better ranking of teams?

And what if this final game takes place on Team A’s home turf? The home team usually benefits over the visiting team. With this extra piece of information, which team do you think has a higher chance of winning now?

How do we incorporate the home advantage when evaluating the game?

## Statistical method — step by step

To answer the questions above, we build a statistical model using NHL data (downloaded from Hockey Reference website). You can modify for other sports as well.

The algorithm used to model the ratings is called the adjusted plus/minus rating system. You can read the detailed description of the system here. Or just implement it by following the three steps below.

## Step #1: Load the data

The data has 640 rows, which includes game results between Oct 2nd, 2019, and Jan 3rd, 2020. It has five variables — date, visitor, visitor_goals, home, and home_goals.

For example, the row below records the game on Dec 6th, 2019. The Montreal Canadiens (visiting team) played against the New York Rangers (home team) with a final score of 2–1.

• Read the data into Python

## Step #2: Transform the data

First, we create the goal_difference variable as the difference between home_goals and visitor_goals. It is greater than 0 when the home team wins and less than 0 when the home team loses while being 0 when two teams tie.

We also add two indicators home_win and home_loss to consider the home advantage impact on the teams.

The data looks like this:

• Transform the data

Next, we create two dummy variable matrices df_visitor and df_home recording the visiting and home team.

The head of df_visitor looks like this:

It is a matrix with the team name as the column and each game’s visiting team dummy variable as the row. Row 0 has column Vancouver Canucks of value 1 and other columns of value 0. It shows that the visiting team in this particular game is the Vancouver Canucks.

The df_home matrix is of similar structure but indicating the home team in the corresponding games.

Next, we transform these two matrices further to become the final dataset.

• Combine previous results to get the final dataset

We subtract df_visitor from df_home to get the final dataset called df_model. Every row of the df_model shows the visiting team with a value of -1 while the home team with a value of 1.

Also, we add back the variable goal_difference from the original dataset df.

The final dataset df_model looks like this:

For example, row 4 says Anaheim Ducks (home team) played against Arizona Coyotes (visiting team). And Anaheim Ducks (home team) won the game by one goal.

In this way, the final dataset includes information on both goal differences and the home advantage factor.

Now we are ready to feed the data into a model!

## Step #3: Build the predictive model

We use the ridge regression model as a demonstration.

It is a linear regression model with an additional term as the penalty. Due to multicollinearity among the independent variables, the traditional linear regression doesn’t create stable results.

• Fit the ridge regression model

We use the goal_difference feature as the target variable.

• Display the results

Let’s print the coefficients of the model.

The result is as below:

These coefficients of each team can be considered as the rating for each team.

The higher the coefficient/rating, the stronger the team.

According to this model, Colorado Avalanche is the best team with the highest rating. My favorite Toronto Maple Leafs is approved as a good team by the model as well!

You did it!

That said, before applying this algorithm to your sports betting, let’s consider a couple of other things.

## How does this method compare to traditional methods?

The statistical method does seem more sophisticated than traditional methods. But how do the performances compare?

Let’s look at three other conventional methods:

### Method #1: Win-Loss %

As we talked about in the earlier section of this article, this is a fundamental statistic that often appears on sports websites. For each particular team, the win-loss % = Total games won/Total games played.

### Method #2: Home team win

As the name of this method suggests, it’s a bet of always choosing the home team to win.

### Method #3: Goal difference with home advantage

This is a complicated method that contains information about goal difference and home advantage as well.

Yet, when coming up with a team rating, it does not consider the strength of the team’s opponents. The method with ridge regression would consider this because it looks at all the teams and all the games together.

***Skip these if you hate formulas***

First, for each particular team, we calculate:

Team Goal Difference Per Game = (Goals scored by the team — Goals allowed by the team)/(Games played by the team)

Next, we use all the past game results to obtain one statistic looking at all the team’s home advantage:

Home Advantage Goal Difference = (Goals scored by all home teams—Goals scored by all visiting teams)/(Games played by all the teams)

With these statistics, we can predict whether the home or visiting team wins a particular game.

Use the example at the beginning again. Team A (home team) is going to play Team C (visiting team). We use the below statistic to predict the result:

Margin = Team A Goal Difference Per Game — Team C Goal Difference Per Game + Home Advantage Goal Difference

If Margin > 0, then we bet on Team A (home team) to win. If Margin < 0, we choose Team C (the visiting team).

******************************

To compare these methods, we use cross-validation for evaluation.

Our statistical model is the winner!

It had a 60% accuracy rate of predicting hockey game results!

But, at the initial stage of the season, it’s better to rely on other metrics. Because the result of the model only improves and becomes better than other methods, as the season progresses (when more data is available).

## What are other tips that could improve the results further?

There is, of course, still room to improve our prediction results.

### Tip #1: Consider the schedule of the team in recent days

You could add variables considering the recent schedule of the teams. Did the team play games or rest within the last few days? Did the team travel a lot outside the home location?

### Tip #2: Weigh game results

The team’s situation always shifts across the season. So the recent games should be more informative compared to the earlier ones. Adding an indicator for that would help.

### Tip #3: Use different models

We used the ridge regression model as an example. Yet, for better results, you could test and combine other machine learning/statistical models such as neural networks, GBM.

The models can’t incorporate all the information. As an experienced sports fan, you must have valuable knowledge. Combing both the statistical methods and your experience is crucial to making better predictions.

Sports betting is an excellent way of practicing data science skills while having fun.

Fit the model before the chip drops!

Good luck, everyone!

Thank you for reading. I hope you found this sports betting guide helpful.

Now I’d like to hear what you have to say. Feel free to leave a comment below right now.

### 9 thoughts on “How to Improve Sports Betting Odds — Step by Step Guide in Python<br /><div style='color:#7A7A7A;font-size: large;font-family:roboto;font-weight:400;'> The data science strategy I used to make \$20,000 betting on sports.</div>”

1. Hi,

I follow your blog and I love your posts! Even though I’m currently learning data science and doing some courses online, I still don’t get how you come up with these interesting ideas. I hope you keep doing it! Thank you

2. Thanks for the article!
Where did the date go in your model?
nevermind … I think there is one row for each game, yes?

3. I have a question about the parameter estimates. How do we use these in a similar fashion to your simple example below? Say for example, the Panthers play the Maple Leafs, do we subtract those two estimates?

Use the example at the beginning again. Team A (home team) is going to play Team C (visiting team). We use the below statistic to predict the result:

Margin = Team A Goal Difference Per Game — Team C Goal Difference Per Game + Home Advantage Goal Difference

If Margin > 0, then we bet on Team A (home team) to win. If Margin < 0, we choose Team C (the visiting team).

1. I think I answered my own question. I see how they are used in a regression equation and are essentially adjusted average plus/minus estimates. The example I mixed up with a different approach. I read that section wrong. This is great stuff!

4. Great info! I have one question about adding in different variable such as days of rest, etc like you mentioned. Would this new variable be defined as a new column along with the “X” data that contains whether the teams were home/away or would you be using this info to weigh those 1’s and -1’s based on the additional variables?

Thanks!

1. Hi Sam,

One way to do it would be to make a new variable in the X data. If one team has a rest disadvantage over the other, you would indicate it as 1 or -1.

For example, if the away team is playing back to back, but the home team is not, we can set that variable as -1. If the home team is the one playing back to back, that variable will be set as 1. If both are playing back to back it would be zero.

Hope this helps,
Lianne and Justin

### How to build XGBoost models in Python With a step-by-step example

This is a practical guide to XGBoost in Python.
Learn how to build your first XGBoost model with this step-by-step tutorial.

### What is gradient boosting in machine learning: fundamentals explained Must read before implementing

This is a beginner’s guide to gradient boosting in machine learning.
Learn what it is and how to improve its performance with regularization.

### What are Python errors and How to fix them

This is a tutorial to Python errors for beginners. Learn their types and how to fix them with general steps.