In this tutorial, we’ll help you understand the logistic regression algorithm in machine learning.
Logistic Regression is a popular algorithm for supervised learning – classification problems. It’s relatively simple and easy to interpret, which makes it one of the first predictive algorithms that a data scientist learns and applies.
Following this beginner-friendly tutorial, you’ll learn step-by-step:
- What is logistic regression in machine learning (ML).
- What are odds, logistic function.
- How to optimize using Maximum Likelihood Estimation/cross entropy cost function.
- How to predict with the logistic model.
- And more.
Even if you’ve already learned logistic regression, this tutorial is also a helpful review.
Let’s get started!
- What is Logistic Regression in Machine Learning?
- Odds and log Odds: the Prerequisites
- Logistic Function: the Logistic Regression Model/Equation
- Maximum Likelihood Estimation: the Best Model Fit
- What is Maximum Likelihood Estimation?
- Log Likelihood Function in statistics
- Cost Function (Cross Entropy Loss) in machine learning
- Optimization Methods
- Model Interpretations
- How to use Logistic Regression Models to Predict?
- More for Logistic Regression Implementation
Before beginning our logistic regression tutorial, if you are not familiar with ML algorithms, please take a look at Machine Learning for Beginners: Overview of Algorithm Types.
Understanding linear regression is critical to studying logistic regression as well. Check out Linear Regression in Machine Learning: Practical Python Tutorial.
It’s also helpful to understand basic statistics such as probability theory. But we’ll try to explain with references or examples.
What is Logistic Regression in Machine Learning?
Within classification problems, we have a labeled training dataset consisting of input variables (X) and a categorical output variable (y). The logistic regression algorithm helps us to find the best fit logistic function to describe the relationship between X and y.
For the classic logistic regression, y is a binary variable with two possible values, such as win/loss, good/bad. Since y is binary, we often label classes as either 1 or 0, with 1 being the desired class of prediction.
When new observations come in, we can use its input variables and the logistic relationship to predict the probability of the new case belonging to class y = 1. The formula for this probability given the input variables X is written below. Let’s denote it as p for simplicity.
P( y = 1 | X ) = p
How does this probability link to a classification problem?
This probability, ranging from 0 to 1, can be used as a criterion to classify the new observation. The higher the value of p, the more likely the new observation belongs to class y = 1, instead of y = 0.
For example, we can choose a cutoff threshold of 0.5. When p > 0.5, the new observation will be classified as y = 1 , otherwise as y = 0.
Note that logistic regression generally means binary logistic regression with the binary target. This can be extended to model outputs with multiple classes such as win/tie/loss, dog/cat/fox/rabbit. In this tutorial, we only cover the binary logistic regression.
Some common applications for logistic regression include:
- fraud detection.
- customer churn prediction.
- cancer diagnosis.
Great, now we are ready to dive into the details of logistic regression.
Follow along, and you’ll get pieces of how logistic regression works explained!
Let’s start with the basics.
Odds and log Odds: the Prerequisites
What is the definition of odds in statistics?
Odds are the ratio of the probability of something happening to the probability of it not happening. It’s also a metric representing the likelihood of the event occurring.
For example, the odds of the observation belonging to class y = 1 is p/(1-p). When the odds are between 0 and 1, the odds are against the observation belonging to y = 1. When the odds are greater than 1, the odds are for the observation belonging to y = 1.
Or, it might be easier to think of odds in terms of gambling, when we bet money on an event to occur.
For example, let’s bet that a six will come up for a toss of a fair six-sided die. The probability of it happening is 1/6. So the odds in favor of us winning are (1/6) / (5/6) = 1/5 or 1:5. Or the odds of us losing are (5/6) / (1/6) = 5:1. The odds are clearly against us winning.
Why do we want to take log of odds?
Because we want to adapt the well-studied linear regression algorithm to classification problems. As mentioned, the classification tasks have output being a probability p ranging from 0 to 1. We can apply linear regression on the transformation, if we can transform p to a range from -infinity to +infinity.
log-odds is the popular mapping function of this transformation.
Since p ranges from 0 to 1, the odds p/(1-p) range from 0 to +infinity. After the log transformation, log(odds) ranges from -infinity to +infinity!
Logistic Function: the Logistic Regression Model/Equation
Now with the transformation, we can model the log(odds) as a linear equation. Assume we have multiple explanatory variables x1, …, xm, and coefficients w0, …, wm, the relationship can be shown as below:
logit(p) = log(odds) = log(p/(1-p)) = w0 + w1*x1 + w2*x2 + … + wm*xm
If you understand linear regression, logistic regression equation should look very familiar.
Now, how is this linked to the “logistic” function?
The standard logistic function is simply the inverse of the logit equation above. If we solve for p from the logit equation, the formula of the logistic function is below:
p = 1/(1 + e^(-(w0 + w1*x1 + w2*x2 + … + wm*xm)))
where e is the base of the natural logarithms
The logistic function is a type of sigmoid function.
sigmoid(h) = 1/(1 + e^(-h))
where h = w0 + w1*x1 + w2*x2 + … + wm*xm for logistic function.
The logistic or sigmoid function has an S-shaped curve or sigmoid curve with the y-axis ranging from 0 and 1 as below.
Now we know the logistic regression formula we are trying to solve, let’s see how to find the best fit equation.
Maximum Likelihood Estimation: the Best Model Fit
Like linear regression, the logistic regression algorithm finds the best values of coefficients (w0, w1, …, wm) to fit the training dataset.
How do we find the best fit model?
Can we use the same estimation method, (Ordinary) Least Squares (OLS), as linear regression?
The answer is NO. We have to use a different method.
The OLS targets minimizing the sum of squared residuals, which involves the difference of the predicted output and the actual output; while the actual output in the logistic linear equation is log(p/(1-p)), we can’t calculate its value since we don’t know the value of p. The only output we know is the class of either y = 0 or 1. So we have to use another estimation method.
What is Maximum Likelihood Estimation?
The standard way to determine the best fit for logistic regression is maximum likelihood estimation (MLE).
In this estimation method, we use a likelihood function that measures how well a set of parameters fit a sample of data. The parameter values that maximize the likelihood function are the maximum likelihood estimates. In other words, the goal is to make inferences about the population that is most likely to have generated the training dataset.
In the logistic regression case, we want to find the estimates for the parameters w0, …, wm.
Let’s see a simple example with the following dataset:
|Observation #||Input x1||Binary Output y|
With one input variable x1, the logistic regression formula becomes:
log(p/(1-p)) = w0 + w1*x1
p = 1/(1 + e^(-(w0 + w1*x1)))
Since y is binary of values 0 or 1, a bernoulli random variable can be used to model its probability:
P(y=1) = p
P(y=0) = 1 – p
P(y) = (p^y)*(1-p)^(1-y)
with y being either 0 or 1
This distribution formula is only for a single observation. How do we model the distribution of multiple observations like P(y0, y1, y2, y3, y4)?
Let’s assume these observations are mutually independent from each other. Then we can write the joint distribution of the training dataset as:
P(y0, y1, y2, y3, y4) = P(y0) * P(y1) * P(y2) * P(y3) * P(y4)
To make it more specific, each observed y has a different probability of being 1. Let’s assume P(yi = 1) = pi for i = 0,1,2,3,4. Then we can rewrite the formula as below:
P(y0) * P(y1) * P(y2) * P(y3) * P(y4) = p0^(y0)*(1-p0)^(1-y0) * p1^(y1)*(1-p1)^(1-y1) *… * p4^(y4)*(1-p4)^(1-y4)
We can calculate the p estimate for each observation based on the logistic function formula:
- p0 = 1/(1 + e^(-(w0 + w1*0.5)))
- p1 = 1/(1 + e^(-(w0 + w1*1.0)))
- p2 = 1/(1 + e^(-(w0 + w1*0.65)))
- p3 = 1/(1 + e^(-(w0 + w1*0.75)))
- p4 = 1/(1 + e^(-(w0 + w1*1.2)))
We also have the values of the output variable y:
- y0 = 0
- y1 = 0
- y2 = 0
- y3 = 1
- y4 = 1
Log Likelihood Function in statistics
So we have all the p0 – p4 and y0 – y4 values from the training dataset. Our likelihood becomes a function of the parameters w0 and w1:
L(w0, w1) = p0^(y0)*(1-p0)^(1-y0) * p1^(y1)*(1-p1)^(1-y1) * … * p4^(y4)*(1-p4)^(1-y4)
The goal is to choose the values of w0 and w1 that result in the maximum likelihood based on the training dataset.
Note that it’s computationally more convenient to optimize the log-likelihood function. Since the natural logarithm is a strictly increasing function, the same w0 and w1 values that maximize L would also maximize l = log(L).
So in statistics, we often try to maximize the function below:
l(w0, w1) = log(L(w0, w1)) = y0*log(p0) + (1-y0)*log(1-p0) + y1*log(p1) + (1-y1)*log(1-p1) + … + (1-y4)*log(1-p4)
Cost Function (Cross Entropy Loss) in machine learning
While in machine learning, we prefer the idea of minimizing cost/loss functions, so we often define the cost function as the negative of the average log-likelihood.
cost function = – avg(l(w0, w1)) = – 1/5 * l(w0, w1) = – 1/5 * (y0*log(p0) + (1-y0)*log(1-p0) + y1*log(p1) + (1-y1)*log(1-p1) + … + (1-y4)*log(1-p4))
Maximizing the (log) likelihood is the same as minimizing the cross entropy loss function.
Unlike OLS estimation for the linear regression, we don’t have a closed-form solution for the MLE. But we do know that the cost function is convex, which means a local minimum is also the global minimum.
To minimize this cost function, Python libraries such as scikit-learn (sklearn) use numerical methods similar to Gradient Descent. And since sklearn uses gradients to minimize the cost function, it’s better to scale the input variables and/or use regularization to make the algorithm more stable.
In this logistic regression tutorial, we are not showing any code. But by using the Logistic Regression algorithm in Python sklearn, we can find the best estimates are w0 = -4.411 and w1 = 4.759 for our example dataset.
We can plot the logistic regression with the sample dataset. As you can see, the output y only has two values of 0 and 1, while the logistic function has an S shape.
We can also make some interpretations with the parameter w1.
Recall that we have:
log(odds of y=1) = log(p/(1-p)) = w0 + w1*x1
where p = P(y = 1)
Since w1 = 4.759, with a one-unit increase of x1, the log odds is expected to increase by 4.759 as well.
How to use Logistic Regression Models to Predict?
As mentioned earlier, we often use logistic regression models for predictions.
Given a new observation, how would we predict which class y = 0 or 1 it belongs to?
For example, say a new observation has input variable x1 = 0.9. By using the logistic regression equation estimated from MLE, we can calculate the probability p of it belongs to y = 1.
p = 1/(1 + e^(-(-4.411 + 4.759*0.9))) = 46.8%
If we use 50% as the threshold, we would predict that this observation is in class 0, since p < 50%.
Since the logistic regression has an S shape, the larger x1, the more likely the observation has class y = 1. What’s the threshold of x1 for us to classify the observation as y = 1?
At the threshold of probability p=50%, the odds are p/(1-p) = 50%/50% = 1. So the log(odds) = log(1) = 0.
While log (odds) fits the linear regression equation, we have:
log(odds) = 0 = -4.411 + 4.759*x1
Solving for x1, we get 0.927. That’s the threshold of x1 for prediction, i.e., when x1 > 0.927, the observation will be classified as y = 1.
More for Logistic Regression Implementation
Similar to linear regression, logistic regression does have certain assumptions. But most of them can be relaxed as well with transformation, as long as the model provides reliable predictions. Some assumptions are listed below:
- Binary output/target: as mentioned at the beginning, logistic regression is for classification problems. We need to make sure the target is binary and transform to values of 0 or 1.
- Linear relationship: the logistic algorithm makes use of the linear equation, so the same assumptions apply here.
- Independent inputs: the highly correlated (multicollinearity) input variables can fail the model convergence.
You’ve made it!
In this tutorial, you’ve learned a lot about logistic regression, a critical machine learning classification algorithm.
We may cover an application of logistic regression in Python in another tutorial. Stay tuned!
Leave a comment for any questions you may have or anything else.
A FREE Python online course, beginner-friendly tutorial. Start your successful data science career journey: learn Python for data science, machine learning.
Read this pandas tutorial to learn Group by in pandas. It is an essential operation on datasets (DataFrame) when doing data manipulation or analysis.
Check out this for a detailed review of resources online, including courses, books, free tutorials, portfolios building, and more.