In this guide, we’ll show a logistic regression example in Python, step-by-step.
Logistic regression is a popular machine learning algorithm for supervised learning – classification problems. In a previous tutorial, we explained the logistic regression model and its related concepts. Following this tutorial, you’ll see the full process of applying it with Python sklearn, including:
- How to explore, clean, and transform the data.
- How to split into training and test datasets.
- How to fit, evaluate, and interpret the model.
If you want to apply logistic regression in your next ML Python project, you’ll love this practical, real-world example.
- Step #1: Import Python Libraries
- Step #2: Explore and Clean the Data
- Step #3: Transform the Categorical Variables: Creating Dummy Variables
- Step #4: Split Training and Test Datasets
- Step #5: Transform the Numerical Variables: Scaling
- Step #6: Fit the Logistic Regression Model
- Step #7: Evaluate the Model
- Step #8: Interpret the Results
This logistic regression tutorial assumes you have basic knowledge of machine learning and Python. If not, please check out the below resources:
- Machine Learning for Beginners: Overview of Algorithm Types
- Logistic Regression for Machine Learning: complete Tutorial
- FREE Python crash course (Python Basics)
- Learn Python Pandas for Data Science: Quick Tutorial (Python Pandas)
- Python NumPy Tutorial: Practical Basics for Data Science (Python NumPy)
Once you are ready, try following the steps below and practice on your Python environment!
Step #1: Import Python Libraries
Before starting the analysis, let’s import the necessary Python packages:
- Pandas – a powerful tool for data analysis and manipulation.
- NumPy – the fundamental package for scientific computing.
- Scikit Learn (sklearn) – a popular tool for machine learning.
Don’t worry about the detailed usage of these functions. You’ll see them in action soon.
Learn Python Pandas for Data Science: Quick Tutorial
Python NumPy Tutorial: Practical Basics for Data Science
Step #2: Explore and Clean the Data
The dataset we are going to use is a Heart Attack directory from Kaggle. The goal of the project is to predict the binary target, whether the patient has heart disease or not.
Upon downloading the csv file, we can use read_csv to load the data as a pandas DataFrame. We also specified na_value = ‘?’ since they represent missing values in the dataset.
First, let’s take a look at the variables by calling the columns of the dataset.
Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num '], dtype='object')
This corresponds to the documentation on Kaggle that 14 variables are available for analysis.
‘num ‘ is the target, a value of 1 shows the presence of heart disease in the patient, otherwise 0.
Let’s rename the target variable num to target, and also print out the classes and their counts.
0 188 1 106 Name: target, dtype: int64
We can see that the dataset is only slightly imbalanced among classes of 0 and 1, so we’ll proceed without special adjustment.
Next, let’s take a look at the summary information of the dataset.
As you can see, there are 294 observations in the dataset and 13 other features besides target.
To keep the cleaning process simple, we’ll remove:
- the columns with many missing values, which are slope, ca, thal.
- the rows with missing values.
Let’s recheck the summary to make sure the dataset is cleaned.
The ten features we’ll be using are:
- age: age in years
- sex: sex (1 = male; 0 = female)
- cp: chest pain type
– 1: typical angina
– 2: atypical angina
– 3: non-anginal pain
– 4: asymptomatic
- trestbps: resting blood pressure (in mm Hg on admission to the hospital)
- chol: serum cholesterol in mg/dl
- fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
- restecg: resting electrocardiographic results
– 0: normal
– 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
– 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
- thalach: maximum heart rate achieved
- exang: exercise-induced angina (1 = yes; 0 = no)
- oldpeak: ST depression induced by exercise relative to rest
We can also take a quick look at the data itself by printing out the dataset.
We have five categorical variables: sex, cp, fbs, restecg, and exang, and five numerical variables being the rest.
In reality, more data cleaning and exploration should be done. Please check out tutorials:
How to use Python Seaborn for Exploratory Data Analysis
Data Cleaning in Python: the Ultimate Guide
Step #3: Transform the Categorical Variables: Creating Dummy Variables
When fitting logistic regression, we often transform the categorical variables into dummy variables.
In logistic regression models, encoding all of the independent variables as dummy variables allows easy interpretation and calculation of the odds ratios, and increases the stability and significance of the coefficients.UCLA: A SMART GUIDE TO DUMMY VARIABLES: FOUR APPLICATIONS AND A MACRO
Among the five categorical variables, sex, fbs, and exang only have two levels of 0 and 1, so they are already in the dummy variable format. But we still need to convert cp and restecg into dummy variables.
Let’s take a closer look at these two variables.
There are four classes for cp and three for restecg.
We can use the get_dummies function to convert them into dummy variables. The drop_first parameter is set to True so that the unnecessary first level dummy variable is removed.
As shown, the variable cp is now represented by three dummy variables cp_2, cp_3, and cp_4. cp_1 was removed since it’s not necessary to distinguish the classes of cp.
- when cp = 1: cp_2 = 0, cp_3 = 0, cp_4 = 0.
- when cp = 2: cp_2 = 1, cp_3 = 0, cp_4 = 0.
- when cp = 3: cp_2 = 0, cp_3 = 1, cp_4 = 0.
- when cp = 4: cp_2 = 0, cp_3 = 0, cp_4 = 1.
Similarly, the variable restecg is now represented by two dummy variables restecg_1.0 and restecg_2.0.
To recap, we can print out the numeric columns and categorical columns as numeric_cols and cat_cols below.
['age', 'trestbps', 'chol', 'thalach', 'oldpeak'] ['cp_2', 'cp_3', 'cp_4', 'exang', 'fbs', 'restecg_1.0', 'restecg_2.0', 'sex']
Step #4: Split Training and Test Datasets
To make sure the fitted model can be generalized to unseen data, we always train it using some data while evaluating the model using the holdout data. So we need to split the original dataset into training and test datasets.
To do this, we can use the train_test_split method with the below specifications:
- test_size = 0.2: keep 20% of the original dataset as the test dataset, i.e., 80% as the training dataset.
- stratify=df[‘target’]: when the dataset is imbalanced, it’s good practice to do stratified sampling. In this way, both the training and test datasets will have similar portions of the target classes as the complete dataset.
To verify the specifications, we can print out the shapes and the classes of target for both the training and test sets.
(208, 14) (53, 14) 0 0.625 1 0.375 Name: target, dtype: float64 0 0.622642 1 0.377358 Name: target, dtype: float64
Step #5: Transform the Numerical Variables: Scaling
Before fitting the model, let’s also scale the numerical variables, which is another common practice in machine learning.
After creating a class of StandardScaler, we calculate (fit) the mean and standard deviation for scaling using df_train’s numeric_cols. Then we create a function get_features_and_target_arrays that:
- performs standardization on the numeric_cols of df to return the new array X_numeric_scaled.
- transforms cat_cols to a NumPy array X_categorical.
- combines both arrays back to the entire feature array X.
- assigns the target column to y.
Then we can apply this function to the training dataset to output our training feature and target, X and y.
This step has to be done after the train test split since the scaling calculations are based on the training dataset.
Step #6: Fit the Logistic Regression Model
Finally, we can fit the logistic regression in Python on our example dataset.
We first create an instance clf of the class LogisticRegression. Then we can fit it using the training dataset.
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto', n_jobs=None, penalty='none', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)
At this point, we have the logistic regression model for our example in Python!
Step #7: Evaluate the Model
After fitting the model, let’s look at some popular evaluation metrics for the dataset.
Further Reading: If you are not familiar with the evaluation metrics, check out 8 popular Evaluation Metrics for Machine Learning Models.
Before starting, we need to get the scaled test dataset.
We can plot the ROC curve.
We can also plot the precision-recall curve.
To calculate other metrics, we need to get the prediction results from the test dataset:
- predict_proba to get the predicted probability of the logistic regression for each class in the model.
The first column of the output of predict_proba is P(target = 0), and the second column is P(target = 1). So we are calling for the second column by its index position 1.
- predict the test dataset labels by choosing the class with the highest probability, which means a threshold of 0.5 in this binary example.
Using the below Python code, we can calculate some other evaluation metrics:
- Log loss
- Average Precision
- F1 score
- Classification report, which contains some of the above plus extra information
Please read the scikit-learn documentation for details.
Log loss = 0.35613 AUC = 0.92424 Average Precision = 0.89045 Using 0.5 as threshold: Accuracy = 0.83019 Precision = 0.76190 Recall = 0.80000 F1 score = 0.78049 Classification Report precision recall f1-score support 0 0.88 0.85 0.86 33 1 0.76 0.80 0.78 20 accuracy 0.83 53 macro avg 0.82 0.82 0.82 53 weighted avg 0.83 0.83 0.83 53
Also, it’s a good idea to get the metrics for the training set for comparison, which we’ll not show in this tutorial. For example, if the training set gives accuracy that’s much higher than the test dataset, there could be overfitting.
To show the confusion matrix, we can plot a heatmap, which is also based on a threshold of 0.5 for binary classification.
Step #8: Interpret the Results
In the last step, let’s interpret the results for our example logistic regression model. We’ll cover both the categorical feature and the numerical feature.
For categorical feature sex, this fitted model says that holding all the other features at fixed values, the odds of having heart disease for males (sex=1) to the odds of having heart disease for females is exp(1.290292). You can derive it based on the logistic regression equation.
For categorical feature cp (chest pain type), we have created dummy variables for it, the reference value is typical angina (cp = 1). So the odds ratio of atypical angina (cp = 2) to typical angina (cp = 1) is exp(-2.895253).
Since the numerical variables are scaled by StandardScaler, we need to think of them in terms of standard deviations. Let’s first print out the list of numeric variable and its sample standard deviation.
For example, holding other variables fixed, there is a 41% increase in the odds of having a heart disease for every standard deviation increase in cholesterol (63.470764) since exp(0.345501) = 1.41.
That’s it. You’ve discovered the general procedures of fitting logistic regression models with an example in Python.
Try to apply it to your next classification problem!
Leave a comment for any questions you may have or anything else.
14 thoughts on “Logistic Regression Example in Python: Step-by-Step Guide<br /><div style='color:#7A7A7A;font-size: large;font-family:roboto;font-weight:400;'> Follow to build your Logistic model</div>”
I get valueerror when fitting:
ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’).
Based on the message it looks like your dataset has missing values in it. Try removing them to see if it works for you.
Can you explain the exp(0.345501) = 1.41. What are the units for these numbers or better yet how are you getting this? I recreated a model from this example and it predicts really well but i cannot interpret the units and coefficients properly from your explanations. Thank you
oops I APOLOGIZE I just realized we are doing 1- e^.345501 to get the 41% increase in odds. Sorry about the confusion
No problem Adam. Thanks for reading
Hi! I would like ask how we can obtain the value 0.345501 from 63.470764 at the interpretation part of cholesterol?
Can you please put the code? It would be helpful
Hi Andre, there is code (GitHub Gists embedded). Maybe try a different device to view it.
I want to know is if each coefficient (or the adjusted OR derived from the coefficient) is significant or not. Is there a function that will generate the p-value for the categorical and numeric variables? Thanks!
Hi Anita, you can take a look at the statsmodel package.
Thanks for share your work! What should I do if I have 3 values in my target column?
Thanks in advance!
If you have more than two values in your target, it’s called multinomial logistic regression. The theory is similar to what we wrote in the post, but with multiple sets of coefficients. You can read more about it here.
If you just want to fit the model, you can just use the LogisticRegression object in scikit learn but with the target having more levels to it. Try it out!
But can you explain how we can obtain the value 0.345501 at the interpretation part of cholesterol?
Hi Yeo, we used the standard scaler to scale the features, before fitting the model. So when we look at the coefficients of the model the interpretation is in terms of standard deviations. For cholesterol, the coefficient is 0.345501. This means if we increase cholesterol by 1 standard deviation, it will increase the log-odds by 0.345501. In the table we can see that, 1 standard deviation of cholesterol is 63.47.