Logistic Regression Example in Python: Step-by-Step Guide
 Follow to build your Logistic model

Lianne & Justin

Lianne & Justin

Share on twitter
Share on linkedin
Share on facebook
Share on email
teaching classes logistic regression python
Source: Unsplash

In this guide, we’ll show a logistic regression example in Python, step-by-step.

Logistic regression is a popular machine learning algorithm for supervised learning – classification problems. In a previous tutorial, we explained the logistic regression model and its related concepts. Following this tutorial, you’ll see the full process of applying it with Python sklearn, including:

  • How to explore, clean, and transform the data.
  • How to split into training and test datasets.
  • How to fit, evaluate, and interpret the model.

If you want to apply logistic regression in your next ML Python project, you’ll love this practical, real-world example.

Let’s start!



This logistic regression tutorial assumes you have basic knowledge of machine learning and Python. If not, please check out the below resources:

Once you are ready, try following the steps below and practice on your Python environment!


Step #1: Import Python Libraries

Before starting the analysis, let’s import the necessary Python packages:

  • Pandas – a powerful tool for data analysis and manipulation.
  • NumPy – the fundamental package for scientific computing.
  • Scikit Learn (sklearn) – a popular tool for machine learning.
    Don’t worry about the detailed usage of these functions. You’ll see them in action soon.

Further Readings:
Learn Python Pandas for Data Science: Quick Tutorial
Python NumPy Tutorial: Practical Basics for Data Science

Step #2: Explore and Clean the Data

The dataset we are going to use is a Heart Attack directory from Kaggle. The goal of the project is to predict the binary target, whether the patient has heart disease or not.

Upon downloading the csv file, we can use read_csv to load the data as a pandas DataFrame. We also specified na_value = ‘?’ since they represent missing values in the dataset.

First, let’s take a look at the variables by calling the columns of the dataset.

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num       '],
      dtype='object')

This corresponds to the documentation on Kaggle that 14 variables are available for analysis.

‘num ‘ is the target, a value of 1 shows the presence of heart disease in the patient, otherwise 0.

Let’s rename the target variable num to target, and also print out the classes and their counts.

0    188
1    106
Name: target, dtype: int64

We can see that the dataset is only slightly imbalanced among classes of 0 and 1, so we’ll proceed without special adjustment.

Next, let’s take a look at the summary information of the dataset.

python logistic regression example kaggle info

As you can see, there are 294 observations in the dataset and 13 other features besides target.

To keep the cleaning process simple, we’ll remove:

  • the columns with many missing values, which are slope, ca, thal.
  • the rows with missing values.

Let’s recheck the summary to make sure the dataset is cleaned.

python logistic regression example cleaned kaggle dataset summary

The ten features we’ll be using are:

  • age: age in years
  • sex: sex (1 = male; 0 = female)
  • cp: chest pain type
    – 1: typical angina
    – 2: atypical angina
    – 3: non-anginal pain
    – 4: asymptomatic
  • trestbps: resting blood pressure (in mm Hg on admission to the hospital)
  • chol: serum cholesterol in mg/dl
  • fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
  • restecg: resting electrocardiographic results
    – 0: normal
    – 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    – 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
  • thalach: maximum heart rate achieved
  • exang: exercise-induced angina (1 = yes; 0 = no)
  • oldpeak: ST depression induced by exercise relative to rest

We can also take a quick look at the data itself by printing out the dataset.

python logistic regression example final dataset for analysis

We have five categorical variables: sex, cp, fbs, restecg, and exang, and five numerical variables being the rest.

Further Readings:
In reality, more data cleaning and exploration should be done. Please check out tutorials:
How to use Python Seaborn for Exploratory Data Analysis
Data Cleaning in Python: the Ultimate Guide

Step #3: Transform the Categorical Variables: Creating Dummy Variables

When fitting logistic regression, we often transform the categorical variables into dummy variables.

In logistic regression models, encoding all of the independent variables as dummy variables allows easy interpretation and calculation of the odds ratios, and increases the stability and significance of the coefficients.

UCLA: A SMART GUIDE TO DUMMY VARIABLES: FOUR APPLICATIONS AND A MACRO

Among the five categorical variables, sex, fbs, and exang only have two levels of 0 and 1, so they are already in the dummy variable format. But we still need to convert cp and restecg into dummy variables.

Let’s take a closer look at these two variables.

There are four classes for cp and three for restecg.

We can use the get_dummies function to convert them into dummy variables. The drop_first parameter is set to True so that the unnecessary first level dummy variable is removed.

python logistic regression example dummy variables for categorical features

As shown, the variable cp is now represented by three dummy variables cp_2, cp_3, and cp_4. cp_1 was removed since it’s not necessary to distinguish the classes of cp.

  • when cp = 1: cp_2 = 0, cp_3 = 0, cp_4 = 0.
  • when cp = 2: cp_2 = 1, cp_3 = 0, cp_4 = 0.
  • when cp = 3: cp_2 = 0, cp_3 = 1, cp_4 = 0.
  • when cp = 4: cp_2 = 0, cp_3 = 0, cp_4 = 1.

Similarly, the variable restecg is now represented by two dummy variables restecg_1.0 and restecg_2.0.

To recap, we can print out the numeric columns and categorical columns as numeric_cols and cat_cols below.

['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
['cp_2', 'cp_3', 'cp_4', 'exang', 'fbs', 'restecg_1.0', 'restecg_2.0', 'sex']

Step #4: Split Training and Test Datasets

To make sure the fitted model can be generalized to unseen data, we always train it using some data while evaluating the model using the holdout data. So we need to split the original dataset into training and test datasets.

To do this, we can use the train_test_split method with the below specifications:

  • test_size = 0.2: keep 20% of the original dataset as the test dataset, i.e., 80% as the training dataset.
  • stratify=df[‘target’]: when the dataset is imbalanced, it’s good practice to do stratified sampling. In this way, both the training and test datasets will have similar portions of the target classes as the complete dataset.

To verify the specifications, we can print out the shapes and the classes of target for both the training and test sets.

(208, 14)
(53, 14)

0    0.625
1    0.375
Name: target, dtype: float64

0    0.622642
1    0.377358
Name: target, dtype: float64

Step #5: Transform the Numerical Variables: Scaling

Before fitting the model, let’s also scale the numerical variables, which is another common practice in machine learning.

After creating a class of StandardScaler, we calculate (fit) the mean and standard deviation for scaling using df_train’s numeric_cols. Then we create a function get_features_and_target_arrays that:

  • performs standardization on the numeric_cols of df to return the new array X_numeric_scaled
  • transforms cat_cols to a NumPy array X_categorical.
  • combines both arrays back to the entire feature array X.
  • assigns the target column to y.

Then we can apply this function to the training dataset to output our training feature and target, X and y.

This step has to be done after the train test split since the scaling calculations are based on the training dataset.

Step #6: Fit the Logistic Regression Model

Finally, we can fit the logistic regression in Python on our example dataset.

We first create an instance clf of the class LogisticRegression. Then we can fit it using the training dataset.

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto', n_jobs=None, penalty='none', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)

At this point, we have the logistic regression model for our example in Python!

Step #7: Evaluate the Model

After fitting the model, let’s look at some popular evaluation metrics for the dataset.

Further Reading: If you are not familiar with the evaluation metrics, check out 8 popular Evaluation Metrics for Machine Learning Models.

Before starting, we need to get the scaled test dataset.

We can plot the ROC curve.

roc curve logistic regression example python sklearn

We can also plot the precision-recall curve.

precision-recall curve logistic regression example python sklearn

To calculate other metrics, we need to get the prediction results from the test dataset:

  • predict_proba to get the predicted probability of the logistic regression for each class in the model.
    The first column of the output of predict_proba is P(target = 0), and the second column is P(target = 1). So we are calling for the second column by its index position 1.
  • predict the test dataset labels by choosing the class with the highest probability, which means a threshold of 0.5 in this binary example.

Using the below Python code, we can calculate some other evaluation metrics:

  • Log loss
  • AUC
  • Average Precision
  • Accuracy
  • Precision
  • Recall
  • F1 score
  • Classification report, which contains some of the above plus extra information

Please read the scikit-learn documentation for details.

Log loss = 0.35613
AUC = 0.92424
Average Precision = 0.89045

Using 0.5 as threshold:
Accuracy = 0.83019
Precision = 0.76190
Recall = 0.80000
F1 score = 0.78049

Classification Report
              precision    recall  f1-score   support

           0       0.88      0.85      0.86        33
           1       0.76      0.80      0.78        20

    accuracy                           0.83        53
   macro avg       0.82      0.82      0.82        53
weighted avg       0.83      0.83      0.83        53

Also, it’s a good idea to get the metrics for the training set for comparison, which we’ll not show in this tutorial. For example, if the training set gives accuracy that’s much higher than the test dataset, there could be overfitting.

To show the confusion matrix, we can plot a heatmap, which is also based on a threshold of 0.5 for binary classification.

confusion matrix logistic regression example python sklearn

Step #8: Interpret the Results

In the last step, let’s interpret the results for our example logistic regression model. We’ll cover both the categorical feature and the numerical feature.

variablecoefficient
0intercept-0.178340
1cp_2-2.895253
2cp_3-1.808676
3cp_4-0.830942
4exang0.514580
5fbs1.514143
6restecg_1.0-0.638990
7restecg_2.0-0.429625
8sex1.290292
9age0.059633
10trestbps-0.013132
11chol0.345501
12thalach-0.285511
13oldpeak1.231252

For categorical feature sex, this fitted model says that holding all the other features at fixed values, the odds of having heart disease for males (sex=1) to the odds of having heart disease for females is exp(1.290292). You can derive it based on the logistic regression equation.

For categorical feature cp (chest pain type), we have created dummy variables for it, the reference value is typical angina (cp = 1). So the odds ratio of atypical angina (cp = 2) to typical angina (cp = 1) is exp(-2.895253).

Since the numerical variables are scaled by StandardScaler, we need to think of them in terms of standard deviations. Let’s first print out the list of numeric variable and its sample standard deviation.

variableunit
0age7.909365
1trestbps18.039942
2chol63.470764
3thalach24.071915
4oldpeak0.891801

For example, holding other variables fixed, there is a 41% increase in the odds of having a heart disease for every standard deviation increase in cholesterol (63.470764) since exp(0.345501) = 1.41.


That’s it. You’ve discovered the general procedures of fitting logistic regression models with an example in Python.

Try to apply it to your next classification problem!

Leave a comment for any questions you may have or anything else.

Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on facebook
Facebook
Share on email
Email
Lianne & Justin

Lianne & Justin

Leave a Comment

Your email address will not be published. Required fields are marked *

More recent articles

Scroll to Top
We use cookies to ensure you get the best experience on our website.  Learn more.