# Logistic Regression Example in Python: Step-by-Step Guide Follow to build your Logistic model #### Lianne & Justin

In this guide, we’ll show a logistic regression example in Python, step-by-step.

Logistic regression is a popular machine learning algorithm for supervised learning – classification problems. In a previous tutorial, we explained the logistic regression model and its related concepts. Following this tutorial, you’ll see the full process of applying it with Python sklearn, including:

• How to explore, clean, and transform the data.
• How to split into training and test datasets.
• How to fit, evaluate, and interpret the model.

If you want to apply logistic regression in your next ML Python project, you’ll love this practical, real-world example.

Let’s start!

This logistic regression tutorial assumes you have basic knowledge of machine learning and Python. If not, please check out the below resources:

Once you are ready, try following the steps below and practice on your Python environment!

## Step #1: Import Python Libraries

Before starting the analysis, let’s import the necessary Python packages:

• Pandas – a powerful tool for data analysis and manipulation.
• NumPy – the fundamental package for scientific computing.
• Scikit Learn (sklearn) – a popular tool for machine learning.
Don’t worry about the detailed usage of these functions. You’ll see them in action soon.

## Step #2: Explore and Clean the Data

The dataset we are going to use is a Heart Attack directory from Kaggle. The goal of the project is to predict the binary target, whether the patient has heart disease or not.

Upon downloading the csv file, we can use read_csv to load the data as a pandas DataFrame. We also specified na_value = ‘?’ since they represent missing values in the dataset.

First, let’s take a look at the variables by calling the columns of the dataset.

```Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num       '],
dtype='object')```

This corresponds to the documentation on Kaggle that 14 variables are available for analysis.

‘num ‘ is the target, a value of 1 shows the presence of heart disease in the patient, otherwise 0.

Let’s rename the target variable num to target, and also print out the classes and their counts.

```0    188
1    106
Name: target, dtype: int64```

We can see that the dataset is only slightly imbalanced among classes of 0 and 1, so we’ll proceed without special adjustment.

Next, let’s take a look at the summary information of the dataset.

As you can see, there are 294 observations in the dataset and 13 other features besides target.

To keep the cleaning process simple, we’ll remove:

• the columns with many missing values, which are slope, ca, thal.
• the rows with missing values.

Let’s recheck the summary to make sure the dataset is cleaned.

The ten features we’ll be using are:

• age: age in years
• sex: sex (1 = male; 0 = female)
• cp: chest pain type
– 1: typical angina
– 2: atypical angina
– 3: non-anginal pain
– 4: asymptomatic
• trestbps: resting blood pressure (in mm Hg on admission to the hospital)
• chol: serum cholesterol in mg/dl
• fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
• restecg: resting electrocardiographic results
– 0: normal
– 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
– 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
• thalach: maximum heart rate achieved
• exang: exercise-induced angina (1 = yes; 0 = no)
• oldpeak: ST depression induced by exercise relative to rest

We can also take a quick look at the data itself by printing out the dataset.

We have five categorical variables: sex, cp, fbs, restecg, and exang, and five numerical variables being the rest.

In reality, more data cleaning and exploration should be done. Please check out tutorials:
How to use Python Seaborn for Exploratory Data Analysis
Data Cleaning in Python: the Ultimate Guide

## Step #3: Transform the Categorical Variables: Creating Dummy Variables

When fitting logistic regression, we often transform the categorical variables into dummy variables.

In logistic regression models, encoding all of the independent variables as dummy variables allows easy interpretation and calculation of the odds ratios, and increases the stability and significance of the coefficients.

UCLA: A SMART GUIDE TO DUMMY VARIABLES: FOUR APPLICATIONS AND A MACRO

Among the five categorical variables, sex, fbs, and exang only have two levels of 0 and 1, so they are already in the dummy variable format. But we still need to convert cp and restecg into dummy variables.

Let’s take a closer look at these two variables.

There are four classes for cp and three for restecg.

We can use the get_dummies function to convert them into dummy variables. The drop_first parameter is set to True so that the unnecessary first level dummy variable is removed.

As shown, the variable cp is now represented by three dummy variables cp_2, cp_3, and cp_4. cp_1 was removed since it’s not necessary to distinguish the classes of cp.

• when cp = 1: cp_2 = 0, cp_3 = 0, cp_4 = 0.
• when cp = 2: cp_2 = 1, cp_3 = 0, cp_4 = 0.
• when cp = 3: cp_2 = 0, cp_3 = 1, cp_4 = 0.
• when cp = 4: cp_2 = 0, cp_3 = 0, cp_4 = 1.

Similarly, the variable restecg is now represented by two dummy variables restecg_1.0 and restecg_2.0.

To recap, we can print out the numeric columns and categorical columns as numeric_cols and cat_cols below.

```['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
['cp_2', 'cp_3', 'cp_4', 'exang', 'fbs', 'restecg_1.0', 'restecg_2.0', 'sex']```

## Step #4: Split Training and Test Datasets

To make sure the fitted model can be generalized to unseen data, we always train it using some data while evaluating the model using the holdout data. So we need to split the original dataset into training and test datasets.

To do this, we can use the train_test_split method with the below specifications:

• test_size = 0.2: keep 20% of the original dataset as the test dataset, i.e., 80% as the training dataset.
• stratify=df[‘target’]: when the dataset is imbalanced, it’s good practice to do stratified sampling. In this way, both the training and test datasets will have similar portions of the target classes as the complete dataset.

To verify the specifications, we can print out the shapes and the classes of target for both the training and test sets.

```(208, 14)
(53, 14)

0    0.625
1    0.375
Name: target, dtype: float64

0    0.622642
1    0.377358
Name: target, dtype: float64```

## Step #5: Transform the Numerical Variables: Scaling

Before fitting the model, let’s also scale the numerical variables, which is another common practice in machine learning.

After creating a class of StandardScaler, we calculate (fit) the mean and standard deviation for scaling using df_train’s numeric_cols. Then we create a function get_features_and_target_arrays that:

• performs standardization on the numeric_cols of df to return the new array X_numeric_scaled
• transforms cat_cols to a NumPy array X_categorical.
• combines both arrays back to the entire feature array X.
• assigns the target column to y.

Then we can apply this function to the training dataset to output our training feature and target, X and y.

This step has to be done after the train test split since the scaling calculations are based on the training dataset.

## Step #6: Fit the Logistic Regression Model

Finally, we can fit the logistic regression in Python on our example dataset.

We first create an instance clf of the class LogisticRegression. Then we can fit it using the training dataset.

`LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto', n_jobs=None, penalty='none', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)`

At this point, we have the logistic regression model for our example in Python!

## Step #7: Evaluate the Model

After fitting the model, let’s look at some popular evaluation metrics for the dataset.

Further Reading: If you are not familiar with the evaluation metrics, check out 8 popular Evaluation Metrics for Machine Learning Models.

Before starting, we need to get the scaled test dataset.

We can plot the ROC curve.

We can also plot the precision-recall curve.

To calculate other metrics, we need to get the prediction results from the test dataset:

• predict_proba to get the predicted probability of the logistic regression for each class in the model.
The first column of the output of predict_proba is P(target = 0), and the second column is P(target = 1). So we are calling for the second column by its index position 1.
• predict the test dataset labels by choosing the class with the highest probability, which means a threshold of 0.5 in this binary example.

Using the below Python code, we can calculate some other evaluation metrics:

• Log loss
• AUC
• Average Precision
• Accuracy
• Precision
• Recall
• F1 score
• Classification report, which contains some of the above plus extra information

```Log loss = 0.35613
AUC = 0.92424
Average Precision = 0.89045

Using 0.5 as threshold:
Accuracy = 0.83019
Precision = 0.76190
Recall = 0.80000
F1 score = 0.78049

Classification Report
precision    recall  f1-score   support

0       0.88      0.85      0.86        33
1       0.76      0.80      0.78        20

accuracy                           0.83        53
macro avg       0.82      0.82      0.82        53
weighted avg       0.83      0.83      0.83        53
```

Also, it’s a good idea to get the metrics for the training set for comparison, which we’ll not show in this tutorial. For example, if the training set gives accuracy that’s much higher than the test dataset, there could be overfitting.

To show the confusion matrix, we can plot a heatmap, which is also based on a threshold of 0.5 for binary classification.

## Step #8: Interpret the Results

In the last step, let’s interpret the results for our example logistic regression model. We’ll cover both the categorical feature and the numerical feature.

For categorical feature sex, this fitted model says that holding all the other features at fixed values, the odds of having heart disease for males (sex=1) to the odds of having heart disease for females is exp(1.290292). You can derive it based on the logistic regression equation.

For categorical feature cp (chest pain type), we have created dummy variables for it, the reference value is typical angina (cp = 1). So the odds ratio of atypical angina (cp = 2) to typical angina (cp = 1) is exp(-2.895253).

Since the numerical variables are scaled by StandardScaler, we need to think of them in terms of standard deviations. Let’s first print out the list of numeric variable and its sample standard deviation.

For example, holding other variables fixed, there is a 41% increase in the odds of having a heart disease for every standard deviation increase in cholesterol (63.470764) since exp(0.345501) = 1.41.

That’s it. You’ve discovered the general procedures of fitting logistic regression models with an example in Python.

Try to apply it to your next classification problem! ### 7 thoughts on “Logistic Regression Example in Python: Step-by-Step Guide<br /><div style='color:#7A7A7A;font-size: large;font-family:roboto;font-weight:400;'> Follow to build your Logistic model</div>”

1. I get valueerror when fitting:
clf.fit(X, y)

ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’).

1. Hi Tom,

Based on the message it looks like your dataset has missing values in it. Try removing them to see if it works for you.

Thanks.

2. Can you explain the exp(0.345501) = 1.41. What are the units for these numbers or better yet how are you getting this? I recreated a model from this example and it predicts really well but i cannot interpret the units and coefficients properly from your explanations. Thank you

1. oops I APOLOGIZE I just realized we are doing 1- e^.345501 to get the 41% increase in odds. Sorry about the confusion

1. Lianne & Justin

3. 1. Hi Andre, there is code (GitHub Gists embedded). Maybe try a different device to view it.

### More recent articles ### How to Send Emails using Python: Tutorial with examples Plain text, HTML with attachments

This is a practical tutorial to send emails using Python.
Learn how to send plain text to HTML with attachment emails with smtplib and email. ### How to generate Reports with Python (3 Formats/4 Tools) As Excel, HTML, PDF

This is a comprehensive guide to Python reporting.
Learn how to generate HTML, Excel, PDF reports automatically with Python tools. ### Unlocking Random Forest in Machine Learning With Python sklearn example

This is a complete tutorial for the random forest in machine learning.
Learn how to improve your decision tree with ensembling with Python sklearn example. 