# How to use Python Seaborn for Exploratory Data Analysis Explore an example dataset by Histogram, Heatmap, Scatter plot, Barplot, etc #### Lianne & Justin

This is a tutorial of using the seaborn library in Python for Exploratory Data Analysis (EDA).

EDA is another critical process in data analysis (or machine learning/statistical modeling), besides Data Cleaning in Python: the Ultimate Guide (2020).

In this guide, you’ll discover (with examples):

• How to use the seaborn Python package to produce useful and beautiful visualizations, including histograms, bar plots, scatter plots, boxplots, and heatmaps.
• How to explore univariate, multivariate numerical and categorical variables with different plots.
• How to discover the relationships among multiple variables.
• Lots more.

Let’s get started!

## What is Exploratory Data Analysis (EDA) and Why?

Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.

A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

Wikipedia

It is important to explore the data before further analysis or modeling. Within this process, we can get an overview of the insights from the dataset; we can discover trends, patterns, and relationships that are not readily apparent.

## What is seaborn?

Seaborn: statistical data visualization is a popular Python library for performing EDA.

It is based on matplotlib and provides a high-level interface for drawing attractive and informative statistical graphics.

Within this post, we’ll use a scraped and cleaned YouTube dataset as an example.

In our previous article How to Get MORE YouTube Views with Machine Learning techniques, we made recommendations on how to get more views based on the same dataset.

Before exploring, let’s read the data into Python as dataset df.

df contains 729 rows and 60 variables. It records different features for each video within Sydney’s YouTube channel, such as:

• views: the number of views of the video
• length: the length of the video/workout in minutes
• calories: the number of calories burned during the workout in the video
• days_since_posted: the number of days since the video was posted until now
• date: the date when the video/workout was posted
Sydney posts one video/workout almost every day
• workout_type: the type of workout the video was focusing on

Again, you can find more details in How to Get MORE YouTube Views with Machine Learning techniques. We’ll just use this dataset here.

## Univariate Analysis: Numerical Variable

First, let’s explore the numerical univariate variables.

We create df_numeric only to include the 7 numeric features.

### Histogram: Single Variable

Histograms are one of our favorite plots.

histogram is an approximate representation of the distribution of numerical data.

To construct a histogram, the first step is to “bin” (or “bucket”) the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval.

Wikipedia

Seaborn’s function distplot has options for:

• bins: the bins setting
It’s useful to plot the variable with different bins settings to discover patterns. If we don’t set this value, the library will find a useful default for us.
• kde: whether to plot a Gaussian kernel density estimate
This helps to estimate the shape of the probability density function of a continuous random variable. More details can be found on seaborn’s page.
• rug: whether to draw a rug plot on the support axis
This draws a small vertical tick at each observation. It helps to know the exact position of the values for the variable.

Let’s start by looking at a single variable: length, which represents the length of the video.

We can see both the kde line and the rug sticks in the plot below.

The videos for Sydney’s channel often have a length of 30, 40, or 50 minutes, which presents a multimodal pattern.

### Histogram: Multiple Variables

Often, we want to visualize multiple numeric variables and look at them together.

We build the function plot_multiple_histograms below to plot histograms for a specific group of variables.

We can see that different variables show different shapes of distributions, outliers, skewness, etc.

## Univariate Analysis: Categorical Variables

Next, let’s look at categorical univariate variables.

### Bar Chart: Single Variable

The bar chart (or countplot in seaborn) is the categorical variables’ version of the histogram.

bar chart or bar plot is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent.

A bar graph shows comparisons among discrete categories.

Wikipedia

First, let’s select the categorical (non-numeric) variables.

We plot the bar chart for the variable area, which represents the body areas that the workout video is focusing on.

There are many areas that the videos targeted. It is hard to read without zooming in. Still, we can see that more than half (over 400) of these videos focused on the “full” body area; and the second most popular area focused on is “ab”.

### Bar Chart: Multiple Variables

Also, we create a function plot_multiple_countplots to plot the bar charts of multiple variables at once.

We use it to plot some indicator variables below.

The is_{}_area are indicator variables for different body areas. For example, is_butt_area == True when the workout focuses on the butt, otherwise it is False.

The is_{}_workout are indicator variables for different workout types. For example, is_strength_workout == True when the workout focuses on strength, otherwise it is False.

## Multivariate Analysis

After exploring the variables one-by-one, let’s look at multiple variables together.

Different plots can be used to explore relationships among different combinations of variables.

In the last section, you can also find a modeling approach for testing relationships among multiple variables.

### Scatter Plot: Two Numerical Variables

First, let’s see how we can discover the relationship between two numerical variables.

What if we want to know how the workout length impacts the number of views?

We can use scatterplots (relplot) to answer the question.

scatter plot uses Cartesian coordinates to display values for typically two variables for a set of data. If the points are coded (color/shape/size), one additional variable can be displayed.

The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.

Wikipedia

We can see that the more popular videos tend to have lengths between 30 and 40 minutes.

### Bar Chart: Two Categorical Variables

What if we want to know the relationship between two categorical variables?

Let’s visualize the most common 6 areas (area2) and the most common 4 workout types (workout_type2) within the videos.

We can see that “full” body “strength” workouts are the most common within the videos.

### Boxplot: Numerical and Categorical Variables

Box plots are useful visualizations when comparing groups of categories together.

box plot (box-and-whisker plot) is a standardized way of displaying the dataset based on a five-number summary: the minimum, the maximum, the sample median, and the first and third quartiles.

Wikipedia

We can use side by side boxplots to compare a numeric variable among categories of a categorical variable.

Do Sydney’s videos get more views on certain days of the week?

Let’s plot day_of_week and views.

This is interesting but hard to see due to outliers. Let’s remove them.

We can see that Monday videos tend to have more views than other days. While Sunday videos get the least views.

### Swarmplot: Numerical and Categorical Variables

Another way of looking at the same question is with a swarm plot.

A swarm plot is a categorical scatterplot where the points are adjusted (only along the categorical axis) so that they don’t overlap.

This gives a better representation of the distribution of values.

seaborn documentation

A swarm plot is a good complement to a box plot when we want to show all observations along with some representation of the underlying distribution.

A swarm plot would have too many dots for larger datasets, but it’s good here with a smaller dataset.

### Boxplot Group: Numerical and Categorical Variables

Are the views on certain days of the week higher for certain workout types?

To answer this question, two categorical variables (workout_type, day_of_week) and one numerical variable (views) are involved.

Let’s see how we can visualize the answer to this question.

We can use a panel boxplot (catplot) to visualize the three variables together.

The catplot is useful to show the relationship between a numerical and one or more categorical variables using one of several visual representations.

That’s quite messy with too many categories of workout_type.

Based on the distribution of workout_type, we group the categories other than “strength”, “hiit”, “stretch”,”cardio” together as ‘Other’.

Also, we remove the outliers to make the plot even more clear.

We can notice things such as:

• “stretch” workouts are only posted on Sundays.
• “hiit” workouts seem to have more views on Mondays.

### Heatmap: Numerical and Categorical Variables

We can also use pivot tables and heatmaps to visualize multiple variables.

heat map is a data visualization technique that shows the magnitude of a phenomenon as color in two dimensions.

The variation in color may be by hue or intensity, giving obvious visual cues to the reader about how the phenomenon is clustered or varies over space.

Wikipedia

For example, the below heatmap has area and workout_type categories as the axes; the color scale represents views in each cell.

### (Advanced) Relationship Test and Scatterplot: Numerical and Categorical Variables

How do we automatically discover the relationships among multiple variables?

Let’s take the most critical features below and see how we could find interesting relationships.

We have 4 numerical variables and 3 categorical variables.

There could be many complicated relationships among them!

In this section, we use the same method to test for relationships (including multicollinearity) among them as in How to Get MORE YouTube Views with Machine Learning techniques.

At a high level, we use K-fold cross-validation to achieve this.

First, we transform the categorical variables. Since we will be using 5-fold cross-validation, we need to make sure there are at least 5 observations for each category level.

Next, we loop through each variable and fit a model to predict it using the other variables. We use a simple model of Gradient Boosting Model (GBM) and K-fold validation.

Depending on whether the target variable is numerical or categorical, we apply different models and scores (model predictive power evaluation metrics).

When the target is numerical, we use the Gradient Boosting Regressor model and Root Mean Squared Error (RMSE); when the target is categorical, we use the Gradient Boosting Classifier model and Accuracy.

For each target, we print out the K-fold validation score (average of the scores) and the most important 5 predictors.

We also add three features rand0rand1rand2 composed of random numbers. They serve as anchors when comparing the relationship between variables. If one predictor is less important or similar compared to these random variables, then it is not an important predictor of the target variable.

From the results above, we can look into each of the target variables and their relationship with the predictors.

Again, the step-by-step procedure of this test can be found in the Test for Multicollinearity section in How to Get MORE YouTube Views with Machine Learning techniques.

We can see that there is a strong relationship between length and calories.

Let’s use a scatter plot to visualize them: the x-axis as length and the y-axis as calories, while the size of the dots represents the views.

We can see that the longer the video, the more calories are burned, which is intuitive. We can also see that the videos with more views tend to have a shorter length.

Related articles:

This previous post used the same dataset. It contains details of how we scraped and transformed the original dataset.

This article covers what to clean and techniques to clean missing data, outliers, duplicates, inconsistent data, etc.

So you won’t miss any new data science articles from us!  ### More recent articles ### How to apply useful Twitter Sentiment Analysis with Python Step-by-Step Example

This is a practical example of Twitter sentiment data analysis with Python. Learn how to get public opinions with this step-by-step guide. ### How to call APIs with Python to request data Yelp and Twitter examples step-by-step

This is a quick tutorial to request data with a Python API call.
Learn how to pull data faster with this post with Twitter and Yelp examples. 