In this tutorial, we’ll summarize essential statistics concepts for data science.
Statistics provides many backbone theories and techniques for data science and machine learning. It’s an in-demand skill for data scientists by employers as well. A data analyst or scientist must know the core statistics knowledge to perform appropriate data analysis. Following this tutorial, you’ll learn practical statistics for data science:
- What is statistics for data science.
- What are categorical and numerical data types.
- All the popular descriptive statistics with Python examples.
- How to use inferential statistical methods.
- The commonly used probability distributions.
- What are hypothesis testing, regression analysis.
For beginners, you can follow this tutorial for important statistical concepts required for data science. This tutorial serves as a reference or a cheat sheet if you already learned the basics of statistics.
Let’s get started!
- What is Statistics for Data Science?
- Types of Data
- Statistical Method #1: Descriptive Statistics
- Statistical Method #2: Inferential Statistics
What is Statistics for Data Science?
Statistics is the science that deals with the collection, analysis, interpretation, and presentation of data. Data Science is a hybrid of different fields, including statistics, mathematics, computer science, etc. A data scientist with strong programming skills can’t make decent analyses without basic statistical theories and techniques.
So let’s look at some key terms to begin the applied statistics for data science tutorial.
When trying to solve a problem using statistics, we often want to study a particular population. Due to the large size, it’s not always feasible to collect data on the entire population, so we often focus the analysis on a chosen subset called sample. For example, we can study a sample of 1000 customers by giving them surveys to infer all the companies’ customers’ behavior.
During this process, different experiment designs and/or random sampling methods are used. The goal is to make sure the sample is well representative of the population. That way, we can extend the conclusions and inferences drawn from the sample analysis to the entire population.
Although these terms may sound unfamiliar, the statistics concepts are used frequently in daily life. Some common applications you probably encountered include:
- the average age or income of the population in your country reported in the news.
- summary statistics of a sports team performance.
- stock price analysis.
So you are likely already using statistics without noticing!
Since data is the core of statistics, let’s look at the main categories of data in statistics.
Types of Data
It’s critical to understand the difference between types of data. For each type, we use different techniques to analyze them, which you will see in the examples later.
Most data can be divided into two types: numerical and categorical.
The numerical (quantitative) data is in the format of numbers and has a mathematical meaning. We can perform mathematical operations to get measures like minimum, maximum, and average. The numerical data can also be further divided into discrete (countable) and continuous (uncountable).
The categorical (qualitative) data is often in the format of names or labels used to represent different categories. Since categorical data is represented by text strings, we can’t perform numerical analysis on it. Categorical data also has subcategories: nominal (without order) and ordinal (ordered or ranked).
For example, when collecting data about a group of students:
- the height is continuous numerical data.
- the number of classes taken is discrete numerical data.
- the gender is nominal categorical data.
- the level of their happiness (not happy/OK/happy) today is ordinal categorical data.
Now, with the overall picture of data types, let’s look at different statistical methods to extract information from the data. The two main methods are descriptive statistics and inferential statistics. We’ll learn them one by one.
Statistical Method #1: Descriptive Statistics
After loading in a dataset, the first thing to do is data exploration. We often use descriptive statistics (numerical measures or graphs) to organize and summarize the sample dataset.
It’s essential to look at some descriptive statistics before starting the analysis. We not only need to understand the data, but also need to clean the errors, missing values, or unexpected values before moving further.
Let’s see some popular descriptive statistics for data science. We’ll first see the definitions, the examples in Python will follow.
Measures of Center
- Mean (Average): the sum of the values divided by the number of values of the dataset. It’s the most commonly used measure of center.
Let’s assume we have a dataset x1, …, xN.
x̄ = (x1 + … + xN)/N = ( Σ xi ) / N
- Median: when the data is ordered by numerical values, the median splits the data into two equal parts. It does not have to be one of the observed data values.
When there are outliers (extreme values of data), the median is a better measure of the center than mean since it’s not as affected by the extreme values. Yet, one disadvantage of the median is it takes longer to compute than the average, which becomes significant for massive datasets.
- Mode: the value(s) that appear the most frequently in the dataset. There could be more than one mode since multiple values can have the same highest frequency of appearance.
Measures of Spread/Variation
After studying the center of the dataset, it’s also important to know the variation of the data from the center.
- Standard Deviation: a number measuring the variance of data from their mean. The standard deviation is always greater than or equal to zero.
The larger the value, the more spread out the data is from their mean. For example, two groups of customers both have average spending of $500 per month, but group 1 and 2 have different standard deviations of 10 and 100. We can expect more variation in spending in group 2 since it has a higher standard deviation.
We also often use standard deviation to determine how far a data is from the mean, e.g., a data value = mean + 2*standard deviation, then it is two standard deviations above the mean.
Let’s assume we have a sample of x1, …, xN. The standard deviation is calculated as the square root of the average of the squared deviations from the mean. There’s a slight difference between the sample and population calculation (either divide by N-1 or N), but it doesn’t make a big difference when the dataset (N) is large.
- Variance: the average of the squares of the deviations from the mean, i.e., the square of the standard deviation.
- Range: the difference between the largest and smallest values. This gives us an idea of the spread.
Measures of Location
- Percentiles (Quartiles): when the data is ordered from smallest to largest, the value below which a given percentage of the data falls is the percentiles.
Quartiles are special percentiles that separate the data into quarters. The first quartile is the same as the 25th percentile, and the third quartile is the same as the 75th percentile. The median is called both the second quartile and the 50th percentile.
The percentiles are useful when comparing values and defining outliers.
Plots / Graphs
- Box plots (box-and-whisker plots): the plot gives a good illustration of the spread, skewness, and center of the data, including the outliers. It’s used for numerical data.
The plot is built based on the five-number summary: the minimum, the first quartile, the median, the third quartile, and the maximum. We’ll explain more details in the example below.
- Histogram: this graph is an approximate representation of the distribution of the data, which shows the shape, center, and spread of the data. It’s mostly used for numerical data.
Before building the plot, we need to select consecutive, non-overlapping bins for the range of the variable. Then we also need to calculate the frequency or the number of times values fall into each bucket. The horizontal axis shows the variable by bins, while the vertical axis represents the frequency (or relative frequency).
- Bar graphs: this chart is useful for comparing the relative size of different categories for categorical data. Within bar charts, each bar’s length or height for each category is proportional to the values that they represent.
We’ve been showing a lot of definitions, let’s see some examples.
Descriptive Stats in Python: Numerical Sample
First, let’s look at a numerical sample dataset.
Note: you don’t have to know Python to understand the examples since the code is straightforward to read.
Below we constructed a random sample of numerical values of size 10.
The values within the sample are below. They are continuous numerical data.
We can calculate its mean, which is 7.5882702497944905.
We can also calculate its sample standard deviation (7.215113981338963) and variance (52.05786976371297). Note that the variance is the square of the standard deviation.
And the range, the max minus the min values, is 23.54768608322609.
We can also look at its five-number summary (min, 25th, 50th, 75th percentiles, and max), which is (-1.5019951, 3.33621042, 6.15884524, 11.63802853, 22.04569099).
It is obvious to notice the minimum and maximum values by looking at the sample.
Note that the 50th percentile is also the median or the second quartile, which is not the observed value in the sample. It’s (8.17089294 + 4.14679754)/2, which has half of the data greater than and the other half less than its value.
The 25th percentile is 3.336, which is between 3.32552029 and 3.36828081. Python’s default method uses linear interpolation (3.32552029 + (3.36828081-3.32552029)*.25) to get the exact percentile values. Most of the time, you won’t have to understand the detailed calculation.
Based on the above five-number summary, let’s look at the boxplot. The end of the “whiskers” are the min and max from the sample. The first and third quartiles represent the two ends of the box, while the median is the middle line of the box.
We can also plot its histogram.
The histogram has bin size chosen by Python and the height of the bars showing the frequency count.
Descriptive Stats in Python: Categorical Sample
Let’s also look at a categorical example.
We generated a random sample of categorical data (‘D’, ‘E’, ‘C’, ‘E’, ‘C’, ‘E’, ‘B’, ‘D’, ‘B’, ‘E’) with the Python code below.
We can look at its frequency count.
The Python code returns (array([‘B’, ‘C’, ‘D’, ‘E’], dtype='<U1′), array([2, 2, 2, 4], dtype=int64)). We can verify that category ‘B’ has two counts in the sample, so are categories ‘C’ and ‘D’, while category ‘E’ has four appearances in the sample.
We can also find its mode, which is obviously category E.
The popular plot for categorical data is a bar plot.
To learn more about Python basics, check out our FREE Python for Data Science course.
With basic knowledge of Python and statistics, check out How to use Python Seaborn for Exploratory Data Analysis for more graphs and plots in Python.
Statistical Method #2: Inferential Statistics
As mentioned earlier, we often have to conclude a population by studying the sample. Statistical inference is the formal process of determining how confident we are about the conclusions with probability theory. So unlike descriptive statistics, which only describe properties of the observed data, we are making inferences about the population with inferential statistics.
The inferential statistics analysis assumes the sample is from a broader population. For example, we could assume the sample is generated from a specific probability distribution. To infer the characteristics of the assumed population, we need to conduct hypothesis testing, parameter estimation, or regression analysis.
Let’s first look at some basic probability distributions. The distributions have parameters that determine its characteristics, which can be estimated by statistics derived from the sample dataset.
When facing probability problems, we often try to find patterns or distribution functions to fit the data. These distributions with well-studied characteristics make solving these problems easier.
What are some of the commonly used distributions in data science?
Normal (Gaussian) Distribution
The normal (Gaussian) distribution is the most important probability distribution. It’s a continuous distribution with the characteristic symmetric bell-shaped probability density function. Even though the normal distribution doesn’t always appear in reality, it’s often assumed in statistical inferences due to the powerful Central Limit Theorem.
The normal distribution has two parameters: the mean and the standard deviation. The standard normal distribution is a special case with the mean equal to 0 and the standard deviation being 1.
The Gamma distribution is a family of continuous distributions, which can be used to model positive numbers (0 to +infinity). This family of distribution has various shapes with different combinations of the two parameter values.
Some popular special cases of Gamma distribution include the exponential distribution and the chi-square distribution. The exponential distribution is often used to model the time between events, while the chi-square distribution is used widely in various hypothesis tests of inferential statistics.
Uniform Distribution (continuous)
The Uniform distribution is one of the easiest to understand continuous distributions. It describes an experiment when there’s equally likely outcome lying between two boundaries a and b. It’s often used in simulations and Bayesian analysis.
The Binomial distribution is a basic discrete distribution to learn. It’s useful for modeling the number of events occurring within n independent and identical trials, each with a binary outcome such as success/failure. The two parameters of Binomial distribution are n (the fixed number of trials) and p (probability of success for each trial).
For example, we could use it to measure the probability of the number of times rolling a six when tossing a fair, six-sided die independently ten times.
Hypothesis testing involves collecting data from a sample and evaluating the data. Based on the tests, we decide whether there is sufficient evidence, based upon analyses of the data, to reject the null hypothesis.
The general procedures of hypothesis testing can be summarized below:
- Based on the statistical problem, determine two hypotheses with opposing conclusions: the null and the alternative hypothesis. We usually set what we are trying to prove as the alternative hypothesis.
For example, if we want to test whether the customers spend less than $1000 per month on average. The null hypothesis could be average_spending >= 1000, and the alternative hypothesis could be average_spending < 1000.
- Collect sample data, the evidence of the hypothesis.
- Analyze sample data by calculating the test statistics based on certain assumptions/distributions.
- Decide to either reject or not reject the null hypothesis. When the null is rejected, we say there’s enough evidence to support the alternative.
How do we make the decision?
Another essential concept is the p-value. It’s calculated based on the test statistics of the sample and its distribution. When the null hypothesis is true, the p-value is the probability of the results from another randomly selected sample being at least as extreme as the results obtained from the current sample. Once we have the p-value, it’s often compared to a preset significance level α (0.05 is the most commonly used level):
- when p-value < α, reject the null hypothesis. There’s enough evidence that the alternative may be correct.
- when p-value >= α, do not reject the null hypothesis. The sample has failed to provide sufficient evidence against the null.
So far, we’ve been mainly talking about a single variable. What if the data contains multiple variables, how to study the relationships among them?
Regression analysis is measuring the relationship between one or more independent variables and one dependent variable. It can be thought of as making statistical inference as well.
The most basic model is linear regression. For details, please check out Linear Regression in Machine Learning: Practical Python Tutorial. This tutorial covers both theoretical explanations and practical Python code for Linear Regression.
That’s all the core statistics for data science!
You should have a good overall understanding of the statistical concepts that are used for data science. Note that each concept can be expanded to more details, keep discovering and learning!
Leave a comment for any questions you may have or anything else.
A FREE Python online course, beginner-friendly tutorial. Start your successful data science career journey: learn Python for data science, machine learning.
This is a complete tutorial to machine learning algorithm types for ML beginners. Start learning ML with this overview, including a list of popular algorithms.
This is a complete tutorial for logistic regression in machine learning. Learn the popular supervised classification predictive algorithm step-by-step.