In this tutorial, we are showing how to GroupBy with a foundation Python library, Pandas.
We can’t do data science/machine learning without Group by in Python. It is an essential operation on datasets (DataFrame) when doing data manipulation or analysis.
In this complete guide, you’ll learn (with examples):
- What is a Pandas GroupBy (object).
- How to create summary statistics for groups with aggregation functions.
- How to create like-indexed objects of statistics for groups with the transformation method.
- How to use the flexible yet less efficient apply function.
- How to use custom functions for multiple columns.
If you want to master this important technique with hands-on examples, don’t miss this guide.
Let’s begin now!
What are Pandas and GroupBy?
Pandas is a powerful and easy to use open-source Python data analysis and manipulation tool. It offers data structures and operations for numerical tables and time series. pandas can be used to import data, manipulate, and clean data.
It is a must-know package for data science.
Related article: How to Install/Setup Python and Prep for Data Science NOW
Pandas can be downloaded with Python by installing the Anaconda distribution. Check out this step-by-step guide.
According to Pandas documentation, “group by” is a process involving one or more of the following steps:
- Splitting the data into groups based on some criteria.
- Applying a function to each group independently.
- Combining the results into a data structure.
This can be used to group large amounts of data and compute operations on these groups. The idea should be familiar to those who have used SQL’s SELECT and Group By statement.
Let’s see a simple example to learn the idea.
Suppose we have a DataFrame with two columns type and amount. The GroupBy process is visualized below:
- splitting the dataset into two groups based on types A and B, then
- applying the function (for example, average) on each group, then
- combining the results as a table
We’ll be going through the most common use cases for GroupBy in data science based on our experience. We recommend downloading the same dataset (with the link below) and practice along.
While these should be a good starting point, you can always search for more details in the Pandas Group By documentation.
Before we start exploring GroupBy, let’s import the packages and read the data in Python.
We are importing pandas together with some other necessary packages.
The dataset we are going to look at is some real anonymized transactions of a Czech bank from 1993 to 1999.
Upon loading into Python, we also clean the data to prepare it for analysis by:
- converting the date of the transaction to datetime
- translating the type of the transaction to English
- translating the operation type of the transaction to English
- filling in missing values of type and operation as ‘Unknown’
We’ll be using these two variables as groups
- sorting the DataFrame by account_id and date
- setting the date variable as index
Now we can look at the top 10 rows of the dataset df:
Related article: How to apply Unsupervised Anomaly Detection on bank transactions
Within the previous article, the same dataset was used for anomaly detection. Check it out if you are interested.
- What are Pandas and GroupBy?
- GroupBy Object: Splitting an Object into Groups
- Aggregation: Summary Statistics for Groups
- Transformation: Like-indexed Object with Statistics for Groups
- Apply: Flexible Statistics for Groups but Less Efficient
GroupBy Object: Splitting an Object into Groups
We are going to use variables type and operation to group the data. Let’s see what categories they have.
The type feature marks the transactions as WITHDRAWAL, CREDIT, or NOT SURE.
The operation feature tags the transactions in more detail of its specific types.
To start the groupby process, we create a GroupBy object called grouped. This helps in splitting the pandas objects into groups.
By using the type function on grouped, we know that it is an object of pandas.core.groupby.generic.DataFrameGroupBy.
The GroupBy object has methods we can call to manipulate each group.
For example, we can use the groups method to get a dictionary with:
- keys being the groups and
- values being the indices of the rows within the groups, which is the date.
We can also get the DataFrame of a group using the get_group method. For example, we can create a new dataset with only the “credit card withdrawal”.
Other than these basic methods, we can also apply functions to the groups.
The main functions we’ll cover are aggregation, transformation, and apply. Let’s look at them one-by-one.
Aggregation: Summary Statistics for Groups
When we use the aggregating functions on a GroupBy object, they are applied to all the columns by default. The resulting output is a DataFrame with the group name as the index.
Example(s) #1: Single Aggregating Function on Multiple Columns
Let’s see an example with the common aggregating functions mean.
We can see that this function applies only to the numerical columns.
And obviously, it doesn’t make sense to apply the mean on trans_id (transaction ID), account_id (account ID), and account (account number).
So let’s filter for specific variables to apply the mean aggregating function.
This looks much better. For example, the average transaction amount for withdrawals in cash is 5095.
There are also other common aggregating functions. Let’ see more examples.
The describe function generates basic descriptive statistics such as:
- standard deviation
- the main quartiles
The min, median, max, and count functions compute the min, median, max, count of the group values.
The nunique function displays the distinctive count for the groups.
We are not showing the result of the min, median, max, and count since they are straight-forward.
The nunique function’s result is below. We can see that trans_id is unique, while one account_id could have multiple transactions.
Related article: The above are the most used groupby functions. You may also check out the pandas document for a full list.
Example #2: Multiple Aggregating Functions on Multiple Columns
So far, we’ve been only applying a single aggregating function at a time. What if we want to apply multiple functions at once?
We can use the aggregate (agg) method to call multiple functions.
Example #3: Custom Defined Function on Multiple Columns
We can also define custom functions and use the agg method.
The example below defines a new function calculating the square root for the sum of the groups.
Example #4: Custom Output Name on Multiple Columns
What if we want to change the output column names?
We can use the named aggregation syntax.
A dictionary is created with:
- the keys being the name of the output column
- the values are tuples with:
– the first element being the column to select and
– the second element being the aggregating function to apply on that column
Then we use the ** operation to unpack the dictionary into the function.
It might sound a little confusing. Let’s see an example below.
The aggs dictionary defines the new output columns sum_amount, avg_balance, and avg_amount.
For example, sum_amount is the new column name where the sum function is applied to the amount column.
Transformation: Like-indexed Object with Statistics for Groups
In this section, we’ll look at the transform method.
It also allows us to apply functions to each group. What’s the difference between transform and aggregation?
The transform method returns a result as an object that is indexed the same (size) as the one being grouped.
The first argument to the transform function must be a series (the group). It must either return a series with the same number of rows as the group chunk or broadcastable to the size of the group chunk. Usually, the broadcastable part just means a single scalar value.
That sounds confusing. Don’t worry. It will be more clear with examples.
Example #1: Custom Defined Function on Multiple Columns – Range
We first define a function max_minus_min, which returns a scalar value: the range (max – min). As you can imagine, this range value is the same for every member of the same group.
We can then use the transform method to apply this function on the variables amount and balance.
This function returns a DataFrame with the two range variables. This object has the same size (number of rows) and index (date) as the original dataset df.
Since df_range has the same order of rows as df, we can add these two new range variables back to the original dataset for further analysis.
You can print out the head of df to see how the amount_range and balance_range are the same for rows of the same groups. We won’t show it here.
Example #2: Custom Defined Function on Multiple Columns – Percentage
Another common example would be getting the percentage of the value among its group.
We define the x_pct function with input x as a series. And it returns the percentage series the same size as x.
Then we use the transform method to pass on this new function to variables amount and balance.
Again, you can add these two variables back as Example #1 to the original dataset for further analysis.
Example #3: Custom Defined Function on Multiple Columns – Time Series
The most common usage of transform for us is creating time series features. It is handy when we need to use a rolling window to calculate things that happened in a previous time frame.
Please note that pandas does have a rolling function. But when we need to apply the function to groups, the best way is to use GroupBy’s transform method.
Since the dataset is for bank transactions. We often want to see the volume transacted within the past n days, for each particular account ID.
These types of features are useful for customer behavior analysis, such as fraud analytics, anti-money laundering detection.
Related article: How to apply Unsupervised Anomaly Detection on bank transactions.
Check out this article using the same dataset for analysis.
The Python code below does the following steps to create the rolling variables:
- create a GroupBy object by account_id since we want to look at account level activities
- define the function sum_n_days with input x (a series) and n (integer). The x is a times series on which we can calculate the sum of its value, within the last n days
- use the transform method to apply this function on the amount and balance variables
- add back these two new variables to the original dataset
We can check an example when account_id = 2 to make sure it’s correct.
As you can see from row 4, there are two transactions within the past 5 days (including the current one on 1993-03-31). So amount_5days is 3713.5 (3700 + 13.5).
Apply: Flexible Statistics for Groups but Less Efficient
Lastly, let’s talk about the apply function, which is also a method to apply functions group-wise.
Some operations on the grouped data might not fit into either the aggregate or transform categories. Or, you may simply want GroupBy to infer how to combine the results.
For these, use thepandas Flexible apply documentation
applyfunction, which can be substituted for both
transformin many standard use cases. However,
applycan handle some exceptional use cases.
For us, apply is more flexible because we can access multiple columns in the function, which you couldn’t do in aggregation and transform.
Everything sounds great about apply so far, but:
While apply is a very flexible method, its downside is that using it can be quite a bit slower than using more specific methods like agg or transform.
Pandas offers a wide range of methods that will be much faster than using apply for their specific purposes, so try to use them before reaching for apply.pandas apply documentation
For this reason, we haven’t found much need to use it. It is still a popular function, so try it out yourself to decide whether to use it.
First, we also need to define a function that will be passed onto apply. The function must take a DataFrame as its first argument and return a DataFrame, series, or scalar. apply will then take care of combining the results back together into a single dataframe or series.
Example #1: Multiple Columns Returning a DataFrame
In the first example, we define a function f which returns a DataFrame with:
- amount_div_balance: the percentage of amount out of balance of the account
- amount_zscore: the z score for the amount
As you can see, apply can have functions using multiple columns from the dataset (amount and balance).
Then we pass on the function f to apply on amount and balance.
Again, you can use the same method as in Example #1 of transform, to add these new variables to the original dataset df for further analysis.
Example #2: Multiple Columns Returning a Scalar
Let’s see another example with a function that uses multiple columns.
The function g uses amount and balance to calculate their mean’s comparison, which is a scalar value.
Then we can use apply again to apply the function g.
Since we return a scalar value, the result doesn’t have the same number of rows as the original dataset. The number of rows is the same as the number of groups, which is the number of account IDs.
We did it! Now you should be able to GroupBy with Pandas (an essential skill for data science) like a boss.
Leave a comment for any questions you may have or anything else.