In this tutorial, we are showing how to **GroupBy **with a foundation Python library, **Pandas**.

We *can’t *do data science/machine learning without Group by in Python. It is an essential operation on datasets (DataFrame) when doing data manipulation or analysis.

In this complete guide, you’ll learn (*with examples*):

- What is a Pandas
**GroupBy**(object). - How to create summary statistics for groups with
**aggregation**functions. - How to create like-indexed objects of statistics for groups with the
**transformation**method. - How to use the flexible yet less efficient
**apply**function. - How to use
**custom****functions**for**multiple columns**.

If you want to master this important technique with hands-on examples, don’t miss this guide.

Let’s begin now!

## What are Pandas and GroupBy?

**Pandas** is a powerful and easy to use open-source Python data analysis and manipulation tool. It offers data structures and operations for numerical tables and time series. pandas can be used to import data, manipulate, and clean data.

It is a must-know package for data science.

**Related article**: How to Install/Setup Python and Prep for Data Science NOW

Pandas can be downloaded with Python by installing the Anaconda distribution. Check out this step-by-step guide.

According to Pandas documentation, “group by” is a process involving one or more of the following steps:

**Splitting**the data into groups based on some criteria.**Applying**a function to each group independently.**Combining**the results into a data structure.

This can be used to group large amounts of data and compute operations on these groups. The idea should be familiar to those who have used SQL’s SELECT and Group By statement.

Let’s see a simple example to learn the idea.

Suppose we have a DataFrame with two columns *type *and *amount*. The GroupBy process is visualized below:

- splitting the dataset into two groups based on types A and B, then
- applying the function (for example, average) on each group, then
- combining the results as a table

We’ll be going through the most common use cases for GroupBy in data science based on our experience. We recommend downloading the same dataset (with the link below) and practice along.

While these should be a good starting point, you can always search for more details in the Pandas Group By documentation.

Before we start exploring GroupBy, let’s import the packages and read the data in Python.

We are importing pandas together with some other necessary packages.

The dataset we are going to look at is some real anonymized transactions of a Czech bank from 1993 to 1999.

Upon loading into Python, we also clean the data to prepare it for analysis by:

- converting the
*date*of the transaction to datetime - translating the
*type*of the transaction to English - translating the
*operation*type of the transaction to English - filling in missing values of type and operation as ‘Unknown’

We’ll be using these two variables as groups - sorting the DataFrame by
*account_id*and*date* - setting the date variable as index

Now we can look at the top 10 rows of the dataset *df*:

**Related article**: How to apply Unsupervised Anomaly Detection on bank transactions

Within the previous article, the same dataset was used for anomaly detection. Check it out if you are interested.

## GroupBy Object: Splitting an Object into Groups

We are going to use variables type and operation to group the data. Let’s see what categories they have.

The type feature marks the transactions as WITHDRAWAL, CREDIT, or NOT SURE.

The operation feature tags the transactions in more detail of its specific types.

To start the groupby process, we create a *GroupBy *object called *grouped*. This helps in splitting the pandas objects into groups.

By using the type function on grouped, we know that it is an object of pandas.core.groupby.generic.DataFrameGroupBy.

The GroupBy object has methods we can call to manipulate each group.

For example, we can use the *groups *method to get a dictionary with:

- keys being the groups and
- values being the indices of the rows within the groups, which is the date.

We can also get the DataFrame of a group using the *get_group *method. For example, we can create a new dataset with only the “credit card withdrawal”.

Other than these basic methods, we can also apply functions to the groups.

The main functions we’ll cover are *aggregation, transformation, and apply*. Let’s look at them one-by-one.

## Aggregation: Summary Statistics for Groups

When we use the aggregating functions on a GroupBy object, they are applied to all the columns by default. The resulting output is a DataFrame with the group name as the index.

### Example(s) #1: Single Aggregating Function on Multiple Columns

Let’s see an example with the common aggregating functions *mean*.

We can see that this function applies only to the numerical columns.

And obviously, it doesn’t make sense to apply the mean on *trans_id *(transaction ID), *account_id *(account ID), and *account *(account number).

So let’s filter for specific variables to apply the mean aggregating function.

This looks much better. For example, the average transaction amount for withdrawals in cash is 5095.

There are also other common aggregating functions. Let’ see more examples.

The *describe *function generates basic descriptive statistics such as:

- count
- mean
- standard deviation
- minimum
- maximum
- the main quartiles

The *min, median, max,* and *count *functions compute the min, median, max, count of the group values.

The *nunique *function displays the distinctive count for the groups.

We are not showing the result of the min, median, max, and count since they are straight-forward.

The nunique function’s result is below. We can see that trans_id is unique, while one account_id could have multiple transactions.

**Related article**: The above are the most used groupby functions. You may also check out the pandas document for a full list.

### Example #2: Multiple Aggregating Functions on Multiple Columns

So far, we’ve been only applying a single aggregating function at a time. What if we want to apply multiple functions at once?

We can use the aggregate (*agg*) method to call multiple functions.

### Example #3: Custom Defined Function on Multiple Columns

We can also define custom functions and use the agg method.

The example below defines a new function calculating the square root for the sum of the groups.

### Example #4: Custom Output Name on Multiple Columns

What if we want to change the output column names?

We can use the named aggregation syntax.

A dictionary is created with:

- the keys being the name of the output column
- the values are tuples with:

– the first element being the column to select and

– the second element being the aggregating function to apply on that column

Then we use the ** operation to unpack the dictionary into the function.

It might sound a little confusing. Let’s see an example below.

The *aggs *dictionary defines the new output columns *sum_amount, avg_balance*, and *avg_amount*.

For example, sum_amount is the new column name where the sum function is applied to the amount column.

## Transformation: Like-indexed Object with Statistics for Groups

In this section, we’ll look at the transform method.

It also allows us to apply** **functions to each group. What’s the difference between transform and aggregation?

The transform method returns a result as an object that is indexed the same (size) as the one being grouped.

The first argument to the transform function must be a series (the group). It must either return a series with the same number of rows as the group chunk or broadcastable to the size of the group chunk. Usually, the broadcastable part just means a single scalar value.

That sounds confusing. Don’t worry. It will be more clear with examples.

### Example #1: Custom Defined Function on Multiple Columns – Range

We first define a function *max_minus_min*, which returns a scalar value: the range (max – min). As you can imagine, this range value is the same for every member of the same group.

We can then use the transform method to apply this function on the variables amount and balance.

This function returns a DataFrame with the two range variables. This object has the same size (number of rows) and index (date) as the original dataset df.

Since *df_range *has the same order of rows as df, we can add these two new range variables back to the original dataset for further analysis.

You can print out the head of df to see how the amount_range and balance_range are the same for rows of the same groups. We won’t show it here.

### Example #2: Custom Defined Function on Multiple Columns – Percentage

Another common example would be getting the percentage of the value among its group.

We define the *x_pct *function with input x as a series. And it returns the percentage series the same size as x.

Then we use the transform method to pass on this new function to variables amount and balance.

Again, you can add these two variables back as Example #1 to the original dataset for further analysis.

### Example #3: Custom Defined Function on Multiple Columns – Time Series

The most common usage of transform for us is creating time series features. It is handy when we need to use a rolling window to calculate things that happened in a previous time frame.

Please note that pandas does have a rolling function. But when we need to apply the function to groups, the best way is to use GroupBy’s transform method.

Since the dataset is for bank transactions. We often want to see the volume transacted within the past n days, for each particular account ID.

These types of features are useful for customer behavior analysis, such as fraud analytics, anti-money laundering detection.

**Related article**: How to apply Unsupervised Anomaly Detection on bank transactions.

Check out this article using the same dataset for analysis.

The Python code below does the following steps to create the rolling variables:

- create a GroupBy object by account_id since we want to look at account level activities
- define the function
*sum_n_days*with input x (a series) and n (integer). The x is a times series on which we can calculate the sum of its value, within the last n days - use the transform method to apply this function on the amount and balance variables
- add back these two new variables to the original dataset

We can check an example when account_id = 2 to make sure it’s correct.

As you can see from row 4, there are two transactions within the past 5 days (including the current one on 1993-03-31). So amount_5days is 3713.5 (3700 + 13.5).

## Apply: Flexible Statistics for Groups but Less Efficient

Lastly, let’s talk about the apply function, which is also a method to apply functions group-wise.

Some operations on the grouped data might not fit into either the aggregate or transform categories. Or, you may simply want GroupBy to infer how to combine the results.

For these, use the

pandas Flexible apply documentation`apply`

function, which can be substituted for both`aggregate`

and`transform`

in many standard use cases. However,`apply`

can handle some exceptional use cases.

For us, apply is more flexible because we can access multiple columns in the function, which you couldn’t do in aggregation and transform.

Everything sounds great about apply so far, but:

While apply is a very flexible method, its downside is that using it can be quite a bit slower than using more specific methods like agg or transform.

Pandas offers a wide range of methods that will be much faster than using apply for their specific purposes, so try to use them before reaching for apply.

pandas apply documentation

For this reason, we haven’t found much need to use it. It is still a popular function, so try it out yourself to decide whether to use it.

First, we also need to define a function that will be passed onto apply. The function must take a DataFrame as its first argument and return a DataFrame, series, or scalar. apply will then take care of combining the results back together into a single dataframe or series.

### Example #1: Multiple Columns Returning a DataFrame

In the first example, we define a function *f* which returns a DataFrame with:

*amount_div_balance*: the percentage of amount out of balance of the account*amount_zscore*: the z score for the amount

As you can see, apply can have functions using multiple columns from the dataset (amount and balance).

Then we pass on the function f to apply on amount and balance.

Again, you can use the same method as in Example #1 of transform, to add these new variables to the original dataset df for further analysis.

### Example #2: Multiple Columns Returning a Scalar

Let’s see another example with a function that uses multiple columns.

The function *g* uses amount and balance to calculate their mean’s comparison, which is a scalar value.

Then we can use apply again to apply the function g.

Since we return a scalar value, the result doesn’t have the same number of rows as the original dataset. The number of rows is the same as the number of groups, which is the number of account IDs.

We did it! Now you should be able to **GroupBy with Pandas** (an essential skill for data science) *like a boss*.

Leave a comment for any questions you may have or anything else.

Before you leave, don’t forget to *sign up for the Just into Data newsletter* below! Or connect with us on Twitter, Facebook.

So you *won’t *miss any new data science articles from us!