In this tutorial, we’ll show how to detect outliers or anomalies on unlabeled bank transactions with Python.
- How to identify rare events in an unlabeled dataset using machine learning algorithms: isolation forest (clustering).
- How to visualize the anomaly detection results.
- How to fight crime with anti-money laundering (AML) or fraud analytics in banks
Use case and tip from people with industry experience
If you want to see unsupervised learning with a practical example, step-by-step, let’s dive in!
What is Anomaly Detection
Anomaly detection (outlier detection) is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data.Wikipedia
Often these rare data points will translate to problems such as bank security issues, structural defects, intrusion activities, medical problems, or errors in a text.
There are supervised/unsupervised anomaly detection techniques, which is based on whether the dataset is labeled or not.
Within this article, we are going to use anomaly detection to spot irregular bank transactions. These transactions could be fraudulent or money laundering activities.
The dataset we are going to look at is some real anonymized transactions of a Czech bank from 1993 to 1999.
The transactions are not labeled (as fraud or money laundering). The majority of the transactions are legit. Yet, due to the nature of the banks’ businesses, it is likely a small part were illegal activities.
Let’s read the data into Python.
We are going to focus on four variables for this exercise:
- date: the date of the transaction
- account_id: the account ID associated with the transaction
- type: the type of the transaction
- amount: the amount of the transaction
Step #1: Exploring and Cleaning the Dataset
Now that we have the dataset df_trans, let’s do some exploratory data analysis and transform the data.
We plot the features amount and type to look at their distributions with either a histogram or bar plot.
We can see that the transaction amount is skewed right with many smaller values and a few large values.
The majority of the transactions types are:
- credit: PRIJEM
- withdrawal: VYDAJ
We’ll change the transaction types to English to make it straight-forward.
And for our analysis, let’s only focus on withdrawal transactions. But both fraud or AML detection could involve different types of transactions.
Also, we set the index to date because it’s required by time series functions in the Pandas package.
This is how our final dataset looks like.
Related article: How to use Python Seaborn for Exploratory Data Analysis
Step #2: Creating New Features
What would be a warning sign for fraud or money laundering activities?
Based on our experience in the banking industry, irregularly large transaction volume from an account is a red flag.
This exploitation could also be over a period of time, since the perpetrators don’t want to alert the bank with a single large transaction.
Let’s create two new features to account for the transaction volume over a period of time:
- sum_5days: the accumulative withdrawal amounts from an account in the previous 5 days (including the current day).
- count_5days: the count of withdrawal transactions from an account in the previous 5 days (including the current day).
As we can see, on 1996-01-05, the account with id=1 had 2 withdrawals in the past 5 days, with an accumulative amount of 2662 (210 + 2452).
Let’s explore these features with more visualizations.
Most accounts don’t take out too much money within 5 days.
Most accounts don’t withdraw with high frequency as well.
There is a positive correlation between the amount withdrawn and the number of withdrawals.
There could be many other features to spot irregular behavior as well, but we’ll focus on these two only.
Step #3: Detecting the Outliers with a Machine Learning Algorithm
How do we use these features to detect the outliers?
We’ll use an unsupervised learning algorithm: Isolation Forest.
The Isolation Forest (iForest) algorithm took advantage of the attributes of anomalies being “few and different”, they are easier to “isolate” compared to normal points. So instead of trying to build a model of normal instances, it explicitly isolates anomalous points in the dataset.
The main advantage of this approach is the possibility of exploiting sampling techniques to an extent that is not allowed to the profile-based methods, creating a very fast algorithm with a low memory demand.Wikipedia
Another advantage of iForest is that we don’t need to scale the variables before applying the technique, unlike most other clustering techniques.
To apply this algorithm, we use PyOD, a comprehensive and scalable Python toolkit for detecting outlying objects.
Before applying the algorithm, it is also critical to define the proportion of anomalies to detect. Often this is bound by the business operation capacity. If we send too many alerts for anomalies, the operations team won’t be able to handle them. In this article, we’ll choose 0.1%.
Now we can fit the anomaly detection model!
To visualize what our model learned, we use a contour plot with matplotlib. The plot is adapted from this resource.
The Python code below seems complicated. But we don’t have to understand all of it. Let’s focus on the plot.
Any data points outside the orange area is an anomalous withdrawal transaction.
It’s hard to see, but the white dots consist of 99.9% of all withdrawal transactions. This matches our setting of the proportion of anomalies to detect of 0.1%.
Until this part, we’ve completed the anomaly detection.
Let’s see how we can use the results.
How to use the Results for Anti-Money Laundering or Fraud Analytics
With the above results, we can send withdrawal transactions for reviews when:
- the account had 1 withdrawal transaction of over ~80K amount
- the account had over ~65K accumulative amount and more than 4 withdrawal transactions within the past 5 days
- the account had over ~45K accumulative amount and more than 5 withdrawal transactions within the past 5 days
Based on our experience working in AML or fraud analytics, there are other factors to consider before putting these alerting strategies into production. For example, we need to:
- balance between customer experience and bank losses
We don’t want these alerts/investigations to impact the legit customers too much. It is normal for the banks to take some losses while running the business.
We may mitigate this impact by setting up customer profiles to whitelist some accounts.
- understand the operations team capacity
Like mentioned before, we need to send the anomalous transactions to the agents who would then do further investigation/other support work. The strategies should be optimized to account for potential larger loss cases first.
Many of the irregular cases are not investigated due to this reason.
- know the limitations of the system
Sometimes the system within the banks is not advanced enough to accommodate all the possible features we want to capture.
Or the system doesn’t have the capacity for real-time decision making.
It is one thing to detect the outliers with machine learning algorithms, while it is another thing to be able to implement the strategies.
This is a good example showing why data scientists have to understand both the technical and the business aspects of a problem.
Hope you found this guide helpful. Now you should have a better understanding of unsupervised anomaly detection, an important machine learning application.
Leave a comment if you have any questions.
Before you leave, don’t forget to sign up for the Just into Data newsletter below! Or connect with us on Twitter, Facebook.
So you won’t miss any new data science articles from us!