Data Cleaning Techniques in Python: the Ultimate Guide
 What and how to clean, with examples

Lianne & Justin

Lianne & Justin

data cleaning python techniques
Source: Pixabay

This is a SUPER practical tutorial on data cleaning (techniques) in Python.

No analysis creates meaningful results with messy data. So no matter you are generating simple reports or fitting machine learning models, we MUST preprocess and clean the data.

By following this guide step-by-step, you’ll learn:

  • What is Data Cleaning, and why is it important?
  • Methods & techniques in Python on how to find and clean:
    • Missing data
    • Irregular data (outliers)
    • Unnecessary data — repetitive data, duplicates, and more
    • Inconsistent data — capitalization, data types, typos, addresses

If you want to clean data efficiently in Python, this guide with examples is for you!

Let’s get started.

Editor’s Note: This guide improved and replaced the original post.



What is Data Cleaning, and why is it important?

Let’s kick off the guide with a definition of data cleaning. I’ll use the below sentence from Wikipedia.

Data cleaning or data cleansing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

What a long description! Data cleaning is certainly not fun and very time-consuming.

So why do we have to clean the data?

In reality, the raw data is almost always messy. If you use such data for analysis, for example, feed into a machine learning model, you’ll get useless insights most of the time. That’s why data cleansing is a critical process for data analysts and data scientists.

As you’ve seen, data cleaning involves different processes depending on the dataset. But it is still good to have a checklist to start with. In this guide, we’ll cover the most common data cleaning problems. For each problem, you’ll learn how to find them and techniques to clean them.

Data Cleaning in Python

We’ll use Python in Jupyter Notebook for data cleaning throughout the guide. More specifically, we’ll use the below Python libraries:

  • pandas: a popular data analysis and manipulation tool, which will be used for most of our data cleaning techniques
  • seaborn: statistical data visualization library
  • missingno: missing data-focused visualization package
  • nltk: natural language toolkit

To follow this data cleaning in Python guide, you need basic knowledge of Python, including pandas. If you are new to Python, please check out the below resources:

Example dataset

For the majority part of the guide, we’ll use the Russian housing market dataset from Kaggle. The goal of the project is to predict housing prices. The dataset is large and should help you understand most of the data cleaning techniques that we’ll cover.

First, let’s take a look at the data in Python.

We learned that the dataset has 30,471 rows and 292 columns from the results.

Since there are too many columns, I’m only showing the information summary of the first few columns below. The columns include information in a house sale transaction, such as sale price, house size, neighborhood features. If you are interested, you can read more about the dataset on Kaggle.

data cleaning python techniques example dataset
The info summary of the first 11 columns

We can also identify the numeric and non-numeric columns. These are necessary since we often treat them using different methods.

So now numeric_cols stores the numeric columns.

Index(['id', 'full_sq', 'life_sq', 'floor', 'max_floor', 'material',
       'build_year', 'num_room', 'kitch_sq', 'state',
       ...
       'cafe_count_5000_price_2500', 'cafe_count_5000_price_4000',
       'cafe_count_5000_price_high', 'big_church_count_5000',
       'church_count_5000', 'mosque_count_5000', 'leisure_count_5000',
       'sport_count_5000', 'market_count_5000', 'price_doc'],
      dtype='object', length=276)

And non_numeric_cols stores the non-numeric columns.

Index(['timestamp', 'product_type', 'sub_area', 'culture_objects_top_25',
       'thermal_power_plant_raion', 'incineration_raion',
       'oil_chemistry_raion', 'radiation_raion', 'railroad_terminal_raion',
       'big_market_raion', 'nuclear_reactor_raion', 'detention_facility_raion',
       'water_1line', 'big_road1_1line', 'railroad_1line', 'ecology'],
      dtype='object')

Now we can go through the ‘dirty’ data types checklist and fix them one by one!

We’ll start with one of the most common problems: missing data.


Missing data

Missing data or missing values are when there’s no data value stored for a column in a row. It is very common among real-world datasets. If not handled properly, they would significantly impact the results of data analysis. Many machine learning models can’t tolerate any missing values. So, we must learn how to handle missing data.

How to find out?

Before cleaning missing data, we need to learn how to detect it. We’ll cover 3 methods in Python.

Method #1: missing data (by columns) count & percentage

This is the most basic method to detect missing data among columns.

The info method that we’ve used earlier includes this information. For example, we print out the summary of all the non-numeric columns below. Note that we are not printing for the entire DataFrame df since there are too many columns.

By looking at the Non-Null Count, we can spot the number of missing data by columns. For our example, all non-null counts are 30,471, the same as the total number of rows, so there are no missing data among the non-numeric columns!

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30471 entries, 0 to 30470
Data columns (total 16 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   timestamp                  30471 non-null  object
 1   product_type               30471 non-null  object
 2   sub_area                   30471 non-null  object
 3   culture_objects_top_25     30471 non-null  object
 4   thermal_power_plant_raion  30471 non-null  object
 5   incineration_raion         30471 non-null  object
 6   oil_chemistry_raion        30471 non-null  object
 7   radiation_raion            30471 non-null  object
 8   railroad_terminal_raion    30471 non-null  object
 9   big_market_raion           30471 non-null  object
 10  nuclear_reactor_raion      30471 non-null  object
 11  detention_facility_raion   30471 non-null  object
 12  water_1line                30471 non-null  object
 13  big_road1_1line            30471 non-null  object
 14  railroad_1line             30471 non-null  object
 15  ecology                    30471 non-null  object
dtypes: object(16)
memory usage: 3.7+ MB

Besides the info method, there are also dedicated methods for missing values. For example, num_missing stores the columns and their numbers of missing values. We can print out the first 10 columns to look at.

We can see that the columns id, timestamp, full_sq has 0 missing data. While life_sq has 6,383 missing values, floor has 167 missing values, and so on.

id                0
timestamp         0
full_sq           0
life_sq        6383
floor           167
max_floor      9572
material       9572
build_year    13605
num_room       9572
kitch_sq       9572
dtype: int64

Besides the count of missing values, we can use the below code to calculate the percentages of missing values by columns. The key is to use the mean method rather than the sum method.

For instance, we see that the column life_sq has 21% of its data missing, the column floor has 0.5% missing.

id            0.000000
timestamp     0.000000
full_sq       0.000000
life_sq       0.209478
floor         0.005481
max_floor     0.314135
material      0.314135
build_year    0.446490
num_room      0.314135
kitch_sq      0.314135
dtype: float64

Method #2: missing data (by columns) heatmap

Sometimes a picture could be worth a thousand words. We can build a heatmap to visualize the missing data. This technique is proper when you have a smaller number of columns.

Let’s use the first 30 columns of our dataset as an example. We’ll use 2 Python libraries to create the heatmap: seaborn and missingno.

seaborn

The seaborn library is a popular statistical data visualization library. Let’s first use it to plot the missing data heatmap.

Based on the heatmap below, we can learn the pattern of missing data for the first 30 columns. The horizontal axis shows the column names; the vertical axis shows the number of rows; the yellow color represents the missing data while the blue color is otherwise.

For example, the column life_sq has missing values across different rows. While the column max_floor has most of its missing values before ~10,000 rows. These patterns can potentially help us understand the reason for data being missing.

missing data heatmap seaborn data cleaning python techniques
Missing data heatmap – seaborn
missingno

The missingno library is a small toolset focused on missing data visualizations and utilities. So you can get the same missing data heatmap as above with shorter code.

missing data heatmap missingno data cleaning python techniques
Missing data heatmap – missingno

Method #3: missing data (by rows) histogram

We’ve been looking at missing data by columns. But we can also summarize the missing data by rows. Missing data histogram is a technique for summarizing such information.

This histogram helps to identify the missing patterns among the 30,471 observations. For example, there are over 6,000 observations with no missing values, and close to 4,000 observations with 1 missing value.

missing data histogram data cleaning python techniques
Missing data histogram

What to do?

There are NO agreed-upon solutions to dealing with missing data. We have to study the dataset and its specific columns to determine how to clean their missing values.

Below covers the 4 most used methods of cleaning missing data in Python. If the situation is more complicated, you could be creative and use more sophisticated methods such as modeling.

Technique #1: drop columns / features

This technique is straightforward. We drop the entire column or feature with missing data, which will certainly cause a loss of information. So we should only perform this when we are sure that the missing data is not informative. Otherwise, we should consider other solutions.

Since we have lots of columns with missing data (51 out of 292), let’s not drop all of them. Let’s consider the columns with a high percentage of missing.

Based on the previous code, pct_missing stores the list of columns and their missing data percentages. What are the columns with over 30% missing data?

There are 10 columns with over 30% of their data missing.

max_floor                     0.314135
material                      0.314135
build_year                    0.446490
num_room                      0.314135
kitch_sq                      0.314135
state                         0.444980
hospital_beds_raion           0.473926
cafe_sum_500_min_price_avg    0.435857
cafe_sum_500_max_price_avg    0.435857
cafe_avg_price_500            0.435857
dtype: float64

Assume we investigated the data and consider these columns not very useful. We can leave them out of the analysis. Using the code below, we first filter for the columns with less than or equal to 30% missing, then apply the copy method to make a copy to the new DataFrame df_less_missing_cols. As shown in the comment below, you can also use the drop method to achieve the same.

This new DataFrame has the same number of rows as the original DataFrame, but only 282 columns. So we’ve successfully removed those 10 columns that have more than 30% missing.

(30471, 282)

What about those columns with less than 30% missing? You could use other techniques that will be covered soon.

Technique #2: drop rows / observations

We can drop the entire row with missing data like the first technique. Again, please be aware of the loss of information when removing rows.

If we only want to drop the rows with many missing values, we can do it similarly to the previous technique.

For example, from the missing data histogram shown earlier, we notice that only a small amount of observations have over 35 columns all missing. We can create a new dataset df_less_missing_rows to only keep observations with less than 35 missing columns.

This removed 1,169 rows from the original dataset.

(29302, 292)

Technique #3: impute the missing with constant values

Instead of dropping data, we can also replace the missing. An easy method is to impute the missing with constant values.

For example, we can impute the numeric columns with a value of -999 and impute the non-numeric columns with ‘_MISSING_’.

Technique #4: impute the missing with statistics

Besides constants, we can also impute the missing values with statistics.

For example, we can impute the numeric columns with their respective medians.

We can also impute the non-numeric columns with their most frequent values.

The variable most_freq stores the most common values for each non-numeric column.

timestamp                             2014-12-16
product_type                          Investment
sub_area                     Poselenie Sosenskoe
culture_objects_top_25                        no
thermal_power_plant_raion                     no
incineration_raion                            no
oil_chemistry_raion                           no
radiation_raion                               no
railroad_terminal_raion                       no
big_market_raion                              no
nuclear_reactor_raion                         no
detention_facility_raion                      no
water_1line                                   no
big_road1_1line                               no
railroad_1line                                no
ecology                                     poor
Name: top, dtype: object

Then we can use it to fill in the missing.

As previously shown, the dataset doesn’t have any missing values for the non-numeric columns. So this step above made no impact on our example dataset.

Further learning:
Within our course Python for Data Analysis with projects, there’s a detailed section about data cleaning, where you can get clearer explanations and learn more techniques. For example, you’ll get a more detailed walkthrough of the code, learn another technique to impute the missing data based on modeling.

That’s it for the data cleaning techniques of missing data in Python! Let’s move on to the next common issue: outliers.

Irregular data (outliers)

Outliers are data that is distinct from other observations. They could bias our data analysis results, providing a misleading representation of the data. Outliers could be real outliers or mistakes.

How to find out?

The definition of outliers is broadly defined for numeric data only. So let’s focus on methods of spotting outliers of numeric columns.

Method #1: descriptive statistics

First, let’s look at kurtosis. Kurtosis is a statistical measure of ‘tailedness’. The higher kurtosis is often linked to the greater extremity of deviations (or outliers) in the data. So this is a single statistic to detect potential outliers.

Among the first 10 columns, we can see that life_sq has the highest kurtosis value. But note that the high value of kurtosis doesn’t guarantee outliers. So we’ll investigate this column more soon.

id               -1.199940
full_sq       12128.727462
life_sq       17058.409645
floor             1.731182
max_floor         6.595262
material          1.358512
build_year    16865.716093
num_room         13.292374
kitch_sq       4680.043742
state            88.399730
dtype: float64

Besides the kurtosis, we could look at the column’s common descriptive statistics.

For example, we could use the describe method in Python.

For the column life_sq, we can see that the maximum value is 7,478, while the 75th percentile is only 43. The maximum value is an outlier.

count    24088.000000
mean        34.403271
std         52.285733
min          0.000000
25%         20.000000
50%         30.000000
75%         43.000000
max       7478.000000
Name: life_sq, dtype: float64

What if you want to learn more about the outliers? Let’s go-to method #2.

Method #2: histogram & box plot

Let’s use the data visualization method to detect outliers. We’ll plot a histogram and a box plot of the column life_sq.

From the histogram, we can see that the data is highly skewed with the possible existence of outliers. But due to the low frequency, we can’t see the exact location of the outliers and the counts.

histogram to detect outliers data cleaning
Histogram of life_sq

From the box plot, we get a more clear view of the outliers. There is an outlier with a value of over 7,000. In fact, all the dots on the plot are considered outliers by the box plot definition. You can read details in the documentation of the boxplot method.

box plot to detect outliers data cleaning
Box plot of life_sq

Method #3: bar chart

As mentioned, outliers are mainly defined for numeric data. But for non-numeric data, there could be irregular values too. We can use a bar chart to learn about the categories and their distributions.

For example, the column ecology has a reasonable distribution. But if there is a category with only one value called ‘extraordinary’, that could be considered an ‘outlier’.

bar chart data cleaning
Bar chart of ecology

Other methods: there are other methods to spot outliers, such as scatter plots, z-score. Please feel free to explore.

What to do?

While outliers are not hard to see, it is tricky to clean them. It depends on the dataset and the goal of the project.

The methods of handling outliers are somewhat similar to missing data. We could drop, replace, or even just keep them.

We won’t cover details in this guide. You can refer to the missing data section for possible solutions in Python. Or, to see detailed examples, you can check out the cleaning data section in our Python for Data Analysis with projects course.

Unnecessary data

Those are a lot of hard work for missing data and outliers! Let’s clean something more straightforward in this section: the unnecessary data.

All the data included in the analysis should support the project’s goal. We consider the data to be unnecessary when it doesn’t add value. We’ll cover 3 main types of unnecessary data.

Unnecessary type #1: repetitive & uninformative

One column can have many observations being the same value. When an extremely high percentage of the column has a repetitive value, we should investigate whether such a column provides valuable information.

How to find out?

We can generate a list of columns with a high percentage of the same value. For example, we specify below to show columns with over 99.9% rows being the same value.

In fact, there’s no such column in our example dataset.

What to do?

If there is one column with a high percentage of the same value, we should look into it to see if it’s informative. We can drop them when they are not, e.g., when the column has 100% being the same value.

Unnecessary type #2: irrelevant

Again, the data needs to provide valuable information for the project. If the features are not related to the question we are trying to solve, they are irrelevant.

How to find out?

We need to skim through the features to identify irrelevant ones. For example, a feature recording the temperature in the US wouldn’t provide direct insights into housing prices in Russia.

What to do?

When the features are not serving the project’s goal, we can remove them. You could use the drop method in pandas.

Unnecessary type #3: duplicates

The duplicate data is when copies of the same observation exist. Let’s look at 2 main types of duplicate data and clean them in Python.

Duplicates type #1: all columns based

How to find out?

This is easy to understand. Such duplicate occurs when all the columns’ values within the observations are the same.

We can use the duplicated method to grab the boolean values of whether a row is duplicated, and then use it to filter for duplicated rows from df.

There are no two rows that look exactly the same. So this returns 0 rows.

What to do?

If there were any duplicates, we could remove them. We can use the drop_duplicates method. But this will return the same DataFrame since there weren’t any duplicates.

Duplicates type #2: key columns based

Instead of looking at all columns, sometimes we want to detect duplicates based on a set of identifiers (columns).

How to find out?

For example, we know that the column id is probably unique. What if we drop it and see if there are duplicates based on the rest of the columns?

We first drop id, and then see if there are duplicated rows from the DataFrame.

I won’t print out the results here, but there are 10 rows shown up to be duplicated.

What to do?

We could remove them and save the new dataset as df_dedupped. We can also compare the shapes of the two datasets (df and df_dedupped).

There were 10 rows removed as duplicates.

(30471, 292)
(30461, 291)

This is great!

Let’s see another example.

How to find out?

For house sales, let’s assume that if two transactions have the same timestamp, full_sq, life_sq, floor, build_year, num_room, price_doc, they are duplicates. Let’s look at duplicates based on this set of columns. We can use the below code to show the duplicated rows based on those columns.

I won’t print out the detailed result, but there are 16 rows returned as duplicates.

What to do?

We can drop these duplicates based on the subset of key columns.

We’ve successfully dropped the 16 duplicates in the new dataset named df_dedupped2.

(30471, 292)
(30455, 292)

That’s all for the unnecessary data.

You’ve come a long way!

Let’s move on to the last section of data cleaning in Python: inconsistent data.

Inconsistent data

It is crucial to have the dataset follow specific standards. There could be different inconsistent data that needs to be cleaned and we’ll cover 4 common ones. Please note that the actual data may be even messier, be creative when cleaning it!

Inconsistent type #1: capitalization

Inconsistent use of upper and lower cases in categorical values is typical. We need to clean it since Python is case-sensitive.

How to find out?

Let’s look at sub_area, which stores the name of different areas.

Poselenie Sosenskoe               1776
Nekrasovka                        1611
Poselenie Vnukovskoe              1372
Poselenie Moskovskij               925
Poselenie Voskresenskoe            713
                                  ... 
Molzhaninovskoe                      3
Poselenie Shhapovskoe                2
Poselenie Kievskij                   2
Poselenie Klenovskoe                 1
Poselenie Mihajlovo-Jarcevskoe       1
Name: sub_area, Length: 146, dtype: int64

We are lucky that the text in this column looks standardized. But there could be inconsistent capitalizations. For instance, ‘Poselenie Sosenskoe’ and ‘pOseleNie sosenskeo’ could refer to the same district.

What to do?

To avoid this, we can lowercase (or uppercase) all letters.

poselenie sosenskoe               1776
nekrasovka                        1611
poselenie vnukovskoe              1372
poselenie moskovskij               925
poselenie voskresenskoe            713
                                  ... 
molzhaninovskoe                      3
poselenie shhapovskoe                2
poselenie kievskij                   2
poselenie klenovskoe                 1
poselenie mihajlovo-jarcevskoe       1
Name: sub_area_lower, Length: 146, dtype: int64

Inconsistent type #2: data types

Another standardization we often need to look at is the data types.

Let’s see an example.

How to find out?

If you recall, earlier, we’ve used the info method to print out the summary of the DataFrame. We could see the data types of each column. We could also print out one column to take a look. Let’s try timestamp.

timestamp has dtype of object while it records dates.

0        2011-08-20
1        2011-08-23
2        2011-08-27
3        2011-09-01
4        2011-09-05
            ...    
30466    2015-06-30
30467    2015-06-30
30468    2015-06-30
30469    2015-06-30
30470    2015-06-30
Name: timestamp, Length: 30471, dtype: object

What to do?

We can convert the column to a DateTime format and even extract the specific year, month, weekday, etc.

 timestamp_dt  year  month  weekday
0  2011-08-20  2011      8        5
1  2011-08-23  2011      8        1
2  2011-08-27  2011      8        5
3  2011-09-01  2011      9        3
4  2011-09-05  2011      9        0

Inconsistent type #3: typos of categorical values

A categorical column takes on a limited and usually fixed number of possible values. Sometimes it shows other values due to reasons like typos.

How to find out?

Let’s see an example. Within the code below:

  • We generate a new DataFrame, df_city_ex
    There is only one column that stores the city names. There are misspellings. For example, ‘torontoo’ and ‘tronto’ both refer to the city of ‘toronto’.
  • The variable cities stores the 4 correct names of ‘toronto’, ‘vancouver’, ‘montreal’, and ‘calgary’.
  • To identify typos, we use fuzzy logic matches. We use edit_distance from nltk, which measures the number of operations (e.g., substitution, insertion, deletion) needed to change from one string into another string.
  • We calculate the distance between the actual values and the correct values.

The typos have a smaller ‘distance’ from the correct value. Because they only differ by a couple of letters.

data cleaning python typos edit_distance nltk

What to do?

We can set criteria to convert these typos to the correct values. For example, the below code sets all the values within 2 characters distance from ‘toronto’/’vancouver’ to be ‘toronto’/’vancouver’.

data cleaning techniques typos edit_distance nltk

Inconsistent type #4: addresses

This is the last data cleaning in Python problem we’ll cover. If you’ve worked with addresses, you know how messy they can be. Just imagine how people can write addresses in all different ways!

How to find out?

We can find messy address data by looking at it. Even though sometimes we can’t spot any problems, we can still run code to standardize them in case there are hidden issues.

There is no address column in our housing dataset due to privacy. Let’s create a new dataset df_add_ex with a column address.

As we can see, address is quite messy.

messy address data cleaning

What to do?

We can run the below code to:

  • lowercase the letters
  • remove leading and trailing white spaces
  • delete periods
  • standardize wordings

We are using regular expressions to specify search patterns. You can learn more about it here.

It looks much better now.

messy address cleaned up python

You did it! What a long journey you’ve come along.

In this guide, you’ve successfully applied various data cleaning techniques in Python.

Hope you are now confident to start your data cleaning process for your machine learning problems.

We’d love to hear from you. Leave a comment for any questions you may have or anything else.

Twitter
LinkedIn
Facebook
Email
Lianne & Justin

Lianne & Justin

2 thoughts on “Data Cleaning Techniques in Python: the Ultimate Guide<br /><div style='color:#7A7A7A;font-size: large;font-family:roboto;font-weight:400;'> What and how to clean, with examples</div>”

Leave a Comment

Your email address will not be published.

More recent articles

Scroll to Top

Learn Python for Data Analysis

with a practical online course

lectures + projects

based on real-world datasets

We use cookies to ensure you get the best experience on our website.  Learn more.