This is a SUPER practical tutorial on data cleaning (techniques) in Python.
No analysis creates meaningful results with messy data. So no matter you are generating simple reports or fitting machine learning models, we MUST preprocess and clean the data.
By following this guide step-by-step, you’ll learn:
- What is Data Cleaning, and why is it important?
- Methods & techniques in Python on how to find and clean:
- Missing data
- Irregular data (outliers)
- Unnecessary data — repetitive data, duplicates, and more
- Inconsistent data — capitalization, data types, typos, addresses
If you want to clean data efficiently in Python, this guide with examples is for you!
Let’s get started.
Editor’s Note: This guide improved and replaced the original post.
What is Data Cleaning, and why is it important?
Let’s kick off the guide with a definition of data cleaning. I’ll use the below sentence from Wikipedia.
Data cleaning or data cleansing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.
What a long description! Data cleaning is certainly not fun and very time-consuming.
So why do we have to clean the data?
In reality, the raw data is almost always messy. If you use such data for analysis, for example, feed into a machine learning model, you’ll get useless insights most of the time. That’s why data cleansing is a critical process for data analysts and data scientists.
As you’ve seen, data cleaning involves different processes depending on the dataset. But it is still good to have a checklist to start with. In this guide, we’ll cover the most common data cleaning problems. For each problem, you’ll learn how to find them and techniques to clean them.
Data Cleaning in Python
We’ll use Python in Jupyter Notebook for data cleaning throughout the guide. More specifically, we’ll use the below Python libraries:
pandas
: a popular data analysis and manipulation tool, which will be used for most of our data cleaning techniquesseaborn
: statistical data visualization librarymissingno
: missing data-focused visualization packagenltk
: natural language toolkit
To follow this data cleaning in Python guide, you need basic knowledge of Python, including pandas
. If you are new to Python, please check out the below resources:
- Python basics: FREE Python crash course.
- Python for data analysis basics: Python for Data Analysis with projects course. This course includes a dedicated data cleaning section. You can get more explanations and learn more techniques than this guide.
Example dataset
For the majority part of the guide, we’ll use the Russian housing market dataset from Kaggle. The goal of the project is to predict housing prices. The dataset is large and should help you understand most of the data cleaning techniques that we’ll cover.
First, let’s take a look at the data in Python.
We learned that the dataset has 30,471 rows and 292 columns from the results.
Since there are too many columns, I’m only showing the information summary of the first few columns below. The columns include information in a house sale transaction, such as sale price, house size, neighborhood features. If you are interested, you can read more about the dataset on Kaggle.

We can also identify the numeric and non-numeric columns. These are necessary since we often treat them using different methods.
So now numeric_cols
stores the numeric columns.
Index(['id', 'full_sq', 'life_sq', 'floor', 'max_floor', 'material', 'build_year', 'num_room', 'kitch_sq', 'state', ... 'cafe_count_5000_price_2500', 'cafe_count_5000_price_4000', 'cafe_count_5000_price_high', 'big_church_count_5000', 'church_count_5000', 'mosque_count_5000', 'leisure_count_5000', 'sport_count_5000', 'market_count_5000', 'price_doc'], dtype='object', length=276)
And non_numeric_cols
stores the non-numeric columns.
Index(['timestamp', 'product_type', 'sub_area', 'culture_objects_top_25', 'thermal_power_plant_raion', 'incineration_raion', 'oil_chemistry_raion', 'radiation_raion', 'railroad_terminal_raion', 'big_market_raion', 'nuclear_reactor_raion', 'detention_facility_raion', 'water_1line', 'big_road1_1line', 'railroad_1line', 'ecology'], dtype='object')
Now we can go through the ‘dirty’ data types checklist and fix them one by one!
We’ll start with one of the most common problems: missing data.
Missing data
Missing data or missing values are when there’s no data value stored for a column in a row. It is very common among real-world datasets. If not handled properly, they would significantly impact the results of data analysis. Many machine learning models can’t tolerate any missing values. So, we must learn how to handle missing data.
How to find out?
Before cleaning missing data, we need to learn how to detect it. We’ll cover 3 methods in Python.
Method #1: missing data (by columns) count & percentage
This is the most basic method to detect missing data among columns.
The info
method that we’ve used earlier includes this information. For example, we print out the summary of all the non-numeric columns below. Note that we are not printing for the entire DataFrame df
since there are too many columns.
By looking at the Non-Null Count, we can spot the number of missing data by columns. For our example, all non-null counts are 30,471, the same as the total number of rows, so there are no missing data among the non-numeric columns!
<class 'pandas.core.frame.DataFrame'> RangeIndex: 30471 entries, 0 to 30470 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 timestamp 30471 non-null object 1 product_type 30471 non-null object 2 sub_area 30471 non-null object 3 culture_objects_top_25 30471 non-null object 4 thermal_power_plant_raion 30471 non-null object 5 incineration_raion 30471 non-null object 6 oil_chemistry_raion 30471 non-null object 7 radiation_raion 30471 non-null object 8 railroad_terminal_raion 30471 non-null object 9 big_market_raion 30471 non-null object 10 nuclear_reactor_raion 30471 non-null object 11 detention_facility_raion 30471 non-null object 12 water_1line 30471 non-null object 13 big_road1_1line 30471 non-null object 14 railroad_1line 30471 non-null object 15 ecology 30471 non-null object dtypes: object(16) memory usage: 3.7+ MB
Besides the info
method, there are also dedicated methods for missing values. For example, num_missing
stores the columns and their numbers of missing values. We can print out the first 10 columns to look at.
We can see that the columns id
, timestamp
, full_sq
has 0 missing data. While life_sq
has 6,383 missing values, floor
has 167 missing values, and so on.
id 0 timestamp 0 full_sq 0 life_sq 6383 floor 167 max_floor 9572 material 9572 build_year 13605 num_room 9572 kitch_sq 9572 dtype: int64
Besides the count of missing values, we can use the below code to calculate the percentages of missing values by columns. The key is to use the mean
method rather than the sum
method.
For instance, we see that the column life_sq
has 21% of its data missing, the column floor
has 0.5% missing.
id 0.000000 timestamp 0.000000 full_sq 0.000000 life_sq 0.209478 floor 0.005481 max_floor 0.314135 material 0.314135 build_year 0.446490 num_room 0.314135 kitch_sq 0.314135 dtype: float64
Method #2: missing data (by columns) heatmap
Sometimes a picture could be worth a thousand words. We can build a heatmap to visualize the missing data. This technique is proper when you have a smaller number of columns.
Let’s use the first 30 columns of our dataset as an example. We’ll use 2 Python libraries to create the heatmap: seaborn
and missingno
.
seaborn
The seaborn
library is a popular statistical data visualization library. Let’s first use it to plot the missing data heatmap.
Based on the heatmap below, we can learn the pattern of missing data for the first 30 columns. The horizontal axis shows the column names; the vertical axis shows the number of rows; the yellow color represents the missing data while the blue color is otherwise.
For example, the column life_sq
has missing values across different rows. While the column max_floor
has most of its missing values before ~10,000 rows. These patterns can potentially help us understand the reason for data being missing.

seaborn
missingno
The missingno
library is a small toolset focused on missing data visualizations and utilities. So you can get the same missing data heatmap as above with shorter code.

missingno
Method #3: missing data (by rows) histogram
We’ve been looking at missing data by columns. But we can also summarize the missing data by rows. Missing data histogram is a technique for summarizing such information.
This histogram helps to identify the missing patterns among the 30,471 observations. For example, there are over 6,000 observations with no missing values, and close to 4,000 observations with 1 missing value.

What to do?
There are NO agreed-upon solutions to dealing with missing data. We have to study the dataset and its specific columns to determine how to clean their missing values.
Below covers the 4 most used methods of cleaning missing data in Python. If the situation is more complicated, you could be creative and use more sophisticated methods such as modeling.
Technique #1: drop columns / features
This technique is straightforward. We drop the entire column or feature with missing data, which will certainly cause a loss of information. So we should only perform this when we are sure that the missing data is not informative. Otherwise, we should consider other solutions.
Since we have lots of columns with missing data (51 out of 292), let’s not drop all of them. Let’s consider the columns with a high percentage of missing.
Based on the previous code, pct_missing
stores the list of columns and their missing data percentages. What are the columns with over 30% missing data?
There are 10 columns with over 30% of their data missing.
max_floor 0.314135 material 0.314135 build_year 0.446490 num_room 0.314135 kitch_sq 0.314135 state 0.444980 hospital_beds_raion 0.473926 cafe_sum_500_min_price_avg 0.435857 cafe_sum_500_max_price_avg 0.435857 cafe_avg_price_500 0.435857 dtype: float64
Assume we investigated the data and consider these columns not very useful. We can leave them out of the analysis. Using the code below, we first filter for the columns with less than or equal to 30% missing, then apply the copy
method to make a copy to the new DataFrame df_less_missing_cols
. As shown in the comment below, you can also use the drop
method to achieve the same.
This new DataFrame has the same number of rows as the original DataFrame, but only 282 columns. So we’ve successfully removed those 10 columns that have more than 30% missing.
(30471, 282)
What about those columns with less than 30% missing? You could use other techniques that will be covered soon.
Technique #2: drop rows / observations
We can drop the entire row with missing data like the first technique. Again, please be aware of the loss of information when removing rows.
If we only want to drop the rows with many missing values, we can do it similarly to the previous technique.
For example, from the missing data histogram shown earlier, we notice that only a small amount of observations have over 35 columns all missing. We can create a new dataset df_less_missing_rows
to only keep observations with less than 35 missing columns.
This removed 1,169 rows from the original dataset.
(29302, 292)
Technique #3: impute the missing with constant values
Instead of dropping data, we can also replace the missing. An easy method is to impute the missing with constant values.
For example, we can impute the numeric columns with a value of -999 and impute the non-numeric columns with ‘_MISSING_’.
Technique #4: impute the missing with statistics
Besides constants, we can also impute the missing values with statistics.
For example, we can impute the numeric columns with their respective medians.
We can also impute the non-numeric columns with their most frequent values.
The variable most_freq
stores the most common values for each non-numeric column.
timestamp 2014-12-16 product_type Investment sub_area Poselenie Sosenskoe culture_objects_top_25 no thermal_power_plant_raion no incineration_raion no oil_chemistry_raion no radiation_raion no railroad_terminal_raion no big_market_raion no nuclear_reactor_raion no detention_facility_raion no water_1line no big_road1_1line no railroad_1line no ecology poor Name: top, dtype: object
Then we can use it to fill in the missing.
As previously shown, the dataset doesn’t have any missing values for the non-numeric columns. So this step above made no impact on our example dataset.
Further learning:
Within our course Python for Data Analysis with projects, there’s a detailed section about data cleaning, where you can get clearer explanations and learn more techniques. For example, you’ll get a more detailed walkthrough of the code, learn another technique to impute the missing data based on modeling.
That’s it for the data cleaning techniques of missing data in Python! Let’s move on to the next common issue: outliers.
Irregular data (outliers)
Outliers are data that is distinct from other observations. They could bias our data analysis results, providing a misleading representation of the data. Outliers could be real outliers or mistakes.
How to find out?
The definition of outliers is broadly defined for numeric data only. So let’s focus on methods of spotting outliers of numeric columns.
Method #1: descriptive statistics
First, let’s look at kurtosis. Kurtosis is a statistical measure of ‘tailedness’. The higher kurtosis is often linked to the greater extremity of deviations (or outliers) in the data. So this is a single statistic to detect potential outliers.
Among the first 10 columns, we can see that life_sq
has the highest kurtosis value. But note that the high value of kurtosis doesn’t guarantee outliers. So we’ll investigate this column more soon.
id -1.199940 full_sq 12128.727462 life_sq 17058.409645 floor 1.731182 max_floor 6.595262 material 1.358512 build_year 16865.716093 num_room 13.292374 kitch_sq 4680.043742 state 88.399730 dtype: float64
Besides the kurtosis, we could look at the column’s common descriptive statistics.
For example, we could use the describe
method in Python.
For the column life_sq
, we can see that the maximum value is 7,478, while the 75th percentile is only 43. The maximum value is an outlier.
count 24088.000000 mean 34.403271 std 52.285733 min 0.000000 25% 20.000000 50% 30.000000 75% 43.000000 max 7478.000000 Name: life_sq, dtype: float64
What if you want to learn more about the outliers? Let’s go-to method #2.
Method #2: histogram & box plot
Let’s use the data visualization method to detect outliers. We’ll plot a histogram and a box plot of the column life_sq
.
From the histogram, we can see that the data is highly skewed with the possible existence of outliers. But due to the low frequency, we can’t see the exact location of the outliers and the counts.

life_sq
From the box plot, we get a more clear view of the outliers. There is an outlier with a value of over 7,000. In fact, all the dots on the plot are considered outliers by the box plot definition. You can read details in the documentation of the boxplot
method.

life_sq
Method #3: bar chart
As mentioned, outliers are mainly defined for numeric data. But for non-numeric data, there could be irregular values too. We can use a bar chart to learn about the categories and their distributions.
For example, the column ecology
has a reasonable distribution. But if there is a category with only one value called ‘extraordinary’, that could be considered an ‘outlier’.

ecology
Other methods: there are other methods to spot outliers, such as scatter plots, z-score. Please feel free to explore.
What to do?
While outliers are not hard to see, it is tricky to clean them. It depends on the dataset and the goal of the project.
The methods of handling outliers are somewhat similar to missing data. We could drop, replace, or even just keep them.
We won’t cover details in this guide. You can refer to the missing data section for possible solutions in Python. Or, to see detailed examples, you can check out the cleaning data section in our Python for Data Analysis with projects course.
Unnecessary data
Those are a lot of hard work for missing data and outliers! Let’s clean something more straightforward in this section: the unnecessary data.
All the data included in the analysis should support the project’s goal. We consider the data to be unnecessary when it doesn’t add value. We’ll cover 3 main types of unnecessary data.
Unnecessary type #1: repetitive & uninformative
One column can have many observations being the same value. When an extremely high percentage of the column has a repetitive value, we should investigate whether such a column provides valuable information.
How to find out?
We can generate a list of columns with a high percentage of the same value. For example, we specify below to show columns with over 99.9% rows being the same value.
In fact, there’s no such column in our example dataset.
What to do?
If there is one column with a high percentage of the same value, we should look into it to see if it’s informative. We can drop them when they are not, e.g., when the column has 100% being the same value.
Unnecessary type #2: irrelevant
Again, the data needs to provide valuable information for the project. If the features are not related to the question we are trying to solve, they are irrelevant.
How to find out?
We need to skim through the features to identify irrelevant ones. For example, a feature recording the temperature in the US wouldn’t provide direct insights into housing prices in Russia.
What to do?
When the features are not serving the project’s goal, we can remove them. You could use the drop
method in pandas
.
Unnecessary type #3: duplicates
The duplicate data is when copies of the same observation exist. Let’s look at 2 main types of duplicate data and clean them in Python.
Duplicates type #1: all columns based
How to find out?
This is easy to understand. Such duplicate occurs when all the columns’ values within the observations are the same.
We can use the duplicated
method to grab the boolean values of whether a row is duplicated, and then use it to filter for duplicated rows from df
.
There are no two rows that look exactly the same. So this returns 0 rows.
What to do?
If there were any duplicates, we could remove them. We can use the drop_duplicates
method. But this will return the same DataFrame since there weren’t any duplicates.
Duplicates type #2: key columns based
Instead of looking at all columns, sometimes we want to detect duplicates based on a set of identifiers (columns).
How to find out?
For example, we know that the column id
is probably unique. What if we drop it and see if there are duplicates based on the rest of the columns?
We first drop id
, and then see if there are duplicated rows from the DataFrame.
I won’t print out the results here, but there are 10 rows shown up to be duplicated.
What to do?
We could remove them and save the new dataset as df_dedupped
. We can also compare the shapes of the two datasets (df
and df_dedupped
).
There were 10 rows removed as duplicates.
(30471, 292) (30461, 291)
This is great!
Let’s see another example.
How to find out?
For house sales, let’s assume that if two transactions have the same timestamp
, full_sq
, life_sq
, floor
, build_year
, num_room
, price_doc
, they are duplicates. Let’s look at duplicates based on this set of columns. We can use the below code to show the duplicated rows based on those columns.
I won’t print out the detailed result, but there are 16 rows returned as duplicates.
What to do?
We can drop these duplicates based on the subset of key columns.
We’ve successfully dropped the 16 duplicates in the new dataset named df_dedupped2.
(30471, 292) (30455, 292)
That’s all for the unnecessary data.
You’ve come a long way!
Let’s move on to the last section of data cleaning in Python: inconsistent data.
Inconsistent data
It is crucial to have the dataset follow specific standards. There could be different inconsistent data that needs to be cleaned and we’ll cover 4 common ones. Please note that the actual data may be even messier, be creative when cleaning it!
Inconsistent type #1: capitalization
Inconsistent use of upper and lower cases in categorical values is typical. We need to clean it since Python is case-sensitive.
How to find out?
Let’s look at sub_area
, which stores the name of different areas.
Poselenie Sosenskoe 1776 Nekrasovka 1611 Poselenie Vnukovskoe 1372 Poselenie Moskovskij 925 Poselenie Voskresenskoe 713 ... Molzhaninovskoe 3 Poselenie Shhapovskoe 2 Poselenie Kievskij 2 Poselenie Klenovskoe 1 Poselenie Mihajlovo-Jarcevskoe 1 Name: sub_area, Length: 146, dtype: int64
We are lucky that the text in this column looks standardized. But there could be inconsistent capitalizations. For instance, ‘Poselenie Sosenskoe’ and ‘pOseleNie sosenskeo’ could refer to the same district.
What to do?
To avoid this, we can lowercase (or uppercase) all letters.
poselenie sosenskoe 1776 nekrasovka 1611 poselenie vnukovskoe 1372 poselenie moskovskij 925 poselenie voskresenskoe 713 ... molzhaninovskoe 3 poselenie shhapovskoe 2 poselenie kievskij 2 poselenie klenovskoe 1 poselenie mihajlovo-jarcevskoe 1 Name: sub_area_lower, Length: 146, dtype: int64
Inconsistent type #2: data types
Another standardization we often need to look at is the data types.
Let’s see an example.
How to find out?
If you recall, earlier, we’ve used the info
method to print out the summary of the DataFrame. We could see the data types of each column. We could also print out one column to take a look. Let’s try timestamp
.
timestamp
has dtype
of object while it records dates.
0 2011-08-20 1 2011-08-23 2 2011-08-27 3 2011-09-01 4 2011-09-05 ... 30466 2015-06-30 30467 2015-06-30 30468 2015-06-30 30469 2015-06-30 30470 2015-06-30 Name: timestamp, Length: 30471, dtype: object
What to do?
We can convert the column to a DateTime format and even extract the specific year, month, weekday, etc.
timestamp_dt year month weekday 0 2011-08-20 2011 8 5 1 2011-08-23 2011 8 1 2 2011-08-27 2011 8 5 3 2011-09-01 2011 9 3 4 2011-09-05 2011 9 0
Inconsistent type #3: typos of categorical values
A categorical column takes on a limited and usually fixed number of possible values. Sometimes it shows other values due to reasons like typos.
How to find out?
Let’s see an example. Within the code below:
- We generate a new DataFrame,
df_city_ex
There is only one column that stores the city names. There are misspellings. For example, ‘torontoo’ and ‘tronto’ both refer to the city of ‘toronto’. - The variable
cities
stores the 4 correct names of ‘toronto’, ‘vancouver’, ‘montreal’, and ‘calgary’. - To identify typos, we use fuzzy logic matches. We use edit_distance from
nltk
, which measures the number of operations (e.g., substitution, insertion, deletion) needed to change from one string into another string. - We calculate the distance between the actual values and the correct values.
The typos have a smaller ‘distance’ from the correct value. Because they only differ by a couple of letters.

What to do?
We can set criteria to convert these typos to the correct values. For example, the below code sets all the values within 2 characters distance from ‘toronto’/’vancouver’ to be ‘toronto’/’vancouver’.

Inconsistent type #4: addresses
This is the last data cleaning in Python problem we’ll cover. If you’ve worked with addresses, you know how messy they can be. Just imagine how people can write addresses in all different ways!
How to find out?
We can find messy address data by looking at it. Even though sometimes we can’t spot any problems, we can still run code to standardize them in case there are hidden issues.
There is no address column in our housing dataset due to privacy. Let’s create a new dataset df_add_ex
with a column address
.
As we can see, address
is quite messy.

What to do?
We can run the below code to:
- lowercase the letters
- remove leading and trailing white spaces
- delete periods
- standardize wordings
We are using regular expressions to specify search patterns. You can learn more about it here.
It looks much better now.

You did it! What a long journey you’ve come along.
In this guide, you’ve successfully applied various data cleaning techniques in Python.
Hope you are now confident to start your data cleaning process for your machine learning problems.
We’d love to hear from you. Leave a comment for any questions you may have or anything else.
2 thoughts on “Data Cleaning Techniques in Python: the Ultimate Guide<br /><div style='color:#7A7A7A;font-size: large;font-family:roboto;font-weight:400;'> What and how to clean, with examples</div>”
Good job, congratulations!
Thanks, Mauro:)