This is a guide to help you quickly learn and use Python pandas for data science, machine learning.
Pandas is the most popular Python package for data manipulation and analysis. Data analysts or scientists often spend 80% of their time exploring and transforming data, which makes pandas an essential skill for data science.
In this tutorial, you’ll learn the most useful data science-related pandas techniques:
- What is pandas Python.
- How to install and import pandas.
- What are the pandas data structures.
- How to do the key dataframe operations such as reading/saving data, exploring data, manipulating data.
- Much more.
This practical tutorial with examples will help you start real data analysis, if you just learned the basics of Python.
Note: if you are new to Python, take our FREE Python crash course for data science.
Let’s begin!
- What is Pandas Python for Data Science?
- Install and Import Pandas
- Learn Python Pandas Data Structures
- Import Data into pandas DataFrame
- Explore the DataFrames
- info: summary
- head or tail: first or last n rows
- shape: dimensionality
- describe: descriptive statistics
- boxplot and histogram: explore numerical variables
- value_counts and bar plot: unique values of categorical variables
- Select and Manipulate Columns and Rows
- Groupby
- Data Cleaning/Manipulation
- Merge and Concatenate DataFrames
- Sort DataFrames
- Save or Output DataFrames
What is Pandas Python for Data Science?
Or why pandas?
Because pandas:
- is easy to use, fast, and open-source.
- offers powerful data structures and operations.
- can be used to import, write, manipulate, clean, and analyze data. Many people consider it as the Excel within Python.
- is the most popular Python data analysis/manipulation package, which is built upon NumPy.
In summary, pandas is a must-know package for data science or machine learning.
Note before the tutorial: it’s not necessary (or possible) to memorize all the functions or methods. Plus, there might be other functions or methods you need in the future.
We suggest you follow the tutorial to learn systematically, which covers the essentials. When you are practicing data science, a simple Google search would often solve the problem.
Install and Import Pandas
So how to get started with pandas?
You can install the Anaconda distribution to start using Python for data science, which is FREE and includes popular packages such as pandas.
- If you want to get pandas, check out How to Install/Setup Python and Prep for Data Science NOW.
- If you want to install pandas and learn the basics of Python, take our FREE Python crash course. This guide assumes you have all the basic Python knowledge.
If you already have Python but not pandas, please use the command below to install it.
pip install pandas
After downloading the package, we need to import the packages for each new Python session.
We also import the matplotlib package, which allows us to show some simple data visualizations.
Learn Python Pandas Data Structures
pandas has two main data structures: Series and DataFrame.
- Series: a 1-dimensional labeled array that can hold any data type such as integers, strings, floating points, Python objects.
It has row/axis labels as the index. - DataFrame: a 2-dimensional labeled data structure with columns of potentially different types.
It also has row labels as the index.
DataFrames can be considered as a collection of Series; it has a structure like a spreadsheet.
In this tutorial, we will focus on DataFrames since they need a lot more data manipulation than Series.
Next, let’s see how to read in data as pandas DataFrames.
Import Data into pandas DataFrame
To work in pandas, we can either create DataFrames or read data from external sources.
We’ll cover three common cases in practice.
Dictionary to Dataframe
There are different ways to construct DataFrames from scratch. Let’s use Python dictionaries as an example.
After a raw dataset is preprocessed in Python, it is often convenient to compile the new dataset as a dictionary. But dictionaries are not great objects for further analysis.
This is when we need to convert this dictionary object into a pandas DataFrame.
This is especially common when analyzing log files or text.
Let’s see an example below.
We have a dataset for housing sales records. First, we create two lists with values of the houses’ addresses and prices. Then we’ll combine these lists into a dictionary d with key ‘address’, value l1 and key ‘price’, value l2. We can simply transform this dictionary as a pandas DataFrame df2 as below.

This is great!
We’ve created our first pandas DataFrame. As you can see, it has six rows and two columns with an index from 0 to 5.
Yet, it is more often we import data from outside sources. Let’s look at the most common source: csv files.
read_csv
Within this tutorial, we’ll mainly use two sample datasets. Both can be downloaded following the links.
One is a YouTube dataset sydney_videos.csv; the other is a transactional dataset trans.csv. They are both datasets we’ve done analyses on, which makes this guide more practical. Please make sure the two files are stored in the same working directory as your Python code. This saves us from specifying the path in read_csv.
Since both datasets are saved as csv files. Let’s use the pandas read_csv function to import them into Python.
For the YouTube dataset, we can use the default settings of read_csv to load it as DataFrame df0.
We then call df0 for a look.

We can see that df0 is a DataFrame with 837 rows and four columns. Each of the rows is a video with columns:
- title
- time posted_ago
- length
- number of
views
We can also change the settings of the input parameters to read_csv. For example, to read the transactional dataset, we set:
- sep = ‘;’ to change the delimiter from commas (,) to semicolons (;).
- dtype={‘k_symbol’: ‘string’, ‘bank’: ‘string’} to set the data types of columns k_symbol and bank to strings.
- parse_dates=[‘date’] to parse the column date as a DateTime format.

Further Reading: Pandas read_csv to DataFrames: Python Pandas Tutorial
Read this complete tutorial to learn more about Python pandas read_csv. Or take the last section as a cheat sheet.
read_sql
Another common case is when we want to import data into Python from SQL servers. We can use read_sql to read in the SQL query or database table into a DataFrame.
For example, we have a sample database in Microsoft SQL Server. We first set up a connection with the server by using the SQLAlchemy function create_engine. Then we can query within Python to get the data as DataFrame.

Further Readings:
How to Download and Install SQL Server and Sample Database: Step-by-Step
SQL Tutorial for Beginners: Learn SQL for Data Analysis
SQL is another essential data science skill. Read these articles to master it.
Besides CSV files and SQL databases, we can read data from a variety of sources with pandas. Learn more with pandas’ official IO tools documentation.
At the end of the tutorial, we’ll show how to save the DataFrames on your computer.
But let’s see what we can do with these DataFrames first.
Explore the DataFrames
After creating the pandas DataFrames, let’s learn more about their basic characteristics.
We’ll cover a few popular methods to do this.
info: summary
The info method shows a helpful summary of our DataFrames. It is one of the first methods you should apply after loading in the dataset.
Let’s try it out on the YouTube DataFrame df0.
We can see that it prints information, including:
- the index information
- the columns and their names, non-null values, counts, dtypes
- memory usage

head or tail: first or last n rows
The head method is also used in every data analysis project. With its help, we can see exactly how the first n rows look like within the DataFrame.
Let’s print the first five rows of df0 using the default settings of the method.

We can also specify the number of rows to display.

The tail method returns the last n rows of the DataFrame. It can be used as a quick verification after we sort or append rows.

shape: dimensionality
To get the number of rows and columns of the DataFrame, we can use shape.
(837, 4)
describe: descriptive statistics
The method describe generates descriptive statistics for both numeric and object columns.
For example, for the numeric column views, we can print out its summary statistics such as:
- count
- mean
- standard deviation
- percentiles.

For the object column posted_ago, we can see its corresponding statistics:
- count
- unique counts
- the most common value and its frequency (top, freq).

boxplot and histogram: explore numerical variables
When the columns are numeric, it’s useful to explore their distributions by boxplot and histogram.
We can do a boxplot on the numerical column views.

We can also look at its histogram.

Note: we prefer the seaborn library for data science since it has more features. But these methods in pandas are good for quick and simple plots.
Further Readings: How to use Python Seaborn for Exploratory Data Analysis
The YouTube dataset was also used in this complete guide of using seaborn Python for Exploratory Data Analysis. Discover how to use histograms, barplots, scatter plots, boxplots, heatmaps, etc.
value_counts and bar plot: unique values of categorical variables
The value_counts method is handy for categorical columns. It returns a pandas Series with counts of unique values, in descending order. And it excludes NA values by default.
For example, we can use it to see frequency tables of column posted_ago. It provides more details than the previous method describe().

We can also display the results in percentages by using the normalize parameter.

It’s also helpful to visualize this result. We can do this by using the plot method.
We just plotted a bar chart.

Note: again, we prefer the seaborn library for data science since it has more features. But these methods in pandas are good for quick and simple plots.
Further Readings: How to use Python Seaborn for Exploratory Data Analysis
Now with a better understanding of the DataFrames, let’s see what we can do with its columns and rows.
Select and Manipulate Columns and Rows
Columns and rows are the basic structures of DataFrames. It’s common practice to select a subset or modify the columns/rows. Let’s learn how to do it with Python pandas. We’ll also cover related topics such as index settings and filtering.
Columns of DataFrames
The columns of the DataFrames often represent variables/features in the dataset. There are often names we can refer to these columns by.
Select Columns
The most basic way of selecting columns is by calling its name within squared brackets [].
For example, we can print out the columns length and views separately below.
Column Operations
We can also perform simple numerical or categorical operations on columns.
For example, we can perform a multiplication of 100 on the numeric column views.

Add New Columns
We can also create new columns in the DataFrame by assigning existing columns’ transformations.
The column length stores the number of seconds of the videos. Sometimes it’s more convenient to look at it in minutes. Let’s create a new column below.
By printing out the head of df0, we can see the new column length_in_minutes being added to it.

Similarly, we can perform categorial operations and create new columns.
The new column title_lowercase contains the lower cases of title, is added to the DataFrame.

Rows of DataFrames
Rows represent observations in the datasets. In this section, we’ll cover the basic methods for getting subsets of the DataFrame and indexing.
The two main methods to locate rows are either by labels (loc) or integer-location based indexing (iloc).
Before that, let’s change the index of our dataset to contain special labels.
set_index: set index to columns
Like strings or lists, every pandas DataFrame has an index. By default, it is an integer ranging from 0 to (number of observations – 1). But we can also set the index to be one of the columns.
The default index of the transaction dataset df1 is integers 0 to 1056319. Since it is a time series dataset, the column date could be a better index.

Let’s set the index to the column date.
We can see that the date column replaced the leftmost index column.

Now we are ready to explore the row locating methods.
loc: locate by labels
We’ll start from the loc method, which helps us to locate rows by their labels/names.
For example, if we want to see all the transactions that happened on 1996-12-25, we can use loc to find them.

We can also get a range of dates using slicing. Note the difference from the usual Python slices. Both the start and the stop are included.

The loc method also allows us to subset both rows and columns by the labels.
For example, instead of printing all the columns for date index ‘1996-12-25’, we can specify the columns.

Besides labels, we can also use boolean arrays (True/False) to subset rows, which will be shown in the Filtering section later.
reset_index
One of the disadvantages of an index is that the data in it is not as accessible as the other columns. To do better analysis on the dates in later sections, let’s reset the index of df1 to return the date as a column.

iloc: locate by index
The iloc method allows us to locate rows based on integer-location based indexing.
For example, to call the 10th to 19th rows, we can use slicing with iloc.
Note that the stop index 20 is not included since iloc slicing is similar to the regular Python slices.

When selecting columns using iloc, we need to specify the column number (starting from 0) rather than the names. For example, the column date is the first column, so it has the index 0.

Besides integers, we can also use boolean arrays (True/False) to subset rows with iloc, similar to loc.
Random Sampling
Random sampling is a common practice in data science. We can use the sample method in pandas to do this.
For example, we can take a random sample of size ten from df1 and assign it to a new DataFrame df1_samp1.

Or, we can take a certain proportion from the original dataset using the parameter frac.

Filter: Conditional Selection
Filtering is another popular technique in Python pandas to learn for data science. It’s useful when we want to subset the data based on some condition(s).
For example, we can grab the rows with both views over 50k and length of over 50 minutes. The variable msk is a boolean array with length equals the number of rows in the Dataframe. The array element is True when the condition is satisfied, and False otherwise.
As you may have noticed, we can use logical operators such as & (AND), | (OR) in filtering. It is good practice to put the parentheses around the conditions to specify the execution. Otherwise, it might not work as expected.

Moreover, we can combine filtering with loc to select rows and columns as below. This is when loc is used to locate rows based on boolean arrays.

Besides displaying the subset, we can also change the values in the DataFrame based on the filters.
We have non-English terms like ‘PRIJEM’ in the column type within the transactions dataset.

We can transform them into English as below.
Now with the English categories in column type, we can more easily perform our analysis.

That’s a lot of rows and columns manipulation! Let’s move on to another critical pandas functionality: Group by.
Groupby
Before learning pandas GroupBy, let’s look at some aggregating functions.
Below are some of the popular ones. We are not showing the results of the code since they are straightforward.
Instead of applying over the whole dataset, aggregating functions are more powerful on groups.
pandas Groupby is a popular manipulation tool for data science. It can be used to group large amounts of data and compute operations based on groups. The idea should be familiar to those who have used SQL’s SELECT and Group By statement.
For example, we can get the sum and mean of the column amount, grouped by column type.
We can also group by multiple columns. For example, we can get the sum, mean of the column amount, grouped by type and operation.
There is a lot more to learn about pandas Groupby. That’s why we wrote another tutorial below. Take a look to dig into it deeper!
Further Reading: How to GroupBy with Python Pandas Like a Boss
To learn more about Python Pandas GroupBy, read this complete tutorial with practical examples. This article covers transform and apply functions as well.
Data Cleaning/Manipulation
pandas is also a powerful tool for cleaning and transforming the dataset. Let’s see some useful functions.
Apply customized Functions
With the pandas apply method, we can apply a function on the DataFrame.
Let’s see an example.
The DataFrame df1’s type column has three main categories, as shown below.

What if we want to perform different operations on different types of transactions:
- when it’s a deposit, calculate amount/balance.
- when it’s a withdrawal, calculate amount/(balance + amount).
We can first define a function specifying the logic we described above and then apply it to the DataFrame. The parameter axis is set to 1 to apply the function on each row for columns.
As you can see, the new column is added to the DataFrame.

Missing Data (NA/Null)
Missing data is a common issue for most datasets.
For example, by calling the tail of df1, we can see it has some missing values in the operation, bank, and account columns.

We can fill in the missing values of column operation with ‘UNKNOWN’ using the fillna method.

We can also combine the filtering method and isnull, and notnull to subset rows with or without missing values.
The code below subsets all the rows with the missing values in the bank column.

The code below subsets all the rows with the bank column NOT missing.

Further Reading: Data Cleaning Techniques in Python: the Ultimate Guide
There is a lot more to learn about missing data. Take a look at this guide for how to deal with missing data.
Remove Columns/Rows
When we want to drop rows or columns from a DataFrame, we use the drop method.
When we drop columns, we set the parameter axis = 1. The default is 0, which drops by index/rows.
For example, we can drop the column title as below.

Or we can drop a few rows.

Note that the drop method is not permanent; we often assign it to a new DataFrame for further analysis. For example, df0_new = df0.drop([‘title’], axis=1). This makes sure we always have the original dataset.
For details on drop_duplicates, please check out the duplicates section in Data Cleaning Guide.
Date functions
pandas has a lot of functionality for dates, which is a common variable in data science.
For example, we can get the year, month, day, and weekday from the date column.

We can then use methods such as Groupby and plot to analyze the data based on the date.
For example, we plot the bar chart of the average amount by month.

Further Reading: Data Cleaning in Python: the Ultimate Guide (2020)
Again, we strongly recommend reading this guide if you want to clean the data better.
Merge and Concatenate DataFrames
Similar to SQL tables, we can also merge or concatenate DataFrames. Let’s see how it works.
For the demonstration, we create two new DataFrames left and right below.

Merge
The merge join is done on columns or indexes.
For example, we can merge the left and right DataFrames on their common column col1. The how parameter specifies the type of merge to be performed. For example, when how = ‘left’, the new DataFrame keeps all the rows from the left DataFrame and the corresponding values from the right DataFrame.
Inner Left Outer Right
Concatenate
When we have two DataFrames with the same row or column information, we can glue them together using concat.
For example, if we create two new DataFrames with the same columns, we may want to combine these two DataFrames.

Sort DataFrames
What if we want to display the DataFrame in order of a column?
For example, we can use sort_values to order the DataFrame by columns account_id, and trans_id, in ascending order.

Save or Output DataFrames
We may want to output and save these pandas DataFrames for further analysis after the many operations.
We’ll introduce the two common methods – pickles and csv files.
pickles
If we need to save the DataFrame and resume our analysis in Python, it’s better to output it as the pickle format. As pickles, the DataFrame will be saved as it is in Python, i.e., we have the same index, data types, etc.
Let’s see how it works.
When we want to pause the analysis, we use to_pickle to save the DataFrame df0 as videos.pkl.
Next time, we can access the same DataFrame by using read_pickle. Let’s also print out its head and info.
The DataFrame looks the same as before. We can keep analyzing from where we left off.


csv files
csv files are more common than pickles and can be opened by other software/programs. So if we want to share the data with others, it’s better to save our DataFrames as csv files.
Note that some information might be lost compared to pickles. For example, we might have to specify the column data types when loading the csv files into Python again.
That’s it! You’ve learned a lot about Python pandas for data science.
Did you get a better idea of how to analyze and manipulate data?
Leave a comment for any questions you may have or anything else!
Related “Break into Data Science” resources:
How to Learn Data Science Online: ALL You Need to Know
A detailed review of resources online, including courses, books, free tutorials, portfolios building, and more.
Python crash course: breaking into Data Science
A FREE Python online course, beginner-friendly tutorial. Start your successful data science career journey.
What are the In-Demand Skills for Data Scientists
Why Python, SQL, Machine Learning are the most in-demand skills for data science.
SQL Tutorial for Beginners: Learn SQL for Data Analysis
An ultimate tutorial to learn SQL for data analysis (from beginner to advanced). Learn & master SQL queries with this practical guide.