Learn Python Pandas for Data Science: Quick Tutorial
 Examples for all primary operations of DataFrames

Lianne & Justin

Lianne & Justin

Share on twitter
Share on linkedin
Share on facebook
Share on email
red pandas data science python
Source: Pixabay

This is a guide to help you quickly learn and use Python pandas for data science, machine learning.

Pandas is the most popular Python package for data manipulation and analysis. Data analysts or scientists often spend 80% of their time exploring and transforming data, which makes pandas an essential skill for data science.

In this tutorial, you’ll learn the most useful data science-related pandas techniques:

  • What is pandas Python.
  • How to install and import pandas.
  • What are the pandas data structures.
  • How to do the key dataframe operations such as reading/saving data, exploring data, manipulating data.
  • Much more.

This practical tutorial with examples will help you start real data analysis, if you just learned the basics of Python.
Note: if you are new to Python, take our FREE Python crash course for data science.

Let’s begin!



What is Pandas Python for Data Science?

Or why pandas?

Because pandas:

  • is easy to use, fast, and open-source.
  • offers powerful data structures and operations.
  • can be used to import, write, manipulate, clean, and analyze data. Many people consider it as the Excel within Python.
  • is the most popular Python data analysis/manipulation package, which is built upon NumPy.

In summary, pandas is a must-know package for data science or machine learning.

Note before the tutorial: it’s not necessary (or possible) to memorize all the functions or methods. Plus, there might be other functions or methods you need in the future.
We suggest you follow the tutorial to learn systematically, which covers the essentials. When you are practicing data science, a simple Google search would often solve the problem.

Install and Import Pandas

So how to get started with pandas?

You can install the Anaconda distribution to start using Python for data science, which is FREE and includes popular packages such as pandas.

If you already have Python but not pandas, please use the command below to install it.

pip install pandas

After downloading the package, we need to import the packages for each new Python session.

We also import the matplotlib package, which allows us to show some simple data visualizations.

Learn Python Pandas Data Structures

pandas has two main data structures: Series and DataFrame.

  • Series: a 1-dimensional labeled array that can hold any data type such as integers, strings, floating points, Python objects. 
    It has row/axis labels as the index.
  • DataFrame: a 2-dimensional labeled data structure with columns of potentially different types.
    It also has row labels as the index.

DataFrames can be considered as a collection of Series; it has a structure like a spreadsheet.

In this tutorial, we will focus on DataFrames since they need a lot more data manipulation than Series.

Next, let’s see how to read in data as pandas DataFrames.


Import Data into pandas DataFrame

To work in pandas, we can either create DataFrames or read data from external sources.

We’ll cover three common cases in practice.

Dictionary to Dataframe

There are different ways to construct DataFrames from scratch. Let’s use Python dictionaries as an example.

After a raw dataset is preprocessed in Python, it is often convenient to compile the new dataset as a dictionary. But dictionaries are not great objects for further analysis.

This is when we need to convert this dictionary object into a pandas DataFrame.

This is especially common when analyzing log files or text.

Let’s see an example below.

We have a dataset for housing sales records. First, we create two lists with values of the houses’ addresses and prices. Then we’ll combine these lists into a dictionary d with key ‘address’, value l1 and key ‘price’, value l2. We can simply transform this dictionary as a pandas DataFrame df2 as below.

learn python pandas for data science
dictionary to dataframe

This is great!

We’ve created our first pandas DataFrame. As you can see, it has six rows and two columns with an index from 0 to 5.

Yet, it is more often we import data from outside sources. Let’s look at the most common source: csv files.

read_csv

Within this tutorial, we’ll mainly use two sample datasets. Both can be downloaded following the links.

One is a YouTube dataset sydney_videos.csv; the other is a transactional dataset trans.csv. They are both datasets we’ve done analyses on, which makes this guide more practical. Please make sure the two files are stored in the same working directory as your Python code. This saves us from specifying the path in read_csv.

Since both datasets are saved as csv files. Let’s use the pandas read_csv function to import them into Python.

For the YouTube dataset, we can use the default settings of read_csv to load it as DataFrame df0.

We then call df0 for a look.

learn python pandas for data science sample dataset1

We can see that df0 is a DataFrame with 837 rows and four columns. Each of the rows is a video with columns:

  • title
  • time posted_ago
  • length
  • number of views

We can also change the settings of the input parameters to read_csv. For example, to read the transactional dataset, we set:

  • sep = ‘;’ to change the delimiter from commas (,) to semicolons (;).
  • dtype={‘k_symbol’: ‘string’, ‘bank’: ‘string’} to set the data types of columns k_symbol and bank to strings.
  • parse_dates=[‘date’] to parse the column date as a DateTime format.
learn python pandas for data science sample dataset2

Further Reading: Pandas read_csv to DataFrames: Python Pandas Tutorial
Read this complete tutorial to learn more about Python pandas read_csv. Or take the last section as a cheat sheet.

read_sql

Another common case is when we want to import data into Python from SQL servers. We can use read_sql to read in the SQL query or database table into a DataFrame.

For example, we have a sample database in Microsoft SQL Server. We first set up a connection with the server by using the SQLAlchemy function create_engine. Then we can query within Python to get the data as DataFrame.

read_sql from sql server

Further Readings:
How to Download and Install SQL Server and Sample Database: Step-by-Step
SQL Tutorial for Beginners: Learn SQL for Data Analysis
SQL is another essential data science skill. Read these articles to master it.

Besides CSV files and SQL databases, we can read data from a variety of sources with pandas. Learn more with pandas’ official IO tools documentation.

At the end of the tutorial, we’ll show how to save the DataFrames on your computer.

But let’s see what we can do with these DataFrames first.

Explore the DataFrames

After creating the pandas DataFrames, let’s learn more about their basic characteristics.

We’ll cover a few popular methods to do this.

info: summary

The info method shows a helpful summary of our DataFrames. It is one of the first methods you should apply after loading in the dataset.

Let’s try it out on the YouTube DataFrame df0.

We can see that it prints information, including:

  • the index information
  • the columns and their names, non-null values, counts, dtypes
  • memory usage
dataframe information summary

head or tail: first or last n rows

The head method is also used in every data analysis project. With its help, we can see exactly how the first n rows look like within the DataFrame.

Let’s print the first five rows of df0 using the default settings of the method.

dataframe head Learn Python Pandas for Data Science

We can also specify the number of rows to display.

dataframe head number of rows specified

The tail method returns the last n rows of the DataFrame. It can be used as a quick verification after we sort or append rows.

dataframe tail

shape: dimensionality

To get the number of rows and columns of the DataFrame, we can use shape.

(837, 4)

describe: descriptive statistics

The method describe generates descriptive statistics for both numeric and object columns.

For example, for the numeric column views, we can print out its summary statistics such as:

  • count
  • mean
  • standard deviation
  • percentiles.
dataframe describe numerical variable Learn Python Pandas for Data Science

For the object column posted_ago, we can see its corresponding statistics:

  • count
  • unique counts
  • the most common value and its frequency (top, freq).
dataframe describe categorical variable

boxplot and histogram: explore numerical variables

When the columns are numeric, it’s useful to explore their distributions by boxplot and histogram.

We can do a boxplot on the numerical column views.

dataframe pandas boxplot

We can also look at its histogram.

dataframe pandas histogram

Note: we prefer the seaborn library for data science since it has more features. But these methods in pandas are good for quick and simple plots.
Further Readings: How to use Python Seaborn for Exploratory Data Analysis
The YouTube dataset was also used in this complete guide of using seaborn Python for Exploratory Data Analysis. Discover how to use histograms, barplots, scatter plots, boxplots, heatmaps, etc.

value_counts and bar plot: unique values of categorical variables

The value_counts method is handy for categorical columns. It returns a pandas Series with counts of unique values, in descending order. And it excludes NA values by default.

For example, we can use it to see frequency tables of column posted_ago. It provides more details than the previous method describe().

dataframe value_counts

We can also display the results in percentages by using the normalize parameter.

dataframe value count proportions

It’s also helpful to visualize this result. We can do this by using the plot method.

We just plotted a bar chart.

dataframe pandas bar plot

Note: again, we prefer the seaborn library for data science since it has more features. But these methods in pandas are good for quick and simple plots.
Further Readings: How to use Python Seaborn for Exploratory Data Analysis

Now with a better understanding of the DataFrames, let’s see what we can do with its columns and rows.

Select and Manipulate Columns and Rows

Columns and rows are the basic structures of DataFrames. It’s common practice to select a subset or modify the columns/rows. Let’s learn how to do it with Python pandas. We’ll also cover related topics such as index settings and filtering.

Columns of DataFrames

The columns of the DataFrames often represent variables/features in the dataset. There are often names we can refer to these columns by.

Select Columns

The most basic way of selecting columns is by calling its name within squared brackets [].

For example, we can print out the columns length and views separately below.

Column Operations

We can also perform simple numerical or categorical operations on columns.

For example, we can perform a multiplication of 100 on the numeric column views.

column operations multiply by 100

Add New Columns

We can also create new columns in the DataFrame by assigning existing columns’ transformations.

The column length stores the number of seconds of the videos. Sometimes it’s more convenient to look at it in minutes. Let’s create a new column below.

By printing out the head of df0, we can see the new column length_in_minutes being added to it.

new columns added dataframe

Similarly, we can perform categorial operations and create new columns.

The new column title_lowercase contains the lower cases of title, is added to the DataFrame.

add column with lowercase title

Rows of DataFrames

Rows represent observations in the datasets. In this section, we’ll cover the basic methods for getting subsets of the DataFrame and indexing.

The two main methods to locate rows are either by labels (loc) or integer-location based indexing (iloc).

Before that, let’s change the index of our dataset to contain special labels.

set_index: set index to columns

Like strings or lists, every pandas DataFrame has an index. By default, it is an integer ranging from 0 to (number of observations – 1). But we can also set the index to be one of the columns.

The default index of the transaction dataset df1 is integers 0 to 1056319. Since it is a time series dataset, the column date could be a better index.

default index

Let’s set the index to the column date.

We can see that the date column replaced the leftmost index column.

set index to column date

Now we are ready to explore the row locating methods.

loc: locate by labels

We’ll start from the loc method, which helps us to locate rows by their labels/names.

For example, if we want to see all the transactions that happened on 1996-12-25, we can use loc to find them.

loc date labels example

We can also get a range of dates using slicing. Note the difference from the usual Python slices. Both the start and the stop are included.

loc date range example

The loc method also allows us to subset both rows and columns by the labels.

For example, instead of printing all the columns for date index ‘1996-12-25’, we can specify the columns.

loc rows and columns

Besides labels, we can also use boolean arrays (True/False) to subset rows, which will be shown in the Filtering section later.

reset_index

One of the disadvantages of an index is that the data in it is not as accessible as the other columns. To do better analysis on the dates in later sections, let’s reset the index of df1 to return the date as a column.

reset index

iloc: locate by index

The iloc method allows us to locate rows based on integer-location based indexing.

For example, to call the 10th to 19th rows, we can use slicing with iloc.

Note that the stop index 20 is not included since iloc slicing is similar to the regular Python slices.

iloc range example

When selecting columns using iloc, we need to specify the column number (starting from 0) rather than the names. For example, the column date is the first column, so it has the index 0.

iloc rows and columns example

Besides integers, we can also use boolean arrays (True/False) to subset rows with iloc, similar to loc.

Random Sampling

Random sampling is a common practice in data science. We can use the sample method in pandas to do this.

For example, we can take a random sample of size ten from df1 and assign it to a new DataFrame df1_samp1.

sampling of size 10

Or, we can take a certain proportion from the original dataset using the parameter frac.

sampling of a fraction

Filter: Conditional Selection

Filtering is another popular technique in Python pandas to learn for data science. It’s useful when we want to subset the data based on some condition(s).

For example, we can grab the rows with both views over 50k and length of over 50 minutes. The variable msk is a boolean array with length equals the number of rows in the Dataframe. The array element is True when the condition is satisfied, and False otherwise.

As you may have noticed, we can use logical operators such as & (AND), | (OR) in filtering. It is good practice to put the parentheses around the conditions to specify the execution. Otherwise, it might not work as expected.

filter for rows Learn Python Pandas for Data Science

Moreover, we can combine filtering with loc to select rows and columns as below. This is when loc is used to locate rows based on boolean arrays.

loc with boolean array filtering Learn Python Pandas for Data Science

Besides displaying the subset, we can also change the values in the DataFrame based on the filters.

We have non-English terms like ‘PRIJEM’ in the column type within the transactions dataset.

type with non-english words Learn Python Pandas for Data Science

We can transform them into English as below.

Now with the English categories in column type, we can more easily perform our analysis.

replace types using loc and filtering dataframe Learn Python Pandas for Data Science

That’s a lot of rows and columns manipulation! Let’s move on to another critical pandas functionality: Group by.

Groupby

Before learning pandas GroupBy, let’s look at some aggregating functions.

Below are some of the popular ones. We are not showing the results of the code since they are straightforward.

Instead of applying over the whole dataset, aggregating functions are more powerful on groups.

pandas Groupby is a popular manipulation tool for data science. It can be used to group large amounts of data and compute operations based on groups. The idea should be familiar to those who have used SQL’s SELECT and Group By statement.

For example, we can get the sum and mean of the column amount, grouped by column type.

We can also group by multiple columns. For example, we can get the sum, mean of the column amount, grouped by type and operation.

There is a lot more to learn about pandas Groupby. That’s why we wrote another tutorial below. Take a look to dig into it deeper!

Further Reading: How to GroupBy with Python Pandas Like a Boss
To learn more about Python Pandas GroupBy, read this complete tutorial with practical examples. This article covers transform and apply functions as well.

Data Cleaning/Manipulation

pandas is also a powerful tool for cleaning and transforming the dataset. Let’s see some useful functions.

Apply customized Functions

With the pandas apply method, we can apply a function on the DataFrame.

Let’s see an example.

The DataFrame df1’s type column has three main categories, as shown below.

types Learn Python Pandas for Data Science

What if we want to perform different operations on different types of transactions:

  • when it’s a deposit, calculate amount/balance.
  • when it’s a withdrawal, calculate amount/(balance + amount).

We can first define a function specifying the logic we described above and then apply it to the DataFrame. The parameter axis is set to 1 to apply the function on each row for columns.

As you can see, the new column is added to the DataFrame.

apply function to create new column in dataframe Learn Python Pandas for Data Science

Missing Data (NA/Null)

Missing data is a common issue for most datasets.

For example, by calling the tail of df1, we can see it has some missing values in the operation, bank, and account columns.

missing data NAN, NA in dataframe Learn Python Pandas for Data Science

We can fill in the missing values of column operation with ‘UNKNOWN’ using the fillna method.

fillna in dataframe Learn Python Pandas for Data Science

We can also combine the filtering method and isnull, and notnull to subset rows with or without missing values.

The code below subsets all the rows with the missing values in the bank column.

isnull dataframe  Learn Python Pandas for Data Science

The code below subsets all the rows with the bank column NOT missing.

notnull dataframe Learn Python Pandas for Data Science

Further Reading: Data Cleaning in Python: the Ultimate Guide (2020)
There is a lot more to learn about missing data. Take a look at this guide for how to deal with missing data.

Remove Columns/Rows

When we want to drop rows or columns from a DataFrame, we use the drop method.

When we drop columns, we set the parameter axis = 1. The default is 0, which drops by index/rows.

For example, we can drop the column title as below.

drop column title dataframe Learn Python Pandas for Data Science

Or we can drop a few rows.

drop rows in dataframe Learn Python Pandas for Data Science

Note that the drop method is not permanent; we often assign it to a new DataFrame for further analysis. For example, df0_new = df0.drop([‘title’], axis=1). This makes sure we always have the original dataset.

For details on drop_duplicates, please check out the duplicates section in Data Cleaning Guide.

Date functions

pandas has a lot of functionality for dates, which is a common variable in data science.

For example, we can get the year, month, day, and weekday from the date column.

date year month day weekday of dataframe columns  Learn Python Pandas for Data Science

We can then use methods such as Groupby and plot to analyze the data based on the date.

For example, we plot the bar chart of the average amount by month.

bar plot of amount group by month Learn Python Pandas for Data Science

Further Reading: Data Cleaning in Python: the Ultimate Guide (2020)
Again, we strongly recommend reading this guide if you want to clean the data better.

Merge and Concatenate DataFrames

Similar to SQL tables, we can also merge or concatenate DataFrames. Let’s see how it works.

For the demonstration, we create two new DataFrames left and right below.

new dataframes for merging

Merge

The merge join is done on columns or indexes.

For example, we can merge the left and right DataFrames on their common column col1. The how parameter specifies the type of merge to be performed. For example, when how = ‘left’, the new DataFrame keeps all the rows from the left DataFrame and the corresponding values from the right DataFrame.

Concatenate

When we have two DataFrames with the same row or column information, we can glue them together using concat.

For example, if we create two new DataFrames with the same columns, we may want to combine these two DataFrames.

concatenated dataframe Learn Python Pandas for Data Science

Sort DataFrames

What if we want to display the DataFrame in order of a column?

For example, we can use sort_values to order the DataFrame by columns account_id, and trans_id, in ascending order.

sort by columns in pandas dataframe Learn Python Pandas for Data Science

Save or Output DataFrames

We may want to output and save these pandas DataFrames for further analysis after the many operations.

We’ll introduce the two common methods – pickles and csv files.

pickles

If we need to save the DataFrame and resume our analysis in Python, it’s better to output it as the pickle format. As pickles, the DataFrame will be saved as it is in Python, i.e., we have the same index, data types, etc.

Let’s see how it works.

When we want to pause the analysis, we use to_pickle to save the DataFrame df0 as videos.pkl.

Next time, we can access the same DataFrame by using read_pickle. Let’s also print out its head and info.

The DataFrame looks the same as before. We can keep analyzing from where we left off.

pickle read into dataframe Learn Python Pandas for Data Science
Learn Python Pandas for Data Science dataframe summary

csv files

csv files are more common than pickles and can be opened by other software/programs. So if we want to share the data with others, it’s better to save our DataFrames as csv files.

Note that some information might be lost compared to pickles. For example, we might have to specify the column data types when loading the csv files into Python again.


That’s it! You’ve learned a lot about Python pandas for data science.

Did you get a better idea of how to analyze and manipulate data?

Leave a comment for any questions you may have or anything else!


Related “Break into Data Science” resources:

How to Learn Data Science Online: ALL You Need to Know
A detailed review of resources online, including coursesbooksfree tutorialsportfolios building, and more.

Python crash course: breaking into Data Science
A FREE Python online course, beginner-friendly tutorial. Start your successful data science career journey.

What are the In-Demand Skills for Data Scientists
Why Python, SQL, Machine Learning are the most in-demand skills for data science.

SQL Tutorial for Beginners: Learn SQL for Data Analysis
An ultimate tutorial to learn SQL for data analysis (from beginner to advanced). Learn & master SQL queries with this practical guide.

Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on facebook
Facebook
Share on email
Email
Lianne & Justin

Lianne & Justin

Leave a Comment

Your email address will not be published. Required fields are marked *

More recent articles

Scroll to Top
We use cookies to ensure you get the best experience on our website.  Learn more.