How to do Web Scraping using Python Beautiful Soup
 Step-by-step basics

Lianne & Justin

Lianne & Justin

Share on twitter
Share on linkedin
Share on facebook
Share on email
web scraping python beautiful soup
Source: Unsplash

In this tutorial, we’ll show you the basics of web scraping using the Python Beautiful Soup library.

Imagine looking for data by browsing the internet, and you found some websites with relevant data. How can you get the data besides manually copying and pasting?

While some websites do offer convenient ways to get data using APIs, most websites don’t. This is when web scraping becomes the go-to method. Given Python’s popularity for data science, it’s essential to learn this skill to automate this data collection process.

Following an example, you’ll learn:

  • the general process of web scraping using Python
  • and in particular, how to use Beautiful Soup, a popular Python library.

Let’s get started!



To follow this tutorial, you need to know:

Note: it takes much effort and time to understand all the pieces of web scraping. And it requires creativity and research even for an experienced programmer to web scrape. So this article will only focus on the basics to help you get started with this technique.


Step #0: Prepare for web scraping

Should I web scrape?

There are two things to consider before web scraping:

  • Does the website offer downloadable files or APIs?
    Web scraping is not the most efficient way to grab data from a website. It depends on each sites’ structure, so a small change in the website may result in you having to update the code. So before jumping in and scraping, you should find out if there are better methods to retrieve the data, such as downloading the files or APIs (Application Programming Interface).

Further Reading: APIs are offered by the websites, so it’s easier to use and more durable than web scraping. So congratulations if the data you want is accessible through APIs. Check out the tutorial How to call APIs with Python to request data to grab it!

  • Is it legal to scrape the data?
    Web scraping does occupy a website’s resources. Due to this and other copyright reasons, some websites do list web scraping as forbidden. So before proceeding, it’s important to read rules about scraping in the website’s terms and conditions or the robots.txt file (e.g., https://www.indeed.com/robots.txt).
    When the website does not specify any related rules, it’s still a good idea to not use excessive resources of the website. For example, you can use the time.sleep() method to slow down the process when scraping for multiple pages. For our example project in this tutorial, the process is short, so that we won’t be using it.

When you are sure web scraping is the way to go, let’s move on and learn some basic knowledge.

What’s the procedure for web scraping?

To web scrape, we need to know some behind-the-scenes processes when browsing the websites.

Each time we want the website to do something (e.g., return a page with the information), we send HTTP requests to its server. As a response, the server provides a response message containing the main contents in the HTML (Hypertext Markup Language) format. Then the web browser (e.g., Chrome) assembles the contents and displays them as the regular multi-media page for us.

This is a similar process when you send shopping requests for products to the merchants and get them shipped to you.

It’s not necessary to understand all the details. But it’s helpful to keep the broad steps in mind since we’ll follow a similar procedure to scrape the web pages. We’ll write Python code to:

  • send requests to the server
  • grab the data returned in HTML documents
  • use Beautiful Soup to parse it
  • extract and transform the data for analysis.

Great, now we are ready to move into Python!

Step #1: Import Python libraries

In this step, we’ll look at the Python libraries that will be used for our web scraping in Python:

  • requests: the standard Python library for handling HTTP, which allows us to send HTTP requests easily.
  • bs4 (beautifulsoup4): this is the Beautiful Soup library, which helps us pull data out of the HTML files returned by HTTP requests. It’s named bs4 since it’s the 4th version of the library at the time of this article.
  • pandas: the popular data manipulation and analysis tool. We’ll be using it to format the scraped data into DataFrames for further analysis.

After installing using the pip install command, we can import them into Python as below.

Step #2: Explore the website

Next, we need to take a closer look at the website.

Which web pages contain the data we want?

What are the URLs of those pages?

To keep the learning process straightforward, we created a page called My Example Website.

Assume we want a list of fruit names and a table of real estate addresses and prices. They happen to be both on this web page with URL https://liannewriting.github.io/scraping_example.html.

web scraping python beautifulsoup example website
My Example Website rendered in a Chrome browser

This is easy because the information we want is on a static URL, which does not change.

Yet, there are also dynamic URLs. For example, when searching for ‘data scientist’ jobs in ‘New York, NY’ on Indeed.com, you may notice the URL looks like this https://www.indeed.com/jobs?q=data%20scientist&l=New%20York%2C%20NY. This is more tricky. It is long and contains query parameters. In which case, you’ll need to figure out the pattern of the URLs before scraping it.

We won’t go into details here. Just know that this URL has the two following main components:

  • a consistent base URL: https://www.indeed.com/jobs
  • ‘?’ and two query parameters: q=data scientist&l=New York, NY.

This understanding of the structure is important when sending requests to the server. To read more about static and dynamic URLs, check out Dynamic URLs vs. static URLs.

All right, let’s get back to our example. Once we find the URLs, we can explore the next question:

What is the structure of the website?

It’s a good idea to explore the website’s contents using your web browser’s developer tools. For example, if you are using Chrome, here’s one easy way to access the Chrome DevTools:

  • open the web page https://liannewriting.github.io/scraping_example.html in Chrome.
  • right-click any element you want to look at and select Inspect to get into the Elements panel. You can expand or collapse the children under each component.
    As you hover the cursor around the element in the panel, you’ll see the corresponding part highlighted on the page. For example, the div container with ‘Avocado’ is highlighted below.
web scraping python beautifulsoup Chrome devtools html

From this, you can see that the main tags we are interested in are:

  • the div elements with the CSS class 'data-container', which contains the fruit names.
html div container with css class web scraping
  • the table element with the id 'data-table', which has all the addresses and price data.
table element with id html web scraping

All right, after you become familiar with the website, we are ready to send requests to the server.

Step #3: Request for data

There are different types of HTTP requests. We’ll use one of the most common and simple method: the GET request.

In the Python code below, we use the requests package to send a GET request and assign the returned response object as response. Also, we ask Python to print out the URL visited as well as the status code to validate the request result.

A status code of 200 represents a successful request. Other codes show that the request has errors, which need more attention before proceeding further.

Nice! Our request was OK, as shown below.

Visited URL: https://liannewriting.github.io/scraping_example.html
200

Now we are ready to make the ‘soup’ with Python BeautifulSoup.

Step #4: Parse the HTML doc with Beautiful Soup

The data is in the text content of response, which is response.text, and is HTML. We can use the html.parser from BeautifulSoup to parse it, saving us a lot of time when web scraping in Python. This transforms the HTML document into a BeautifulSoup object, which is a complex tree of Python objects.

bs4.BeautifulSoup

This is great. Let’s print out the soup to take a look!

We can print out the soup object in an easier to read format using the prettify() method.

The printed result is not included since it’s long. But you can either print it on your computer or look at it here.

You’ll notice soup has content corresponding to what we explored in the Chrome DevTools earlier. So it’s still a long document that’s hard to read.

It’s time to learn what’s inside this BeautifulSoup object.

We’ll focus on the Tag object, which corresponds to the tags of the original HTML document. It is the most common type you’ll deal with when scraping for data.

Let’s find the data.

Step #5: Find the data with Beautiful Soup

There are two most popular methods to search for data using Python Beautiful Soup: find() and find_all().

Let’s start from a common usage: searching for tags with a specific class.

Example #1: Find div with class

First, we can use find_all() to get all the div tags with the class name ‘data-container’ as below:

  • the first argument is name, which we didn’t list it out. But we are passing in the string 'div' to filter for tags with name div.
  • the second argument is class_, which we specified as the class name 'data-container'. Note that ‘class’ is a reserved name in Python, so Beautiful Soup uses ‘class_’ as the argument.

The find_all() method returns a list of results. So if you print out data_containers, it looks like below:

[<div class="data-container">Apple</div>,
 <div class="data-container">Orange</div>,
 <div class="data-container">Peach</div>,
 <div class="data-container">Pear</div>,
 <div class="data-container">Avocado</div>,
 <div class="data-container">Strawberry</div>,
 <div class="data-container">Grape</div>,
 <div class="data-container">Blueberry</div>,
 <div class="data-container">Blackberry</div>]

This is one big step closer to the data!

We can use the below code to explore the elements more.

<div class="data-container">Apple</div>
Apple

To get the full list of fruit names, we can set up an empty list dat first and then loop through all the tags for data.

['Apple', 'Orange', 'Peach', 'Pear', 'Avocado', 'Strawberry', 'Grape', 'Blueberry', 'Blackberry']

Nice! Now let’s move onto the addresses and price data within the table.

Example #2: Find table with id

Since there’s only one table within this document, we can use the find() method instead of find_all(). The find() method returns the single result if found.

To get the table with an id attribute ‘data-table’, we use the code below, similar to the previous example. Note that the id attribute should be unique to HTML elements, which will get us the exact table we want.

<table id="data-table">
<tr><th>Address</th><th>Price</th></tr>
<tr><td>1 First St</td><td>100000</td></tr>
<tr><td>2 Second St</td><td>200000</td></tr>
<tr><td>3 Third St</td><td>300000</td></tr>
<tr><td>4 Fourth St</td><td>400000</td></tr>
<tr><td>5 Fifth St</td><td>500000</td></tr>
<tr><td>6 Sixth St</td><td>600000</td></tr>
</table>

To get the data, we need to dig deeper into the nested tags.

Notice that within the table tag, there are also the tr, th and td tags:

  • tr: the table row tags represent the rows of the HTML tables. This is not the innermost tag containing the data so let’s leave it.
  • th: the table heading tags holding the headers, which are the column names. We don’t need this information so let’s ignore them too.
  • td: finally, the table data/cell tags hold the data we’re interested in. So we’ll use the find_all() method on the data_table to return all its child td elements.
[<td>1 First St</td>,
 <td>100000</td>,
 <td>2 Second St</td>,
 <td>200000</td>,
 <td>3 Third St</td>,
 <td>300000</td>,
 <td>4 Fourth St</td>,
 <td>400000</td>,
 <td>5 Fifth St</td>,
 <td>500000</td>,
 <td>6 Sixth St</td>,
 <td>600000</td>]

There you go!

From the above printed list, we can see that indices 0, 2, 4, 6, 8, 10 are addresses while the others are prices. We can loop through them using the enumerate function. Read Tip #3: Enumerate Function in Python Tips and Tricks if you are not familiar with it.

Now that we have all the data in the Python list format, it’s better to transform them into pandas DataFrames. While this step is optional, it’s more convenient for analysis in Python.

   fruit_name
0       Apple
1      Orange
2       Peach
3        Pear
4     Avocado
5  Strawberry
6       Grape
7   Blueberry
8  Blackberry
     addresses  prices
0   1 First St  100000
1  2 Second St  200000
2   3 Third St  300000
3  4 Fourth St  400000
4   5 Fifth St  500000
5   6 Sixth St  600000

Further Reading: if you are not familiar with pandas, which is the most popular data manipulation and analysis tool in Python. Check out Learn Python Pandas for Data Science: Quick Tutorial.

Now they should look familiar to you!

We’ve only covered the most common Tag object. For more details of all the object types, check out kinds of objects documentations. And if you prefer to use CSS selectors, check out this documentation as well.

Other Python web scraping libraries

Selenium: this package is especially useful in some special situations:

  • when scraping the website that requires more actions before accessing the data such as logins.
  • when scraping for dynamic content, which involves JavaScript.

Scrapy: this is a more powerful framework that can help with web scraping. Try it if you want more advanced functionalities.


With the example in this tutorial, you learned web scraping basics with the Beautiful Soup Python library. For your convenience, the above Python code is compiled together in this GitHub repo.

Even though the real-world situation is often more complicated, you’ve got a good foundation to explore yourself!

We’d love to hear from you. Leave a comment for any questions you may have or anything else.

Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on facebook
Facebook
Share on email
Email
Lianne & Justin

Lianne & Justin

Leave a Comment

Your email address will not be published. Required fields are marked *

More recent articles

Scroll to Top
We use cookies to ensure you get the best experience on our website.  Learn more.