In this tutorial, we’ll show you the basics of web scraping using the Python Beautiful Soup library.
Imagine looking for data by browsing the internet, and you found some websites with relevant data. How can you get the data besides manually copying and pasting?
While some websites do offer convenient ways to get data using APIs, most websites don’t. This is when web scraping becomes the go-to method. Given Python’s popularity for data science, it’s essential to learn this skill to automate this data collection process.
Following an example, you’ll learn:
- the general process of web scraping using Python
- and in particular, how to use Beautiful Soup, a popular Python library.
Let’s get started!
To follow this tutorial, you need to know:
- Python basics, which you can learn with our FREE Python crash course: breaking into Data Science.
- HTML basics, which you can get a quick overview with HTML Introduction from W3 Schools.
Note: it takes much effort and time to understand all the pieces of web scraping. And it requires creativity and research even for an experienced programmer to web scrape. So this article will only focus on the basics to help you get started with this technique.
Step #0: Prepare for web scraping
Should I web scrape?
There are two things to consider before web scraping:
- Does the website offer downloadable files or APIs?
Web scraping is not the most efficient way to grab data from a website. It depends on each sites’ structure, so a small change in the website may result in you having to update the code. So before jumping in and scraping, you should find out if there are better methods to retrieve the data, such as downloading the files or APIs (Application Programming Interface).
Further Reading: APIs are offered by the websites, so it’s easier to use and more durable than web scraping. So congratulations if the data you want is accessible through APIs. Check out the tutorial How to call APIs with Python to request data to grab it!
- Is it legal to scrape the data?
Web scraping does occupy a website’s resources. Due to this and other copyright reasons, some websites do list web scraping as forbidden. So before proceeding, it’s important to read rules about scraping in the website’s terms and conditions or the robots.txt file (e.g., https://www.indeed.com/robots.txt).
When the website does not specify any related rules, it’s still a good idea to not use excessive resources of the website. For example, you can use the
time.sleep()method to slow down the process when scraping for multiple pages. For our example project in this tutorial, the process is short, so that we won’t be using it.
When you are sure web scraping is the way to go, let’s move on and learn some basic knowledge.
What’s the procedure for web scraping?
To web scrape, we need to know some behind-the-scenes processes when browsing the websites.
Each time we want the website to do something (e.g., return a page with the information), we send HTTP requests to its server. As a response, the server provides a response message containing the main contents in the HTML (Hypertext Markup Language) format. Then the web browser (e.g., Chrome) assembles the contents and displays them as the regular multi-media page for us.
This is a similar process when you send shopping requests for products to the merchants and get them shipped to you.
It’s not necessary to understand all the details. But it’s helpful to keep the broad steps in mind since we’ll follow a similar procedure to scrape the web pages. We’ll write Python code to:
- send requests to the server
- grab the data returned in HTML documents
- use Beautiful Soup to parse it
- extract and transform the data for analysis.
Great, now we are ready to move into Python!
Step #1: Import Python libraries
In this step, we’ll look at the Python libraries that will be used for our web scraping in Python:
- requests: the standard Python library for handling HTTP, which allows us to send HTTP requests easily.
- bs4 (beautifulsoup4): this is the Beautiful Soup library, which helps us pull data out of the HTML files returned by HTTP requests. It’s named bs4 since it’s the 4th version of the library at the time of this article.
- pandas: the popular data manipulation and analysis tool. We’ll be using it to format the scraped data into DataFrames for further analysis.
After installing using the
pip install command, we can import them into Python as below.
Step #2: Explore the website
Next, we need to take a closer look at the website.
Which web pages contain the data we want?
What are the URLs of those pages?
To keep the learning process straightforward, we created a page called
My Example Website.
Assume we want a list of fruit names and a table of real estate addresses and prices. They happen to be both on this web page with URL
This is easy because the information we want is on a static URL, which does not change.
Yet, there are also dynamic URLs. For example, when searching for ‘data scientist’ jobs in ‘New York, NY’ on Indeed.com, you may notice the URL looks like this
https://www.indeed.com/jobs?q=data%20scientist&l=New%20York%2C%20NY. This is more tricky. It is long and contains query parameters. In which case, you’ll need to figure out the pattern of the URLs before scraping it.
We won’t go into details here. Just know that this URL has the two following main components:
- a consistent base URL:
- ‘?’ and two query parameters:
q=data scientist&l=New York, NY.
This understanding of the structure is important when sending requests to the server. To read more about static and dynamic URLs, check out Dynamic URLs vs. static URLs.
All right, let’s get back to our example. Once we find the URLs, we can explore the next question:
What is the structure of the website?
It’s a good idea to explore the website’s contents using your web browser’s developer tools. For example, if you are using Chrome, here’s one easy way to access the Chrome DevTools:
- open the web page https://liannewriting.github.io/scraping_example.html in Chrome.
- right-click any element you want to look at and select
Inspectto get into the Elements panel. You can expand or collapse the children under each component.
As you hover the cursor around the element in the panel, you’ll see the corresponding part highlighted on the page. For example, the
divcontainer with ‘Avocado’ is highlighted below.
From this, you can see that the main tags we are interested in are:
divelements with the CSS class
'data-container', which contains the fruit names.
tableelement with the id
'data-table', which has all the addresses and price data.
All right, after you become familiar with the website, we are ready to send requests to the server.
Step #3: Request for data
There are different types of HTTP requests. We’ll use one of the most common and simple method: the GET request.
In the Python code below, we use the
requests package to send a
GET request and assign the returned response object as
response. Also, we ask Python to print out the URL visited as well as the status code to validate the request result.
A status code of
200 represents a successful request. Other codes show that the request has errors, which need more attention before proceeding further.
Nice! Our request was OK, as shown below.
Visited URL: https://liannewriting.github.io/scraping_example.html 200
Now we are ready to make the ‘soup’ with Python BeautifulSoup.
Step #4: Parse the HTML doc with Beautiful Soup
The data is in the text content of
response, which is
response.text, and is HTML. We can use the
BeautifulSoup to parse it, saving us a lot of time when web scraping in Python. This transforms the HTML document into a
BeautifulSoup object, which is a complex tree of Python objects.
This is great. Let’s print out the
soup to take a look!
We can print out the
soup object in an easier to read format using the
The printed result is not included since it’s long. But you can either print it on your computer or look at it here.
soup has content corresponding to what we explored in the Chrome DevTools earlier. So it’s still a long document that’s hard to read.
It’s time to learn what’s inside this
We’ll focus on the
Tag object, which corresponds to the tags of the original HTML document. It is the most common type you’ll deal with when scraping for data.
Let’s find the data.
Step #5: Find the data with Beautiful Soup
Let’s start from a common usage: searching for tags with a specific class.
Example #1: Find
First, we can use
find_all() to get all the div tags with the class name ‘data-container’ as below:
- the first argument is
name, which we didn’t list it out. But we are passing in the string
'div'to filter for tags with name
- the second argument is
class_, which we specified as the class name
'data-container'. Note that ‘class’ is a reserved name in Python, so Beautiful Soup uses ‘class_’ as the argument.
find_all() method returns a list of results. So if you print out
data_containers, it looks like below:
[<div class="data-container">Apple</div>, <div class="data-container">Orange</div>, <div class="data-container">Peach</div>, <div class="data-container">Pear</div>, <div class="data-container">Avocado</div>, <div class="data-container">Strawberry</div>, <div class="data-container">Grape</div>, <div class="data-container">Blueberry</div>, <div class="data-container">Blackberry</div>]
This is one big step closer to the data!
We can use the below code to explore the elements more.
<div class="data-container">Apple</div> Apple
To get the full list of fruit names, we can set up an empty list
dat first and then loop through all the tags for data.
['Apple', 'Orange', 'Peach', 'Pear', 'Avocado', 'Strawberry', 'Grape', 'Blueberry', 'Blackberry']
Nice! Now let’s move onto the addresses and price data within the table.
Example #2: Find
Since there’s only one table within this document, we can use the
find() method instead of
find() method returns the single result if found.
To get the
table with an id attribute ‘data-table’, we use the code below, similar to the previous example. Note that the id attribute should be unique to HTML elements, which will get us the exact table we want.
<table id="data-table"> <tr><th>Address</th><th>Price</th></tr> <tr><td>1 First St</td><td>100000</td></tr> <tr><td>2 Second St</td><td>200000</td></tr> <tr><td>3 Third St</td><td>300000</td></tr> <tr><td>4 Fourth St</td><td>400000</td></tr> <tr><td>5 Fifth St</td><td>500000</td></tr> <tr><td>6 Sixth St</td><td>600000</td></tr> </table>
To get the data, we need to dig deeper into the nested tags.
Notice that within the
table tag, there are also the
tr: the table row tags represent the rows of the HTML tables. This is not the innermost tag containing the data so let’s leave it.
th: the table heading tags holding the headers, which are the column names. We don’t need this information so let’s ignore them too.
td: finally, the table data/cell tags hold the data we’re interested in. So we’ll use the
find_all()method on the
data_tableto return all its child
[<td>1 First St</td>, <td>100000</td>, <td>2 Second St</td>, <td>200000</td>, <td>3 Third St</td>, <td>300000</td>, <td>4 Fourth St</td>, <td>400000</td>, <td>5 Fifth St</td>, <td>500000</td>, <td>6 Sixth St</td>, <td>600000</td>]
There you go!
From the above printed list, we can see that indices 0, 2, 4, 6, 8, 10 are addresses while the others are prices. We can loop through them using the enumerate function. Read Tip #3: Enumerate Function in Python Tips and Tricks if you are not familiar with it.
Now that we have all the data in the Python list format, it’s better to transform them into pandas DataFrames. While this step is optional, it’s more convenient for analysis in Python.
fruit_name 0 Apple 1 Orange 2 Peach 3 Pear 4 Avocado 5 Strawberry 6 Grape 7 Blueberry 8 Blackberry addresses prices 0 1 First St 100000 1 2 Second St 200000 2 3 Third St 300000 3 4 Fourth St 400000 4 5 Fifth St 500000 5 6 Sixth St 600000
Further Reading: if you are not familiar with pandas, which is the most popular data manipulation and analysis tool in Python. Check out Learn Python Pandas for Data Science: Quick Tutorial.
Now they should look familiar to you!
We’ve only covered the most common
Tag object. For more details of all the object types, check out kinds of objects documentations. And if you prefer to use CSS selectors, check out this documentation as well.
Other Python web scraping libraries
Selenium: this package is especially useful in some special situations:
- when scraping the website that requires more actions before accessing the data such as logins.
Scrapy: this is a more powerful framework that can help with web scraping. Try it if you want more advanced functionalities.
With the example in this tutorial, you learned web scraping basics with the Beautiful Soup Python library. For your convenience, the above Python code is compiled together in this GitHub repo.
Even though the real-world situation is often more complicated, you’ve got a good foundation to explore yourself!
We’d love to hear from you. Leave a comment for any questions you may have or anything else.