12.1 Scraping with Beautiful Soup

How web scraping works

Most web pages are written in html. Every part of a page has a html tag that tells the internet browser how text should be displayed. For example, paragraphs begin with a <p> tag, and end with a </p> tag; titles are bookended with the <title> and </title> tags. Web scraping tools interpret tags and collect the data as instructed.

How to web scrape with Python’s Beautiful Soup

Python users have several packages at their disposal to gather data enmasse – Beautiful Soup, Selenium, and Scrapy depending on the project’s complexity. Since this is an introduction to web scraping, we will use the beginner-friendly Beautiful Soup to demonstrate the principles behind web scraping. Our goal in the exercise below is to gather an online article into a CSV file, with the headline and content in separate columns.

#Step 1: Import the Beautiful Soup package

After importing the Beautiful Soup package under the acronym ‘bs’, we import urllib.request to open the URL. Next, we bring in the article in its entirety i.e. the headline, the text, and all the code on that page. We’ll save this information under a name called ‘sauce’.

#Step 2: Examine the HTML code

‘Sauce’ is incredibly messy so we need to make the source code more organized by using Beautiful Soup and the lxml parser. Doing so allows us to see the tags that structure the HTML code and identify the parts we need. In the example below, we see the article’s headline is wrapped between two tags i.e. <title> and </title>.

#Step 3: Identify and import the parts we want

In this instance, we are interested in importing the text and the headline.

Let’s start off by scraping the text which is usually bookended with a <p> tag at the front and a </p> tag at the end. The ‘p’ stands for ‘paragraph’. Since this tag is HTML convention and is consistent throughout the page, we can get the entire text by typing in this simple command. Instructing Python to print the paragraphs as a text makes the display easier to read in Jupyter Notebook.

#Step 4: Remove special characters

Now we have the text stored inside the ‘article’ bucket but there are special characters like \ that need to be removed. To eliminate these special characters, we first need to convert the data type from a ‘list’ to a ‘string’.

There are several ways for us to clean the data. For instance, we could use re.compile(“\W+”) and re.sub(not_word_pattern, ” “, article_string) but this approach removes all punctuation, making text analysis problematic since punctuations are used as markers when extracting sentences (see Chapter 13: Text mining).

Using the regex method shown below allows us to keep all alphanumeric characters and spaces, and periods.

#Step 5: Retrieve the headline

As we saw earlier on this example, the headline is wrapped by the <title> and </title> tags. The headline can be obtained through a shorter process:

Just like before, we need to convert the data type from the ‘beautiful soup tag’ to a ‘string’. We also need to cleanse the headline of any special characters, trailing spaces or tags.

#Step 6: Export the headline and text as a CSV file

When the data is clean enough, we combine the headline and the text in one file (‘results_df’). Finally, we export the data frame as a CSV file.

The result will look like this:

A note about web scraping

Web scraping can be challenging since the code needs to accommodate a back-end that can differ greatly between websites. Many commercial websites also deploy anti-scraping technology to block web scraping attempts. Circumventing this barrier could violate ethical and legal boundaries.