Select Page

12. Extracting Data from the Web


The term ‘scraping’ refers to any automated method of extracting information from websites.  Scraping has many purposes, and its use cases can include:

  • Marketers who want a snapshot of consumer behavior find value in analyzing product reviews scraped from a website; 
  • Researchers interested in understanding public discourse around a political issue can examine social media discussions by retrieving data through the platform’s API or scraping the website; or
  • Hedge funds looking to stay ahead of the curve by assessing data scraped from the web.

The large data volume needed for anyone to gain meaningful insight makes an automatic data collection process imperative. Imagine copying-and-pasting every single title, text, and date from 1000 movie reviews into a spreadsheet one at a time? While there are services for manual labor (e.g. Amazon’s Mechanical Turk), the tedious and time consuming nature of this process is neither scalable nor sustainable. Thankfully there are a myriad of options available that minimize the time spent gathering web data. These range from solutions that require no coding skills e.g. Parsehub, Octoparse, Scraper to others that rely on programming languages such as Python and R.

People are free to specify the data they want harvested. For instance, someone interested in analyzing pricing fluctuations of competing products can choose to narrow the information collection to details such as the product price, date, and product name instead of including items like user reviews and user ratings along the way.