Introduction to web scraping

This lesson is not currently maintained. It was intended to be a lesson for librarians as part of the Library Carpentry lesson program. In 2023, the Library Carpentry Curriculum Advisory Committee requested that the lesson be transferred to The Carpentries Incubator, in the hope that members of the community may find it and wish to take ownership of its development.

If you would like to become a maintainer for this lesson and take charge of its development, please open an issue on the repository and a member of The Carpentries team will help you get started.

Web scraping is the process of extracting data from websites. Some data that is available on the web is presented in a format that makes it easier to collect and use it, for example in the form of downloadable comma-separated values (CSV) datasets that can then be imported in a spreadsheet or loaded into a data analysis script. Often however, even though it is publicly available, data is not readily available for reuse. For example it can be contained in a PDF, or a table on a website, or spread across multiple web pages.

There are a variety of ways to scrape a website to extract information for reuse. In its simplest form, this can be achieved by copying and pasting snippets from a web page, but this can be unpractical if there is a large amount of data to be extracted, or if it is spread over a large number of pages. Instead, specialized tools and techniques can be used to automate this process, by defining what sites to visit, what information to look for, and whether data extraction should stop once the end of a page has been reached, or whether to follow hyperlinks and repeat the process recursively. Automating web scraping also allows to define whether the process should be run at regular intervals and capture changes in the data.

Prerequisites

As webscraping is a technique to extract data from web pages, it requires some understanding of the technologies that are used to display information on the web. This lesson therefore assumes that learners will have some familiarity with HTML and the Document Object Model (DOM).

The first part of this lesson will use browser extensions to introduce the concepts of web scraping as well as introduce the XPath syntax for selecting elements on a web page and requires no further specific knowledge. The second part will introduce the use of specialized libraries to scrape websites by writing custom computer programs and will require some familiarity with the Python programming language and object-oriented programming.

Software requirements

Refer to the Setup section to install the required software to follow along this lesson.

Under development

Please note that the contents of this lesson are still being actively developed. Any feedback is appreciated, please do not hesitate to contact the author or contribute to the lesson by forking it on GitHub.

Schedule

	Setup	Download files required for the lesson
00:00	1. Introduction: What is web scraping?	What is web scraping and why is it useful? What are typical use cases for web scraping?
00:10	2. Selecting content on a web page with XPath	How can I select a specific element on web page? What is XPath and how can I use it?
00:55	3. Manually scrape data using browser extensions	How can I get started scraping data off the web? How can I use XPath to more accurately select what data to scrape?
02:00	4. Web scraping using Python and Scrapy	How can scraping a web site be automated? How can I setup a scraping project using the Scrapy framework for Python? How do I tell Scrapy what elements to scrape from a webpage? How do I tell Scrapy to follow URLs and scrape their contents? What to do with the data extracted with Scrapy?
04:00	5. Conclusion	When is web scraping OK and when is it not? Is web scraping legal? Can I get into trouble? How can I make sure I’m doing the right thing? What can I do with the data that I’ve scraped?
04:30	Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.