From a Spreadsheet to a Database

Bahlai’s Law says,

"Other people's data is always inconsistent and in the wrong format."

We open this lesson by taking data from three different data sources, one scraped by learners in the library carpentry webscraping lesson and joining them together to answer a research question.

We will then take such data, a bibliography embedded in a spreadsheet, and turn it into something usable.

Along the way, we will use all of the tools introduced so far to extract, reformat, and analyze information that would otherwise be difficult or impossible to work with.

Prerequisites

Learners should have completed introductory lessons on:

  • the Unix shell (head and tail, word count, sorting, and pipes)
  • Git (setting up a repository, committing files)
  • Python (libraries, loops, list indexing, string formatting)
  • SQL (creating tables, inserting data, joins)

Schedule

Setup Download files required for the lesson
00:00 1. Loading data directly into the database What is the difference between a spreadsheet and a table?
How do I bulk load data into sqlite3?
00:40 2. Combining datasets What is the essential thing needed to combine data from different sources?
How do we answer research questions with SQL and other peoples’ data?
01:20 3. Extracting Data What do we do with data we cannot quickly bulk load into a database?
How does python read csvs?
01:50 4. Transforming and Loading Data into a database How do we clean data using python?
How can we load that cleaned data into a database?
02:20 Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.