This lesson is in the early stages of development (Alpha version)

Transforming and Loading Data into a database

Overview

Teaching: 20 min
Exercises: 10 min
Questions
  • How do we clean data using python?

  • How can we load that cleaned data into a database?

Objectives
  • Create a csv of loadable data for our database

  • Load the transformed data into the database

  • Ask questions of that data

Normalization

Normalization is the process of splitting data up into multiple tables. If we were going to continue this process and make this a research database, we would take the entire set of incoming data and break it up into multiple tables.

To learn more about normalization, this tutorial is quite comprehensive. There are a few rules to remember:

If we have time after adding data, we will explore how to normalize this bibliography. For now, let us focus on the quick and dirty solution.

Transforming Data

We now have (key, author) pairs. The next step is to put them in a relational database so that we can start asking and answering questions. Our starting point is the final script from the previous topic:

# display-authors-2.py
# Print (key, author) pairs.

import sys
import csv

with open(sys.argv[1], 'r') as raw:
    reader = csv.reader(raw);
    for line in reader:
        key, authors = line[0], line[3]
        for auth in authors.split('; '): # semi-colon plus space instead of semi-colon
            print(key, auth)

whose output looks like this:

8SW85SQM McClelland, James L
85QV9X5F McClelland, J. L.
85QV9X5F McNaughton, B. L.
85QV9X5F O'Reilly, R. C.
Z4X6DT6N Ratcliff, R.
F5DGU3Q4 McCloskey, M.
F5DGU3Q4 Cohen, N. J.
PNGQMCP5 Buciluǎ, Cristian
PNGQMCP5 Caruana, Rich
PNGQMCP5 Niculescu-Mizil, Alexandru

What we want is data we can load with our prior commands. In order to do that, we need to write this back out as a different csv.

so let’s modify the program to output this instead:

# convert-1.py
# Output a csv we can load into the database.

import sys
import csv

output_rows=[]

with open(sys.argv[1], 'r') as raw:
    reader = csv.reader(raw);
    for line in reader:
        key, authors = line[0], line[3]
        for auth in authors.split('; '):
            output_rows.append([key, auth])

# We need a second argument for our output file
with open(sys.argv[2], 'w') as csvout:
    writer = csv.writer(csvout) #We are making a csv file from the second argument
    writer.writerow(["Key", "Author"]) #We need to make a header row
    writer.writerows(output_rows) #We need to dump the data we put into output_rows

print(len(output_rows)) #We need to know how many rows output_rows has so we can make sure they all make it into the database.

Let’s run the program:

$ python convert-1.py bibliography_data/bibliography.csv key_author.csv
$ head key_author.csv

This generates a beautiful new csv ready for importing:

Key,Author
8SW85SQM,"McClelland, James L"
85QV9X5F,"McClelland, J. L."
85QV9X5F,"McNaughton, B. L."
85QV9X5F,"O'Reilly, R. C."
Z4X6DT6N,"Ratcliff, R."
F5DGU3Q4,"McCloskey, M."
F5DGU3Q4,"Cohen, N. J."
PNGQMCP5,"Buciluǎ, Cristian"
PNGQMCP5,"Caruana, Rich"

And we know that there are 6587 key-author pairs here.

Now, we need to make a script like we did at the start of the lesson to load this data.

Call the file load_bibliography.sql

.mode csv
.import key_author.csv key_author

.header on
.mode column

SELECT *
  FROM key_author
 LIMIT 10;

SELECT count(*)
  FROM key_author;

And we run it in an in-memory database:

$ sqlite3
sqlite> .read load_bibliography.csv

We see:

Key         Author             
----------  -------------------
8SW85SQM    McClelland, James L
85QV9X5F    McClelland, J. L.  
85QV9X5F    McNaughton, B. L.  
85QV9X5F    O'Reilly, R. C.    
Z4X6DT6N    Ratcliff, R.       
F5DGU3Q4    McCloskey, M.      
F5DGU3Q4    Cohen, N. J.       
PNGQMCP5    Buciluǎ, Cristian  
PNGQMCP5    Caruana, Rich      
PNGQMCP5    Niculescu-Mizil, Al

count(*)  
----------
6587     

Cutting Corners

Python and other languages have libraries for interacting with databases, so why are we bulk loading data like this instead? The answer is that it’s a simple solution that uses all of the tools we have introduced so far. If we had to do anything more complicated with this data, we would almost certainly import sqlite3 and do things the right way.

That looks good, so let’s try asking some questions:

SELECT author, count(*) 
  FROM key_author 
 GROUP BY author 
 ORDER BY count(*) desc 
 LIMIT 10;
Author          count(*)  
--------------  ----------
Bengio, Yoshua  122       
Bengio, Y.      111       
Hinton, Geoffr  78        
LeCun, Yann     56        
Hinton, G. E.   45        
Salakhutdinov,  34        
LeCun, Y.       31        
Vincent, Pasca  29        
Jordan, M. I.   27        
Frasconi, P.    25 

The first thing we see is that our work has paid off: we can now find out who the most prolific authors are with a single command. The second thing we see is that we’re not done yet: “Bengio, Yoshua” and “Bengio, Y.” are almost certainly the same person, as are “LeCun, Yann” and “LeCun, Y.”. If we really want to know who has written the most papers, we’re going to have to reconcile different names for the same person.

Denormalization is the Root of Much Evil

Data is normalized if it contains no redundancy. (The full definition has more conditions, but this is good enough for our purposes.) Our data is denormalized because it represents particular authors in several different ways. We can try to normalize the data using heuristics such as, “If the surname matches and the initialized form of the other parts of the name match, it’s the same person.” Errors will always creep in, however: in our case, we have no way of knowing whether “Albar, M.” is Mohammd Albar or Michael Albar. Answers based on heuristically normalized data are therefore always approximations of reality. It is crucial in these cases that we record the heuristics we used and the transformations we made so that others (including our future selves) can check our work.

Instead of normalizing names and otherwise cleaning the data as part of our python processing step, let’s see what other questions we can answer.

Who has co-authored papers with whom?

SELECT a.author, b.author 
  FROM key_author a 
  JOIN key_author b USING(key) 
 WHERE a.author > b.author
 LIMIT 10;
Author             Author           
-----------------  -----------------
McNaughton, B. L.  McClelland, J. L.
O'Reilly, R. C.    McClelland, J. L.
O'Reilly, R. C.    McNaughton, B. L.
McCloskey, M.      Cohen, N. J.     
Caruana, Rich      Buciluǎ, Cristian
Niculescu-Mizil,   Buciluǎ, Cristian
Niculescu-Mizil,   Caruana, Rich    
Rigamonti, Robert  Fua, Pascal      
Rigamonti, Robert  Lepetit, Vincent 
Sironi, Amos       Fua, Pascal 

(We use a.author > b.author to ensure that each distinct pair of authors only shows up once.) What if we want to know how often different pairs of people have written together?

SELECT a.author, b.author, count(*)
  FROM key_author a 
  JOIN key_author b USING(key) 
 WHERE a.author > b.author
 GROUP BY a.author, b.author
 ORDER BY count(*) desc
limit 10;
Author           Author          count(*)  
---------------  --------------  ----------
Vincent, Pascal  Bengio, Yoshua  27        
Roux, Nicolas L  Bengio, Yoshua  20        
Delalleau, Oliv  Bengio, Yoshua  19        
Bengio, Y.       Bengio, S.      18        
Larochelle, Hug  Bengio, Yoshua  15        
Roux, Nicolas L  Delalleau, Oli  15        
Vincent, P.      Bengio, Y.      15        
Chapados, N.     Bengio, Y.      14        
Gori, M.         Frasconi, P.    14        
Salakhutdinov,   Hinton, Geoffr  14  

Again, we have a data normalization problem: the pair “Vincent, Pascal” and “Bengio, Yoshua” is almost certainly the same as the pair “Vincent, P.” and “Bengio, S.” But that problem would be there even if we were analyzing this data manually, and putting it in a database has made asking new questions so easy that we can do research we otherwise wouldn’t have been able to tackle. The final step is to commit our script to Git and celebrate with a nice cup of tea.

Doing Things the Right Way

Rewrite the Python program to use the sqlite3 library to create the database. And use python to create cleaner input data. What did you have to do to make the data consistent and legible?

Distinct Pairs

Explain why using a.author > b.author ensures that distinct pairs of authors only show up once.

Data cleansing

Write a function that takes two authors’ names as input and returns True if they are probably the same person and False if they do not. Compare your function to your neighbor’s: can you find a case where the two would disagree?

Key Points

  • Explain why inferences drawn from uncleaned data are always approximate.