Transforming and Loading Data into a database
Overview
Teaching: 20 min
Exercises: 10 minQuestions
How do we clean data using python?
How can we load that cleaned data into a database?
Objectives
Create a csv of loadable data for our database
Load the transformed data into the database
Ask questions of that data
Normalization
Normalization is the process of splitting data up into multiple tables. If we were going to continue this process and make this a research database, we would take the entire set of incoming data and break it up into multiple tables.
To learn more about normalization, this tutorial is quite comprehensive. There are a few rules to remember:
- No multivalued attributes!
- Every row must have a “key” which uniquely identifies that row
- Every attribute must relate only to the key.
If we have time after adding data, we will explore how to normalize this bibliography. For now, let us focus on the quick and dirty solution.
Transforming Data
We now have (key, author) pairs. The next step is to put them in a relational database so that we can start asking and answering questions. Our starting point is the final script from the previous topic:
# display-authors-2.py
# Print (key, author) pairs.
import sys
import csv
with open(sys.argv[1], 'r') as raw:
reader = csv.reader(raw);
for line in reader:
key, authors = line[0], line[3]
for auth in authors.split('; '): # semi-colon plus space instead of semi-colon
print(key, auth)
whose output looks like this:
8SW85SQM McClelland, James L
85QV9X5F McClelland, J. L.
85QV9X5F McNaughton, B. L.
85QV9X5F O'Reilly, R. C.
Z4X6DT6N Ratcliff, R.
F5DGU3Q4 McCloskey, M.
F5DGU3Q4 Cohen, N. J.
PNGQMCP5 Buciluǎ, Cristian
PNGQMCP5 Caruana, Rich
PNGQMCP5 Niculescu-Mizil, Alexandru
What we want is data we can load with our prior commands. In order to do that, we need to write this back out as a different csv.
so let’s modify the program to output this instead:
# convert-1.py
# Output a csv we can load into the database.
import sys
import csv
output_rows=[]
with open(sys.argv[1], 'r') as raw:
reader = csv.reader(raw);
for line in reader:
key, authors = line[0], line[3]
for auth in authors.split('; '):
output_rows.append([key, auth])
# We need a second argument for our output file
with open(sys.argv[2], 'w') as csvout:
writer = csv.writer(csvout) #We are making a csv file from the second argument
writer.writerow(["Key", "Author"]) #We need to make a header row
writer.writerows(output_rows) #We need to dump the data we put into output_rows
print(len(output_rows)) #We need to know how many rows output_rows has so we can make sure they all make it into the database.
Let’s run the program:
$ python convert-1.py bibliography_data/bibliography.csv key_author.csv
$ head key_author.csv
This generates a beautiful new csv ready for importing:
Key,Author
8SW85SQM,"McClelland, James L"
85QV9X5F,"McClelland, J. L."
85QV9X5F,"McNaughton, B. L."
85QV9X5F,"O'Reilly, R. C."
Z4X6DT6N,"Ratcliff, R."
F5DGU3Q4,"McCloskey, M."
F5DGU3Q4,"Cohen, N. J."
PNGQMCP5,"Buciluǎ, Cristian"
PNGQMCP5,"Caruana, Rich"
And we know that there are 6587 key-author pairs here.
Now, we need to make a script like we did at the start of the lesson to load this data.
Call the file load_bibliography.sql
.mode csv
.import key_author.csv key_author
.header on
.mode column
SELECT *
FROM key_author
LIMIT 10;
SELECT count(*)
FROM key_author;
And we run it in an in-memory database:
$ sqlite3
sqlite> .read load_bibliography.csv
We see:
Key Author
---------- -------------------
8SW85SQM McClelland, James L
85QV9X5F McClelland, J. L.
85QV9X5F McNaughton, B. L.
85QV9X5F O'Reilly, R. C.
Z4X6DT6N Ratcliff, R.
F5DGU3Q4 McCloskey, M.
F5DGU3Q4 Cohen, N. J.
PNGQMCP5 Buciluǎ, Cristian
PNGQMCP5 Caruana, Rich
PNGQMCP5 Niculescu-Mizil, Al
count(*)
----------
6587
Cutting Corners
Python and other languages have libraries for interacting with databases, so why are we bulk loading data like this instead? The answer is that it’s a simple solution that uses all of the tools we have introduced so far. If we had to do anything more complicated with this data, we would almost certainly
import sqlite3
and do things the right way.
That looks good, so let’s try asking some questions:
SELECT author, count(*)
FROM key_author
GROUP BY author
ORDER BY count(*) desc
LIMIT 10;
Author count(*)
-------------- ----------
Bengio, Yoshua 122
Bengio, Y. 111
Hinton, Geoffr 78
LeCun, Yann 56
Hinton, G. E. 45
Salakhutdinov, 34
LeCun, Y. 31
Vincent, Pasca 29
Jordan, M. I. 27
Frasconi, P. 25
The first thing we see is that our work has paid off: we can now find out who the most prolific authors are with a single command. The second thing we see is that we’re not done yet: “Bengio, Yoshua” and “Bengio, Y.” are almost certainly the same person, as are “LeCun, Yann” and “LeCun, Y.”. If we really want to know who has written the most papers, we’re going to have to reconcile different names for the same person.
Denormalization is the Root of Much Evil
Data is normalized if it contains no redundancy. (The full definition has more conditions, but this is good enough for our purposes.) Our data is denormalized because it represents particular authors in several different ways. We can try to normalize the data using heuristics such as, “If the surname matches and the initialized form of the other parts of the name match, it’s the same person.” Errors will always creep in, however: in our case, we have no way of knowing whether “Albar, M.” is Mohammd Albar or Michael Albar. Answers based on heuristically normalized data are therefore always approximations of reality. It is crucial in these cases that we record the heuristics we used and the transformations we made so that others (including our future selves) can check our work.
Instead of normalizing names and otherwise cleaning the data as part of our python processing step, let’s see what other questions we can answer.
Who has co-authored papers with whom?
SELECT a.author, b.author
FROM key_author a
JOIN key_author b USING(key)
WHERE a.author > b.author
LIMIT 10;
Author Author
----------------- -----------------
McNaughton, B. L. McClelland, J. L.
O'Reilly, R. C. McClelland, J. L.
O'Reilly, R. C. McNaughton, B. L.
McCloskey, M. Cohen, N. J.
Caruana, Rich Buciluǎ, Cristian
Niculescu-Mizil, Buciluǎ, Cristian
Niculescu-Mizil, Caruana, Rich
Rigamonti, Robert Fua, Pascal
Rigamonti, Robert Lepetit, Vincent
Sironi, Amos Fua, Pascal
(We use a.author > b.author
to ensure that
each distinct pair of authors only shows up once.)
What if we want to know how often different pairs of people have written together?
SELECT a.author, b.author, count(*)
FROM key_author a
JOIN key_author b USING(key)
WHERE a.author > b.author
GROUP BY a.author, b.author
ORDER BY count(*) desc
limit 10;
Author Author count(*)
--------------- -------------- ----------
Vincent, Pascal Bengio, Yoshua 27
Roux, Nicolas L Bengio, Yoshua 20
Delalleau, Oliv Bengio, Yoshua 19
Bengio, Y. Bengio, S. 18
Larochelle, Hug Bengio, Yoshua 15
Roux, Nicolas L Delalleau, Oli 15
Vincent, P. Bengio, Y. 15
Chapados, N. Bengio, Y. 14
Gori, M. Frasconi, P. 14
Salakhutdinov, Hinton, Geoffr 14
Again, we have a data normalization problem: the pair “Vincent, Pascal” and “Bengio, Yoshua” is almost certainly the same as the pair “Vincent, P.” and “Bengio, S.” But that problem would be there even if we were analyzing this data manually, and putting it in a database has made asking new questions so easy that we can do research we otherwise wouldn’t have been able to tackle. The final step is to commit our script to Git and celebrate with a nice cup of tea.
Doing Things the Right Way
Rewrite the Python program to use the
sqlite3
library to create the database. And use python to create cleaner input data. What did you have to do to make the data consistent and legible?
Distinct Pairs
Explain why using
a.author > b.author
ensures that distinct pairs of authors only show up once.
Data cleansing
Write a function that takes two authors’ names as input and returns
True
if they are probably the same person andFalse
if they do not. Compare your function to your neighbor’s: can you find a case where the two would disagree?
Key Points
Explain why inferences drawn from uncleaned data are always approximate.