Transforming and Loading Data into a database
OverviewTeaching: 20 min
Exercises: 10 minQuestions
How do we clean data using python?
How can we load that cleaned data into a database?Objectives
Create a csv of loadable data for our database
Load the transformed data into the database
Ask questions of that data
Normalization is the process of splitting data up into multiple tables. If we were going to continue this process and make this a research database, we would take the entire set of incoming data and break it up into multiple tables.
To learn more about normalization, this tutorial is quite comprehensive. There are a few rules to remember:
- No multivalued attributes!
- Every row must have a “key” which uniquely identifies that row
- Every attribute must relate only to the key.
If we have time after adding data, we will explore how to normalize this bibliography. For now, let us focus on the quick and dirty solution.
We now have (key, author) pairs. The next step is to put them in a relational database so that we can start asking and answering questions. Our starting point is the final script from the previous topic:
# display-authors-2.py # Print (key, author) pairs. import sys import csv with open(sys.argv, 'r') as raw: reader = csv.reader(raw); for line in reader: key, authors = line, line for auth in authors.split('; '): # semi-colon plus space instead of semi-colon print(key, auth)
whose output looks like this:
8SW85SQM McClelland, James L 85QV9X5F McClelland, J. L. 85QV9X5F McNaughton, B. L. 85QV9X5F O'Reilly, R. C. Z4X6DT6N Ratcliff, R. F5DGU3Q4 McCloskey, M. F5DGU3Q4 Cohen, N. J. PNGQMCP5 Buciluǎ, Cristian PNGQMCP5 Caruana, Rich PNGQMCP5 Niculescu-Mizil, Alexandru
What we want is data we can load with our prior commands. In order to do that, we need to write this back out as a different csv.
so let’s modify the program to output this instead:
# convert-1.py # Output a csv we can load into the database. import sys import csv output_rows= with open(sys.argv, 'r') as raw: reader = csv.reader(raw); for line in reader: key, authors = line, line for auth in authors.split('; '): output_rows.append([key, auth]) # We need a second argument for our output file with open(sys.argv, 'w') as csvout: writer = csv.writer(csvout) #We are making a csv file from the second argument writer.writerow(["Key", "Author"]) #We need to make a header row writer.writerows(output_rows) #We need to dump the data we put into output_rows print(len(output_rows)) #We need to know how many rows output_rows has so we can make sure they all make it into the database.
Let’s run the program:
$ python convert-1.py bibliography_data/bibliography.csv key_author.csv $ head key_author.csv
This generates a beautiful new csv ready for importing:
Key,Author 8SW85SQM,"McClelland, James L" 85QV9X5F,"McClelland, J. L." 85QV9X5F,"McNaughton, B. L." 85QV9X5F,"O'Reilly, R. C." Z4X6DT6N,"Ratcliff, R." F5DGU3Q4,"McCloskey, M." F5DGU3Q4,"Cohen, N. J." PNGQMCP5,"Buciluǎ, Cristian" PNGQMCP5,"Caruana, Rich"
And we know that there are 6587 key-author pairs here.
Now, we need to make a script like we did at the start of the lesson to load this data.
Call the file
.mode csv .import key_author.csv key_author .header on .mode column SELECT * FROM key_author LIMIT 10; SELECT count(*) FROM key_author;
And we run it in an in-memory database:
$ sqlite3 sqlite> .read load_bibliography.csv
Key Author ---------- ------------------- 8SW85SQM McClelland, James L 85QV9X5F McClelland, J. L. 85QV9X5F McNaughton, B. L. 85QV9X5F O'Reilly, R. C. Z4X6DT6N Ratcliff, R. F5DGU3Q4 McCloskey, M. F5DGU3Q4 Cohen, N. J. PNGQMCP5 Buciluǎ, Cristian PNGQMCP5 Caruana, Rich PNGQMCP5 Niculescu-Mizil, Al count(*) ---------- 6587
Python and other languages have libraries for interacting with databases, so why are we bulk loading data like this instead? The answer is that it’s a simple solution that uses all of the tools we have introduced so far. If we had to do anything more complicated with this data, we would almost certainly
import sqlite3and do things the right way.
That looks good, so let’s try asking some questions:
SELECT author, count(*) FROM key_author GROUP BY author ORDER BY count(*) desc LIMIT 10;
Author count(*) -------------- ---------- Bengio, Yoshua 122 Bengio, Y. 111 Hinton, Geoffr 78 LeCun, Yann 56 Hinton, G. E. 45 Salakhutdinov, 34 LeCun, Y. 31 Vincent, Pasca 29 Jordan, M. I. 27 Frasconi, P. 25
The first thing we see is that our work has paid off: we can now find out who the most prolific authors are with a single command. The second thing we see is that we’re not done yet: “Bengio, Yoshua” and “Bengio, Y.” are almost certainly the same person, as are “LeCun, Yann” and “LeCun, Y.”. If we really want to know who has written the most papers, we’re going to have to reconcile different names for the same person.
Denormalization is the Root of Much Evil
Data is normalized if it contains no redundancy. (The full definition has more conditions, but this is good enough for our purposes.) Our data is denormalized because it represents particular authors in several different ways. We can try to normalize the data using heuristics such as, “If the surname matches and the initialized form of the other parts of the name match, it’s the same person.” Errors will always creep in, however: in our case, we have no way of knowing whether “Albar, M.” is Mohammd Albar or Michael Albar. Answers based on heuristically normalized data are therefore always approximations of reality. It is crucial in these cases that we record the heuristics we used and the transformations we made so that others (including our future selves) can check our work.
Instead of normalizing names and otherwise cleaning the data as part of our python processing step, let’s see what other questions we can answer.
Who has co-authored papers with whom?
SELECT a.author, b.author FROM key_author a JOIN key_author b USING(key) WHERE a.author > b.author LIMIT 10;
Author Author ----------------- ----------------- McNaughton, B. L. McClelland, J. L. O'Reilly, R. C. McClelland, J. L. O'Reilly, R. C. McNaughton, B. L. McCloskey, M. Cohen, N. J. Caruana, Rich Buciluǎ, Cristian Niculescu-Mizil, Buciluǎ, Cristian Niculescu-Mizil, Caruana, Rich Rigamonti, Robert Fua, Pascal Rigamonti, Robert Lepetit, Vincent Sironi, Amos Fua, Pascal
a.author > b.author to ensure that
each distinct pair of authors only shows up once.)
What if we want to know how often different pairs of people have written together?
SELECT a.author, b.author, count(*) FROM key_author a JOIN key_author b USING(key) WHERE a.author > b.author GROUP BY a.author, b.author ORDER BY count(*) desc limit 10;
Author Author count(*) --------------- -------------- ---------- Vincent, Pascal Bengio, Yoshua 27 Roux, Nicolas L Bengio, Yoshua 20 Delalleau, Oliv Bengio, Yoshua 19 Bengio, Y. Bengio, S. 18 Larochelle, Hug Bengio, Yoshua 15 Roux, Nicolas L Delalleau, Oli 15 Vincent, P. Bengio, Y. 15 Chapados, N. Bengio, Y. 14 Gori, M. Frasconi, P. 14 Salakhutdinov, Hinton, Geoffr 14
Again, we have a data normalization problem: the pair “Vincent, Pascal” and “Bengio, Yoshua” is almost certainly the same as the pair “Vincent, P.” and “Bengio, S.” But that problem would be there even if we were analyzing this data manually, and putting it in a database has made asking new questions so easy that we can do research we otherwise wouldn’t have been able to tackle. The final step is to commit our script to Git and celebrate with a nice cup of tea.
Doing Things the Right Way
Rewrite the Python program to use the
sqlite3library to create the database. And use python to create cleaner input data. What did you have to do to make the data consistent and legible?
Explain why using
a.author > b.authorensures that distinct pairs of authors only show up once.
Write a function that takes two authors’ names as input and returns
Trueif they are probably the same person and
Falseif they do not. Compare your function to your neighbor’s: can you find a case where the two would disagree?
Explain why inferences drawn from uncleaned data are always approximate.