Loading data directly into the database
Overview
Teaching: 30 min
Exercises: 10 minQuestions
What is the difference between a spreadsheet and a table?
How do I bulk load data into sqlite3?
Objectives
Explore data from different locations using a single database
Almost everything we can do in a database can be done with sufficiently complex spreadsheets. However, when operating on clean(ish) data, databases allow us to transform large amounts of data in consistent and reliable ways without needing to see the data we are manipulating.
None of the examples we are covering today is so large that it would cause a modern spreadsheet program to crash. With that said, the 57,254 rows in 2016Census_G01_AUS_SA1.csv
are balky and difficult to work with when we are scrolling around, creating named ranges, and working on multiple layers of comparison.
The single main benefit of SQL over a spreadsheet that we are covering today is the idea of a JOIN
. Spreadsheets can certainly perform join-like operations, but the steps are many, specific to each join, and are not reusable. Whereas with SQL, since every output of a query is itself a table which can be used to input into other queries, we not only can JOIN
multiple sets of data together, but we can do it with a single line of code.
The main drawbacks of SQL over a spreadsheet is that you have to write that code and that you have to have clean data. (Cue Monty Python Spanish Inquisition sketch here.) However, because each query can produce output used in subsequent queries, we don’t have to deal with that complexity all at the same time. Instead, we can gradually creep up on our desired output, making sure of our footing at each step.
We will end this first chapter by joining data from different data sources to answer a rather silly research question:
A totally not staged question
I wonder, what’s the demographic impact of each political party in Australia on Twitter. @DenubisX - gimme some numbers!
Research Question Operationalisation
How can we turn this question into something we can answer?
In research design, especially in psychology, social sciences, life sciences, and physics, operationalization is a process of defining the measurement of a phenomenon that is not directly measurable, though its existence is indicated by other phenomena.
We start with the question: “What can we measure?”
- We know which politicians are on twitter from the list of Australian members of parliament.
- We know what party those politicians are from, from the same list.
- We know that the 2016 census has demographic population per “Statistical Area”
- We know that the ABS publishes a dataset which names the regions alongside their statistical identifiers.
We can therefore ask something like: For members of parliment on twitter, what is the sum total of people in their district, grouped by party?
Granted, this question is entirely contrived, but hopefully plausible in the ability of linking datasets.
We now need to inspect the data files to see if these measurable things are present.
In this chapter we will be working with a number of different datasets:
- The scraped list of Australian politicians from the Library Carpentry Webscraping session.
- Excerpted Data from the 2016 Australian Census
- Geographic Electoral Division data from the Australian Bureau of Statistics
Download this zip file of all the data we will be using in this course.
Extract the zip file above. It should have the following structure:
$ ls -F aust_mp_twitter_data/ bibliography_data/
aust_mp_twitter_data/:
2016Census_G01_AUS_SA1.csv 20180628-austmpdata.csv CED_2017_AUST.csv
bibliography_data/:
bibliography.csv
If we characterise the 3 csv files in aust_mp_twitter_data/
using
$ head -2 aust_mp_twitter_data/*.csv
we see:
==> aust_mp_twitter_data/2016Census_G01_AUS_SA1.csv <==
SA1_7DIGITCODE_2016,Tot_P_M,Tot_P_F,Tot_P_P,Age_0_4_yr_M,Age_0_4_yr_F,Age_0_4_yr_P,(...)
1100701,140,116,256,6,7,8(...)
==> aust_mp_twitter_data/20180628-austmpdata.csv <==
district,link,name,party,phonenumber,twitter
"Warringah, New South Wales",https://www.aph.gov.au/Senators_and_Members/Parliamentarian?MPID=EZ5,Hon Tony Abbott MP,Liberal Party of Australia,(02) 9977 6411,http://twitter.com/TonyAbbottMHR
==> aust_mp_twitter_data/CED_2017_AUST.csv <==
"SA1_MAINCODE_2016","CED_CODE_2017","CED_NAME_2017","STATE_CODE_2017","STATE_NAME_2017","AREA_ALBERS_SQKM"
"11901135801","101","Banks","1","New South Wales",0.1146
We see the following columns of interest:
Tot_P_P
. Described by the Census Metadata as:G3 Tot_P_P Total_Persons_Persons G01 G01a Persons
- The “SA1 Maincode” columns, suggesting linking; and
- The the “Party” and “Twitter” columns of “our own” web-scraped data.
To a first approximation, we can answer Petra’s question!
Direct database csv loading
CSV, comma separated values, or “flat-file” files are one of the preferred means of exchanging data produced from spreadsheets and databases. When the data is produced directly from a computer, for a computer’s consumption, it is very easy to load. However, this process is fragile and when we are dealing with “denormalized” data, we will have to perform cleaning steps before the data is ready for use.
While we can enter queries directly into our database, this approach does not scale well when we need to make tweaks to complex queries or we need to write a series of queries to get the data we want. Therefore, we want to make our scripts in a text editor.
Let us start by exploring the sqlite3 interface:
$ sqlite3
You should see:
SQLite version 3.22.0 2018-01-22 18:45:57
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite>
In-memory databases
By running sqlite3 without a database name, we are creating an in-memory database. This database will evaporate as soon as we quit the program, which is very useful for initial explorations of data or temporary manipulations without needing to run commands to delete and remake whole database.
Let’s run that .help
command.
Here are the commands we are interested in right now:
command | explanation |
---|---|
.import FILE TABLE | Import data from FILE into TABLE |
.mode CSV | Sets the input and output modes to csv |
.separator COL | Change the column separator … for both the output mode and .import |
.tables | List names of tables |
.read FILENAME | Execute SQL in FILENAME |
.mode csv
The sqlite .help command says “output,” but if we forget to run it and look in the manual, we learn:
Note that it is important to set the "mode" to "csv" before running the ".import" command. This is necessary to prevent the command-line shell from trying to interpret the input file text as some other format.
Now, let us see if we can load some data. In sqlite:
sqlite> .mode csv
sqlite> .import aust_mp_twitter_data/20180628-austmpdata.csv austmpdata
Now let us look at the data to see if it’s viable for us to work with.
sqlite> .mode column
sqlite> .header on
sqlite> SELECT * FROM austmpdata LIMIT 5;
We should see something like
district link name party phonenumber twitter
-------------------------- -------------------------------------------------------------------- ------------------ -------------------------- -------------- --------------------------------
Warringah, New South Wales https://www.aph.gov.au/Senators_and_Members/Parliamentarian?MPID=EZ5 Hon Tony Abbott MP Liberal Party of Australia (02) 9977 6411 http://twitter.com/TonyAbbottMHR
Bennelong, New South Wales https://www.aph.gov.au/Senators_and_Members/Parliamentarian?MPID=M3M Mr John Alexander Liberal Party of Australia (02) 9869 4288 http://twitter.com/JohnAlexander
Grayndler, New South Wales https://www.aph.gov.au/Senators_and_Members/Parliamentarian?MPID=R36 Hon Anthony Albane Australian Labor Party (02) 9564 3588 http://twitter.com/AlboMP
Mallee, Victoria https://www.aph.gov.au/Senators_and_Members/Parliamentarian?MPID=303 Mr Andrew Broad MP The Nationals https://twitter.com/broad4mallee
McMahon, New South Wales https://www.aph.gov.au/Senators_and_Members/Parliamentarian?MPID=DZS Hon Chris Bowen MP Australian Labor Party (02) 9604 0710 http://twitter.com/Bowenchris
We can then verify the table by calling:
sqlite> .schema austmpdata
CREATE TABLE austmpdata(
"district" TEXT,
"link" TEXT,
"name" TEXT,
"party" TEXT,
"phonenumber" TEXT,
"twitter" TEXT
);
This output is a maximally generic table – but it is a table which holds our data. This is nothing we couldn’t do in a spreadsheet. Yet.
Now, however, let us try to interact with multiple tables at the same time.
Writing a script to load our research database
The first thing we need to do is to write a script to load the database.
Make a file called aust_mp.sql
.echo on
.mode csv
.import aust_mp_twitter_data/20180628-austmpdata.csv austmp
.import aust_mp_twitter_data/2016Census_G01_AUS_SA1.csv census
.import aust_mp_twitter_data/CED_2017_AUST.csv ced
.mode column
.header on
-- normally we'd want to select * here. But considering how many columns our dataset outputs, it makes for really awful output. Instead, we'll take the columns we care about instead.
SELECT SA1_7DIGITCODE_2016, Tot_P_M, Tot_P_F, Tot_P_P
FROM census
LIMIT 5;
SELECT district, party, twitter
FROM austmp
LIMIT 5;
SELECT *
FROM ced
LIMIT 5;
SELECT COUNT(*)
FROM austmp;
SELECT COUNT(*)
FROM census;
SELECT COUNT(*)
FROM ced;
Sanity checks
It is always important to have sanity checks when loading other peoples’ data. Running
count(*)
counts the number of rows imported (which can be verified against a row countwc -l
andSELECT * FROM table LIMIT 5
.)
Now, let’s run it.
sqlite> .read aust_mp.sql
You should see:
.mode csv
.import aust_mp_twitter_data/20180628-austmpdata.csv austmp
.import aust_mp_twitter_data/2016Census_G01_AUS_SA1.csv census
.import aust_mp_twitter_data/CED_2017_AUST.csv ced
.mode column
.header on
SELECT SA1_7DIGITCODE_2016, Tot_P_M, Tot_P_F, Tot_P_P
FROM census
LIMIT 5;
SA1_7DIGITCODE_2016 Tot_P_M Tot_P_F Tot_P_P
------------------- ---------- ---------- ----------
1100701 140 116 256
1100702 202 175 381
1100703 201 228 428
1100704 189 254 446
1100705 192 208 402
SELECT district, party, twitter
FROM austmp
LIMIT 5;
district party twitter
-------------------------- -------------------------- --------------------------------
Warringah, New South Wales Liberal Party of Australia http://twitter.com/TonyAbbottMHR
Bennelong, New South Wales Liberal Party of Australia http://twitter.com/JohnAlexander
Grayndler, New South Wales Australian Labor Party http://twitter.com/AlboMP
Mallee, Victoria The Nationals https://twitter.com/broad4mallee
McMahon, New South Wales Australian Labor Party http://twitter.com/Bowenchris
SELECT *
FROM ced
LIMIT 5;
SA1_MAINCODE_2016 CED_CODE_2017 CED_NAME_2017 STATE_CODE_2017 STATE_NAME_2017 AREA_ALBERS_SQKM
----------------- ------------- ------------- --------------- --------------- ----------------
11901135801 101 Banks 1 New South Wales 0.1146
11901135802 101 Banks 1 New South Wales 0.0722
11901135803 101 Banks 1 New South Wales 0.1597
11901135804 101 Banks 1 New South Wales 0.173
11901135805 101 Banks 1 New South Wales 0.108
SELECT COUNT(*)
FROM austmp;
COUNT(*)
----------
145
SELECT COUNT(*)
FROM census;
COUNT(*)
----------
57523
SELECT COUNT(*)
FROM ced;
COUNT(*)
----------
57523
Looking at the output above, there should be a moment of “Huh, that’s funny.” at the COUNT(*)
of the ced
table and the census
table. However, if we look at the SELECT *
above, we can indeed see different data. The reason that they have the same number of rows is that they’re both working off of the Australian Bureau of Statistics “Statistical Area 1” identifiers.
Exercise
Let’s practice some SQL. Write queries which answer the following questions:
- Using the
LIKE
operator, which MPs are listed as Doctors?- What is the
SUM()
area of each state?Hints:
- The wildcard operator is
%
forLIKE
.- We need to
GROUP BY
before we aggregate withSUM()
Solution
SELECT name FROM austmp WHERE name LIKE 'Dr%'; SELECT state_name_2017, SUM(AREA_ALBERS_SQKM) FROM ced GROUP BY state_name_2017;
Discussion
How is this approach different in your experience from working with a spreadsheet? What sort of data cleansing and verification work do you do with imported spreadsheet data? How would we translate that to this process?
Resources
Key Points
Most databases allow people to load data from csv
The default settings on data loading are OK for a quick look at the data