Databases using SQL
Overview
Teaching: 60 min
Exercises: 5 minQuestions
What is a relational database and why should I use it?
What is SQL?
Objectives
Understand the benefits of using a relational database
Set up a small database from csv files using SQLite
Understand SQLite data types
Setup
Note: this should have been done by participants before the start of the workshop.
We use DB Browser for SQLite and the
EEBO/TCP Database throughout this lesson. See Setup for instructions on how to download the data, and also how to install and open DB Browser for SQLite.
Motivation
To start, let’s orient ourselves in our project workflow. Previously, we used Excel and OpenRefine to go from messy, human created data to cleaned, computer-readable data. Now we’re going to move to the next piece of the data workflow, using the computer to read in our data, and then use it for analysis and visualization.
Dataset Description
The data we will be using is a time-series for the texts collected and encoded by the Text Creation Partnership. This is a real dataset.
Questions
First, let’s download and look at some of the cleaned spreadsheets
from the
Humanities dataset.
We’ll need the following four files:
authors.csv
titles.csv
dates.csv
pages.csv
Challenge
Open each of these csv files and explore them. What information is contained in each file? Specifically, if I had the following research questions:
- How has the length and dates of Martin Luther title attributions changed over time?
- What is the average number of each titles, per year?
- What information can I learn about Martin Luther species in the 1500s, over time?
What would I need to answer these questions? Which files have the data I need? What operations would I need to perform if I were doing these analyses by hand?
Goals
In order to answer the questions described above, we’ll need to do the following basic data operations:
- select subsets of the data (rows and columns)
- group subsets of data
- do math and other calculations
- combine data across spreadsheets
In addition, we don’t want to do this manually! Instead of searching for the right pieces of data ourselves, or clicking between spreadsheets, or manually sorting columns, we want to make the computer do the work.
In particular, we want to use a tool where it’s easy to repeat our analysis in case our data changes. We also want to do all this searching without actually modifying our source data.
Putting our data into a relational database and using SQL will help us achieve these goals.
Definition: Relational Database
A relational database stores data in relations made up of records with fields. The relations are usually represented as tables; each record is usually shown as a row, and the fields as columns. In most cases, each record will have a unique identifier, called a key, which is stored as one of its fields. Records may also contain keys that refer to records in other tables, which enables us to combine information from two or more sources.
Databases
Why use relational databases
Using a relational database serves several purposes.
- It keeps your data separate from your analysis.
- This means there’s no risk of accidentally changing data when you analyze it.
- If we get new data we can just rerun the query.
- It’s fast, even for large amounts of data.
- It improves quality control of data entry (type constraints and use of forms in MS Access, Filemaker, Oracle Application Express etc.)
- The concepts of relational database querying are core to understanding how to do similar things using programming languages such as R or Python.
Database Management Systems
There are a number of different database management systems for working with relational data. We’re going to use SQLite today, but basically everything we teach you will apply to the other database systems as well (e.g. MySQL, PostgreSQL, MS Access, MS SQL Server, Oracle Database and Filemaker Pro). The only things that will differ are the details of exactly how to import and export data and the details of data types.
Relational databases
Let’s look at a pre-existing database, the eebo.db
file from the Humanities dataset that we downloaded during
Setup. Clicking on the “open file” icon, then
find that file and clicking on it will open the database.
You can see the tables in the database by looking at the left hand side of the
screen under Tables, where each table corresponds to one of the csv
files
we were exploring earlier. To see the contents of any table, click on it, and
then click the “Browse and Search” tab in the right panel. This will
give us a view that we’re used to - just a copy of the table. Hopefully this
helps to show that a database is, in some sense, just a collection of tables,
where there’s some value in the tables that allows them to be connected to each
other (the “related” part of “relational database”).
The leftmost tab, “Structure”, provides some metadata about each table. It
describes the columns, often called fields. (The rows of a database table
are called records.) If you scroll down in the Structure view, you’ll
see a list of fields, their labels, and their data type. Each field contains
one variety or type of data, often numbers or text. You can see in the
dates
table that most fields contain numbers (integers) while the titles
table is nearly all text.
The “Execute SQL” tab is blank now - this is where we’ll be typing our queries to retrieve information from the database tables.
To summarize:
- Relational databases store data in tables with fields (columns) and records (rows)
- Data in tables has types, and all values in a field have the same type (list of data types)
- Queries let us look up data or make calculations based on columns
Database Design
- Every row-column combination contains a single atomic value, i.e., not containing parts we might want to work with separately.
- One field per type of information
- No redundant information
- Split into separate tables with one table per class of information
- Needs an identifier in common between tables – shared column - to reconnect (known as a foreign key).
Import
Before we get started with writing our own queries, we’ll create our own
database. We’ll be creating this database from the three csv
files
we downloaded earlier. Close the currently open database and then
follow these instructions:
- Start a New Database
- New Database
- Give a name Save As. Creates the database in the opened folder
- Press Cancel
- Select File -> Import -> Table from CSV File.
- Select the
authors.csv
file to import - Give the table a name that matches the file name (
authors
), or use the default - If the first row has column headings, check the appropriate box
- Make sure the delimiter and quotation options are appropriate for the CSV files.
- Press OK
Challenge
- Import the
places
,pages
anddates
tables
You can also use this same approach to append new data to an existing table.
Adding data to existing tables
- “File” tab -> Import -> Table from CSV file
- Enter data into a csv file and append
Data types
Data type | Description |
---|---|
CHARACTER(n) | Character string. Fixed-length n |
VARCHAR(n) or CHARACTER VARYING(n) | Character string. Variable length. Maximum length n |
BINARY(n) | Binary string. Fixed-length n |
BOOLEAN | Stores TRUE or FALSE values |
VARBINARY(n) or BINARY VARYING(n) | Binary string. Variable length. Maximum length n |
INTEGER(p) | Integer numerical (no decimal). |
SMALLINT | Integer numerical (no decimal). |
INTEGER | Integer numerical (no decimal). |
BIGINT | Integer numerical (no decimal). |
DECIMAL(p,s) | Exact numerical, precision p, scale s. |
NUMERIC(p,s) | Exact numerical, precision p, scale s. (Same as DECIMAL) |
FLOAT(p) | Approximate numerical, mantissa precision p. A floating number in base 10 exponential notation. |
REAL | Approximate numerical |
FLOAT | Approximate numerical |
DOUBLE PRECISION | Approximate numerical |
DATE | Stores year, month, and day values |
TIME | Stores hour, minute, and second values |
TIMESTAMP | Stores year, month, day, hour, minute, and second values |
INTERVAL | Composed of a number of integer fields, representing a period of time, depending on the type of interval |
ARRAY | A set-length and ordered collection of elements |
MULTISET | A variable-length and unordered collection of elements |
XML | Stores XML data |
SQL Data Type Quick Reference
Different databases offer different choices for the data type definition.
The following table shows some of the common names of data types between the various database platforms:
Data type | Access | SQLServer | Oracle | MySQL | PostgreSQL |
---|---|---|---|---|---|
boolean | Yes/No | Bit | Byte | N/A | Boolean |
integer | Number (integer) | Int | Number | Int / Integer | Int / Integer |
float | Number (single) | Float / Real | Number | Float | Numeric |
currency | Currency | Money | N/A | N/A | Money |
string (fixed) | N/A | Char | Char | Char | Char |
string (variable) | Text (<256) / Memo (65k+) | Varchar | Varchar2 | Varchar | Varchar |
binary object OLE Object Memo Binary (fixed up to 8K) | Varbinary (<8K) | Image (<2GB) Long | Raw Blob | Text Binary | Varbinary |
Key Points
SQL allows us to select and group subsets of data, do math and other calculations, and combine data.
A relational database is made up of tables which are related to each other by shared keys.
Different database management systems (DBMS) use slightly different vocabulary, but they are all based on the same ideas.
Basic Queries
Overview
Teaching: 30 min
Exercises: 5 minQuestions
How do I write a basic query in SQL?
Objectives
Write and build queries.
Filter data given various criteria.
Sort the results of a query.
Writing my first query
Let’s start by using the eebo table. Here we have data on every title that is included in the catalogue.
Let’s write an SQL query that selects only the year column from the surveys table. SQL queries can be written in the box located under the “Execute SQL” tab. Click ‘Run SQL’ to execute the query in the box.
SELECT Title
FROM eebo;
We have capitalized the words SELECT and FROM because they are SQL keywords. SQL is case insensitive, but it helps for readability, and is good style.
If we want more information, we can just add a new column to the list of fields, right after SELECT:
SELECT Title, TCP
FROM eebo;
Or we can select all of the columns in a table using the wildcard *
SELECT *
FROM eebo;
Limiting results
Sometimes you don’t want to see all the results you just want to get a sense of of what’s being returned. In that case you can use the LIMIT command. In particular you would want to do this if you were working with large databases.
SELECT *
FROM eebo
LIMIT 10;
Unique values
If we want only the unique values so that we can quickly see what authors have
been cataloged we use DISTINCT
SELECT DISTINCT Date
FROM eebo;
If we select more than one column, then the distinct pairs of values are returned
SELECT DISTINCT Date, Title
FROM eebo;
Calculated values
We can also do calculations with the values in a query. For example, if we wanted to look at the number of pages associated with different dates, but we needed to know how many tens of pages we would use
SELECT Date, PageCount/10
FROM eebo;
When we run the query, the expression pages / 10
is evaluated for each
row and appended to that row, in a new column. If we used the INTEGER
data type
for the pages field then integer division would have been done, to obtain the
correct results in that case divide by 10.0
. Expressions can use any fields,
any arithmetic operators (+
, -
, *
, and /
) and a variety of built-in
functions. For example, we could round the values to make them easier to read.
SELECT TCP, title, ROUND(pages / 10, 2)
FROM eebo;
Challenge
- Write a query that returns the year, EEBO and page length
Filtering
Databases can also filter data – selecting only the data meeting certain
criteria. For example, let’s say we only want data for the titles that
have a free status. We need to add a WHERE
clause to our query:
SELECT *
FROM eebo
WHERE status='Free';
We can do the same thing with numbers. Here, we only want the data since 1640:
SELECT *
FROM eebo
WHERE date >= '1600';
If we used the TEXT
data type for the year the WHERE
clause should
be Date >= '1600'
. We can use more sophisticated conditions by combining tests
with AND
and OR
. For example, suppose we want the data on Free status
starting in the year 1600:
SELECT *
FROM catalogue
WHERE (date >= '1600') AND (status = 'Free');
Note that the parentheses are not needed, but again, they help with
readability. They also ensure that the computer combines AND
and OR
in the way that we intend.
Challenge
- Produce a table listing the data for all titles in the catalogue with a page length more than 75, telling us the date, TCP id code, and page.
Building more complex queries
Now, lets combine the above queries to get data for 2 authors from the year 1580 on. This time, let’s use IN as one way to make the query easier to understand. It is equivalent to saying `WHERE (author = ‘Aylett, Robert, 1583-1655?’) OR (author = ‘Bacon, Francis, 1561-1626.’), but reads more neatly:
SELECT *
FROM eebo
WHERE (date >= '1580') AND (author IN ('Aylett, Robert, 1583-1655?', 'Bacon, Francis, 1561-1626.'));
We started with something simple, then added more clauses one by one, testing their effects as we went along. For complex queries, this is a good strategy, to make sure you are getting what you want. Sometimes it might help to take a subset of the data that you can easily see in a temporary database to practice your queries on before working on a larger or more complicated database.
When the queries become more complex, it can be useful to add comments. In SQL,
comments are started by --
, and end at the end of the line. For example, a
commented version of the above query can be written as:
-- Get post 1580 data on authors
-- These are in the catalogue table, and we are interested in all columns
SELECT * FROM eebo
-- Sampling year is in the column `Date`, and we want to include after 1580
WHERE (date >= '1580')
-- Author names
AND (author IN ('Aylett, Robert, 1583-1655?', 'Bacon, Francis, 1561-1626.'));
Although SQL queries often read like plain English, it is always useful to add comments; this is especially true of more complex queries.
Sorting
We can also sort the results of our queries by using ORDER BY
.
For simplicity, let’s go back to the catalogue table and order by publication date.
First, let’s look at what’s in the catalogue table. It’s a table of the eebo catalogue id and the information for each id.
SELECT *
FROM eebo;
Now let’s order it by date.
SELECT *
FROM eebo
ORDER BY date ASC;
The keyword ASC
tells us to order it in Ascending order.
We could alternately use DESC
to get descending order.
SELECT *
FROM eebo
ORDER BY date DESC;
ASC
is the default.
We can also sort on several fields at once. To truly be alphabetical, we might want to order by date then author.
SELECT *
FROM eebo
ORDER BY Date ASC, Author ASC;
Challenge
- Write a query that returns year, TCP id, and page from the catalogue table, sorted with the largest page lengths at the top.
Order of execution
Another note for ordering. We don’t actually have to display a column to sort by it. For example, let’s say we want to order an author by the EEBO index, but we only want to see genus and species.
SELECT Title, Terms
FROM eebo
WHERE author = 'Bacon, Francis, 1561-1626.'
ORDER BY TCP ASC;
We can do this because sorting occurs earlier in the computational pipeline than field selection.
The computer is basically doing this:
- Filtering rows according to WHERE
- Sorting results according to ORDER BY
- Displaying requested columns or expressions.
Clauses are written in a fixed order: SELECT
, FROM
, WHERE
, then ORDER
BY
. It is possible to write a query as a single line, but for readability,
we recommend to put each clause on its own line.
Challenge
- Let’s try to combine what we’ve learned so far in a single query. Using the catalogue table write a query to display the title, terms field and the page length (rounded to two decimal places), for titles published in 1550, ordered alphabetically by the author.
- Write the query as a single line, then put each clause on its own line, and see how more legible the query becomes!
Key Points
It is useful to apply conventions when writing SQL queries to aid readability.
Use logical connectors such as AND or OR to create more complex queries.
Calculations using mathematical symbols can also be performed on SQL queries.
Adding comments in SQL helps keep complex queries understandable.
SQL Aggregation
Overview
Teaching: 30 min
Exercises: 5 minQuestions
How can I summarize my data by aggregating, filtering, or ordering query results?
Objectives
Apply aggregation to group records in SQL.
Filter and order results of a query based on aggregate functions.
Save a query to make a new table.
Apply filters to find missing values in SQL.
COUNT and GROUP BY
Aggregation allows us to combine results by grouping records based on value and calculating combined values in groups.
Let’s go to the eebo table and find out how many individual titles there are. Using the wildcard simply counts the number of records (rows):
SELECT COUNT(*)
FROM eebo;
We can also find out the total page length:
SELECT COUNT(*), SUM(PageCount)
FROM eebo;
There are many other aggregate functions included in SQL including
MAX
, MIN
, and AVG
.
Challenge
Write a query that returns: total page length, average page length, and the min and max page lengths for all titles encoded over the duration of the project. Can you modify it so that it outputs these values only for page lengths between 1 and 10?
Now, let’s see how many individuals were counted in each species. We do this
using a GROUP BY
clause
SELECT TCP, COUNT(*)
FROM eebo
GROUP BY Status;
GROUP BY
tells SQL what field or fields we want to use to aggregate the data.
If we want to group by multiple fields, we give GROUP BY
a comma separated list.
Challenge
Write queries that return:
- How many groups of terms were created in each year
- in total
- per author
- Average number of each term groupings in each year.
Can you modify the above queries combining them into one?
The HAVING
keyword
In the previous lesson, we have seen the keywords WHERE
, allowing to
filter the results according to some criteria. SQL offers a mechanism to
filter the results based on aggregate functions, through the HAVING
keyword.
For example, we can adapt the last request we wrote to only return information about page length with a count higher than 100:
SELECT TCP, PageCount
FROM eebo
GROUP BY PageCount
HAVING PageCount > 10;
The HAVING
keyword works exactly like the WHERE
keyword, but uses
aggregate functions instead of database fields.
If you use AS
in your query to rename a column, HAVING
can use this
information to make the query more readable. For example, in the above
query, we can call the COUNT(pages)
by another name, like
page
. This can be written this way:
SELECT TCP, author, PageCount AS page
FROM eebo
GROUP BY TCP
HAVING PageCount > 10;
Note that in both queries, HAVING
comes after GROUP BY
. One way to
think about this is: the data are retrieved (SELECT
), can be filtered
(WHERE
), then joined in groups (GROUP BY
); finally, we only select some
of these groups (HAVING
).
Challenge
Write a query that returns, from the
authors
table, theeebo
IDs in eachauthors
, only for theauthors
with more than 5 works.
Ordering Aggregated Results
We can order the results of our aggregation by a specific column, including the aggregated column. Let’s count the number of individuals of each species captured, ordered by the count:
SELECT author, COUNT(*)
FROM eebo
GROUP BY author
ORDER BY COUNT(author);
Saving Queries for Future Use
It is not uncommon to repeat the same operation more than once, for example for monitoring or reporting purposes. SQL comes with a very powerful mechanism to do this: views. Views are a form of query that is saved in the database, and can be used to look at, filter, and even update information. One way to think of views is as a table, that can read, aggregate, and filter information from several places before showing it to you.
Creating a view from a query requires to add CREATE VIEW viewname AS
before the query itself. For example, imagine that my project only covers
the data gathered of books published between 1642 - 1651. That
query would look like:
SELECT *
FROM eebo
WHERE (date > '1558' AND date < '1603');
But we don’t want to have to type that every time we want to ask a question about that particular subset of data. Let’s create a view:
CREATE VIEW elizabethan AS
SELECT *
FROM eebo
WHERE (date > '1558' AND date < '1603');
You can also add a view using Create View in the View menu and see the results in the Views tab just like a table.
Now, we will be able to access these results with a much shorter notation:
SELECT *
FROM elizabethan;
There should only be 51 records. If you look at the PageCount
column, it’s
easy to see what the average page length would be. If we use SQL to find the
average page count of books that are available, SQL behaves like we would hope,
ignoring the NULL values:
SELECT AVG(PageCount)
FROM elizabethan
WHERE author != '';
But if we try to be extra clever, and find the average ourselves, we might get tripped up:
SELECT SUM(PageCount), COUNT(*), SUM(PageCount)/COUNT(*)
FROM elizabethan
WHERE author != '';
Here the COUNT
command includes all 51 records (even those with NULL
values), but the SUM
only includes the 49 records with data in the
weight
field, giving us an incorrect average. However,
our strategy will work if we modify the count command slightly:
SELECT SUM(PageCount), COUNT(PageCount), SUM(PageCount)/COUNT(PageCount)
FROM elizabethan
WHERE author != '';
When we count the pages field specifically, SQL ignores the records with data
missing in that field. So here is one example where NULLs can be tricky:
COUNT(*)
and COUNT(field)
can return different values.
Another case is when we use a “negative” query. Let’s count all the non-free titles:
SELECT COUNT(*)
FROM elizabethan
WHERE Author != '';
Now let’s count all the titles wth no author listed:
SELECT COUNT(*)
FROM elizabethan
WHERE Author != '';
But if we compare those two numbers with the total:
SELECT COUNT(*)
FROM elizabethan;
We’ll see that they don’t add up to the total! That’s because SQL doesn’t automatically include NULL values in a negative conditional statement. So if we are quering “not x”, then SQL divides our data into three categories: ‘x’, ‘not NULL, not x’ and NULL and returns the ‘not NULL, not x’ group. Sometimes this may be what we want - but sometimes we may want the missing values included as well! In that case, we’d need to change our query to:
SELECT COUNT(*)
FROM eebo
WHERE author != '' OR author IS NULL;
There is one more subtlety we need to be aware of. Suppose we run this query:
SELECT COUNT(*), PageCount
FROM elizabethen;
Key Points
Use the
GROUP BY
keyword to aggregate data.Functions like
MIN
,MAX
,AVERAGE
,SUM
,COUNT
, etc. operate on aggregated data.Use the
HAVING
keyword to filter on aggregate properties.Use a
VIEW
to access the result of a query as though it was a new table.
Joins and aliases
Overview
Teaching: 30 min
Exercises: 5 minQuestions
How do I bring data together from separate tables?
How can I make sure column names from my queries make sense and aren’t too long?
Objectives
Employ joins to combine data from two tables.
Apply functions to manipulate individual values.
Employ aliases to assign new names to items in a query.
Joins
To combine data from two tables we use the SQL JOIN
command, which comes after
the FROM
command.
The JOIN
command on its own will result in a cross product, where each row in
first table is paired with each row in the second table. Usually this is not
what is desired when combining two tables with data that is related in some way.
For that, we need to tell the computer which columns provide the link between the two
tables using the word ON
. What we want is to join the data with the same
species codes.
SELECT *
FROM authors
JOIN dates
ON authors.TCP = dates.TCP;
ON
is like WHERE
, it filters things out according to a test condition. We use
the table.colname
format to tell the manager what column in which table we are
referring to.
The output of the JOIN
command will have columns from first table plus the
columns from the second table. For the above command, the output will be a table
that has the following column names:
TCP | Authors | TCP | Date | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
… |
Alternatively, we can use the word USING
, as a short-hand. In this case we are
telling the manager that we want to combine authors
with titles
and that
the common column is eebo
.
SELECT *
FROM authors
JOIN dates
USING (TCP);
The output will only have one TCP column
TCP | Authors | Date | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
… |
We often won’t want all of the fields from both tables, so anywhere we would
have used a field name in a non-join query, we can use table.colname
.
For example, what if we wanted information authors and titles but not their TCP ids.
SELECT authors.Author, dates.Date
FROM dates
JOIN authors
ON dates.TCP = authors.TCP;
Authors | Title | |||
---|---|---|---|---|
… |
Many databases, including SQLite, also support a join through the WHERE clause of a query.
For example, you may see the query above written without an explicit JOIN.
SELECT dates.Date, authors.Author
FROM authors, dates
WHERE authors.TCP = dates.TCP;
For the remainder of this lesson, we’ll stick with the explicit use of the JOIN keyword for joining tables in SQL.
Challenge:
- Write a query that returns the authors and places of every TCP ID captured in the catalogue.
Different join types
We can count the number of records returned by our original join query.
SELECT COUNT(*)
FROM authors
JOIN places
USING (TCP);
Notice that this number is smaller than the number of records present in the catalogue data.
SELECT COUNT(*)
FROM eebo;
This is because, by default, SQL only returns records where the joining value
is present in the join columns of both tables (i.e. it takes the intersection
of the two join columns). This joining behaviour is known as an INNER JOIN
.
In fact the JOIN
command is simply shorthand for INNER JOIN
and the two
terms can be used interchangably as they will produce the same result.
We can also tell the computer that we wish to keep all the records in the first
table by using the command LEFT OUTER JOIN
, or LEFT JOIN
for short.
Challenge:
- Re-write the original query to keep all the entries present in the
eebo
table. How many records are returned by this query?
Challenge:
- Count the number of records in the
places
table that have aNULL
value in theeebo
column.
In SQL a NULL
value in one table can never be joined to a NULL value in a
second table because NULL
is not equal to anything, even itself.
Combining joins with sorting and aggregation
Joins can be combined with sorting, filtering, and aggregation. So, if we wanted the average number of pages for each author in the catalogue, we could do something like
SELECT authors.author, AVG(eebo.PageCount)
FROM eebo
JOIN authors
ON authors.TCP = eebo.TCP
GROUP BY eebo.PageCount;
Challenge:
- Write a query that returns the number of authors of the titles published in each year in descending order.
Challenge:
- Write a query that finds the average pages of each year of publication.
Functions
SQL includes numerous functions for manipulating data. You’ve already seen some
of these being used for aggregation (SUM
and COUNT
) but there are functions
that operate on individual values as well. Probably the most important of these
are IFNULL
and NULLIF
. IFNULL
allows us to specify a value to use in
place of NULL
.
We can represent unknown ids with “U” instead of NULL
:
SELECT TCP, Place, IFNULL(Place, 'U')
FROM places;
The lone “Place” column is only included in the query above to illustrate where
IFNULL
has changed values; this isn’t a usage requirement.
Challenge:
- Write a query that returns ‘NP’ instead of
NULL
for values in theAuthor
column.
Challenge:
- Write a query that calculates the average page length of each title, assuming that unknown lengths are 30 (as above).
IFNULL
can be particularly useful in JOIN
. When joining the authors
and
dates
tables earlier, some results were excluded because the ``
was NULL
. We can use IFNULL
to include them again, re-writing the NULL
to
a valid joining value:
SELECT eebo.Place, places.Place
FROM places
JOIN eebo
ON eebo.Place = IFNULL(places.Place, 'AB');
Challenge:
- Write a query that returns the number of titles of the authors caught in each plot, using
IFNULL
to assume that unknown titles are all of the authors “Nemo”.
The inverse of IFNULL
is NULLIF
. This returns NULL
if the first argument
is equal to the second argument. If the two are not equal, the first argument
is returned. This is useful for “nulling out” specific values.
We can “null out” vid:
SELECT TCP, Place, NULLIF(Place, 'London')
FROM eebo;
Some more functions which are common to SQL databases are listed in the table below:
Function | Description |
---|---|
ABS(n) |
Returns the absolute (positive) value of the numeric expression n |
LENGTH(s) |
Returns the length of the string expression s |
LOWER(s) |
Returns the string expression s converted to lowercase |
NULLIF(x, y) |
Returns NULL if x is equal to y, otherwise returns x |
ROUND(n) or ROUND(n, x) |
Returns the numeric expression n rounded to x digits after the decimal point (0 by default) |
TRIM(s) |
Returns the string expression s without leading and trailing whitespace characters |
UPPER(s) |
Returns the string expression s converted to uppercase |
Finally, some useful functions which are particular to SQLite are listed in the table below:
Function | Description |
---|---|
IFNULL(x, y) |
Returns x if it is non-NULL, otherwise returns y |
RANDOM() |
Returns a random integer between -9223372036854775808 and +9223372036854775807. |
REPLACE(s, f, r) |
Returns the string expression s in which every occurrence of f has been replaced with r |
SUBSTR(s, x, y) or SUBSTR(s, x) |
Returns the portion of the string expression s starting at the character position x (leftmost position is 1), y characters long (or to the end of s if y is omitted) |
Challenge:
Write a query that returns author names, sorted from longest titles name down to shortest.
Aliases
As queries get more complex names can get long and unwieldy (as we saw before). To help make things clearer we can use aliases to assign new names to things in the query.
We can alias both table names:
SELECT dt.Date, auth.Author
FROM dates AS dt
JOIN authors AS auth
ON dt.TCP = auth.TCP;
And column names:
SELECT dt.Date AS yr, auth.Author AS author
FROM dates AS dt
JOIN authors AS auth
ON dt.TCP = auth.TCP;
The AS
isn’t technically required, so you could do
SELECT dt.Date yr
FROM dates dt;
but using AS
is much clearer so it is good style to include it.
Challenge (optional):
SQL queries help us ask specific questions which we want to answer about our data. The real skill with SQL is to know how to translate our humanities questions into a sensible SQL query (and subsequently visualize and interpret our results).
Have a look at the following questions; these questions are written in plain English. Can you translate them to SQL queries and give a suitable answer?
How many entries from each year are there per year?
How many years have similar amounts of books published?
Proposed solutions:
Solution:
SELECT date as year, count(*) FROM eebo GROUP BY year ORDER BY year DESC
Solution:
SELECT date as year, count(*) AS volumes FROM eebo GROUP BY year ORDER BY volumes DESC
Key Points
Use the
JOIN
command to combine data from two tables—theON
orUSING
keywords specify which columns link the tables.Regular
JOIN
returns only matching rows. Other join commands provide different behavior, e.g.,LEFT JOIN
retains all rows of the table on the left side of the command.
IFNULL
allows you to specify a value to use in place ofNULL
, which can help in joins
NULLIF
can be used to replace certain values withNULL
in resultsMany other functions like
IFNULL
andNULLIF
can operate on individual values.Aliases can help shorten long queries. To write clear and readible queries, use the
AS
keyword when creating aliases.