SQL Aggregation

next episode

Overview

Teaching: 50 min
Exercises: 10 min

Questions

How can I summarize my data by aggregating, filtering, or ordering query results?

How to save query result?

Objectives

Apply aggregation to group records in SQL.

Filter and order results of a query based on aggregate functions.

Save a query to make a new table.

Apply filters to find missing values in SQL.

COUNT and GROUP BY

Aggregation allows us to combine results by grouping records based on value, it is useful for calculating combined values in groups.

Let’s go to the invoice_info table and find out how many invoices there are. Using the wildcard * simply counts the number of records (rows):

SELECT COUNT(*)
FROM invoice_info;

We can also find out how many total bottles sold:

SELECT COUNT(*), SUM(Bottles_Sold)
FROM invoice_info;

There are many other aggregate functions included in SQL, for example: MAX, MIN, and AVG.

Now, let’s see how many invoices were there for each Store. We do this using a GROUP BY clause

SELECT Store_id, COUNT(*)
FROM invoice_info
GROUP BY Store_id;

GROUP BY tells SQL what field or fields we want to use to aggregate the data. If we want to group by multiple fields, we give GROUP BY a comma separated list.

Challenge

What is the most expensive soda in each category? Use the item_info table, write queries that return: Item_Description, Category_id, max Bottle_Retail_Price of the most expensive soda in each category.
Solution
SELECT Item_Description, Category_id, MAX(Bottle_Retail_Price)
FROM item_info
GROUP BY Category_id;

Ordering Aggregated Results

We can order the results of our aggregation by a specific column, including the aggregated column. Let’s count the number of individuals of each species captured, ordered by the count:

SELECT Store_id, COUNT(*)
FROM invoice_info
GROUP BY Store_id
ORDER BY COUNT(*);

The `HAVING` keyword

You can make the query more readable by using AS in your query to rename a column. For example, in the above query, we can call the COUNT(*) by another name, like num_invoices.

In the previous episode, we have seen the keyword WHERE, allowing us to filter the results according to some criteria. SQL offers a mechanism to filter the results based on aggregate functions, through the HAVING keyword.

For example, we can request to only return information about stores that has more than 1000 invoices.

SELECT Store_id, COUNT(*) AS num_invoices
FROM invoice_info
WHERE Bottles_sold > 1
GROUP BY Store_id
HAVING num_invoices > 1000
ORDER BY num_invoices;

The HAVING keyword works exactly like the WHERE keyword, but uses aggregated columns instead of database fields to filter.

Note that HAVING comes after GROUP BY. One way to think about this is: the data are retrieved (SELECT), which can be filtered (WHERE), then combined in groups (GROUP BY); finally, we can filter again based on some of these groups (HAVING).

Challenge

What sodas were sold more than 100000 bottles in the whole database?
In another word, write a query that returns item_id and total bottles sold in invoice_info table Where the total bottles sold is more than 100000
Solution
SELECT Item_id, SUM(Bottles_Sold) AS ct
FROM invoice_info
GROUP BY Item_id
HAVING ct > 100000;

What About NULL?

Real-world data is never complete — there are always holes. Databases represent these holes using a special value called null. null is not zero, False, or the empty string; it is a one-of-a-kind value that means “nothing here”. Dealing with null requires a few special tricks and some careful thinking.

To start, let’s have a look at the store_info table. Stores with Store_id 5226 and 5249 have no County_id — or rather, its County_id is null:

To find the Stores that does not have a County, note that you cannot do == NULL. In stead, you can use IS NULL statement.

SELECT * FROM store_info 
WHERE COunty_id IS NULL;

Another case is when we use a “negative” query. Let’s count all the Stores in County_id = 82:

SELECT COUNT(*) 
FROM store_info 
WHERE County_id = 82;

Now let’s count all the Stores in County_id != 82:

SELECT COUNT(*) 
FROM store_info 
WHERE County_id != 82;

But if we compare those two numbers with the total:

SELECT COUNT(*)
FROM store_info;

We’ll see that they don’t add up to the total! That’s because SQL doesn’t automatically include NULL values in a negative conditional statement. So if we are querying “not x”, then SQL divides our data into three categories: ‘x’, ‘not NULL, not x’ and NULL; then, returns the ‘not NULL, not x’ group. Sometimes this may be what we want - but sometimes we may want the missing values included as well! In that case, we’d need to change our query to:

Challenge

How do you count all the Stores not in County_id 82 to include the missing values?
Solution
SELECT COUNT(*) 
FROM store_info 
WHERE County_id != 82 OR County_id IS NULL;

Subqueries

Another way to combine the data from two tables is subqueries. You can use the result of a query as a table. For example, you can find which store name that sell certain items:

SELECT Store_Name
FROM store_info
WHERE Store_id IN
(SELECT Store_id
FROM invoice_info
INNER JOIN item_info
USING(Item_id)
);

Saving Queries for Future Use

It is not uncommon to repeat the same operation more than once, for example for monitoring or reporting purposes. SQL comes with a very powerful mechanism to do this by creating views. Views are a form of query that is saved in the database, and can be used to look at, filter, and even update information. One way to think of views is as a table, that can read, aggregate, and filter information from several places before showing it to you.

Creating a view from a query requires to add CREATE VIEW viewname AS before the query itself. For example, imagine that we are doing a project that only need the data during the May of 2017. We can query by:

SELECT * FROM invoice_info
WHERE Date BETWEEN "2017-05-01" AND "2017-05-31"
ORDER BY Date;

But we don’t want to type that every time we want to ask a question about that particular subset of data. Hence, we can benefit from a view:

CREATE VIEW May_2017 AS
SELECT * FROM invoice_info
WHERE Date BETWEEN "2017-05-01" AND "2017-05-31"
ORDER BY Date;

Now if you execute the query with pd.read_sql(), what happened? It does not work!
WHY? alt text
It is saving something to database, instead of returning the query result. So nothing can be store to the DataFrame.
Instead, what you can do is to create a cursor object, and execute the statement:

q2 = '''CREATE VIEW May_2017 AS
    SELECT * FROM invoice_info
    WHERE Date BETWEEN "2017-05-01" AND "2017-05-31"
    ORDER BY Date;'''
c = conn.cursor()
c.execute(q2)

To save changes, you do

conn.commit()

Now, you have successfully created the view. May_2017 view is almost like a table in the database. You can do something like this:

SELECT * FROM May_2017;

Key Points

Use the GROUP BY keyword to aggregate data.

Functions like MIN, MAX, AVERAGE, SUM, COUNT, etc. operate on aggregated data.

Aliases can help shorten long queries. To write clear and readible queries, use the AS keyword when creating aliases.

Use the HAVING keyword to filter on aggregate properties.

Use a VIEW to store the result of a query as though it was a new table.

previous episode

SQL for Business

SQL Aggregation

next episode

Overview

COUNT and GROUP BY

Challenge

Solution

Ordering Aggregated Results

The HAVING keyword

Challenge

Solution

What About NULL?

Challenge

Solution

Subqueries

Saving Queries for Future Use

Key Points

previous episode

next episode

The `HAVING` keyword