Introduction

Overview

Teaching: 15 min
Exercises: 10 min

Questions

What are common use cases for timeseries analysis?

Objectives

Read and plot timeseries data using Python Pandas.

Group data and generate descriptive statistics.

Resample timeseries data using datetime indexing in Pandas.

Introducing time series

We regularly encounter time series data in our daily lives. Examples are abundant and include stock prices, daily rates of infection during the COVID-19 pandemic, gas and electricity use as described in our utility bills, and data points from our fitness tracking devices. Any type of observation for which data are collected at specific intervals can be represented as a time series.

Time series data may be analyzed for multiple purposes, including

Identifying trends.
Making forecasts.
Fraud and anomaly detection.

This series of lessons is intended as an introduction to key concepts in the analysis and visualization of time series data. Additional lessons in this series will introduce concepts and methods for making forecasts and classifying time series for different purposes.

Challenge: Time series in our daily lives

Above, we provided examples of information that is best represented using time series, for example daily prices of stocks.

What are some other examples of information that is collected consecutively or at specific intervals? Think of an example and reflect on the following questions:

Are observations recorded at specific times?

Are observations recorded at regular intervals?

What types of trends would you expect to see in the data?

Time series analysis using Python

Several Python libraries have been developed which can be used to support analysis of trends, forecasting, and classification of time series data. A non-exhaustive list of libraries includes:

pandas()
statistics()
scikitlearn()
tensorflow()

This lesson is focused on Pandas, which provides many useful features for inspecting large datasets, as well as resampling time series data to quickly calculate and plot statistics and moving averages.

Key Points

Pandas is a Python library that operates efficiently against large datasets.

Datetime indexing with Pandas enables resampling of timeseries data using different time steps.

Reading Data Files with Python

Overview

Teaching: 45 min
Exercises: 20 min

Questions

How can we manipulate tabular data files in Python?

Objectives

Read tabular data from a file into Pandas.

Change the structure of a dataframe.

Combine multiple files into a single dataset.

Our data have been stored in one CSV file per smart meter, in which the data include all of the meter readings taken from that meter during the period of study. Readings are taken every 15 minutes, for a total of 96 readings per day.

The structure of the data has not been altered from the source data. Each CSV file contains the following fields:

METER_FID: The identification number of the meter for which the data were recorded.
START_READ: The meter reading at 00:15 (12:15 AM) of the date specified in the INTERVAL_TIME field. Note that this value is the same for each reading taken during a day, and is the same as the END_READ value of the previous day.
END_READ: The actual value of the meter after the final interval of the date specified in the INTERVAL_TIME field, representing the total amount of power consumed during the whole day. This value becomes the value of the START_READ field for the next day.
INTERVAL_TIME: The time at which a meter reading was taken. Readings are taken every 15 minutes, with the first reading for a day taken at 00:15 (12:15 AM) and representing power usage since 00:00 (12:00 AM) of the same day.
INTERVAL_READ: The amount of power consumed between readings. The sum of readings between 00:15 of a given day and 00:00 of the following day should be equal (to two decimal places) to the difference between the END_READ and START_READ values for that day.

In order to inspect and analyze the data for each meter, we could read and work with each file independently. However, as we may also want to compare statistics or trends across multiple meters, it can be more practical to combine or concatenate the data from multiple meters into a single dataset. Having done that, we can group the data by meter, date, or other variables of interest. The Pandas library for Python allows us to do just this.

Libraries

When you launch a Python environment, it opens with a set of default libraries loaded. These libraries provide access to a default set of functions, including commonly used functions like print() and help(). Libraries can be used to extend the functionality of the environment beyond the default. Here, we are going to import the pandas and glob libraries to add functionality to our environment. Specifically, pandas is a library that provides methods for indexing and manipulating large datasets. The glob library allows us to quickly create lists of files based on patterns in filenames.

import pandas as pd
import glob

Reading files

Creating a list of filenames that we want to read is a common way of iterating over files in Python. We use glob to create the list, though for now we will only read the first file in the list.

file_list = glob.glob('../data/*.csv')
print(file_list)

Lists may or may not be sorted in Python. That is, our files may appear in any order in the list. We can sort the files by name before reading the first file.

file_list = sorted(file_list)
data = pd.read_csv(file_list[0])

Inspecting the data

The data is assigned to the variable data, which is an object.

Get information about data as an object, and also information about the data.

# Find the data type of the 'data' object

print(type(data))

<class 'pandas.core.frame.DataFrame'>

We can get information about the data by requesting its attributes or using functions.

Use the shape attribute to see the size in rows and columns.

print(data.shape)

(105012, 5)

Use the dtypes attribute to see the data types of all columns in the dataset.

print(data.dtypes)

INTERVAL_TIME     object
METER_FID          int64
START_READ       float64
END_READ         float64
INTERVAL_READ    float64
dtype: object

We can limit the above to a single column.

print(data.START_READ.dtypes)

dtype('float64')

shape and dtypes are two examples of attributes.

We can also use functions or methods to see information about the dataframe.

The info() function outputs structural information about the data.

print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105012 entries, 0 to 105011
Data columns (total 5 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   INTERVAL_TIME  105012 non-null  object 
 1   METER_FID      105012 non-null  int64  
 2   START_READ     105012 non-null  float64
 3   END_READ       105012 non-null  float64
 4   INTERVAL_READ  105012 non-null  float64
dtypes: float64(3), int64(1), object(1)
memory usage: 4.0+ MB
None

Note that info() prints out all of the information we got from the type() function and the shape and dtypes attributes.

We can also inspect some rows of data. Since the table is large - over 100,000 rows - we may only want to look at the first few or the last few rows. To do this, we can use the head() and tail() functions, respectively.

print(data.head())

         INTERVAL_TIME  METER_FID  START_READ   END_READ  INTERVAL_READ
2017-01-01 00:00:00        285   14951.787  14968.082         0.0744
2017-01-01 00:15:00        285   14968.082  14979.831         0.0762
2017-01-01 00:30:00        285   14968.082  14979.831         0.1050
2017-01-01 00:45:00        285   14968.082  14979.831         0.0636
2017-01-01 01:00:00        285   14968.082  14979.831         0.0870

print(data.tail())

              INTERVAL_TIME  METER_FID  START_READ   END_READ  INTERVAL_READ
105007  2019-12-31 22:45:00        285   28358.546  28397.749         0.4944
105008  2019-12-31 23:00:00        285   28358.546  28397.749         0.4974
105009  2019-12-31 23:15:00        285   28358.546  28397.749         0.4422
105010  2019-12-31 23:30:00        285   28358.546  28397.749         0.4212
105011  2019-12-31 23:45:00        285   28358.546  28397.749         0.3816

So far we have called all of the functions using default arugments. For example, by default head() and tail() will print the first or last five rows. If we want to view more (or fewer) rows, we can pass the number of rows as an argument. If for example we wanted to see the first ten rows of data, we would pass that number as the argument:

print(data.head(10))

         INTERVAL_TIME  METER_FID  START_READ   END_READ  INTERVAL_READ
2017-01-01 00:00:00        285   14951.787  14968.082         0.0744
2017-01-01 00:15:00        285   14968.082  14979.831         0.0762
2017-01-01 00:30:00        285   14968.082  14979.831         0.1050
2017-01-01 00:45:00        285   14968.082  14979.831         0.0636
2017-01-01 01:00:00        285   14968.082  14979.831         0.0870
2017-01-01 01:15:00        285   14968.082  14979.831         0.0858
2017-01-01 01:30:00        285   14968.082  14979.831         0.0750
2017-01-01 01:45:00        285   14968.082  14979.831         0.0816
2017-01-01 02:00:00        285   14968.082  14979.831         0.0966
2017-01-01 02:15:00        285   14968.082  14979.831         0.0720

Challenge: Know Your Data

Which of the following commands will output the data type of the ‘INTERVAL_TIME' column in our ‘data’ dataframe?
A. print(type(INTERVAL_TIME))
B. print(data.info())
C. print(data.INTERVAL_TIME.dtypes)
D. print(INTERVAL_TIME.dtypes)
Solution
B and C will both work. Option B prints the dtypes 
for the whole dataframe. Option C prints the dtype
for the INTERVAL_TIME column.

Challenge: Drilling Down

We have seen how we can use dot notation to get the ‘dtypes’ information for a single column:
print(data.START_READ.dtypes)
Dot notation can be used in function calls, too. Which of the following commands would we use to view the first 20 rows of data from the START_READ column:
A. print(data.START_READ.head())
B. print(data.START_READ.info())
C. print(data.head(20).START_READ)
D. print(data.START_READ.head(20))
Solution
Options C and D print out the first 20 rows of data
from the START_READ column.

Modifying data frames - adding columns

If we inspect the first five rows of the INTERVAL_TIME column, we see that dates are provided in a long format.

print(data.INTERVAL_TIME.head())

  2017-01-01 00:00:00
  2017-01-01 00:15:00
  2017-01-01 00:30:00
  2017-01-01 00:45:00
  2017-01-01 01:00:00
Name: INTERVAL_TIME, dtype: object

Note as well that the data type (dtype) is given as object, even though datetime is a data type recognized by Pandas. In order to use date information in analyses, we can convert the data in the INTERVAL_TIME column into a recognized date format. In a later episode, we will look at ways we can manipulate datetime information by creating a datetime index. For now we will begin by exploring some of Pandas datetime methods without reindexing the data.

We can change the data type of the “INTERVAL_TIME” column in place, but for a closer look at what is happening we will save the date data in a new column with the updated data type.

data["iso_date"] = pd.to_datetime(data["INTERVAL_TIME"], infer_datetime_format=True)

We can verify that the new column was added and that the data type of the new column is datetime using the info() command.

print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105012 entries, 0 to 105011
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   INTERVAL_TIME  105012 non-null  object        
 1   METER_FID      105012 non-null  int64         
 2   START_READ     105012 non-null  float64       
 3   END_READ       105012 non-null  float64       
 4   INTERVAL_READ  105012 non-null  float64       
 5   iso_date       105012 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(1), object(1)
memory usage: 4.8+ MB
None

Note too that we are adding the new column but leaving the original date data intact.

Combining multiple files into a single dataframe

Now that we have read a single file into our Python environment and explored its structure a little, we want to develop a process to combine all of our data files into a single dataframe. We can do this using list comprehension to read the files in our file list and concatenate them into a single dataframe.

To do this, we will re-assign our data variable to include the complete, concatenated dataset. Note that it is a large dataset, with over 1,500,000 rows.

data = pd.concat([pd.read_csv(f) for f in file_list])
print(data.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1575180 entries, 0 to 105011
Data columns (total 5 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   INTERVAL_TIME  1575180 non-null  object 
 1   METER_FID      1575180 non-null  int64  
 2   START_READ     1575180 non-null  float64
 3   END_READ       1575180 non-null  float64
 4   INTERVAL_READ  1575180 non-null  float64
dtypes: float64(3), int64(1), object(1)
memory usage: 72.1+ MB
None

Let’s take a moment to unpack the line of code that creates the dataframe:

data = pd.concat([pd.read_csv(f) for f in file_list])

This line of code includes several statements that are evaluated in the following order:

The for loop for f in file_list is evaluated first. As denoted by the square brackets, this creates a list of dataframes by calling Pandas’ read_csv() function on each of the files in the file_list variable that we created above.
The list of 15 dataframes is combined into a single dataframe using the concat() function. This function takes as its default argument a list of dataframes that share the same structure, and combines them vertically into a single dataframe.

Now we can add the iso_date column using the same command as before:

data["iso_date"] = pd.to_datetime(data["INTERVAL_TIME"], infer_datetime_format=True)
print(data.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1575180 entries, 0 to 105011
Data columns (total 6 columns):
 #   Column         Non-Null Count    Dtype         
---  ------         --------------    -----         
 0   INTERVAL_TIME  1575180 non-null  object        
 1   METER_FID      1575180 non-null  int64         
 2   START_READ     1575180 non-null  float64       
 3   END_READ       1575180 non-null  float64       
 4   INTERVAL_READ  1575180 non-null  float64       
 5   iso_date       1575180 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(1), object(1)
memory usage: 84.1+ MB
None

Challenge: Separating Dates into Columns

It can be useful in a variety of contexts to split date data into multiple columns, with one column for the year, one for the month, and another for the day. This can facilitate filtering or ordering by date in systems like SQLite, for example, that don’t by default have a date data type.

Given the lines of code below, put them in the correct order to read the data file “ladpu_smart_meter_data_01.csv” and split the “INTERVAL_TIME” column into three new columns for “year,” “month,” and “day.”
data["day"] = pd.to_datetime(data["INTERVAL_TIME"], infer_datetime_format=True).dt.day

data["year"] = pd.to_datetime(data["INTERVAL_TIME"], infer_datetime_format=True).dt.year

data = pd.read_csv("../data/ladpu_smart_meter_data_01.csv")

data["month"] = pd.to_datetime(data["INTERVAL_TIME"], infer_datetime_format=True).dt.month
Solution
data = pd.read_csv("../data/ladpu_smart_meter_data_01.csv")

data["year"] = pd.to_datetime(data["INTERVAL_TIME"], infer_datetime_format=True).dt.year

data["month"] = pd.to_datetime(data["INTERVAL_TIME"], infer_datetime_format=True).dt.month

data["day"] = pd.to_datetime(data["INTERVAL_TIME"], infer_datetime_format=True).dt.day

Key Points

PANDAS is a Python library designed to work with large datasets.

Use concat() to concatenate tabular dataframes that have the same structure.

Subsetting Dataframes with PANDAS

Overview

Teaching: 30 min
Exercises: 15 min

Questions

How can we filter and subset data in Python?

Objectives

Understand how Pandas indexes data.

Subset dataframes using index positions and filtering criteria.

Pandas is designed to work with large datasets. Multiple indexing features allow for quick subsetting and filtering of data.

Rows and columns

The axes of a dataframe are the rows and columns. Both rows and column labels are indexed when a dataframe is created, such as when reading a CSV file into memory. This allows for quick selection of specific rows, columns, or cells, either through slicing or subsetting a dataframe by filtering values according to specific criteria.

It’s important to keep this terminology clear, because in addition to indexing rows and column labels, Pandas creates an index of row labels for every dataframe. This index can be specified when the dataframe is created, or alternatively Pandas will create an integer based index of row labels by default.

We can demonstrate this by reading a single file from our dataset. First, let’s load the libraries we will use for this episode.

import pandas as pd
import glob

Next we use the glob() function to create a list of files. Then we use the same process as before to read all of the files into a single large dataframe. We will also use the info() function to look at the structure of our dataframe.

file_list = glob.glob("../data/*.csv")
data = pd.concat([pd.read_csv(f) for f in file_list])
print(data.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1575180 entries, 0 to 105011
Data columns (total 5 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   INTERVAL_TIME  1575180 non-null  object 
 1   METER_FID      1575180 non-null  int64  
 2   START_READ     1575180 non-null  float64
 3   END_READ       1575180 non-null  float64
 4   INTERVAL_READ  1575180 non-null  float64
dtypes: float64(3), int64(1), object(1)
memory usage: 72.1+ MB
None

We can use the axes attribute to inspect the row and column indices. The output gives the row index first, and the column index second.

print(data.axes)

[Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,
                 8,      9,
            ...
            105002, 105003, 105004, 105005, 105006, 105007, 105008, 105009,
            105010, 105011],
           dtype='int64', length=1575180), Index(['INTERVAL_TIME', 'METER_FID', 'START_READ', 'END_READ',
       'INTERVAL_READ'],
      dtype='object')]

The output above is a list, and the row index of our dataframe is the first object in the list:

[Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,
                 8,      9,
            ...
            105002, 105003, 105004, 105005, 105006, 105007, 105008, 105009,
            105010, 105011],
           dtype='int64', length=1575180), Index(['INTERVAL_TIME', 'METER_FID', 'START_READ', 'END_READ',
       'INTERVAL_READ'],
      dtype='object')]

This indicates that our rows are indexed or labeled using incremented integers, beginning with the first row labeled 0 and the last row labeled 105011. Recall that Python uses zero-indexing, so the final value in the index should be understood as “up to but not including 105011.”

The final row index number may be surprising. The output of the info() function above indicates that our dataframe has 1,575,180 rows. But here the last row has an index number of 105011. Why is this?

When we created our dataframe, we did so by reading all of the files in our file list and concatenating them into a single dataframe. Each file was first read into its own dataframe before concatenation, which means that each of the 15 dataframe had its own row index beginning with 0 and stopping at whatever the last row index would be for that dataframe. When dataframes are concatenated, original row index numbers are preserved by default. Because of this, many thousands of row index numbers would have been common across datasets, resulting in duplicate row index numbers in the concatenated dataframe.

To make the most efficient use of Pandas in this case, we can reset the index of the dataframe. After resetting the index, we will check the axes attribute again.

data.reset_index(inplace=True, drop=True)
print(data.axes)

[RangeIndex(start=0, stop=1575180, step=1), Index(['index', 'INTERVAL_TIME', 'METER_FID', 'START_READ', 'END_READ',
       'INTERVAL_READ'],
      dtype='object')]

Note that this time the index is identified as a RangeIndex, rather than the Int64index that was output before. Instead of listing every integer index number, the range is given. This means that our rows are indexed or labeled using incremented integers, beginning with the first row labeled 0 and the last row labeled 1575810. Recall that Python uses zero-indexing, so the stop value in the RangeIndex should be understood as “up to but not including 1575180.” We can confirm this by referring to the output of the info() function above, which states that the dataframe index has 1575180 entries, labeled from 0 to 1575179.

The second object in the list output by printing the dataframe’s axes attribute is the column index. By default, the items in this index will be the column names, which can also be understood as the column axis labels:

Index(['INTERVAL_TIME', 'METER_FID', 'START_READ', 'END_READ',
       'INTERVAL_READ']

We can see the row labels using the head() function. Note that there is no column name for the row index. This is because there was no source column in our CSV file that the row labels refer to. We will update the attributes of the row index below.

print(data.head())

         INTERVAL_TIME  METER_FID  START_READ   END_READ  INTERVAL_READ
2017-01-01 00:00:00        285   14951.787  14968.082         0.0744
2017-01-01 00:15:00        285   14968.082  14979.831         0.0762
2017-01-01 00:30:00        285   14968.082  14979.831         0.1050
2017-01-01 00:45:00        285   14968.082  14979.831         0.0636
2017-01-01 01:00:00        285   14968.082  14979.831         0.0870

Selecting Specific Columns

If we want to select all of the values in a single column, we can use the column name.

print(data["INTERVAL_TIME"])

0          2017-01-01 00:00:00
1          2017-01-01 00:15:00
2          2017-01-01 00:30:00
3          2017-01-01 00:45:00
4          2017-01-01 01:00:00
                  ...         
1575175    2019-12-31 22:45:00
1575176    2019-12-31 23:00:00
1575177    2019-12-31 23:15:00
1575178    2019-12-31 23:30:00
1575179    2019-12-31 23:45:00
Name: INTERVAL_TIME, Length: 1575180, dtype: object

In order to select multiple columns, we need to provide the column names as a list.

print(data[["METER_FID", "INTERVAL_TIME"]])

         METER_FID        INTERVAL_TIME
0              285  2017-01-01 00:00:00
1              285  2017-01-01 00:15:00
2              285  2017-01-01 00:30:00
3              285  2017-01-01 00:45:00
4              285  2017-01-01 01:00:00
...            ...                  ...
1575175       8078  2019-12-31 22:45:00
1575176       8078  2019-12-31 23:00:00
1575177       8078  2019-12-31 23:15:00
1575178       8078  2019-12-31 23:30:00
1575179       8078  2019-12-31 23:45:00

Note that all of our output includes row labels.

We can request attributes or perform operations on subsets.

print(data["INTERVAL_TIME"].shape)

(1575180,)

print(data["INTERVAL_READ"].sum())

365877.39449999994

Challenge: Find the Maximum Value of a Column

In addition to getting the sum of values from a a specific column, Pandas has functions for generating other statistics. These include min() for the minimum value within a column and max() for the maximum value.

Which of the below lines of code would give us the maximum values of both the “START_READ” and “END_READ” columns?
A. print(data[START_READ, END_READ].max())
B. print(data["START_READ", "END_READ"].max())
C. print(data[[START_READ, END_READ]].max())
D. print(data[["START_READ", "END_READ"]].max())
Solution
Option D prints out the maximum value of the columns.

Challenge: Perform Operations on Specific Data Types

In addition to selecting columns by name, Pandas has a select\_dtypes() method that lets us select columns of a specific data type.

Which of the options below will print out the sum of each column with a float data type?
A. print(data.select_dtypes(float).sum())
B. print(data.select_dtypes([["START_READ", "END_READ", "INTERVAL_READ"]]).sum())
C. print(data[["START_READ", "END_READ", "INTERVAL_READ"]].sum())
D. print(data.sum(select_dtypes(float))
Solution
Both A and C will output the correct result. However, option C requires us to already know that those columns have data of the correct data type. 

Selecting Specific Rows

Subsets of rows of a Pandas dataframe can be selected different ways. The first two methods we’ll look at, .loc and .iloc indexing, have some subtle differences that are worth exploring.

.loc indexing will select rows with specific labels. Recall from earlier that when we read our CSV file into a dataframe, a row index was created which used integers as the label for each row. We can see information about the row index using the dataframe’s index attribute.

print(data.index)

RangeIndex(start=0, stop=1575180, step=1)

To select a specific row using .loc, we need to know the label of the index. In this case, the labels start with 0 and go up to 1575810, so if we want to select the first row we use the first index label. In this cast that is 0.

print(data.loc[0])

INTERVAL_TIME    2017-01-01 00:00:00
METER_FID                        285
START_READ                 14951.787
END_READ                   14968.082
INTERVAL_READ                 0.0744
Name: 0, dtype: object

Note that above we said the label of the last row is 1575179, even though the stop value of the index attribute is 1575180. That is because default row indexing uses zero-indexing, which is common for Python data structures. The stop value given above should be understood as up to but not including. We can demonstrate this by trying to use the label 1575180 to select a row:

print("Index error:")
print(data.loc[1575180])

Index error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\range.py:391, in RangeIndex.get_loc(self, key, method, tolerance)
    390 try:
--> 391     return self._range.index(new_key)
    392 except ValueError as err:

ValueError: 1575180 is not in range

The error message in this case means that we tried to select a row using a label that is not in the index.

To print the actual last row, we provide a row index number that is one less than the stop value:

print("Actual last row:")
print(data.loc[1575179])

Actual last row:
INTERVAL_TIME    2019-12-31 23:45:00
METER_FID                       8078
START_READ                 40684.475
END_READ                    40695.86
INTERVAL_READ                  0.087
Name: 1575179, dtype: object

iloc is used to select a row or subset of rows based on the integer position of row indexers. This method also uses zero-indexing, so integers will range from 0 to 1 less than the number of rows in the dataframe. The first row would have a position integer of 0, the second row would have a position integer of 1, etc.

print(data.iloc[0])

INTERVAL_TIME    2017-01-01 00:00:00
METER_FID                        285
START_READ                 14951.787
END_READ                   14968.082
INTERVAL_READ                 0.0744
Name: 0, dtype: object

As above, we know there are 1575180 rows in our dataset but zero-indexing means that the position integer of the last row is 1575179. If we try to select a row using the position integer 1575180, we get the same error as before.

print(data.iloc[1575180])

IndexError: single positional indexer is out-of-bounds

We can also select rows using their position relative to the last row. If we want to select the last row without already knowing how many rows are in the dataframe, we can refer to its position using [-1].

print(data.iloc[-1])

INTERVAL_TIME    2019-12-31 23:45:00
METER_FID                       8078
START_READ                 40684.475
END_READ                    40695.86
INTERVAL_READ                  0.087
Name: 1575179, dtype: object

An alternative, more roundabout way is to use the len() function. Above, we noted that position integers will range from 0 to 1 less than the number of rows in the dataframe. In combination with the len() function, we can select the last row in a dataframe using:

print(data.iloc[len(data) - 1])

INTERVAL_TIME    2019-12-31 23:45:00
METER_FID                       8078
START_READ                 40684.475
END_READ                    40695.86
INTERVAL_READ                  0.087
Name: 1575179, dtype: object

Challenge: Selecting Cells

Given the lines of code below, put them in the correct order to read the data file ladpu_smart_meter_data_01 and print the starting and ending meter readings.
print(data.iloc[0]["START_READ"])

data = pd.read_csv("../data/ladpu_smart_meter_data_01.csv")

print(data.iloc[-1]["END_READ"])
Solution
data = pd.read_csv("../data/ladpu_smart_meter_data_01.csv")
print(data.iloc[0]["START_READ"])
print(data.iloc[-1]["END_READ"])

Slicing Data

So far we have used label and position based indexing to select single rows from a data frame. We can select larger subsets using index slicing. Because this is a common operation, we don’t have to specify the index.

print(data[1:10])

         INTERVAL_TIME  METER_FID  START_READ   END_READ  INTERVAL_READ
2017-01-01 00:15:00        285   14968.082  14979.831         0.0762
2017-01-01 00:30:00        285   14968.082  14979.831         0.1050
2017-01-01 00:45:00        285   14968.082  14979.831         0.0636
2017-01-01 01:00:00        285   14968.082  14979.831         0.0870
2017-01-01 01:15:00        285   14968.082  14979.831         0.0858
2017-01-01 01:30:00        285   14968.082  14979.831         0.0750
2017-01-01 01:45:00        285   14968.082  14979.831         0.0816
2017-01-01 02:00:00        285   14968.082  14979.831         0.0966
2017-01-01 02:15:00        285   14968.082  14979.831         0.0720

When only a single colon is used in the square brackets, the integer on the left indicates the starting position. The integer on the right indicates the ending position, but here again note that the returned rows will be up to but not including the row with the specified position.

By default, all rows between the starting and ending position will be returned. We can also specify a number of rows to increment over, using a second colon followed by the interval of rows to use. For example, if we want to return every other row out of the first twenty rows in the dataset, we would use the following:

print(data[0:20:2])

          INTERVAL_TIME  METER_FID  START_READ   END_READ  INTERVAL_READ
 2017-01-01 00:00:00        285   14951.787  14968.082         0.0744
 2017-01-01 00:30:00        285   14968.082  14979.831         0.1050
 2017-01-01 01:00:00        285   14968.082  14979.831         0.0870
 2017-01-01 01:30:00        285   14968.082  14979.831         0.0750
 2017-01-01 02:00:00        285   14968.082  14979.831         0.0966
2017-01-01 02:30:00        285   14968.082  14979.831         0.0798
2017-01-01 03:00:00        285   14968.082  14979.831         0.0660
2017-01-01 03:30:00        285   14968.082  14979.831         0.0846
2017-01-01 04:00:00        285   14968.082  14979.831         0.0912
2017-01-01 04:30:00        285   14968.082  14979.831         0.0720

If we don’t specify a starting or ending position, Python will default to the first and last positions, respectively. The following will output the first ten rows.

print(data[:10])

         INTERVAL_TIME  METER_FID  START_READ   END_READ  INTERVAL_READ
2017-01-01 00:00:00        285   14951.787  14968.082         0.0744
2017-01-01 00:15:00        285   14968.082  14979.831         0.0762
2017-01-01 00:30:00        285   14968.082  14979.831         0.1050
2017-01-01 00:45:00        285   14968.082  14979.831         0.0636
2017-01-01 01:00:00        285   14968.082  14979.831         0.0870
2017-01-01 01:15:00        285   14968.082  14979.831         0.0858
2017-01-01 01:30:00        285   14968.082  14979.831         0.0750
2017-01-01 01:45:00        285   14968.082  14979.831         0.0816
2017-01-01 02:00:00        285   14968.082  14979.831         0.0966
2017-01-01 02:15:00        285   14968.082  14979.831         0.0720

We can use a negative position index to return the last ten rows.

print(data[-10:])

               INTERVAL_TIME  METER_FID  START_READ  END_READ  INTERVAL_READ
1575170  2019-12-31 21:30:00       8078   40684.475  40695.86         0.1410
1575171  2019-12-31 21:45:00       8078   40684.475  40695.86         0.1098
1575172  2019-12-31 22:00:00       8078   40684.475  40695.86         0.1020
1575173  2019-12-31 22:15:00       8078   40684.475  40695.86         0.1140
1575174  2019-12-31 22:30:00       8078   40684.475  40695.86         0.1098
1575175  2019-12-31 22:45:00       8078   40684.475  40695.86         0.1284
1575176  2019-12-31 23:00:00       8078   40684.475  40695.86         0.1194
1575177  2019-12-31 23:15:00       8078   40684.475  40695.86         0.1062
1575178  2019-12-31 23:30:00       8078   40684.475  40695.86         0.1344
1575179  2019-12-31 23:45:00       8078   40684.475  40695.86         0.0870

This works the same with position based indexing. It can also work for label based indexing, depending on the labels used.

print(data.iloc[2:12:2])

print(data.iloc[2:12:2])

          INTERVAL_TIME  METER_FID  START_READ   END_READ  INTERVAL_READ
 2017-01-01 00:30:00        285   14968.082  14979.831         0.1050
 2017-01-01 01:00:00        285   14968.082  14979.831         0.0870
 2017-01-01 01:30:00        285   14968.082  14979.831         0.0750
 2017-01-01 02:00:00        285   14968.082  14979.831         0.0966
2017-01-01 02:30:00        285   14968.082  14979.831         0.0798

print(data.loc[3:19:3])

          INTERVAL_TIME  METER_FID  START_READ   END_READ  INTERVAL_READ
 2017-01-01 00:45:00        285   14968.082  14979.831         0.0636
 2017-01-01 01:30:00        285   14968.082  14979.831         0.0750
 2017-01-01 02:15:00        285   14968.082  14979.831         0.0720
2017-01-01 03:00:00        285   14968.082  14979.831         0.0660
2017-01-01 03:45:00        285   14968.082  14979.831         0.0726
2017-01-01 04:30:00        285   14968.082  14979.831         0.0720

We can select the values of single cells or column and row subsets by combining the methods used so far. First we specify the row index to use, then the column.

print(data.iloc[0]['INTERVAL_READ'])

0.0744

We can select multiple rows and/or columns. Note that selecting multiple columns requires us to put them into a list.

print(data.iloc[:10][['INTERVAL_TIME', 'INTERVAL_READ']])

         INTERVAL_TIME  INTERVAL_READ
2017-01-01 00:00:00         0.0744
2017-01-01 00:15:00         0.0762
2017-01-01 00:30:00         0.1050
2017-01-01 00:45:00         0.0636
2017-01-01 01:00:00         0.0870
2017-01-01 01:15:00         0.0858
2017-01-01 01:30:00         0.0750
2017-01-01 01:45:00         0.0816
2017-01-01 02:00:00         0.0966
2017-01-01 02:15:00         0.0720

We can also use position indexing to select columns, with the same slicing syntax as above. For example, to select the first ten rows of the first two columns:

print(data.iloc[1:10, 0:2])

         INTERVAL_TIME  METER_FID
2017-01-01 00:15:00        285
2017-01-01 00:30:00        285
2017-01-01 00:45:00        285
2017-01-01 01:00:00        285
2017-01-01 01:15:00        285
2017-01-01 01:30:00        285
2017-01-01 01:45:00        285
2017-01-01 02:00:00        285
2017-01-01 02:15:00        285

Challenge: Subsetting

The frequency at which meter readings were taken means that a single day’s worth of data consists of 96 rows. Which of the following lines of code would we use to select daily start and ending meter readings, plus the date for each day?
A. data.loc[::96][['START_READ', 'END_READ', 'INTERVAL_TIME']]
B. data.loc[0:96:96]['START_READ', 'END_READ', 'INTERVAL_TIME']
C. data.loc[::96]['START_READ', 'END_READ', 'INTERVAL_TIME']
D. data.loc[:-1:96][['START_READ', 'END_READ', 'INTERVAL_TIME']]
Solution
Option A returns the first row of the specified columns for each day.

Conditions

There are several ways to select subsets of a data frame using conditions to select rows based on the value of specific variables. One commonly used method is boolean indexing, which returns rows for which a given condition evaluates to True. The following example tests whether the values of the “INTERVAL_READ” variable are greater than 0.16.

print(data["INTERVAL_READ"] > 0.16)

0          False
1          False
2          False
3          False
4          False
           ...  
1575175    False
1575176    False
1575177    False
1575178    False
1575179    False
Name: INTERVAL_READ, Length: 1575180, dtype: bool

Note that we have to include the condition to be evaluated as the indexer in order to see the rows for which the condition evaluates to True.

print(data[data['INTERVAL_READ'] > 0.16])

               INTERVAL_TIME  METER_FID  START_READ   END_READ  INTERVAL_READ
32       2017-01-01 08:00:00        285   14968.082  14979.831         0.2304
33       2017-01-01 08:15:00        285   14968.082  14979.831         0.1854
34       2017-01-01 08:30:00        285   14968.082  14979.831         0.1878
35       2017-01-01 08:45:00        285   14968.082  14979.831         0.1920
36       2017-01-01 09:00:00        285   14968.082  14979.831         0.1710
...                      ...        ...         ...        ...            ...
1575153  2019-12-31 17:15:00       8078   40684.475  40695.860         0.2616
1575154  2019-12-31 17:30:00       8078   40684.475  40695.860         0.4152
1575155  2019-12-31 17:45:00       8078   40684.475  40695.860         0.4950
1575156  2019-12-31 18:00:00       8078   40684.475  40695.860         0.3192
1575158  2019-12-31 18:30:00       8078   40684.475  40695.860         0.1980

[703542 rows x 5 columns]

The column used for the condition does not have to be included in the subset. Selecting rows in this case requires the use of label based indexing.

print(data.loc[data["INTERVAL_READ"] > 0.26, ['METER_FID', 'INTERVAL_TIME']])

         METER_FID        INTERVAL_TIME
66             285  2017-01-01 16:30:00
77             285  2017-01-01 19:15:00
134            285  2017-01-02 09:30:00
139            285  2017-01-02 10:45:00
141            285  2017-01-02 11:15:00
...            ...                  ...
1575152       8078  2019-12-31 17:00:00
1575153       8078  2019-12-31 17:15:00
1575154       8078  2019-12-31 17:30:00
1575155       8078  2019-12-31 17:45:00
1575156       8078  2019-12-31 18:00:00

[373641 rows x 2 columns]

It is possible to specify multiple conditions.

print(data[(data["INTERVAL_READ"] > 0.16) & (data["METER_FID"] == 285)].head())

          INTERVAL_TIME  METER_FID  START_READ   END_READ  INTERVAL_READ
2017-01-01 08:00:00        285   14968.082  14979.831         0.2304
2017-01-01 08:15:00        285   14968.082  14979.831         0.1854
2017-01-01 08:30:00        285   14968.082  14979.831         0.1878
2017-01-01 08:45:00        285   14968.082  14979.831         0.1920
2017-01-01 09:00:00        285   14968.082  14979.831         0.1710

The query() method provides similar functionality with a simplified syntax. Using the query() method, the above example becomes:

print(data.query('INTERVAL_READ > 0.16 and METER_FID == 285').head())

          INTERVAL_TIME  METER_FID  START_READ   END_READ  INTERVAL_READ
2017-01-01 08:00:00        285   14968.082  14979.831         0.2304
2017-01-01 08:15:00        285   14968.082  14979.831         0.1854
2017-01-01 08:30:00        285   14968.082  14979.831         0.1878
2017-01-01 08:45:00        285   14968.082  14979.831         0.1920
2017-01-01 09:00:00        285   14968.082  14979.831         0.1710

Note in this case that the conditions are passed as a string to the query. This requires us to pay close attention to consistent use of single and double quotes.

We can also select specific columns when using the query() method.

print(data.query('INTERVAL_READ > 0.16 and METER_FID == 285')[["METER_FID", "INTERVAL_TIME"]].head())

    METER_FID        INTERVAL_TIME
      285  2017-01-01 08:00:00
      285  2017-01-01 08:15:00
      285  2017-01-01 08:30:00
      285  2017-01-01 08:45:00
      285  2017-01-01 09:00:00

Challenge: Select Intervals with Above Average Power Consumption

Fill in the blanks below to select the “INTERVAL_TIME”, “INTERVAL_READ”, and “METER_FID” for all rows for which the value of the “INTERVAL_READ” variable is above the average value for that variable in the entire dataset.
data.___[data["_________"] ___ data["INTERVAL_READ"].mean(), ["___________", "___________", "METER_FID"]]
Solution
data.loc[data["INTERVAL_READ"] > data["INTERVAL_READ"].mean(), ["INTERVAL_TIME", "INTERVAL_READ", "METER_FID"]]

Key Points

Rows can be selected in Pandas using label or position based indexing.

Boolean indexing and the query() method can be used to select subsets of a dataframe using conditions.

Summarizing Tabular Data with PANDAS

Overview

Teaching: 30 min
Exercises: 15 min

Questions

How can we summarize large datasets?

Objectives

Summarize large datasets using descriptive statistics.

Perform operations to identify outliers in tabular data.

Metadata provided with the published Los Alamos Public Utility Department Smart Meter dataset at Dryad https://doi.org/10.5061/dryad.m0cfxpp2c indicates that aside from de-identification the data have not been pre-processed or normalized and may include missing data, duplicate entries, and other anomalies.

The full dataset is too large for this tutorial format. The subset used for the lesson has been cleaned to elminate null and duplicate values, but may still contain outliers or erroneous meter readings. One way to identify potential errors within a dataset is to inspect the data using descriptive statistics.

Descriptive Statistics

Descriptive statistics provide us with a means to summarize large datasets, and to identify the presence of outliers, missing data, etc.

First, import Pandas and the glob library, which provides utilities for generating lists of files that match specific file name patterns.

import pandas as pd
import glob
import numpy as np

Next, use the glob function to create a list of files that Pandas will combine into a single data frame.

file_list = glob.glob("../data/*.csv")
file_list

['../data\\ladpu_smart_meter_data_01.csv',
 '../data\\ladpu_smart_meter_data_02.csv',
 '../data\\ladpu_smart_meter_data_03.csv',
 '../data\\ladpu_smart_meter_data_04.csv',
 '../data\\ladpu_smart_meter_data_05.csv',
 '../data\\ladpu_smart_meter_data_06.csv',
 '../data\\ladpu_smart_meter_data_07.csv',
 '../data\\ladpu_smart_meter_data_08.csv',
 '../data\\ladpu_smart_meter_data_09.csv',
 '../data\\ladpu_smart_meter_data_10.csv',
 '../data\\ladpu_smart_meter_data_11.csv',
 '../data\\ladpu_smart_meter_data_12.csv',
 '../data\\ladpu_smart_meter_data_13.csv',
 '../data\\ladpu_smart_meter_data_14.csv',
 '../data\\ladpu_smart_meter_data_15.csv']

As before, concatenate all of the individual CSV files into a single dataframe and reset the index.

data = pd.concat([pd.read_csv(f) for f in file_list])
data.reset_index(inplace=True, drop=True)
print(data.axes)

[RangeIndex(start=0, stop=1575180, step=1), Index(['INTERVAL_TIME', 'METER_FID', 'START_READ', 'END_READ',
       'INTERVAL_READ'],
      dtype='object')]

print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1575180 entries, 0 to 1575179
Data columns (total 5 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   INTERVAL_TIME  1575180 non-null  object 
 1   METER_FID      1575180 non-null  int64  
 2   START_READ     1575180 non-null  float64
 3   END_READ       1575180 non-null  float64
 4   INTERVAL_READ  1575180 non-null  float64
dtypes: float64(3), int64(1), object(1)
memory usage: 60.1+ MB
None

In a later episode we will create a datetime index using the INTERVAL_TIME field. For now we will change the data type of INTERVAL_TIME to datetime and store the updated data in an iso_date column.

data["iso_date"] = pd.to_datetime(data["INTERVAL_TIME"], infer_datetime_format=True)
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1575180 entries, 0 to 1575179
Data columns (total 6 columns):
 #   Column         Non-Null Count    Dtype         
---  ------         --------------    -----         
 0   INTERVAL_TIME  1575180 non-null  object        
 1   METER_FID      1575180 non-null  int64         
 2   START_READ     1575180 non-null  float64       
 3   END_READ       1575180 non-null  float64       
 4   INTERVAL_READ  1575180 non-null  float64       
 5   iso_date       1575180 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(1), object(1)
memory usage: 72.1+ MB
None

The output of the info() method above indicates that four of the columns in the dataframe have numeric data types: “METER_FID”, “START_READ”, “END_READ”, and “INTERVAL_READ”. By default, Pandas will calculate descriptive statistics for numeric data types within a dataset.

print(data.describe())

          METER_FID    START_READ      END_READ  INTERVAL_READ
count  1.575180e+06  1.575180e+06  1.575180e+06   1.575180e+06
mean   2.357953e+04  3.733560e+04  3.734571e+04   2.322766e-01
std    1.414977e+04  1.877812e+04  1.877528e+04   3.026917e-01
min    2.850000e+02  7.833000e+00  7.833000e+00   0.000000e+00
25%    1.006300e+04  2.583424e+04  2.584509e+04   8.820000e-02
50%    2.419700e+04  3.399917e+04  3.401764e+04   1.452000e-01
75%    3.503400e+04  4.391686e+04  4.393681e+04   2.490000e-01
max    4.501300e+04  9.997330e+04  9.997330e+04   3.709200e+00

Since the values for “START_READ” and “END_READ” are calculated across fifteen different meters over a period of three years, those statistics may not be useful or of interest. Without grouping or otherwise manipulating the data, the only statistics that may be informative in the aggregate are for the “INTERVAL_READ” variable. This is the variable that measures actual power consumption per time intervals of 15 minutes.

Also note that “METER_FID” is treated as an integer by Pandas. This is reasonable, since the identifiers are whole numbers. However, since these identifiers aren’t treated numerically in our analysis we can change the data type to object. We will see below how this affects the way statistics are calculated.

data = data.astype({"METER_FID": "object"})
print(data.dtypes)

INTERVAL_TIME            object
METER_FID                object
START_READ              float64
END_READ                float64
INTERVAL_READ           float64
iso_date         datetime64[ns]
dtype: object

If we re-run the code print(data.describe()) as above, Pandas will now exclude information about “METER_FID”, since by default Pandas only outputs descriptive statistics for numeric data types. We can change the default behavior to include statistics for all columns.

print(data.describe(include="all", datetime_is_numeric=True))

              INTERVAL_TIME  METER_FID    START_READ      END_READ  \
count               1575180  1575180.0  1.575180e+06  1.575180e+06   
unique               105012       15.0           NaN           NaN   
top     2017-01-01 00:00:00      285.0           NaN           NaN   
freq                     15   105012.0           NaN           NaN   
mean                    NaN        NaN  3.733560e+04  3.734571e+04   
min                     NaN        NaN  7.833000e+00  7.833000e+00   
25%                     NaN        NaN  2.583424e+04  2.584509e+04   
50%                     NaN        NaN  3.399917e+04  3.401764e+04   
75%                     NaN        NaN  4.391686e+04  4.393681e+04   
max                     NaN        NaN  9.997330e+04  9.997330e+04   
std                     NaN        NaN  1.877812e+04  1.877528e+04   

        INTERVAL_READ                       iso_date  
count    1.575180e+06                        1575180  
unique            NaN                            NaN  
top               NaN                            NaN  
freq              NaN                            NaN  
mean     2.322766e-01  2018-07-02 20:14:16.239287296  
min      0.000000e+00            2017-01-01 00:00:00  
25%      8.820000e-02            2017-10-02 12:11:15  
50%      1.452000e-01            2018-07-03 00:22:30  
75%      2.490000e-01            2019-04-02 12:33:45  
max      3.709200e+00            2019-12-31 23:45:00  
std      3.026917e-01                            NaN  

For non-numeric data types, Pandas has included statistics for count, unique, top, and freq. Respectively, these represent the total number of observations, the number of uniquely occuring values, the most commonly occuring value, and the number of time the most commonly occurring value appears in the dataset.

We can view the descriptive statistics for a single column:

print(data["INTERVAL_READ"].describe())

count    1.575180e+06
mean     2.322766e-01
std      3.026917e-01
min      0.000000e+00
25%      8.820000e-02
50%      1.452000e-01
75%      2.490000e-01
max      3.709200e+00
Name: INTERVAL_READ, dtype: float64

We notice the difference between the value given for 75% range and the maximum meter reading is larger than the differences between the other percentiles. For a closer inspection of high readings, we can also specify percentiles.

print(data["INTERVAL_READ"].describe(percentiles = [0.75, 0.85, 0.95, 0.99]))

count    1.575180e+06
mean     2.322766e-01
std      3.026917e-01
min      0.000000e+00
50%      1.452000e-01
75%      2.490000e-01
85%      3.846000e-01
95%      6.912000e-01
99%      1.714926e+00
max      3.709200e+00
Name: INTERVAL_READ, dtype: float64

There seem to be some meter readings that are unusually high between the 95% percentile and the maximum. We will investgate this in more detail below.

Challenge: Descriptive Statistics by Data Type

We have seen that the default behavior in Pandas is to output descriptive statistics for numeric data types. It is also possible, using the include argument demonstrated above, to output descriptive statistics for specific data types.

Write some code to output descriptive statistics by data type for each of the data types in the dataset. Refer to python data types documentation and use any of the methods we have demonstrated to identify the data types of different columns in a pandas dataframe.

Solution

print("Dataframe info:")
print(data.info())

Dataframe info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1575180 entries, 0 to 1575179
Data columns (total 6 columns):
 #   Column         Non-Null Count    Dtype         
---  ------         --------------    -----         
 0   INTERVAL_TIME  1575180 non-null  object        
 1   METER_FID      1575180 non-null  object        
 2   START_READ     1575180 non-null  float64       
 3   END_READ       1575180 non-null  float64       
 4   INTERVAL_READ  1575180 non-null  float64       
 5   iso_date       1575180 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(3), object(2)
memory usage: 72.1+ MB
None

print("Descriptive stats for floating point numbers:")
print(data.describe(include=float))

Descriptive stats for floating point numbers:
         START_READ      END_READ  INTERVAL_READ
count  1.575180e+06  1.575180e+06   1.575180e+06
mean   3.733560e+04  3.734571e+04   2.322766e-01
std    1.877812e+04  1.877528e+04   3.026917e-01
min    7.833000e+00  7.833000e+00   0.000000e+00
25%    2.583424e+04  2.584509e+04   8.820000e-02
50%    3.399917e+04  3.401764e+04   1.452000e-01
75%    4.391686e+04  4.393681e+04   2.490000e-01
max    9.997330e+04  9.997330e+04   3.709200e+00

print("Descriptive stats for objects:")
print(data.describe(include=object))

Descriptive stats for objects:
              INTERVAL_TIME  METER_FID
count               1575180    1575180
unique               105012         15
top     2017-01-01 00:00:00        285
freq                     15     105012

Working with Values

Returning to our descriptive statistics, we have already noted that the maximum value is well above the 99th percentile. Our minimum value of zero may also be unusual, since many homes might be expected to use some amount of energy every fifteen minutes, even when residents are away.

print(data["INTERVAL_READ"].describe())

count    1.575180e+06
mean     2.322766e-01
std      3.026917e-01
min      0.000000e+00
25%      8.820000e-02
50%      1.452000e-01
75%      2.490000e-01
max      3.709200e+00
Name: INTERVAL_READ, dtype: float64

We can also select the minimum and maximum values in a column using the min() and max() functions.

print("Minimum value:", data["INTERVAL_READ"].min())
print("Maximum value:", data["INTERVAL_READ"].max())

Minimum value: 0.0
Maximum value: 3.7092

This is useful if we want to know what those values are. We may want to have more information about the corresponding meter’s start and end reading, the date, and the meter ID. One way to discover this information is to use the idxmin() and idxmax() functions to get the position indices of the rows where the minimum and maximum values occur.

print("Position index of the minimum value:", data["INTERVAL_READ"].idxmin())
print("Position index of the maximum value:", data["INTERVAL_READ"].idxmax())

Position index of the minimum value: 31315
Position index of the maximum value: 1187650

Now we can use the position index to select the row with the reported minimum value.

print(data.iloc[31315])

INTERVAL_TIME    2017-11-24 05:45:00
METER_FID                        285
START_READ                 18409.997
END_READ                   18418.554
INTERVAL_READ                    0.0
iso_date         2017-11-24 05:45:00
Name: 31315, dtype: object

We can do the same with the maximum value.

print(data.iloc[1187650])

INTERVAL_TIME    2017-12-06 18:30:00
METER_FID                      29752
START_READ                 99685.954
END_READ                   99788.806
INTERVAL_READ                 3.7092
iso_date         2017-12-06 18:30:00
Name: 1187650, dtype: object

An important caveat here is that in both cases, the idxmin() and idxmax() functions return a single position index number, when in fact the minimum and maximum values may occur multiple times. We can use the value_counts() function to demonstrate that 0 occurs several times in the dataset. Since the dataset is large, we will first subset the data to only include meter readings with very small intervals. Then we will get the count of values across the subset.

low_readings = data[data["INTERVAL_READ"] <= 0.005].copy()
print(pd.value_counts(low_readings["INTERVAL_READ"]))

00480    856
00420    349
00000     72
00001     40
00360      3
00354      1
00177      1
00240      1
00180      1
00294      1
Name: INTERVAL_READ, dtype: int64

It’s also helpful to visualize the distribution of values. Pandas has a hist() function for this.

low_readings["INTERVAL_READ"].hist()

Histogram showing distribution of INTERVAL_READ values below 0.005

We can plot the distribution of INTERVAL_READ values across the entire dataset. Note that the default behavior in this case is to group the values into 10 bins. The number of bins can be specified using the bins argument.

data["INTERVAL_READ"].hist(bins-20)

Histogram showing distribution of INTERVAL_READ values across entire dataset

To find the number of rows with “INTERVAL_READ” values equal to the minimum or maximum value of that column, we can also create a subset of rows with that value and then get the length of the subset.

print("Number of rows with minimum interval read values:", len(data[data["INTERVAL_READ"] == data["INTERVAL_READ"].min()]))
print("Number of rows with maximum interval read values:", len(data[data["INTERVAL_READ"] == data["INTERVAL_READ"].max()]))

Number of rows with minimum interval read values: 72
Number of rows with maximum interval read values: 1

Grouping data to calculate statistics per smart meter

Up to this point we have been calculating statistics across the entire dataset. This may be useful in some limited circumstances, recall that the dataset consists of data from 15 smart metters over the course of three years. In many cases it will be more useful to calculate statistics per meter. One way to do that would have been to read and analyze the data for each meter individually by reading one file at a time. However, data such as these may often be streamed from multiple meters or other instruments into a single source or database. In these or similar cases it can be more efficient to group data according to specific characteristics.

Our dataset includes a METER_FID field that specifies the identifier of the smart meter from which the data were captured. We can use Pandas’ groupby() method to group data on this field. First we can inspect the number of meters represented in the dataset using the unique() function.

print(data["METER_FID"].unique())

[285 10063 44440 4348 45013 32366 24197 18918 35034 42755 25188 29752 20967 12289 8078]

The output is an array of the unique values in the METER_FID column. We can group the data by these values to essentially create 15 logical subsets of data, with one subset per meter.

meter_data_groups = data.groupby("METER_FID")
print(meter_data_groups.groups.keys())

dict_keys([285, 4348, 8078, 10063, 12289, 18918, 20967, 24197, 25188, 29752, 32366, 35034, 42755, 44440, 45013])

We can use the key or the name of the group to reference a specific group using the get_group() method. Because a group in Pandas is a dataframe, we can operate against groups using the same methods available to any Pandas dataframe.

print(type(meter_data_groups.get_group(285)))

<class 'pandas.core.frame.DataFrame'>

Above, we used the max() function to identify the maximum INTERVAL_READ value within the entire dataset. Using get_group(), we can select the maximum value and other statitics using similar syntax.

print("Maximum value for meter ID 285:", meter_data_groups.get_group(285)["INTERVAL_READ"].max())
print("Sum of values for meter ID 285:", meter_data_groups.get_group(285)["INTERVAL_READ"].sum())

Maximum value for meter ID 285: 2.0526
Sum of values for meter ID 285: 13407.0322

Importantly, we can calculate statistics across all groups at once.

meter_data_groups['INTERVAL_READ'].sum()

METER_FID
    13407.0322
    9854.7946
   16979.6884
  22628.9866
  21175.5478
  13790.9044
  20488.7761
  16401.9134
  28962.4688
  67684.5526
  25832.2660
  22653.0976
  21882.2350
  46950.2104
  17184.9206
Name: INTERVAL_READ, dtype: float64

Finally, to calculate multiple statistics we pass a list of the statistics we’re interested in as an argument to the agg() method.

print(meter_data_groups['INTERVAL_READ'].agg([np.sum, np.amax, np.amin, np.mean, np.std]))

                  sum    amax  amin      mean       std
METER_FID                                              
      13407.0322  2.0526   0.0  0.127671  0.188185
      9854.7946  1.4202   0.0  0.093844  0.093136
     16979.6884  1.4202   0.0  0.161693  0.158175
    22628.9866  2.2284   0.0  0.215490  0.234565
    21175.5478  1.4928   0.0  0.201649  0.125956
    13790.9044  1.8240   0.0  0.131327  0.153955
    20488.7761  1.2522   0.0  0.195109  0.127863
    16401.9134  1.4574   0.0  0.156191  0.117501
    28962.4688  2.2116   0.0  0.275802  0.259377
    67684.5526  3.7092   0.0  0.644541  0.779913
    25832.2660  1.5744   0.0  0.245993  0.213024
    22653.0976  2.2038   0.0  0.215719  0.157792
    21882.2350  2.1780   0.0  0.208378  0.151845
    46950.2104  2.2098   0.0  0.447094  0.334872
    17184.9206  1.6278   0.0  0.163647  0.148538

Key Points

Use the PANDAS describe() method to generate descriptive statistics for a dataframe.

Creating a Datetime Index with Pandas

Overview

Teaching: 40 min
Exercises: 20 min

Questions

How can we visualize trends in time series?

Objectives

Use pandas datetime indexing to subset time series data.

Resample and plot time series statistics.

Calculate and plot rolling averages.

So far we have been interacting with the time series data in ways that are not very different from other tabular datasets. However, time series data can contain unique features that make it difficult to analyze aggregate statistics. For example, the sum of power consumption as measured by a single smart meter over the course of three years may be useful in a limited context, but if we want to analyze the data in order to make predictions about likely demand within one or more households at a given time, we need to account for time features including.

trends
seasonality

In this episode, we will use Pandas’ datetime indexing features to illustrate these features.

Setting a datetime index

To begin with we will demonstrate datetime indexing with data from a single meter. First, let’s import the necessary libraries.

import pandas as pd
import matplotlib.pyplot as plt
import glob
import numpy as np

Create the file list as before, but this time read only the first file.

file_list = glob.glob("../data/*.csv")
data = pd.read_csv(file_list[0])
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105012 entries, 0 to 105011
Data columns (total 5 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   INTERVAL_TIME  105012 non-null  object 
 1   METER_FID      105012 non-null  int64  
 2   START_READ     105012 non-null  float64
 3   END_READ       105012 non-null  float64
 4   INTERVAL_READ  105012 non-null  float64
dtypes: float64(3), int64(1), object(1)
memory usage: 4.0+ MB
None

We can plot the INTERVAL_READ values without making any changes to the data. In the image below, we observe what may be trends or seasonality. But we are able to infer this based on our understanding of the data - note that the labels of the X index in the image are the row index positions, not date information.

data["INTERVAL_READ"].plot()

Plot of three years of readings from a single smart meter.

The plot is difficult to read because it is dense. Recall that meter readings are taken every 15 minutes, for a total in the case of this meter of 105,012 readings across the three years of data included in the dataset. In the plot above, we have asked Pandas to plot the value of each one of those readings - that’s a lot of points!

In order to reduce the amount of information and inspect the data for trends and seasonality, the first thing we need to do is set the index of the dataframe to use the INTERVAL_TIME column as a datetime index.

data.set_index(pd.to_datetime(data['INTERVAL_TIME']), inplace=True)
print(data.info())

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 105012 entries, 2017-01-01 00:00:00 to 2019-12-31 23:45:00
Data columns (total 5 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   INTERVAL_TIME  105012 non-null  object 
 1   METER_FID      105012 non-null  int64  
 2   START_READ     105012 non-null  float64
 3   END_READ       105012 non-null  float64
 4   INTERVAL_READ  105012 non-null  float64
dtypes: float64(3), int64(1), object(1)
memory usage: 4.8+ MB
None

Note that the output of the info() function indicates the index is now a DatetimeIndex. However, the INTERVAL_TIME column is still listed as a column, with a data type of object or string.

print(data.head())

                           INTERVAL_TIME  METER_FID  START_READ   END_READ  \
INTERVAL_TIME                                                                
2017-01-01 00:00:00  2017-01-01 00:00:00        285   14951.787  14968.082   
2017-01-01 00:15:00  2017-01-01 00:15:00        285   14968.082  14979.831   
2017-01-01 00:30:00  2017-01-01 00:30:00        285   14968.082  14979.831   
2017-01-01 00:45:00  2017-01-01 00:45:00        285   14968.082  14979.831   
2017-01-01 01:00:00  2017-01-01 01:00:00        285   14968.082  14979.831   

                     INTERVAL_READ  
INTERVAL_TIME                       
2017-01-01 00:00:00         0.0744  
2017-01-01 00:15:00         0.0762  
2017-01-01 00:30:00         0.1050  
2017-01-01 00:45:00         0.0636  
2017-01-01 01:00:00         0.0870  

Slicing data by date

Now that we have a datetime index for our dataframe, we can use date and time components as labels for label based indexing and slicing. We can use all or part of the datetime information, dependng on how we want to subset the data.

To select data for a single day, we can slice using the date.

one_day_data = data.loc["2017-08-01"].copy()
print(one_day_data.info())

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 96 entries, 2017-08-01 00:00:00 to 2017-08-01 23:45:00
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   INTERVAL_TIME  96 non-null     object 
 1   METER_FID      96 non-null     int64  
 2   START_READ     96 non-null     float64
 3   END_READ       96 non-null     float64
 4   INTERVAL_READ  96 non-null     float64
dtypes: float64(3), int64(1), object(1)
memory usage: 4.5+ KB
None

The info() output above shows that the subset includes 96 readings for a single day, and that the readings were taken between 12AM and 23:45PM. This is expected, based on the structure of the data.

The head() command outputs the first five rows of data.

print(one_day_data.head())

                           INTERVAL_TIME  METER_FID  START_READ   END_READ  \
INTERVAL_TIME                                                                
2017-08-01 00:00:00  2017-08-01 00:00:00        285   17563.273  17572.024   
2017-08-01 00:15:00  2017-08-01 00:15:00        285   17572.024  17577.448   
2017-08-01 00:30:00  2017-08-01 00:30:00        285   17572.024  17577.448   
2017-08-01 00:45:00  2017-08-01 00:45:00        285   17572.024  17577.448   
2017-08-01 01:00:00  2017-08-01 01:00:00        285   17572.024  17577.448   

                     INTERVAL_READ  
INTERVAL_TIME                       
2017-08-01 00:00:00         0.0264  
2017-08-01 00:15:00         0.0432  
2017-08-01 00:30:00         0.0432  
2017-08-01 00:45:00         0.0264  
2017-08-01 01:00:00         0.0582  

We can inlcude timestamps in the label to subset the data to specific portions of a day. For example, we can inspect power consumption between 8AM and 17:00PM. In this case we use the syntax from before to specify a range for the slice.

print(data.loc["2017-08-01 00:00:00": "2017-08-01 17:00:00"])

                           INTERVAL_TIME  METER_FID  START_READ   END_READ  \
INTERVAL_TIME                                                                
2017-08-01 00:00:00  2017-08-01 00:00:00        285   17563.273  17572.024   
2017-08-01 00:15:00  2017-08-01 00:15:00        285   17572.024  17577.448   
2017-08-01 00:30:00  2017-08-01 00:30:00        285   17572.024  17577.448   
2017-08-01 00:45:00  2017-08-01 00:45:00        285   17572.024  17577.448   
2017-08-01 01:00:00  2017-08-01 01:00:00        285   17572.024  17577.448   
...                                  ...        ...         ...        ...   
2017-08-01 16:00:00  2017-08-01 16:00:00        285   17572.024  17577.448   
2017-08-01 16:15:00  2017-08-01 16:15:00        285   17572.024  17577.448   
2017-08-01 16:30:00  2017-08-01 16:30:00        285   17572.024  17577.448   
2017-08-01 16:45:00  2017-08-01 16:45:00        285   17572.024  17577.448   
2017-08-01 17:00:00  2017-08-01 17:00:00        285   17572.024  17577.448   

                     INTERVAL_READ  
INTERVAL_TIME                       
2017-08-01 00:00:00         0.0264  
2017-08-01 00:15:00         0.0432  
2017-08-01 00:30:00         0.0432  
2017-08-01 00:45:00         0.0264  
2017-08-01 01:00:00         0.0582  
...                            ...  
2017-08-01 16:00:00         0.0540  
2017-08-01 16:15:00         0.0288  
2017-08-01 16:30:00         0.0462  
2017-08-01 16:45:00         0.0360  
2017-08-01 17:00:00         0.0378  

[69 rows x 5 columns]

We can also slice by month or year. To subset all of the data from 2018 only requires us to include the year in the index label.

print(data.loc["2018"].info())

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 35036 entries, 2018-01-01 00:00:00 to 2018-12-31 23:45:00
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   INTERVAL_TIME  35036 non-null  object 
 1   METER_FID      35036 non-null  int64  
 2   START_READ     35036 non-null  float64
 3   END_READ       35036 non-null  float64
 4   INTERVAL_READ  35036 non-null  float64
dtypes: float64(3), int64(1), object(1)
memory usage: 1.6+ MB
None

The following would subset the data to include days that occurred during the northern hemisphere’s winter of 2018-2019.

print(data.loc["2018-12-21": "2019-03-21"].info())

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8732 entries, 2018-12-21 00:00:00 to 2019-03-21 23:45:00
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   INTERVAL_TIME  8732 non-null   object 
 1   METER_FID      8732 non-null   int64  
 2   START_READ     8732 non-null   float64
 3   END_READ       8732 non-null   float64
 4   INTERVAL_READ  8732 non-null   float64
dtypes: float64(3), int64(1), object(1)
memory usage: 409.3+ KB
None

Resampling time series

Referring to the above slice of data from December 21, 2018 through March 21, 2019, we can also plot the meter reading values. The result is still noisy, as the plot includes 8732 values taken at 15 minute intervals during that time period. That is still a lot of data to put into a single plot.

print(data.loc["2018-12-21": "2019-03-21"]["INTERVAL_READ"].plot())

Plot of three months of readings from a single smart meter.

Alternatively, we can resample the data to convert readings to a different frequency. That is, instead of a reading every 15 minutes, we can resample to aggregate or group meter readings by hour, day, month, etc. The resample() method has one required argument, which specifies the frequency at which the data should be resampled. For example, to group the data by day, we use an uppercase “D” for this argument

daily_data = data.resample("D")
print(type(daily_data))

<class 'pandas.core.resample.DatetimeIndexResampler'>

Note that the data type of the resampled data is a DatetimeIndexResampler. This is similar to a groupby object in Pandas, and can be accessed using many of the same functions and methods. Previously, we used the max() and min() functions to retrieve maximum and minimum meter readings from the entire dataset. Having resampled the data, we can now retrieve these values per day. Be sure to specify the column against which we want to perform these operations.

print(daily_data['INTERVAL_READ'].min())

INTERVAL_TIME
2017-01-01    0.0282
2017-01-02    0.0366
2017-01-03    0.0300
2017-01-04    0.0288
2017-01-05    0.0288
               ...  
2019-12-27    0.0324
2019-12-28    0.0444
2019-12-29    0.0546
2019-12-30    0.0366
2019-12-31    0.0276
Freq: D, Name: INTERVAL_READ, Length: 1095, dtype: float64

We can also plot the results of operations, for example the daily total power consumption.

daily_data["INTERVAL_READ"].sum().plot()

Plot of daily total power consumption from a single smart meter.

Calculating and plotting rolling averages

Resampling allows us to reduce some of the noise in our plots by aggregating data to a different time frequency. In this way, trends in the data become more evident within our plots. One way we can further amplify trends in time series is through calculating and plotting rolling averages. When we resample data, any statistics are aggregated to the period represented by the resampling frequency. In our example above, the data were resampled per day. As a result our minimum and maximum values, or any other statistics, are calculated per day. By contrast, rolling means allow us to calcuate averages across days.

To demonstrate, we will resample our December 21, 2018 - March 21, 2019 subset and compare daily total power consumption with a rolling average of power consumption per week.

winter_18 = data.loc["2018-12-21": "2019-03-21"].copy()
print(winter_18.info())

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8732 entries, 2018-12-21 00:00:00 to 2019-03-21 23:45:00
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   INTERVAL_TIME  8732 non-null   object 
 1   METER_FID      8732 non-null   int64  
 2   START_READ     8732 non-null   float64
 3   END_READ       8732 non-null   float64
 4   INTERVAL_READ  8732 non-null   float64
dtypes: float64(3), int64(1), object(1)
memory usage: 409.3+ KB
None

With the resampled data, we can calculate daily total power consumption using the sum()function. The head() command is used below to view the first 14 rows of output.

print(winter_18_daily["INTERVAL_READ"].sum().head(14))

INTERVAL_TIME
2018-12-21     4.2462
2018-12-22     2.8758
2018-12-23     2.7906
2018-12-24     2.7930
2018-12-25     2.6748
2018-12-26    13.7778
2018-12-27     9.3912
2018-12-28     7.4262
2018-12-29     7.9764
2018-12-30     8.9406
2018-12-31     7.3674
2019-01-01     7.5324
2019-01-02    10.2534
2019-01-03     6.8544
Freq: D, Name: INTERVAL_READ, dtype: float64

To calculate a rolling seven day statistic, we use the Numpy rolling() function. The window argument specifies how many timesteps we are calculating the rolling statistic across. Since our data have been resampled at a per day frequency, a single timestep is one day. Passing a value of 7 to the window argument means that we will be calculating a statistic across every seven days, or weekly. In order to calculate averages, the result of the rolling() function is then passed to the mean() function. For the sake of comparison we will also huse the head() function to output the first 14 rows of the result.

print(winter_18_daily["INTERVAL_READ"].sum().rolling(window=7).mean().head(14))

INTERVAL_TIME
2018-12-21         NaN
2018-12-22         NaN
2018-12-23         NaN
2018-12-24         NaN
2018-12-25         NaN
2018-12-26         NaN
2018-12-27    5.507057
2018-12-28    5.961343
2018-12-29    6.690000
2018-12-30    7.568571
2018-12-31    8.222057
2019-01-01    8.916000
2019-01-02    8.412514
2019-01-03    8.050114
Freq: D, Name: INTERVAL_READ, dtype: float64

Note that the first six rows of output are not a number, or NaN. This is because setting a window of 7 for our rolling average meant that there were not enough days from which to calculate an average between December 21 and 26. Beginning with December 27, a seven day average can be calculated using the daily total power consumption between December 21 and 27. On December 28, the average is calculated using the daily total power consumption between December 22 and 28, etc. We can verify this by manually calculating the 7 day average between December 21 and 27 and comparing it with the value given above:

print((4.2462 + 2.8758 + 2.7906 + 2.7930 + 2.6748 + 13.7778 + 9.3912) / 7)

5.507057142857143

Plotting the daily total and the weekly average together demonstrates how the rolling average smooths the data and can make trends more apparent.

fig, ax = plt.subplots()

ax.plot(winter_18_daily["INTERVAL_READ"].sum(), label="Daily total")
ax.plot(winter_18_daily["INTERVAL_READ"].sum().rolling(window=7).mean(), label="7 day rolling average")

ax.legend(loc=2)  

ax.set_xlabel("Time")
ax.set_ylabel("Power consumption")

plt.xticks(rotation=45)
plt.tight_layout()

Plot of daily total power consumption and rolling 7 day average from a single smart meter.

Key Points

Use resample() on a datetime index to calculate aggregate statistics across time series.

Time Series Analysis of Smart Meter Power Consumption Data

Introduction

Overview

Introducing time series

Challenge: Time series in our daily lives

Time series analysis using Python

Key Points

Reading Data Files with Python

Overview

Libraries

Reading files

Inspecting the data

Challenge: Know Your Data

Solution

Challenge: Drilling Down

Solution

Modifying data frames - adding columns

Combining multiple files into a single dataframe

Challenge: Separating Dates into Columns

Solution

Key Points

Subsetting Dataframes with PANDAS

Overview

Rows and columns

Selecting Specific Columns

Challenge: Find the Maximum Value of a Column

Solution

Challenge: Perform Operations on Specific Data Types

Solution

Selecting Specific Rows

Challenge: Selecting Cells

Solution

Slicing Data

Challenge: Subsetting

Solution

Conditions

Challenge: Select Intervals with Above Average Power Consumption

Solution

Key Points

Summarizing Tabular Data with PANDAS

Overview

Descriptive Statistics

Challenge: Descriptive Statistics by Data Type

Solution

Working with Values

Grouping data to calculate statistics per smart meter

Key Points

Creating a Datetime Index with Pandas

Overview

Setting a datetime index

Slicing data by date

Resampling time series

Calculating and plotting rolling averages

Key Points