This lesson is still being designed and assembled (Pre-Alpha version)

Subsetting Dataframes with PANDAS

Overview

Teaching: 30 min
Exercises: 15 min
Questions
  • How can we filter and subset data in Python?

Objectives
  • Understand how Pandas indexes data.

  • Subset dataframes using index positions and filtering criteria.

Pandas is designed to work with large datasets. Multiple indexing features allow for quick subsetting and filtering of data.

Rows and columns

The axes of a dataframe are the rows and columns. Both rows and column labels are indexed when a dataframe is created, such as when reading a CSV file into memory. This allows for quick selection of specific rows, columns, or cells, either through slicing or subsetting a dataframe by filtering values according to specific criteria.

It’s important to keep this terminology clear, because in addition to indexing rows and column labels, Pandas creates an index of row labels for every dataframe. This index can be specified when the dataframe is created, or alternatively Pandas will create an integer based index of row labels by default.

We can demonstrate this by reading a single file from our dataset. First, let’s load the libraries we will use for this episode.

import pandas as pd
import glob

Next we use the glob() function to create a list of files. Then we use the same process as before to read all of the files into a single large dataframe. We will also use the info() function to look at the structure of our dataframe.

file_list = glob.glob("../data/*.csv")
data = pd.concat([pd.read_csv(f) for f in file_list])
print(data.info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1575180 entries, 0 to 105011
Data columns (total 5 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   INTERVAL_TIME  1575180 non-null  object 
 1   METER_FID      1575180 non-null  int64  
 2   START_READ     1575180 non-null  float64
 3   END_READ       1575180 non-null  float64
 4   INTERVAL_READ  1575180 non-null  float64
dtypes: float64(3), int64(1), object(1)
memory usage: 72.1+ MB
None

We can use the axes attribute to inspect the row and column indices. The output gives the row index first, and the column index second.

print(data.axes)
[Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,
                 8,      9,
            ...
            105002, 105003, 105004, 105005, 105006, 105007, 105008, 105009,
            105010, 105011],
           dtype='int64', length=1575180), Index(['INTERVAL_TIME', 'METER_FID', 'START_READ', 'END_READ',
       'INTERVAL_READ'],
      dtype='object')]

The output above is a list, and the row index of our dataframe is the first object in the list:

[Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,
                 8,      9,
            ...
            105002, 105003, 105004, 105005, 105006, 105007, 105008, 105009,
            105010, 105011],
           dtype='int64', length=1575180), Index(['INTERVAL_TIME', 'METER_FID', 'START_READ', 'END_READ',
       'INTERVAL_READ'],
      dtype='object')]

This indicates that our rows are indexed or labeled using incremented integers, beginning with the first row labeled 0 and the last row labeled 105011. Recall that Python uses zero-indexing, so the final value in the index should be understood as “up to but not including 105011.”

The final row index number may be surprising. The output of the info() function above indicates that our dataframe has 1,575,180 rows. But here the last row has an index number of 105011. Why is this?

When we created our dataframe, we did so by reading all of the files in our file list and concatenating them into a single dataframe. Each file was first read into its own dataframe before concatenation, which means that each of the 15 dataframe had its own row index beginning with 0 and stopping at whatever the last row index would be for that dataframe. When dataframes are concatenated, original row index numbers are preserved by default. Because of this, many thousands of row index numbers would have been common across datasets, resulting in duplicate row index numbers in the concatenated dataframe.

To make the most efficient use of Pandas in this case, we can reset the index of the dataframe. After resetting the index, we will check the axes attribute again.

data.reset_index(inplace=True, drop=True)
print(data.axes)
[RangeIndex(start=0, stop=1575180, step=1), Index(['index', 'INTERVAL_TIME', 'METER_FID', 'START_READ', 'END_READ',
       'INTERVAL_READ'],
      dtype='object')]

Note that this time the index is identified as a RangeIndex, rather than the Int64index that was output before. Instead of listing every integer index number, the range is given. This means that our rows are indexed or labeled using incremented integers, beginning with the first row labeled 0 and the last row labeled 1575810. Recall that Python uses zero-indexing, so the stop value in the RangeIndex should be understood as “up to but not including 1575180.” We can confirm this by referring to the output of the info() function above, which states that the dataframe index has 1575180 entries, labeled from 0 to 1575179.

The second object in the list output by printing the dataframe’s axes attribute is the column index. By default, the items in this index will be the column names, which can also be understood as the column axis labels:

Index(['INTERVAL_TIME', 'METER_FID', 'START_READ', 'END_READ',
       'INTERVAL_READ']

We can see the row labels using the head() function. Note that there is no column name for the row index. This is because there was no source column in our CSV file that the row labels refer to. We will update the attributes of the row index below.

print(data.head())
         INTERVAL_TIME  METER_FID  START_READ   END_READ  INTERVAL_READ
0  2017-01-01 00:00:00        285   14951.787  14968.082         0.0744
1  2017-01-01 00:15:00        285   14968.082  14979.831         0.0762
2  2017-01-01 00:30:00        285   14968.082  14979.831         0.1050
3  2017-01-01 00:45:00        285   14968.082  14979.831         0.0636
4  2017-01-01 01:00:00        285   14968.082  14979.831         0.0870

Selecting Specific Columns

If we want to select all of the values in a single column, we can use the column name.

print(data["INTERVAL_TIME"])
0          2017-01-01 00:00:00
1          2017-01-01 00:15:00
2          2017-01-01 00:30:00
3          2017-01-01 00:45:00
4          2017-01-01 01:00:00
                  ...         
1575175    2019-12-31 22:45:00
1575176    2019-12-31 23:00:00
1575177    2019-12-31 23:15:00
1575178    2019-12-31 23:30:00
1575179    2019-12-31 23:45:00
Name: INTERVAL_TIME, Length: 1575180, dtype: object

In order to select multiple columns, we need to provide the column names as a list.

print(data[["METER_FID", "INTERVAL_TIME"]])
         METER_FID        INTERVAL_TIME
0              285  2017-01-01 00:00:00
1              285  2017-01-01 00:15:00
2              285  2017-01-01 00:30:00
3              285  2017-01-01 00:45:00
4              285  2017-01-01 01:00:00
...            ...                  ...
1575175       8078  2019-12-31 22:45:00
1575176       8078  2019-12-31 23:00:00
1575177       8078  2019-12-31 23:15:00
1575178       8078  2019-12-31 23:30:00
1575179       8078  2019-12-31 23:45:00

Note that all of our output includes row labels.

We can request attributes or perform operations on subsets.

print(data["INTERVAL_TIME"].shape)
(1575180,)
print(data["INTERVAL_READ"].sum())
365877.39449999994

Challenge: Find the Maximum Value of a Column

In addition to getting the sum of values from a a specific column, Pandas has functions for generating other statistics. These include min() for the minimum value within a column and max() for the maximum value.

Which of the below lines of code would give us the maximum values of both the “START_READ” and “END_READ” columns?

A. print(data[START_READ, END_READ].max())
B. print(data["START_READ", "END_READ"].max())
C. print(data[[START_READ, END_READ]].max())
D. print(data[["START_READ", "END_READ"]].max())

Solution

Option D prints out the maximum value of the columns.

Challenge: Perform Operations on Specific Data Types

In addition to selecting columns by name, Pandas has a select\_dtypes() method that lets us select columns of a specific data type.

Which of the options below will print out the sum of each column with a float data type?

A. print(data.select_dtypes(float).sum())
B. print(data.select_dtypes([["START_READ", "END_READ", "INTERVAL_READ"]]).sum())
C. print(data[["START_READ", "END_READ", "INTERVAL_READ"]].sum())
D. print(data.sum(select_dtypes(float))

Solution

Both A and C will output the correct result. However, option C requires us to already know that those columns have data of the correct data type. 

Selecting Specific Rows

Subsets of rows of a Pandas dataframe can be selected different ways. The first two methods we’ll look at, .loc and .iloc indexing, have some subtle differences that are worth exploring.

.loc indexing will select rows with specific labels. Recall from earlier that when we read our CSV file into a dataframe, a row index was created which used integers as the label for each row. We can see information about the row index using the dataframe’s index attribute.

print(data.index)
RangeIndex(start=0, stop=1575180, step=1)

To select a specific row using .loc, we need to know the label of the index. In this case, the labels start with 0 and go up to 1575810, so if we want to select the first row we use the first index label. In this cast that is 0.

print(data.loc[0])
INTERVAL_TIME    2017-01-01 00:00:00
METER_FID                        285
START_READ                 14951.787
END_READ                   14968.082
INTERVAL_READ                 0.0744
Name: 0, dtype: object

Note that above we said the label of the last row is 1575179, even though the stop value of the index attribute is 1575180. That is because default row indexing uses zero-indexing, which is common for Python data structures. The stop value given above should be understood as up to but not including. We can demonstrate this by trying to use the label 1575180 to select a row:

print("Index error:")
print(data.loc[1575180])
Index error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\range.py:391, in RangeIndex.get_loc(self, key, method, tolerance)
    390 try:
--> 391     return self._range.index(new_key)
    392 except ValueError as err:

ValueError: 1575180 is not in range

The error message in this case means that we tried to select a row using a label that is not in the index.

To print the actual last row, we provide a row index number that is one less than the stop value:

print("Actual last row:")
print(data.loc[1575179])
Actual last row:
INTERVAL_TIME    2019-12-31 23:45:00
METER_FID                       8078
START_READ                 40684.475
END_READ                    40695.86
INTERVAL_READ                  0.087
Name: 1575179, dtype: object

iloc is used to select a row or subset of rows based on the integer position of row indexers. This method also uses zero-indexing, so integers will range from 0 to 1 less than the number of rows in the dataframe. The first row would have a position integer of 0, the second row would have a position integer of 1, etc.

print(data.iloc[0])
INTERVAL_TIME    2017-01-01 00:00:00
METER_FID                        285
START_READ                 14951.787
END_READ                   14968.082
INTERVAL_READ                 0.0744
Name: 0, dtype: object

As above, we know there are 1575180 rows in our dataset but zero-indexing means that the position integer of the last row is 1575179. If we try to select a row using the position integer 1575180, we get the same error as before.

print(data.iloc[1575180])
IndexError: single positional indexer is out-of-bounds

We can also select rows using their position relative to the last row. If we want to select the last row without already knowing how many rows are in the dataframe, we can refer to its position using [-1].

print(data.iloc[-1])
INTERVAL_TIME    2019-12-31 23:45:00
METER_FID                       8078
START_READ                 40684.475
END_READ                    40695.86
INTERVAL_READ                  0.087
Name: 1575179, dtype: object

An alternative, more roundabout way is to use the len() function. Above, we noted that position integers will range from 0 to 1 less than the number of rows in the dataframe. In combination with the len() function, we can select the last row in a dataframe using:

print(data.iloc[len(data) - 1])
INTERVAL_TIME    2019-12-31 23:45:00
METER_FID                       8078
START_READ                 40684.475
END_READ                    40695.86
INTERVAL_READ                  0.087
Name: 1575179, dtype: object

Challenge: Selecting Cells

Given the lines of code below, put them in the correct order to read the data file ladpu_smart_meter_data_01 and print the starting and ending meter readings.

print(data.iloc[0]["START_READ"])

data = pd.read_csv("../data/ladpu_smart_meter_data_01.csv")

print(data.iloc[-1]["END_READ"])

Solution

data = pd.read_csv("../data/ladpu_smart_meter_data_01.csv")
print(data.iloc[0]["START_READ"])
print(data.iloc[-1]["END_READ"])

Slicing Data

So far we have used label and position based indexing to select single rows from a data frame. We can select larger subsets using index slicing. Because this is a common operation, we don’t have to specify the index.

print(data[1:10])
         INTERVAL_TIME  METER_FID  START_READ   END_READ  INTERVAL_READ
1  2017-01-01 00:15:00        285   14968.082  14979.831         0.0762
2  2017-01-01 00:30:00        285   14968.082  14979.831         0.1050
3  2017-01-01 00:45:00        285   14968.082  14979.831         0.0636
4  2017-01-01 01:00:00        285   14968.082  14979.831         0.0870
5  2017-01-01 01:15:00        285   14968.082  14979.831         0.0858
6  2017-01-01 01:30:00        285   14968.082  14979.831         0.0750
7  2017-01-01 01:45:00        285   14968.082  14979.831         0.0816
8  2017-01-01 02:00:00        285   14968.082  14979.831         0.0966
9  2017-01-01 02:15:00        285   14968.082  14979.831         0.0720

When only a single colon is used in the square brackets, the integer on the left indicates the starting position. The integer on the right indicates the ending position, but here again note that the returned rows will be up to but not including the row with the specified position.

By default, all rows between the starting and ending position will be returned. We can also specify a number of rows to increment over, using a second colon followed by the interval of rows to use. For example, if we want to return every other row out of the first twenty rows in the dataset, we would use the following:

print(data[0:20:2])
          INTERVAL_TIME  METER_FID  START_READ   END_READ  INTERVAL_READ
0   2017-01-01 00:00:00        285   14951.787  14968.082         0.0744
2   2017-01-01 00:30:00        285   14968.082  14979.831         0.1050
4   2017-01-01 01:00:00        285   14968.082  14979.831         0.0870
6   2017-01-01 01:30:00        285   14968.082  14979.831         0.0750
8   2017-01-01 02:00:00        285   14968.082  14979.831         0.0966
10  2017-01-01 02:30:00        285   14968.082  14979.831         0.0798
12  2017-01-01 03:00:00        285   14968.082  14979.831         0.0660
14  2017-01-01 03:30:00        285   14968.082  14979.831         0.0846
16  2017-01-01 04:00:00        285   14968.082  14979.831         0.0912
18  2017-01-01 04:30:00        285   14968.082  14979.831         0.0720

If we don’t specify a starting or ending position, Python will default to the first and last positions, respectively. The following will output the first ten rows.

print(data[:10])
         INTERVAL_TIME  METER_FID  START_READ   END_READ  INTERVAL_READ
0  2017-01-01 00:00:00        285   14951.787  14968.082         0.0744
1  2017-01-01 00:15:00        285   14968.082  14979.831         0.0762
2  2017-01-01 00:30:00        285   14968.082  14979.831         0.1050
3  2017-01-01 00:45:00        285   14968.082  14979.831         0.0636
4  2017-01-01 01:00:00        285   14968.082  14979.831         0.0870
5  2017-01-01 01:15:00        285   14968.082  14979.831         0.0858
6  2017-01-01 01:30:00        285   14968.082  14979.831         0.0750
7  2017-01-01 01:45:00        285   14968.082  14979.831         0.0816
8  2017-01-01 02:00:00        285   14968.082  14979.831         0.0966
9  2017-01-01 02:15:00        285   14968.082  14979.831         0.0720

We can use a negative position index to return the last ten rows.

print(data[-10:])
               INTERVAL_TIME  METER_FID  START_READ  END_READ  INTERVAL_READ
1575170  2019-12-31 21:30:00       8078   40684.475  40695.86         0.1410
1575171  2019-12-31 21:45:00       8078   40684.475  40695.86         0.1098
1575172  2019-12-31 22:00:00       8078   40684.475  40695.86         0.1020
1575173  2019-12-31 22:15:00       8078   40684.475  40695.86         0.1140
1575174  2019-12-31 22:30:00       8078   40684.475  40695.86         0.1098
1575175  2019-12-31 22:45:00       8078   40684.475  40695.86         0.1284
1575176  2019-12-31 23:00:00       8078   40684.475  40695.86         0.1194
1575177  2019-12-31 23:15:00       8078   40684.475  40695.86         0.1062
1575178  2019-12-31 23:30:00       8078   40684.475  40695.86         0.1344
1575179  2019-12-31 23:45:00       8078   40684.475  40695.86         0.0870

This works the same with position based indexing. It can also work for label based indexing, depending on the labels used.

print(data.iloc[2:12:2])
print(data.iloc[2:12:2])

          INTERVAL_TIME  METER_FID  START_READ   END_READ  INTERVAL_READ
2   2017-01-01 00:30:00        285   14968.082  14979.831         0.1050
4   2017-01-01 01:00:00        285   14968.082  14979.831         0.0870
6   2017-01-01 01:30:00        285   14968.082  14979.831         0.0750
8   2017-01-01 02:00:00        285   14968.082  14979.831         0.0966
10  2017-01-01 02:30:00        285   14968.082  14979.831         0.0798
print(data.loc[3:19:3])
          INTERVAL_TIME  METER_FID  START_READ   END_READ  INTERVAL_READ
3   2017-01-01 00:45:00        285   14968.082  14979.831         0.0636
6   2017-01-01 01:30:00        285   14968.082  14979.831         0.0750
9   2017-01-01 02:15:00        285   14968.082  14979.831         0.0720
12  2017-01-01 03:00:00        285   14968.082  14979.831         0.0660
15  2017-01-01 03:45:00        285   14968.082  14979.831         0.0726
18  2017-01-01 04:30:00        285   14968.082  14979.831         0.0720

We can select the values of single cells or column and row subsets by combining the methods used so far. First we specify the row index to use, then the column.

print(data.iloc[0]['INTERVAL_READ'])
0.0744

We can select multiple rows and/or columns. Note that selecting multiple columns requires us to put them into a list.

print(data.iloc[:10][['INTERVAL_TIME', 'INTERVAL_READ']])
         INTERVAL_TIME  INTERVAL_READ
0  2017-01-01 00:00:00         0.0744
1  2017-01-01 00:15:00         0.0762
2  2017-01-01 00:30:00         0.1050
3  2017-01-01 00:45:00         0.0636
4  2017-01-01 01:00:00         0.0870
5  2017-01-01 01:15:00         0.0858
6  2017-01-01 01:30:00         0.0750
7  2017-01-01 01:45:00         0.0816
8  2017-01-01 02:00:00         0.0966
9  2017-01-01 02:15:00         0.0720

We can also use position indexing to select columns, with the same slicing syntax as above. For example, to select the first ten rows of the first two columns:

print(data.iloc[1:10, 0:2])
         INTERVAL_TIME  METER_FID
1  2017-01-01 00:15:00        285
2  2017-01-01 00:30:00        285
3  2017-01-01 00:45:00        285
4  2017-01-01 01:00:00        285
5  2017-01-01 01:15:00        285
6  2017-01-01 01:30:00        285
7  2017-01-01 01:45:00        285
8  2017-01-01 02:00:00        285
9  2017-01-01 02:15:00        285

Challenge: Subsetting

The frequency at which meter readings were taken means that a single day’s worth of data consists of 96 rows. Which of the following lines of code would we use to select daily start and ending meter readings, plus the date for each day?

A. data.loc[::96][['START_READ', 'END_READ', 'INTERVAL_TIME']]
B. data.loc[0:96:96]['START_READ', 'END_READ', 'INTERVAL_TIME']
C. data.loc[::96]['START_READ', 'END_READ', 'INTERVAL_TIME']
D. data.loc[:-1:96][['START_READ', 'END_READ', 'INTERVAL_TIME']]

Solution

Option A returns the first row of the specified columns for each day.

Conditions

There are several ways to select subsets of a data frame using conditions to select rows based on the value of specific variables. One commonly used method is boolean indexing, which returns rows for which a given condition evaluates to True. The following example tests whether the values of the “INTERVAL_READ” variable are greater than 0.16.

print(data["INTERVAL_READ"] > 0.16)
0          False
1          False
2          False
3          False
4          False
           ...  
1575175    False
1575176    False
1575177    False
1575178    False
1575179    False
Name: INTERVAL_READ, Length: 1575180, dtype: bool

Note that we have to include the condition to be evaluated as the indexer in order to see the rows for which the condition evaluates to True.

print(data[data['INTERVAL_READ'] > 0.16])
               INTERVAL_TIME  METER_FID  START_READ   END_READ  INTERVAL_READ
32       2017-01-01 08:00:00        285   14968.082  14979.831         0.2304
33       2017-01-01 08:15:00        285   14968.082  14979.831         0.1854
34       2017-01-01 08:30:00        285   14968.082  14979.831         0.1878
35       2017-01-01 08:45:00        285   14968.082  14979.831         0.1920
36       2017-01-01 09:00:00        285   14968.082  14979.831         0.1710
...                      ...        ...         ...        ...            ...
1575153  2019-12-31 17:15:00       8078   40684.475  40695.860         0.2616
1575154  2019-12-31 17:30:00       8078   40684.475  40695.860         0.4152
1575155  2019-12-31 17:45:00       8078   40684.475  40695.860         0.4950
1575156  2019-12-31 18:00:00       8078   40684.475  40695.860         0.3192
1575158  2019-12-31 18:30:00       8078   40684.475  40695.860         0.1980

[703542 rows x 5 columns]

The column used for the condition does not have to be included in the subset. Selecting rows in this case requires the use of label based indexing.

print(data.loc[data["INTERVAL_READ"] > 0.26, ['METER_FID', 'INTERVAL_TIME']])
         METER_FID        INTERVAL_TIME
66             285  2017-01-01 16:30:00
77             285  2017-01-01 19:15:00
134            285  2017-01-02 09:30:00
139            285  2017-01-02 10:45:00
141            285  2017-01-02 11:15:00
...            ...                  ...
1575152       8078  2019-12-31 17:00:00
1575153       8078  2019-12-31 17:15:00
1575154       8078  2019-12-31 17:30:00
1575155       8078  2019-12-31 17:45:00
1575156       8078  2019-12-31 18:00:00

[373641 rows x 2 columns]

It is possible to specify multiple conditions.

print(data[(data["INTERVAL_READ"] > 0.16) & (data["METER_FID"] == 285)].head())
          INTERVAL_TIME  METER_FID  START_READ   END_READ  INTERVAL_READ
32  2017-01-01 08:00:00        285   14968.082  14979.831         0.2304
33  2017-01-01 08:15:00        285   14968.082  14979.831         0.1854
34  2017-01-01 08:30:00        285   14968.082  14979.831         0.1878
35  2017-01-01 08:45:00        285   14968.082  14979.831         0.1920
36  2017-01-01 09:00:00        285   14968.082  14979.831         0.1710

The query() method provides similar functionality with a simplified syntax. Using the query() method, the above example becomes:

print(data.query('INTERVAL_READ > 0.16 and METER_FID == 285').head())
          INTERVAL_TIME  METER_FID  START_READ   END_READ  INTERVAL_READ
32  2017-01-01 08:00:00        285   14968.082  14979.831         0.2304
33  2017-01-01 08:15:00        285   14968.082  14979.831         0.1854
34  2017-01-01 08:30:00        285   14968.082  14979.831         0.1878
35  2017-01-01 08:45:00        285   14968.082  14979.831         0.1920
36  2017-01-01 09:00:00        285   14968.082  14979.831         0.1710

Note in this case that the conditions are passed as a string to the query. This requires us to pay close attention to consistent use of single and double quotes.

We can also select specific columns when using the query() method.

print(data.query('INTERVAL_READ > 0.16 and METER_FID == 285')[["METER_FID", "INTERVAL_TIME"]].head())
    METER_FID        INTERVAL_TIME
32        285  2017-01-01 08:00:00
33        285  2017-01-01 08:15:00
34        285  2017-01-01 08:30:00
35        285  2017-01-01 08:45:00
36        285  2017-01-01 09:00:00

Challenge: Select Intervals with Above Average Power Consumption

Fill in the blanks below to select the “INTERVAL_TIME”, “INTERVAL_READ”, and “METER_FID” for all rows for which the value of the “INTERVAL_READ” variable is above the average value for that variable in the entire dataset.

data.___[data["_________"] ___ data["INTERVAL_READ"].mean(), ["___________", "___________", "METER_FID"]]

Solution

data.loc[data["INTERVAL_READ"] > data["INTERVAL_READ"].mean(), ["INTERVAL_TIME", "INTERVAL_READ", "METER_FID"]]

Key Points

  • Rows can be selected in Pandas using label or position based indexing.

  • Boolean indexing and the query() method can be used to select subsets of a dataframe using conditions.