This lesson is in the early stages of development (Alpha version)

Python for Humanities

Short Introduction to Programming in Python

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is Python?

  • Why should I learn Python?

Objectives
  • Describe the advantages of using programming vs. completing repetitive tasks by hand.

  • Define the following data types in Python: strings, integers, and floats.

  • Perform mathematical operations in Python using basic operators.

  • Define the following as it relates to Python: lists, tuples, and dictionaries.

The Basics of Python

Python is a general purpose programming language that supports rapid development of scripts and applications.

Python’s main advantages:

Interpreter

Python is an interpreted language which can be used in two ways:

user:host:~$ python
Python 3.5.1 (default, Oct 23 2015, 18:05:06)
[GCC 4.8.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 2 + 2
4
>>> print("Hello World")
Hello World
user:host:~$ python my_script.py
Hello World

Introduction to Python built-in data types

Strings, integers and floats

Python has built-in numeric types for integers, floats, and complex numbers. Strings are a built-in textual type.:

text = "Data Carpentry"
number = 42
pi_value = 3.1415

Here we’ve assigned data to variables, namely text, number and pi_value, using the assignment operator =. The variable called text is a string which means it can contain letters and numbers. Notice that in order to define a string you need to have quotes around your text. To print out the value stored in a variable we can simply type the name of the variable into the interpreter:

>>> text
"Data Carpentry"

However, in a script, a print function is needed to output the text:

example.py

# A Python script file
# Comments in Python start with #
# The next line uses the print function to print out the text string
print(text)

Running the script

$ python example.py
Data Carpentry

Tip: The print function is a built-in function in Python. Later in this lesson, we will introduce methods and user-defined functions. The Python documentation is excellent for reference on the differences between them.

Operators

We can perform mathematical calculations in Python using the basic operators +, -, /, *, %:

>>> 2 + 2   #  addition
4
>>> 6 * 7   #  multiplication
42
>>> 2 ** 16  # power
65536
>>> 13 % 5  # modulo
3

We can also use comparison and logic operators: <, >, ==, !=, <=, >= and statements of identity such as and, or, not. The data type returned by this is called a boolean.

>>> 3 > 4
False
>>> True and True
True
>>> True or False
True

Sequential types: Lists and Tuples

Lists

Lists are a common data structure to hold an ordered sequence of elements. Each element can be accessed by an index. Note that Python indexes start with 0 instead of 1:

>>> numbers = [1, 2, 3]
>>> numbers[0]
1

A for loop can be used to access the elements in a list or other Python data structure one at a time:

>>> for num in numbers:
...     print(num)
...
1
2
3
>>>

Indentation is very important in Python. Note that the second line in the example above is indented. Just like three chevrons >>> indicate an interactive prompt in Python, the three dots ... are Python’s prompt for multiple lines. This is Python’s way of marking a block of code. [Note: you do not type >>> or ....]

To add elements to the end of a list, we can use the append method. Methods are a way to interact with an object (a list, for example). We can invoke a method using the dot . followed by the method name and a list of arguments in parentheses. Let’s look at an example using append:

>>> numbers.append(4)
>>> print(numbers)
[1, 2, 3, 4]
>>>

To find out what methods are available for an object, we can use the built-in help command:

help(numbers)

Help on list object:

class list(object)
 |  list() -> new empty list
 |  list(iterable) -> new list initialized from iterable's items
 ...

Tuples

A tuple is similar to a list in that it’s an ordered sequence of elements. However, tuples can not be changed once created (they are “immutable”). Tuples are created by placing comma-separated values inside parentheses ().

# tuples use parentheses
a_tuple= (1, 2, 3)
another_tuple = ('blue', 'green', 'red')
# Note: lists use square brackets
a_list = [1, 2, 3]

Challenge - Tuples

  1. What happens when you type a_tuple[2]=5 vs a_list[1]=5 ?
  2. Type type(a_tuple) into python - what is the object type?

Dictionaries

A dictionary is a container that holds pairs of objects - keys and values.

>>> translation = {'one': 1, 'two': 2}
>>> translation['one']
1

Dictionaries work a lot like lists - except that you index them with keys. You can think about a key as a name for or a unique identifier for a set of values in the dictionary. Keys can only have particular types - they have to be “hashable”. Strings and numeric types are acceptable, but lists aren’t.

>>> rev = {1: 'one', 2: 'two'}
>>> rev[1]
'one'
>>> bad = {[1, 2, 3]: 3}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

In Python, a “Traceback” is an multi-line error block printed out for the user.

To add an item to the dictionary we assign a value to a new key:

>>> rev = {1: 'one', 2: 'two'}
>>> rev[3] = 'three'
>>> rev
{1: 'one', 2: 'two', 3: 'three'}

Using for loops with dictionaries is a little more complicated. We can do this in two ways:

>>> for key, value in rev.items():
...     print(key, '->', value)
...
1 -> one
2 -> two
3 -> three

or

>>> for key in rev.keys():
...     print(key, '->', rev[key])
...
1 -> one
2 -> two
3 -> three
>>>

Challenge - Can you do reassignment in a dictionary?

  1. First check what rev is right now (remember rev is the name of our dictionary).

    Type:

    >>> rev
    
  2. Try to reassign the second value (in the key value pair) so that it no longer reads “two” but instead reads “apple-sauce”.

  3. Now display rev again to see if it has changed.

It is important to note that dictionaries are “unordered” and do not remember the sequence of their items (i.e. the order in which key:value pairs were added to the dictionary). Because of this, the order in which items are returned from loops over dictionaries might appear random and can even change with time.

Functions

Defining a section of code as a function in Python is done using the def keyword. For example a function that takes two arguments and returns their sum can be defined as:

def add_function(a, b):
    result = a + b
    return result

z = add_function(20, 22)
print(z)
42

Key points about functions are:

Key Points


Starting With Data

Overview

Teaching: 30 min
Exercises: 30 min
Questions
  • How can I import data in Python?

  • What is Pandas?

  • Why should I use Pandas to work with data?

Objectives
  • Navigate the workshop directory and download a dataset.

  • Explain what a library is and what libraries are used for.

  • Describe what the Python Data Analysis Library (Pandas) is.

  • Load the Python Data Analysis Library (Pandas).

  • Use read_csv to read tabular data into Python.

  • Describe what a DataFrame is in Python.

  • Access and summarize data stored in a DataFrame.

  • Define indexing as it relates to data structures.

  • Perform basic mathematical operations and summary statistics on data in a Pandas DataFrame.

  • Create simple plots.

Working With Pandas DataFrames in Python

We can automate the process above using Python. It’s efficient to spend time building the code to perform these tasks because once it’s built, we can use it over and over on different datasets that use a similar format. This makes our methods easily reproducible. We can also easily share our code with colleagues and they can replicate the same analysis.

Starting in the same spot

To help the lesson run smoothly, let’s ensure everyone is in the same directory. This should help us avoid path and file name issues. At this time please navigate to the workshop directory. If you working in IPython Notebook be sure that you start your notebook in the workshop directory.

A quick aside that there are Python libraries like OS Library that can work with our directory structure, however, that is not our focus today.

Our Data

For this lesson, we will be using the EEBO catalogue data, a subset of the data from EEBO/TCP Early English Books Online/Text Creation Partnership

We will be using files from the Data repository. This section will use the eebo.csv file that can be downloaded here: https://raw.githubusercontent.com/iaine/python-humanities-lesson/gh-pages/data/eebo.csv

We are studying the authors and titles published marked up by the Text Creation Partnership. The dataset is stored as a .csv file: each row holds information for a single title, and the columns represent:

Column Description
TCP TCP identity
EEBO EEBO identity
VID VID identity
STC STC identity
status Whether the book is free or not
Author Author(s)
Date Date of publication
Title The Book title
Terms Terms associated with the text
Page Count Number of pages in the text
Place Location where published

The first few rows of our first file look like this:

TCP,EEBO,VID,STC,Status,Author,Date,Title,Terms,Page Count,Place
A00002,99850634,15849,STC 1000.5; ESTC S115415,Free,"Aylett, Robert, 1583-1655?",1625,"The brides ornaments viz. fiue meditations, morall and diuine. 1. Knowledge, 2. zeale, 3. temperance, 4. bountie, 5. ioy.",,134,London
A00005,99842408,7058,STC 10000; ESTC S106695,Free,"Higden, Ranulf, d. 1364. Polycronicon. English. Selections.; Trevisa, John, d. 1402.",1515,Here begynneth a shorte and abreue table on the Cronycles ...; Saint Albans chronicle.,Great Britain -- History -- To 1485 -- Early works to 1800.; England -- Description and travel -- Early works to 1800.,302,London
A00007,99844302,9101,STC 10002; ESTC S108645,Free,"Higden, Ranulf, d. 1364. Polycronicon.",1528,"The Cronycles of Englonde with the dedes of popes and emperours, and also the descripcyon of Englonde; Saint Albans chronicle.",Great Britain -- History -- To 1485 -- Early works to 1800.; England -- Description and travel -- Early works to 1800.,386,London
A00008,99848896,14017,STC 10003; ESTC S113665,Free,"Wood, William, fl. 1623, attributed name.",1623,Considerations vpon the treaty of marriage between England and Spain,Great Britain -- Foreign relations -- Spain.,14,The Netherlands?
A00011,99837000,1304,STC 10008; ESTC S101178,Free,,1640,"Englands complaint to Iesus Christ, against the bishops canons of the late sinfull synod, a seditious conuenticle, a packe of hypocrites, a sworne confederacy, a traiterous conspiracy ... In this complaint are specified those impieties and insolencies, which are most notorious, scattered through the canons and constitutions of the said sinfull synod. And confuted by arguments annexed hereunto.",Church of England. -- Thirty-nine Articles -- Controversial literature.; Canon law -- Early works to 1800.,54,Amsterdam
A00012,99853871,19269,STC 1001; ESTC S118664,Free,"Aylett, Robert, 1583-1655?",1623,"Ioseph, or, Pharoah's fauourite; Joseph.",Joseph -- (Son of Jacob) -- Early works to 1800.,99,London
A00014,33143147,28259,STC 10011.6; ESTC S3200,Free,,1624,Greate Brittaines noble and worthy councell of warr,"England and Wales. -- Privy Council -- Portraits.; Great Britain -- History -- James I, 1603-1625.; Broadsides -- London (England) -- 17th century.",1,London
A00015,99837006,1310,STC 10011; ESTC S101184,Free,"Jones, William, of Usk.",1607,"Gods vvarning to his people of England By the great ouer-flowing of the vvaters or floudes lately hapned in South-wales and many other places. Wherein is described the great losses, and wonderfull damages, that hapned thereby: by the drowning of many townes and villages, to the vtter vndooing of many thousandes of people.",Floods -- Wales -- Early works to 1800.,16, London
A00018,99850740,15965,STC 10015; ESTC S115521,Free,,1558,The lame[n]tacion of England; Lamentacion of England.,"Great Britain -- History -- Mary I, 1553-1558 -- Early works to 1800.",26,Germany?

About Libraries

A library in Python contains a set of tools (called functions) that perform tasks on our data. Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench for use in a project. Once a library is set up, it can be used or called to perform many tasks.

Pandas in Python

One of the best options for working with tabular data in Python is to use the Python Data Analysis Library (a.k.a. Pandas). The Pandas library provides data structures, produces high quality plots with matplotlib and integrates nicely with other libraries that use NumPy (which is another Python library) arrays.

Python doesn’t load all of the libraries available to it by default. We have to add an import statement to our code in order to use library functions. To import a library, we use the syntax import libraryName. If we want to give the library a nickname to shorten the command, we can add as nickNameHere. An example of importing the pandas library using the common nickname pd is below.

import pandas as pd

Each time we call a function that’s in a library, we use the syntax LibraryName.FunctionName. Adding the library name with a . before the function name tells Python where to find the function. In the example above, we have imported Pandas as pd. This means we don’t have to type out pandas each time we call a Pandas function.

Reading CSV Data Using Pandas

We will begin by locating and reading our survey data which are in CSV format. We can use Pandas’ read_csv function to pull the file directly into a DataFrame.

So What’s a DataFrame?

A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, factors and more) in columns. It is similar to a spreadsheet or an SQL table or the data.frame in R. A DataFrame always has an index (0-based). An index refers to the position of an element in the data structure.

# note that pd.read_csv is used because we imported pandas as pd
pd.read_csv("eebo.csv")

The above command yields the output below:

          TCP        EEBO     VID  \
0      A00002  99850634.0   15849   
1      A00005  99842408.0    7058   
2      A00007  99844302.0    9101   
3      A00008  99848896.0   14017   
4      A00011  99837000.0    1304
...
146  A00525  99856552        ...                854   Prentyd  London
147  A00527  99849909        ...                 72            London
148  A00535  99849912        ...                106        Saint-Omer 
[148 rows x 11 columns]

We can see that there were 149 rows parsed. Each row has 11 columns. The first column is the index of the DataFrame. The index is used to identify the position of the data, but it is not an actual column of the DataFrame. It looks like the read_csv function in Pandas read our file properly. However, we haven’t saved any data to memory so we can work with it.We need to assign the DataFrame to a variable. Remember that a variable is a name for a value, such as x, or data. We can create a new object with a variable name by assigning a value to it using =.

Let’s call the imported survey data tcp_df:

tcp_df = pd.read_csv("eebo.csv")

Notice when you assign the imported DataFrame to a variable, Python does not produce any output on the screen. We can print the value of the tcp_df object by typing its name into the Python command prompt.

tcp_df

which prints contents like above

Manipulating Our Index Data

Now we can start manipulating our data. First, let’s check the data type of the data stored in tcp_df using the type method. The type method and __class__ attribute tell us that tcp_df is <class 'pandas.core.frame.DataFrame'>.

type(tcp_df)
# this does the same thing as the above!
tcp_df.__class__

We can also enter tcp_df.dtypes at our prompt to view the data type for each column in our DataFrame. int64 represents numeric integer values - int64 cells can not store decimals. object represents strings (letters and numbers). float64 represents numbers with decimals.

tcp_df.dtypes

which returns:

TCP         object
EEBO        float64
VID         object
STC         object
Status      object
Author      object
Date        object
Title       object
Terms       object
Page Count  int64
dtype: object

We’ll talk a bit more about what the different formats mean in a different lesson.

Useful Ways to View DataFrame objects in Python

There are many ways to summarize and access the data stored in DataFrames, using attributes and methods provided by the DataFrame object.

To access an attribute, use the DataFrame object name followed by the attribute name df_object.attribute. Using the DataFrame tcp_df and attribute columns, an index of all the column names in the DataFrame can be accessed with tcp_df.columns.

Methods are called in a similar fashion using the syntax df_object.method(). As an example, tcp_df.head() gets the first few rows in the DataFrame tcp_df using the head() method. With a method, we can supply extra information in the parens to control behaviour.

Let’s look at the data using these.

Challenge - DataFrames

Using our DataFrame tcp_df, try out the attributes & methods below to see what they return.

  1. tcp_df.columns
  2. tcp_df.shape Take note of the output of shape - what format does it return the shape of the DataFrame in?

    HINT: More on tuples, here.

  3. tcp_df.head() Also, what does tcp_df.head(15) do?
  4. tcp_df.tail()

Calculating Statistics From Data In A Pandas DataFrame

We’ve read our data into Python. Next, let’s perform some quick summary statistics to learn more about the data that we’re working with. We might want to know how many animals were collected in each plot, or how many of each species were caught. We can perform summary stats quickly using groups. But first we need to figure out what we want to group by.

Let’s begin by exploring our data:

# Look at the column names
tcp_df.columns.values

which returns:

array(['TCP', 'EEBO', 'VID', 'STC', 'Status', 'Author', 'Date', 'Title',
       'Terms', 'Page Count', 'Place'], dtype=object)

Let’s get a list of all the page counts. The pd.unique function tells us all of the unique values in the Pages column.

pd.unique(tcp_df['Page Count'])

which returns:

array([134, 302, 386,  14,  54,  99,   1,  16,  26,  62,  50,  66,  30,
         6,  36,   8,  12,  24,  22,   7,  20,  40,  38,  13,  28,  10,
        23,   2, 112,  18,   4,  27,  42,  17,  46,  58, 200, 158,  65,
        96, 178,  52, 774,  81, 392,  74, 162,  56, 100, 172,  94,  79,
       107,  48, 102, 343, 136,  70, 156, 133, 228, 357, 110,  72,  44,
        43,  37,  98, 566, 500, 746, 884, 254, 618, 274, 188, 374,  47,
        34, 177,  82,  78,  64, 124,  80, 108, 182, 120,  68, 854, 106])

Challenge - Statistics

  1. Create a list of unique locations found in the index data. Call it places. How many unique location are there in the data? How many unique species are in the data?

  2. What is the difference between len(places) and tcp_df['Place'].nunique()?

Groups in Pandas

We often want to calculate summary statistics grouped by subsets or attributes within fields of our data. For example, we might want to calculate the average weight of all individuals per plot.

We can calculate basic statistics for all records in a single column using the syntax below:

tcp_df['Page Count'].describe()

gives output

count    149.000000
mean     104.382550
std      160.125398
min        1.000000
25%       16.000000
50%       52.000000
75%      108.000000
max      884.000000
Name: Page Count, dtype: float64

We can also extract one specific metric if we wish:

tcp_df['Page Count'].min()
tcp_df['Page Count'].max()
tcp_df['Page Count'].mean()
tcp_df['Page Count'].std()
tcp_df['Page Count'].count()

But if we want to summarize by one or more variables, for example sex, we can use Pandas’ .groupby method. Once we’ve created a groupby DataFrame, we can quickly calculate summary statistics by a group of our choice.

# Group data by status
grouped_data = tcp_df.groupby('Place')

The pandas function describe will return descriptive stats including: mean, median, max, min, std and count for a particular column in the data. Pandas’ describe function will only return summary values for columns containing numeric data.

# summary statistics for all numeric columns by place
grouped_data.describe()
# provide the mean for each numeric column by place
grouped_data.mean()

grouped_data.mean() OUTPUT:

                                       EEBO     ...      Page Count
Place                                           ...                
Amsterdam                      9.983700e+07     ...       54.000000
Antverpi                       9.983759e+07     ...       34.000000
Antwerp                        9.985185e+07     ...       40.000000
Cambridge                      2.445926e+07     ...        1.000000
Emden                          9.984713e+07     ...       58.000000

The groupby command is powerful in that it allows us to quickly generate summary stats.

Challenge - Summary Data

  1. What is the mean page length for books published in Amsterdam and how many for London
  2. What happens when you group by two columns using the following syntax and then grab mean values:
    • grouped_data2 = tcp_df.groupby(['EEBO','Page Count'])
    • grouped_data2.mean()
  3. Summarize the Date values in your data. HINT: you can use the following syntax to only create summary statistics for one column in your data tcp_df['Page Count'].describe()

Did you get #3 right?

A Snippet of the Output from challenge 3 looks like:

	
	      count    149.000000
	      mean    1584.288591
	      std       36.158864
	      min     1515.000000
	      25%     1552.000000
	      50%     1583.000000
	      75%     1618.000000
	      max     1640.000000
         ...

Quickly Creating Summary Counts in Pandas

Let’s next count the number of samples for each author. We can do this in a few ways, but we’ll use groupby combined with a count() method.

# count the number of texts by authors
author_counts = tcp_df.groupby('Author')['EEBO'].count()
print(author_counts)

Or, we can also count just the rows that have the author “A. B.”:

tcp_df.groupby('Author')['EEBO'].count()['Aylett, Robert, 1583-1655?']

Challenge - Make a list

What’s another way to create a list of authors and associated count of the records in the data? Hint: you can perform count, min, etc functions on groupby DataFrames in the same way you can perform them on regular DataFrames.

Basic Math Functions

If we wanted to, we could perform math on an entire column of our data. For example let’s multiply all weight values by 2. A more practical use of this might be to normalize the data according to a mean, area, or some other value calculated from our data.

# multiply all page length values by 2
tcp_df['Page Count']*2

Quick & Easy Plotting Data Using Pandas

We can plot our summary stats using Pandas, too.

# make sure figures appear inline in Ipython Notebook
%matplotlib inline
# create a quick bar chart
author_counts.plot(kind='bar')

Weight by Species Plot Weight by species plot

We can also look at how many animals were captured in each plot:

total_count = tcp_df.groupby('Terms')['Place'].nunique()
# let's plot that too
total_count.plot(kind='bar')

Challenge - Plots

  1. Create a plot of Authors across all Places per plot.

Summary Plotting Challenge

Create a stacked bar plot, with page count on the Y axis, and the stacked variable being author. The plot should show total weight by sex for each plot. Some tips are below to help you solve this challenge:

  • For more on Pandas plots, visit this link.
  • You can use the code that follows to create a stacked bar plot but the data to stack need to be in individual columns. Here’s a simple example with some data where ‘a’, ‘b’, and ‘c’ are the groups, and ‘one’ and ‘two’ are the subgroups.
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
pd.DataFrame(d)

shows the following data

      one  two
  a    1    1
  b    2    2
  c    3    3
  d  NaN    4

We can plot the above with

# plot stacked data so columns 'one' and 'two' are stacked
my_df = pd.DataFrame(d)
my_df.plot(kind='bar',stacked=True,title="The title of my graph")

Stacked Bar Plot

  • You can use the .unstack() method to transform grouped data into columns for each plotting. Try running .unstack() on some DataFrames above and see what it yields.

Start by transforming the grouped data (by plot and sex) into an unstacked layout, then create a stacked plot.

Solution to Summary Challenge

First we group data by plot and by sex, and then calculate a total for each plot.

tcp_author = tcp_df.groupby(['Place','Author'])
author_count = tcp_author['Page Count'].sum()

This calculates the sums for each author for each author as a table

plot  sex
plot_id  sex
Place                          Author  
Antverpi                       Evans, Lewis, fl. 1574.              34
Antwerp                        Verstegan, Richard, ca. 1550-1640.   40
<other plots removed for brevity>

Below we’ll use .unstack() on our grouped data to figure out the total weight that each sex contributed to each plot.

by_plot_sex = tcp_df.groupby(['plot_id','sex'])
plot_sex_count = by_plot_sex['weight'].sum()
plot_sex_count.unstack()

The unstack function above will display the following output:

sex          F      M
plot_id              
1        38253  59979
2        50144  57250
3        27251  28253
4        39796  49377
<other plots removed for brevity>

Now, create a stacked bar plot with that data where the weights for each sex are stacked by plot.

Rather than display it as a table, we can plot the above data by stacking the values of each sex as follows:

by_plot_sex = tcp_df.groupby(['plot_id','sex'])
plot_sex_count = by_plot_sex['weight'].sum()
spc = plot_sex_count.unstack()
s_plot = spc.plot(kind='bar',stacked=True,title="Total weight by plot and sex")
s_plot.set_ylabel("Weight")
s_plot.set_xlabel("Plot")

Stacked Bar Plot

Key Points


Indexing, Slicing and Subsetting DataFrames in Python

Overview

Teaching: 30 min
Exercises: 30 min
Questions
  • How can I access specific data within my data set?

  • How can Python and Pandas help me to analyse my data?

Objectives
  • Describe what 0-based indexing is.

  • Manipulate and extract data using column headings and index locations.

  • Employ slicing to select sets of data from a DataFrame.

  • Employ label and integer-based indexing to select ranges of data in a dataframe.

  • Reassign values within subsets of a DataFrame.

  • Create a copy of a DataFrame.

  • Query /select a subset of data using a set of criteria using the following operators: =, !=, >, <, >=, <=.

  • Locate subsets of data using masks.

  • Describe BOOLEAN objects in Python and manipulate data using BOOLEANs.

In lesson 01, we read a CSV into a Python pandas DataFrame. We learned:

In this lesson, we will explore ways to access different parts of the data using:

Loading our data

We will continue to use the surveys dataset that we worked with in the last lesson. Let’s reopen and read in the data again:

# Make sure pandas is loaded
import pandas as pd

# read in the survey csv
authors_df = pd.read_csv("eebo.csv")

Indexing and Slicing in Python

We often want to work with subsets of a DataFrame object. There are different ways to accomplish this including: using labels (column headings), numeric ranges, or specific x,y index locations.

Selecting data using Labels (Column Headings)

We use square brackets [] to select a subset of an Python object. For example, we can select all data from a column named species_id from the surveys_df DataFrame by name. There are two ways to do this:

# Method 1: select a 'subset' of the data using the column name
authors_df['Place']

# Method 2: use the column name as an 'attribute'; gives the same output
authors_df.Place

We can also create a new object that contains only the data within the Status column as follows:

# creates an object, texts_species, that only contains the `status_id` column
texts_status = authors_df['Status']

We can pass a list of column names too, as an index to select columns in that order. This is useful when we need to reorganize our data.

NOTE: If a column name is not contained in the DataFrame, an exception (error) will be raised.

# select the author and EEBO columns from the DataFrame
authors_df[['Author', 'EEBO']]

# what happens when you flip the order?
authors_df[['EEBO', 'Author']]

#what happens if you ask for a column that doesn't exist?
authors_df['Texts']

Extracting Range based Subsets: Slicing

REMINDER: Python Uses 0-based Indexing

Let’s remind ourselves that Python uses 0-based indexing. This means that the first element in an object is located at position

  1. This is different from other tools like R and Matlab that index elements within objects starting at 1.
# Create a list of numbers:
a = [1, 2, 3, 4, 5]

indexing diagram slicing diagram

Challenge - Extracting data

  1. What value does the code below return?

    a[0]
    
  2. How about this:

    a[5]
    
  3. In the example above, calling a[5] returns an error. Why is that?

  4. What about?

    a[len(a)]
    

Slicing Subsets of Rows in Python

Slicing using the [] operator selects a set of rows and/or columns from a DataFrame. To slice out a set of rows, you use the following syntax: data[start:stop]. When slicing in pandas the start bound is included in the output. The stop bound is one step BEYOND the row you want to select. So if you want to select rows 0, 1 and 2 your code would look like this:

# select rows 0, 1, 2 (row 3 is not selected)
authors_df[0:3]

The stop bound in Python is different from what you might be used to in languages like Matlab and R.

# select the first 5 rows (rows 0, 1, 2, 3, 4)
authors_df[:5]

# select the last element in the list
# (the slice starts at the last element,
# and ends at the end of the list)
authors_df[-1:]

We can also reassign values within subsets of our DataFrame.

But before we do that, let’s look at the difference between the concept of copying objects and the concept of referencing objects in Python.

Copying Objects vs Referencing Objects in Python

Let’s start with an example:

# using the 'copy() method'
true_copy_authors_df = authors_df.copy()

# using '=' operator
ref_authors_df = authors_df

You might think that the code ref_authors_df = authors_df creates a fresh distinct copy of the surveys_df DataFrame object. However, using the = operator in the simple statement y = x does not create a copy of our DataFrame. Instead, y = x creates a new variable y that references the same object that x refers to. To state this another way, there is only one object (the DataFrame), and both x and y refer to it.

In contrast, the copy() method for a DataFrame creates a true copy of the DataFrame.

Let’s look at what happens when we reassign the values within a subset of the DataFrame that references another DataFrame object:

    # Assign the value `0` to the first three rows of data in the DataFrame
    ref_authors_df[0:3] = 0
    ```

Let's try the following code:

    ```
   # ref_authors_df was created using the '=' operator
    ref_authors_df.head()

    # surveys_df is the original dataframe
    authors_df.head()

What is the difference between these two dataframes?

When we assigned the first 3 columns the value of 0 using the ref_surveys_df DataFrame, the surveys_df DataFrame is modified too. Remember we created the reference ref_survey_df object above when we did ref_survey_df = surveys_df. Remember surveys_df and ref_surveys_df refer to the same exact DataFrame object. If either one changes the object, the other will see the same changes to the reference object.

To review and recap:

Okay, that’s enough of that. Let’s create a brand new clean dataframe from the original data CSV file.

authors_df = pd.read_csv("eebo.csv")

Slicing Subsets of Rows and Columns in Python

We can select specific ranges of our data in both the row and column directions using either label or integer-based indexing.

To select a subset of rows and columns from our DataFrame, we can use the iloc method. For example, we can select month, day and year (columns 2, 3 and 4 if we start counting at 1), like this:

# iloc[row slicing, column slicing]
authors_df.iloc[0:3, 1:4]

which gives the output

         EEBO    VID                       STC
0  99850634.0  15849  STC 1000.5; ESTC S115415
1  99842408.0   7058   STC 10000; ESTC S106695
2  99844302.0   9101   STC 10002; ESTC S108645

Notice that we asked for a slice from 0:3. This yielded 3 rows of data. When you ask for 0:3, you are actually telling Python to start at index 0 and select rows 0, 1, 2 up to but not including 3.

Let’s explore some other ways to index and select subsets of data:

# select all columns for rows of index values 0 and 10
authors_df.loc[[0, 10], :]

# what does this do?
authors_df.loc[0, ['Author', 'Title', 'Status']]

# What happens when you type the code below?
authors_df.loc[[0, 10, 149], :]

NOTE: Labels must be found in the DataFrame or you will get a KeyError.

Indexing by labels loc differs from indexing by integers iloc. With iloc, the start bound and the stop bound are inclusive. When using loc instead, integers can also be used, but the integers refer to the index label and not the position. For example, using loc and select 1:4 will get a different result than using iloc to select rows 1:4.

We can also select a specific data value using a row and column location within the DataFrame and iloc indexing:

# Syntax for iloc indexing to finding a specific data element
dat.iloc[row, column]

In this iloc example,

authors_df.iloc[2, 6]

gives the output

'1528'

Remember that Python indexing begins at 0. So, the index location [2, 6] selects the element that is 3 rows down and 7 columns over in the DataFrame.

Challenge - Range

  1. What happens when you execute:

    • authors_df[0:1]
    • authors_df[:4]
    • authors_df[:-1]

Subsetting Data using Criteria

We can also select a subset of our data using criteria. For example, we can select all rows that have a status value of Free:

authors_df.loc[authors_df["Terms"].str.contains("sermon", na=False)]

Which produces the following output:

        TCP      EEBO  ...  Page Count   Place
23   A00156  99851064  ...           8  London
27   A00164  99851065  ...           7  London
113  A00426  99857357  ...          72  London
141  A00510  99852090  ...          48  London

[4 rows x 11 columns]

Or we can select all rows with a page length greater than 100:

authors_df[authors_df["Page Count"] > 100]

We can define sets of criteria too:

authors_df[(authors_df.Date >= 1500) & (authors_df.Date <= 1550)]

Python Syntax Cheat Sheet

Use can use the syntax below when querying data by criteria from a DataFrame. Experiment with selecting various subsets of the “surveys” data.

Challenge - Queries

  1. Select a subset of rows in the authors_df DataFrame that contain data from the year 1500 and that contain page count values less than or equal to 8. How many rows did you end up with? What did your neighbor get?

  2. You can use the isin command in Python to query a DataFrame based upon a list of values as follows:

    authors_df[authors_df['Date'].isin([listGoesHere])]
    

Use the isin function to find all plots that contain particular species in the “authors” DataFrame. How many records contain these values?

  1. Experiment with other queries. Create a query that finds all rows with a Page Count value > or equal to 1.

  2. The ~ symbol in Python can be used to return the OPPOSITE of the selection that you specify in Python. It is equivalent to is not in. Write a query that selects all rows with Date NOT equal to 1500 or 1600 in the “authors” data.

Using masks to identify a specific condition

A mask can be useful to locate where a particular subset of values exist or don’t exist - for example, NaN, or “Not a Number” values. To understand masks, we also need to understand BOOLEAN objects in Python.

Boolean values include True or False. For example,

# set x to 5
x = 5

# what does the code below return?
x > 5

# how about this?
x == 5

When we ask Python what the value of x > 5 is, we get False. This is because the condition,x is not greater than 5, is not met since x is equal to 5.

To create a boolean mask:

Let’s try this out. Let’s identify all locations in the survey data that have null (missing or NaN) data values. We can use the isnull method to do this. The isnull method will compare each cell with a null value. If an element has a null value, it will be assigned a value of True in the output object.

pd.isnull(authors_df)

A snippet of the output is below:

         TCP   EEBO    VID    STC  Status  Author   Date  Title  Terms  Pages
0      False  False  False  False   False   False  False  False   True  False
1      False  False  False  False   False   False  False  False  False  False
2      False  False  False  False   False   False  False  False  False  False
3      False  False  False  False   False   False  False  False  False  False

[149 rows x 11 columns]

To select the rows where there are null values, we can use the mask as an index to subset our data as follows:

# To select just the rows with NaN values, we can use the 'any()' method
authors_df[pd.isnull(authors_df).any(axis=1)]

Note that the weight column of our DataFrame contains many null or NaN values. We will explore ways of dealing with this in Lesson 03.

We can run isnull on a particular column too. What does the code below do?

# what does this do?
empty_authors = authors_df[pd.isnull(authors_df['Author'])]['Author']
print(empty_authors)

Let’s take a minute to look at the statement above. We are using the Boolean object pd.isnull(authors_df['Author']) as an index to authors_df. We are asking Python to select rows that have a NaN value of author.

Challenge - Putting it all together

  1. Create a new DataFrame that only contains titles with status values that are not from London. Assign each status value in the new DataFrame to a new value of ‘x’. Determine the number of null values in the subset.

  2. Create a new DataFrame that contains only observations that are of status free and where page count values are greater than 100.

Key Points


Data Types and Formats

Overview

Teaching: 20 min
Exercises: 25 min
Questions
  • What types of data can be contained in a DataFrame?

  • Why is the data type important?

Objectives
  • Describe how information is stored in a Python DataFrame.

  • Define the two main types of data in Python: text and numerics.

  • Examine the structure of a DataFrame.

  • Modify the format of values in a DataFrame.

  • Describe how data types impact operations.

  • Define, manipulate, and interconvert integers and floats in Python.

  • Analyze datasets having missing/null values (NaN values).

The format of individual columns and rows will impact analysis performed on a dataset read into python. For example, you can’t perform mathematical calculations on a string (text formatted data). This might seem obvious, however sometimes numeric values are read into python as strings. In this situation, when you then try to perform calculations on the string-formatted numeric data, you get an error.

In this lesson we will review ways to explore and better understand the structure and format of our data.

Types of Data

How information is stored in a DataFrame or a python object affects what we can do with it and the outputs of calculations as well. There are two main types of data that we’re explore in this lesson: numeric and text data types.

Numeric Data Types

Numeric data types include integers and floats. A floating point (known as a float) number has decimal points even if that decimal point value is 0. For example: 1.13, 2.0 1234.345. If we have a column that contains both integers and floating point numbers, Pandas will assign the entire column to the float data type so the decimal points are not lost.

An integer will never have a decimal point. Thus if we wanted to store 1.13 as an integer it would be stored as 1. Similarly, 1234.345 would be stored as 1234. You will often see the data type Int64 in python which stands for 64 bit integer. The 64 simply refers to the memory allocated to store data in each cell which effectively relates to how many digits it can store in each “cell”. Allocating space ahead of time allows computers to optimize storage and processing efficiency.

Text Data Type

Text data type is known as Strings in Python, or Objects in Pandas. Strings can contain numbers and / or characters. For example, a string might be a word, a sentence, or several sentences. A Pandas object might also be a plot name like ‘plot1’. A string can also contain or consist of numbers. For instance, ‘1234’ could be stored as a string. As could ‘10.23’. However strings that contain numbers can not be used for mathematical operations!

Pandas and base Python use slightly different names for data types. More on this is in the table below:

Pandas Type Native Python Type Description
object string The most general dtype. Will be assigned to your column if column has mixed types (numbers and strings).
int64 int Numeric characters. 64 refers to the memory allocated to hold this character.
float64 float Numeric characters with decimals. If a column contains numbers and NaNs(see below), pandas will default to float64, in case your missing value has a decimal.
datetime64, timedelta[ns] N/A (but see the datetime module in Python’s standard library) Values meant to hold time data. Look into these for time series experiments.

Checking the format of our data

Now that we’re armed with a basic understanding of numeric and text data types, let’s explore the format of our survey data. We’ll be working with the same surveys.csv dataset that we’ve used in previous lessons.

# note that pd.read_csv is used because we imported pandas as pd
authors_df = pd.read_csv("eebo.csv")

Remember that we can check the type of an object like this:

type(authors_df)

OUTPUT: pandas.core.frame.DataFrame

Next, let’s look at the structure of our surveys data. In pandas, we can check the type of one column in a DataFrame using the syntax dataFrameName[column_name].dtype:

authors_df['TCP'].dtype

OUTPUT: dtype('O')

A type ‘O’ just stands for “object” which in Pandas’ world is a string (text).

authors_df['EEBO'].dtype

OUTPUT: dtype('int64')

The type int64 tells us that python is storing each value within this column as a 64 bit integer. We can use the dat.dtypes command to view the data type for each column in a DataFrame (all at once).

authors_df.dtypes

which returns:

TCP        object
EEBO       int64
VID        object
STC        object
Status     object
Author     object
Date       object
Title      object
Terms      object
Page Count int64
Place      object
dtype: object

Note that most of the columns in our Survey data are of type object. This means that they are strings. But the EEBO column is a integer value which means it contains whole numbers.

Working With Integers and Floats

So we’ve learned that computers store numbers in one of two ways: as integers or as floating-point numbers (or floats). Integers are the numbers we usually count with. Floats have fractional parts (decimal places). Let’s next consider how the data type can impact mathematical operations on our data. Addition, subtraction, division and multiplication work on floats and integers as we’d expect.

print(5+5)
10

print(24-4)
20

If we divide one integer by another, we get a float. The result on python 3 is different than in python 2, where the result is an integer (integer division).

print(5/9)
0.5555555555555556

print(10/3)
3.3333333333333335

We can also convert a floating point number to an integer or an integer to floating point number. Notice that Python by default rounds down when it converts from floating point to integer.

# convert a to integer
a = 7.83
int(a)
7

# convert to float
b = 7
float(b)
7.0

Working With Our Index Data

Getting back to our data, we can modify the format of values within our data, if we want. For instance, we could convert the EEBO field to integer values.

# convert the record_id field from an integer to a float
authors_df['Page Count'] = authors_df['Page Count'].astype('float64')
authors_df['Page Count'].dtype

OUTPUT: dtype('float64')

Challenge - Changing Types

Try converting the column plot_id to floats using

authors_df.EEBO.astype("float")

Next try converting EEBO to an integer. What goes wrong here? What is Pandas telling you? We will talk about some solutions to this later.

Missing Data Values - NaN

What happened in the last challenge activity? Notice that this throws a value error: ValueError: Cannot convert NA to integer. If we look at the weight column in the surveys data we notice that there are NaN (Not a Number) values. NaN values are undefined values that cannot be represented mathematically. Pandas, for example, will read an empty cell in a CSV or Excel sheet as a NaN. NaNs have some desirable properties: if we were to average the weight column without replacing our NaNs, Python would know to skip over those cells.

authors_df['EEBO'].mean()
84280511.94630873

Dealing with missing data values is always a challenge. It’s sometimes hard to know why values are missing - was it because of a data entry error? Or data that someone was unable to collect? Should the value be 0? We need to know how missing values are represented in the dataset in order to make good decisions. If we’re lucky, we have some metadata that will tell us more about how null values were handled.

For instance, in some disciplines, like Remote Sensing, missing data values are often defined as -9999. Having a bunch of -9999 values in your data could really alter numeric calculations. Often in spreadsheets, cells are left empty where no data are available. Pandas will, by default, replace those missing values with NaN. However it is good practice to get in the habit of intentionally marking cells that have no data, with a no data value! That way there are no questions in the future when you (or someone else) explores your data.

Where Are the NaN’s?

Let’s explore the NaN values in our data a bit further. Using the tools we learned in lesson 02, we can figure out how many rows contain NaN values for weight. We can also create a new subset from our data that only contains rows with weight values > 0 (ie select meaningful weight values):

len(authors_df[pd.isnull(authors_df.EEBO)])
# how many rows have weight values?
len(authors_df[authors_df.EEBO > 0])

We can replace all NaN values with zeroes using the .fillna() method (after making a copy of the data so we don’t lose our work):

df1 = authors_df.copy()
# fill all NaN values with 0
df1['EEBO'] = df1['EEBO'].fillna(0)

However NaN and 0 yield different analysis results. The mean value when NaN values are replaced with 0 is different from when NaN values are simply thrown out or ignored.

df1['EEBO'].mean()
52170900.611285985

We can fill NaN values with any value that we chose. The code below fills all NaN values with a mean for all weight values.

 df1['EEBO'] = authors_df['EEBO'].fillna(authors_df['EEBO'].mean())

We could also chose to create a subset of our data, only keeping rows that do not contain NaN values.

The point is to make conscious decisions about how to manage missing data. This is where we think about how our data will be used and how these values will impact the scientific conclusions made from the data.

Python gives us all of the tools that we need to account for these issues. We just need to be cautious about how the decisions that we make impact scientific results.

Challenge - Counting

Count the number of missing values per column. Hint: The method .count() gives you the number of non-NA observations per column. Try looking to the .isnull() method.

Recap

What we’ve learned:

Key Points


Combining DataFrames with pandas

Overview

Teaching: 20 min
Exercises: 25 min
Questions
  • Can I work with data from multiple sources?

  • How can I combine data from different data sets?

Objectives
  • Combine data from multiple files into a single DataFrame using merge and concat.

  • Combine two DataFrames using a unique ID found in both DataFrames.

  • Employ to_csv to export a DataFrame in CSV format.

  • Join DataFrames using common fields (join keys).

In many “real world” situations, the data that we want to use come in multiple files. We often need to combine these files into a single DataFrame to analyze the data. The pandas package provides various methods for combining DataFrames including merge and concat.

To work through the examples below, we first need to load the species and surveys files into pandas DataFrames. In iPython:

import pandas as pd
authors_df = pd.read_csv("authors.csv",
                         keep_default_na=False, na_values=[""])
authors_df

        TCP                                             Author
0    A00002                         Aylett, Robert, 1583-1655?
1    A00005  Higden, Ranulf, d. 1364. Polycronicon. English...
2    A00007             Higden, Ranulf, d. 1364. Polycronicon.
3    A00008          Wood, William, fl. 1623, attributed name.
4    A00011

places_df = pd.read_csv("places.csv",
                         keep_default_na=False, na_values=[""])
places_df
    A00002                         London
0   A00005                         London
1   A00007                         London
2   A00008               The Netherlands?
3   A00011                      Amsterdam
4   A00012                         London
5   A00014                         London

Take note that the read_csv method we used can take some additional options which we didn’t use previously. Many functions in python have a set of options that can be set by the user if needed. In this case, we have told Pandas to assign empty values in our CSV to NaN keep_default_na=False, na_values=[""]. More about all of the read_csv options here.

Concatenating DataFrames

We can use the concat function in Pandas to append either columns or rows from one DataFrame to another. Let’s grab two subsets of our data to see how this works.

# read in first 10 lines of surveys table
place_sub = places_df.head(10)
# grab the last 20 rows 
place_sub_last10 = places_df.tail(20)
#reset the index values to the second dataframe appends properly
place_sub_last10 = place_sub_last10.reset_index(drop=True)
# drop=True option avoids adding new index column with old index values

When we concatenate DataFrames, we need to specify the axis. axis=0 tells Pandas to stack the second DataFrame under the first one. It will automatically detect whether the column names are the same and will stack accordingly. axis=1 will stack the columns in the second DataFrame to the RIGHT of the first DataFrame. To stack the data vertically, we need to make sure we have the same columns and associated column format in both datasets. When we stack horizonally, we want to make sure what we are doing makes sense (ie the data are related in some way).

# stack the DataFrames on top of each other
vertical_stack = pd.concat([place_sub, place_sub_last10], axis=0)

# place the DataFrames side by side
horizontal_stack = pd.concat([place_sub, place_sub_last10], axis=1)

Row Index Values and Concat

Have a look at the vertical_stack dataframe? Notice anything unusual? The row indexes for the two data frames place_sub and place_sub_last10 have been repeated. We can reindex the new dataframe using the reset_index() method.

Writing Out Data to CSV

We can use the to_csv command to do export a DataFrame in CSV format. Note that the code below will by default save the data into the current working directory. We can save it to a different folder by adding the foldername and a slash to the file vertical_stack.to_csv('foldername/out.csv'). We use the ‘index=False’ so that pandas doesn’t include the index number for each line.

# Write DataFrame to CSV
vertical_stack.to_csv('out.csv', index=False)

Check out your working directory to make sure the CSV wrote out properly, and that you can open it! If you want, try to bring it back into python to make sure it imports properly.

# for kicks read our output back into python and make sure all looks good
new_output = pd.read_csv('out.csv', keep_default_na=False, na_values=[""])

Challenge - Combine Data

In the data folder, there are two catalogue data files: 1640.csv and 1641.csv. Read the data into python and combine the files to make one new data frame. Create a plot of average plot weight by year grouped by sex. Export your results as a CSV and make sure it reads back into python properly.

Joining DataFrames

When we concatenated our DataFrames we simply added them to each other - stacking them either vertically or side by side. Another way to combine DataFrames is to use columns in each dataset that contain common values (a common unique id). Combining DataFrames using a common field is called “joining”. The columns containing the common values are called “join key(s)”. Joining DataFrames in this way is often useful when one DataFrame is a “lookup table” containing additional data that we want to include in the other.

NOTE: This process of joining tables is similar to what we do with tables in an SQL database.

The places.csv file is table that contains the place and EEBO id for some titles. When we want to access that information, we can create a query that joins the additional columns of information to the author data.

Storing data in this way has many benefits including:

Identifying join keys

To identify appropriate join keys we first need to know which field(s) are shared between the files (DataFrames). We might inspect both DataFrames to identify these columns. If we are lucky, both DataFrames will have columns with the same name that also contain the same data. If we are less lucky, we need to identify a (differently-named) column in each DataFrame that contains the same information.

>>> authors_df.columns

Index([u'TCP', u'EEBO', u'VID', u'STC', u'Status', u'Author', u'Date', u'Title', u'Terms', u'Pages'], dtype='object')

>>> places_df.columns

Index([u'EEBO', u'Place'], dtype='object')

In our example, the join key is the column containing the identifier, which is called TCP.

Now that we know the fields with the common TCP ID attributes in each DataFrame, we are almost ready to join our data. However, since there are different types of joins, we also need to decide which type of join makes sense for our analysis.

Inner joins

The most common type of join is called an inner join. An inner join combines two DataFrames based on a join key and returns a new DataFrame that contains only those rows that have matching values in both of the original DataFrames.

Inner joins yield a DataFrame that contains only rows where the value being joins exists in BOTH tables. An example of an inner join, adapted from this page is below:

Inner join -- courtesy of codinghorror.com

The pandas function for performing joins is called merge and an Inner join is the default option:

merged_inner = pd.merge(left=authors_df,right=places_df, left_on='TCP', right_on='TCP')
# in this case `species_id` is the only column name in  both dataframes, so if we skippd `left_on`
# and `right_on` arguments we would still get the same result

# what's the size of the output data?
merged_inner.shape
merged_inner

OUTPUT:

      TCP                                             Author             Place
0  A00002                         Aylett, Robert, 1583-1655?            London
1  A00005  Higden, Ranulf, d. 1364. Polycronicon. English...            London
2  A00007             Higden, Ranulf, d. 1364. Polycronicon.            London
3  A00008          Wood, William, fl. 1623, attributed name.  The Netherlands?
4  A00011                                                NaN         Amsterdam

The result of an inner join of authors_df and places_df is a new DataFrame that contains the combined set of columns from those tables. It only contains rows that have two-letter species codes that are the same in both the authos_df and place_df DataFrames. In other words, if a row in authors_df has a value of TCP that does not appear in the TCP column of TCP, it will not be included in the DataFrame returned by an inner join. Similarly, if a row in places_df has a value of TCP that does not appear in the TCP column of places_df, that row will not be included in the DataFrame returned by an inner join.

The two DataFrames that we want to join are passed to the merge function using the left and right argument. The left_on='TCP' argument tells merge to use the TCP column as the join key from places_df (the left DataFrame). Similarly , the right_on='TCP' argument tells merge to use the TCP column as the join key from authors_df (the right DataFrame). For inner joins, the order of the left and right arguments does not matter.

The result merged_inner DataFrame contains all of the columns from authors (TCP, Person) as well as all the columns from places_df (TCP, Place).

Notice that merged_inner has fewer rows than place_sub. This is an indication that there were rows in place_df with value(s) for EEBO that do not exist as value(s) for EEBO in authors_df.

Left joins

What if we want to add information from cat_sub to survey_sub without losing any of the information from survey_sub? In this case, we use a different type of join called a “left outer join”, or a “left join”.

Like an inner join, a left join uses join keys to combine two DataFrames. Unlike an inner join, a left join will return all of the rows from the left DataFrame, even those rows whose join key(s) do not have values in the right DataFrame. Rows in the left DataFrame that are missing values for the join key(s) in the right DataFrame will simply have null (i.e., NaN or None) values for those columns in the resulting joined DataFrame.

Note: a left join will still discard rows from the right DataFrame that do not have values for the join key(s) in the left DataFrame.

Left Join

A left join is performed in pandas by calling the same merge function used for inner join, but using the how='left' argument:

merged_left = pd.merge(left=places_df,right=authors_df, how='left', left_on='TCP', right_on='TCP')
merged_left

**OUTPUT:**
      TCP             Place                                             Author
0  A00002            London                         Aylett, Robert, 1583-1655?
1  A00005            London  Higden, Ranulf, d. 1364. Polycronicon. English...
2  A00007            London             Higden, Ranulf, d. 1364. Polycronicon.
3  A00008  The Netherlands?          Wood, William, fl. 1623, attributed name.
4  A00011         Amsterdam                                                NaN

The result DataFrame from a left join (merged_left) looks very much like the result DataFrame from an inner join (merged_inner) in terms of the columns it contains. However, unlike merged_inner, merged_left contains the same number of rows as the original place_sub DataFrame. When we inspect merged_left, we find there are rows where the information that should have come from authors_df (i.e., Author) is missing (they contain NaN values):

 merged_inner[ pd.isnull(merged_inner.Author) ]
**OUTPUT:**
        TCP Author      Place
4    A00011    NaN  Amsterdam
6    A00014    NaN     London
8    A00018    NaN   Germany?

These rows are the ones where the value of Author from authors_df does not occur in places_df.

Other join types

The pandas merge function supports two other join types:

Final Challenges

Challenge - Distributions

Create a new DataFrame by joining the contents of the eebo.csv and places.csv tables. Then calculate and plot the distribution of:

  1. title by author by place

Key Points


Data workflows and automation

Overview

Teaching: 40 min
Exercises: 50 min
Questions
  • Can I automate operations in Python?

  • What are functions and why should I use them?

Objectives
  • Describe why for loops are used in Python.

  • Employ for loops to automate data analysis.

  • Write unique filenames in Python.

  • Build reusable code in Python.

  • Write functions using conditional statements (if, then, else).

So far, we’ve used Python and the pandas library to explore and manipulate individual datasets by hand, much like we would do in a spreadsheet. The beauty of using a programming language like Python, though, comes from the ability to automate data processing through the use of loops and functions.

For loops

Loops allow us to repeat a workflow (or series of actions) a given number of times or while some condition is true. We would use a loop to automatically process data that’s stored in multiple files (daily values with one file per year, for example). Loops lighten our work load by performing repeated tasks without our direct involvement and make it less likely that we’ll introduce errors by making mistakes while processing each file by hand.

Let’s write a simple for loop that simulates what a kid might see during a visit to the zoo:

>>> animals = ['lion', 'tiger', 'crocodile', 'vulture', 'hippo']
>>> print(animals)
['lion', 'tiger', 'crocodile', 'vulture', 'hippo']

>>> for creature in animals:
...    print(creature)
lion
tiger
crocodile
vulture
hippo

The line defining the loop must start with for and end with a colon, and the body of the loop must be indented.

In this example, creature is the loop variable that takes the value of the next entry in animals every time the loop goes around. We can call the loop variable anything we like. After the loop finishes, the loop variable will still exist and will have the value of the last entry in the collection:

>>> animals = ['lion', 'tiger', 'crocodile', 'vulture', 'hippo']
>>> for creature in animals:
...    pass

>>> print('The loop variable is now: ' + creature)
The loop variable is now: hippo

We are not asking python to print the value of the loop variable anymore, but the for loop still runs and the value of creature changes on each pass through the loop. The statement pass in the body of the loop just means “do nothing”.

Challenge - Loops

  1. What happens if we don’t include the pass statement?

  2. Rewrite the loop so that the animals are separated by commas, not new lines (Hint: You can concatenate strings using a plus sign. For example, print(string1 + string2) outputs ‘string1string2’).

Automating data processing using For Loops

The file we’ve been using so far, TCP.csv, contains 25 years of data and is very large. We would like to separate the data for each year into a separate file.

Let’s start by making a new directory inside the folder data to store all of these files using the module os:

    import os

    os.mkdir('data/yearly_files')

The command os.mkdir is equivalent to mkdir in the shell. Just so we are sure, we can check that the new directory was created within the data folder:

>>> os.listdir('data')
['authors.csv', 'yearly_files', 'places.csv', 'eebo.db', 'eebo.csv']

The command os.listdir is equivalent to ls in the shell.

In previous lessons, we saw how to use the library pandas to load the species data into memory as a DataFrame, how to select a subset of the data using some criteria, and how to write the DataFrame into a csv file. Let’s write a script that performs those three steps in sequence for the year 2002:

import pandas as pd

# Load the data into a DataFrame
authors_df = pd.read_csv('data/eebo.csv')

# Select only data for 1636
authors1636 = authors_df[authors_df.Date == "1636"]

# Write the new DataFrame to a csv file
authors1636.to_csv('yearly_files/authors1636.csv')

To create yearly data files, we could repeat the last two commands over and over, once for each year of data. Repeating code is neither elegant nor practical, and is very likely to introduce errors into your code. We want to turn what we’ve just written into a loop that repeats the last two commands for every year in the dataset.

Let’s start by writing a loop that simply prints the names of the files we want to create - the dataset we are using covers 1977 through 2002, and we’ll create a separate file for each of those years. Listing the filenames is a good way to confirm that the loop is behaving as we expect.

We have seen that we can loop over a list of items, so we need a list of years to loop over. We can get the years in our DataFrame with:

>>> authors_df['Date']

0      1625
1      1515
2      1528
3      1623
4      1640
5      1623
    ...    
141    1635
142    1614
143    1589
144    1636
145    1562
146    1533
147    1606
148    1618
Name: Date, Length: 149, dtype: int64

but we want only unique years, which we can get using the unique function which we have already seen.

>>> authors_df['Date'].unique()
array([1625, 1515, 1528, 1623, 1640, 1624, 1607, 1558, 1599, 1622, 1613,
       1600, 1635, 1569, 1579, 1597, 1538, 1559, 1563, 1577, 1580, 1626,
       1631, 1565, 1632, 1571, 1554, 1615, 1549, 1567, 1605, 1636, 1591,
       1588, 1619, 1566, 1593, 1547, 1603, 1609, 1589, 1574, 1584, 1630,
       1621, 1610, 1542, 1534, 1519, 1550, 1540, 1557, 1606, 1545, 1537,
       1532, 1526, 1531, 1533, 1572, 1536, 1529, 1535, 1543, 1586, 1596,
       1552, 1608, 1611, 1616, 1581, 1639, 1570, 1564, 1568, 1602, 1618,
       1583, 1638, 1592, 1544, 1585, 1614, 1562])

Putting this into our for loop we get

>>> for year in authors_df['Date'].unique():
...    filename='data/yearly_files/authors' + str(year) + '.csv'
...    print(filename)
...
yearly_files/authors1625.csv
yearly_files/authors1515.csv
yearly_files/authors1528.csv
yearly_files/authors1623.csv
yearly_files/authors1640.csv
yearly_files/authors1624.csv
yearly_files/authors1607.csv

We can now add the rest of the steps we need to create separate text files:

# Load the data into a DataFrame
authors_df = pd.read_csv('data/eebo.csv')

for year in authors_df['Data'].unique():

    # Select data for the year
    publish_year = authors_df[authors_df.Date == year]

    # Write the new DataFrame to a csv file
    filename = 'yearly_files/authors' + str(year) + '.csv'
    publish_year.to_csv(filename)

Look inside the yearly_files directory and check a couple of the files you just created to confirm that everything worked as expected.

Writing Unique FileNames

Notice that the code above created a unique filename for each year.

filename = 'data/yearly_files/authors' + str(year) + '.csv'

Let’s break down the parts of this name:

Notice that we use single quotes to add text strings. The variable is not surrounded by quotes. This code produces the string data/yearly_files/authors1607.csv which contains the path to the new filename AND the file name itself.

Challenge - Modifying loops

  1. Some of the surveys you saved are missing data (they have null values that show up as NaN - Not A Number - in the DataFrames and do not show up in the text files). Modify the for loop so that the entries with null values are not included in the yearly files.

  2. What happens if there is no data for a year in the sequence (for example, imagine we had used 1800 as the start year in range)?

  3. Let’s say you only want to look at data from a given multiple of years. How would you modify your loop in order to generate a data file for only every 5th year, starting from 1500?

  4. Instead of splitting out the data by years, a colleague wants to analyse each place separately. How would you write a unique csv file for each location?

Building reusable and modular code with functions

Suppose that separating large data files into individual yearly files is a task that we frequently have to perform. We could write a for loop like the one above every time we needed to do it but that would be time consuming and error prone. A more elegant solution would be to create a reusable tool that performs this task with minimum input from the user. To do this, we are going to turn the code we’ve already written into a function.

Functions are reusable, self-contained pieces of code that are called with a single command. They can be designed to accept arguments as input and return values, but they don’t need to do either. Variables declared inside functions only exist while the function is running and if a variable within the function (a local variable) has the same name as a variable somewhere else in the code, the local variable hides but doesn’t overwrite the other.

Every method used in Python (for example, print) is a function, and the libraries we import (say, pandas) are a collection of functions. We will only use functions that are housed within the same code that uses them, but it’s also easy to write functions that can be used by different programs.

Functions are declared following this general structure:

def this_is_the_function_name(input_argument1, input_argument2):

    # The body of the function is indented
    # This function prints the two arguments to screen
    print('The function arguments are:', input_argument1, input_argument2, '(this is done inside the function!)')

    # And returns their product
    return input_argument1 * input_argument2

The function declaration starts with the word def, followed by the function name and any arguments in parenthesis, and ends in a colon. The body of the function is indented just like loops are. If the function returns something when it is called, it includes a return statement at the end.

This is how we call the function:

>>> product_of_inputs = this_is_the_function_name(2,5)
The function arguments are: 2 5 (this is done inside the function!)

>>> print('Their product is:', product_of_inputs, '(this is done outside the function!)')
Their product is: 10 (this is done outside the function!)

Challenge - Functions

  1. Change the values of the arguments in the function and check its output
  2. Try calling the function by giving it the wrong number of arguments (not 2) or not assigning the function call to a variable (no product_of_inputs =)
  3. Declare a variable inside the function and test to see where it exists (Hint: can you print it from outside the function?)
  4. Explore what happens when a variable both inside and outside the function have the same name. What happens to the global variable when you change the value of the local variable?

We can now turn our code for saving yearly data files into a function. There are many different “chunks” of this code that we can turn into functions, and we can even create functions that call other functions inside them. Let’s first write a function that separates data for just one year and saves that data to a file:

def one_year_csv_writer(this_year, all_data):
    """
    Writes a csv file for data from a given year.

    this_year --- year for which data is extracted
    all_data --- DataFrame with multi-year data
    """

    # Select data for the year
    texts_year = all_data[all_data.Date == this_year]

    # Write the new DataFrame to a csv file
    filename = 'data/yearly_files/function_authors' + str(this_year) + '.csv'
    texts_year.to_csv(filename)

The text between the two sets of triple double quotes is called a docstring and contains the documentation for the function. It does nothing when the function is running and is therefore not necessary, but it is good practice to include docstrings as a reminder of what the code does. Docstrings in functions also become part of their ‘official’ documentation:

one_year_csv_writer?
one_year_csv_writer('1607',authors_df)

We changed the root of the name of the csv file so we can distinguish it from the one we wrote before. Check the yearly_files directory for the file. Did it do what you expect?

What we really want to do, though, is create files for multiple years without having to request them one by one. Let’s write another function that replaces the entire For loop by simply looping through a sequence of years and repeatedly calling the function we just wrote, one_year_csv_writer:

def yearly_data_csv_writer(start_year, end_year, all_data):
    """
    Writes separate csv files for each year of data.

    start_year --- the first year of data we want
    end_year --- the last year of data we want
    all_data --- DataFrame with multi-year data
    """

    # "end_year" is the last year of data we want to pull, so we loop to end_year+1
    for year in range(start_year, end_year+1):
        one_year_csv_writer(str(year), all_data)

Because people will naturally expect that the end year for the files is the last year with data, the for loop inside the function ends at end_year + 1. By writing the entire loop into a function, we’ve made a reusable tool for whenever we need to break a large data file into yearly files. Because we can specify the first and last year for which we want files, we can even use this function to create files for a subset of the years available. This is how we call this function:

# Load the data into a DataFrame
authors_df = pd.read_csv('data/eebo.csv')

# Create csv files
yearly_data_csv_writer(1500, 1650, authors_df)

BEWARE! If you are using IPython Notebooks and you modify a function, you MUST re-run that cell in order for the changed function to be available to the rest of the code. Nothing will visibly happen when you do this, though, because simply defining a function without calling it doesn’t produce an output. Any cells that use the now-changed functions will also have to be re-run for their output to change.

Challenge- More functions

  1. Add two arguments to the functions we wrote that take the path of the directory where the files will be written and the root of the file name. Create a new set of files with a different name in a different directory.
  2. How could you use the function yearly_data_csv_writer to create a csv file for only one year? (Hint: think about the syntax for range)
  3. Make the functions return a list of the files they have written. There are many ways you can do this (and you should try them all!): either of the functions can print to screen, either can use a return statement to give back numbers or strings to their function call, or you can use some combination of the two. You could also try using the os library to list the contents of directories.
  4. Explore what happens when variables are declared inside each of the functions versus in the main (non-indented) body of your code. What is the scope of the variables (where are they visible)? What happens when they have the same name but are given different values?

The functions we wrote demand that we give them a value for every argument. Ideally, we would like these functions to be as flexible and independent as possible. Let’s modify the function yearly_data_csv_writer so that the start_year and end_year default to the full range of the data if they are not supplied by the user. Arguments can be given default values with an equal sign in the function declaration. Any arguments in the function without default values (here, all_data) is a required argument and MUST come before the argument with default values (which are optional in the function call).

    def yearly_data_arg_test(all_data, start_year = '1500', end_year = '1650'):
        """
        Modified from yearly_data_csv_writer to test default argument values!

        start_year --- the first year of data we want --- default: 1300
        end_year --- the last year of data we want --- default: 1700
        all_data --- DataFrame with multi-year data
        """

        return start_year, end_year


    start,end = yearly_data_arg_test (authors_df, '1600', '1660')
    print('Both optional arguments:\t', start, end)

    start,end = yearly_data_arg_test (authors_df)
    print('Default values:\t\t\t', start, end)
    Both optional arguments:	1600 1660
    Default values:			1300 1700

The “\t” in the print statements are tabs, used to make the text align and be easier to read.

But what if our dataset doesn’t start in 1300 and end in 1700? We can modify the function so that it looks for the start and end years in the dataset if those dates are not provided:

    def yearly_data_arg_test(all_data, start_year = None, end_year = None):
        """
        Modified from yearly_data_csv_writer to test default argument values!

        start_year --- the first year of data we want --- default: None - check all_data
        end_year --- the last year of data we want --- default: None - check all_data
        all_data --- DataFrame with multi-year data
        """

        if not start_year:
            start_year = min(all_data.Date)
        if not end_year:
            end_year = max(all_data.Date)

        return start_year, end_year


    start,end = yearly_data_arg_test (authors_df, '1600', '1660')
    print('Both optional arguments:\t', start, end)

    start,end = yearly_data_arg_test (authors_df)
    print('Default values:\t\t\t', start, end)
    Both optional arguments:	1600 1660
    Default values:		1500 1650

The default values of the start_year and end_year arguments in the function yearly_data_arg_test are now None. This is a build-in constant in Python that indicates the absence of a value - essentially, that the variable exists in the namespace of the function (the directory of variable names) but that it doesn’t correspond to any existing object.

Challenge - Variables

  1. What type of object corresponds to a variable declared as None? (Hint: create a variable set to None and use the function type())

  2. Compare the behavior of the function yearly_data_arg_test when the arguments have None as a default and when they do not have default values.

  3. What happens if you only include a value for start_year in the function call? Can you write the function call with only a value for end_year? (Hint: think about how the function must be assigning values to each of the arguments - this is related to the need to put the arguments without default values before those with default values in the function definition!)

If Loops

The body of the test function now has two conditional loops (if loops) that check the values of start_year and end_year. If loops execute the body of the loop when some condition is met. They commonly look something like this:

    a = 5

    if a<0: # meets first condition?

        # if a IS less than zero
        print('a is a negative number')

    elif a>0: # did not meet first condition. meets second condition?

        # if a ISN'T less than zero and IS more than zero
        print('a is a positive number')

    else: # met neither condition

        # if a ISN'T less than zero and ISN'T more than zero
        print('a must be zero!')

Which would return:

    a is a positive number

Change the value of a to see how this function works. The statement elif means “else if”, and all of the conditional statements must end in a colon.

The if loops in the function yearly_data_arg_test check whether there is an object associated with the variable names start_year and end_year. If those variables are None, the if loops return the boolean True and execute whaever is in their body. On the other hand, if the variable names are associated with some value (they got a number in the function call), the if loops return False and do not execute. The opposite conditional statements, which would return True if the variables were associated with objects (if they had received value in the function call), would be if start_year and if end_year.

As we’ve written it so far, the function yearly_data_arg_test associates values in the function call with arguments in the function definition just based in their order. If the function gets only two values in the function call, the first one will be associated with all_data and the second with start_year, regardless of what we intended them to be. We can get around this problem by calling the function using keyword arguments, where each of the arguments in the function definition is associated with a keyword and the function call passes values to the function using these keywords:

    def yearly_data_arg_test(all_data, start_year = None, end_year = None):
        """
        Modified from yearly_data_csv_writer to test default argument values!

        start_year --- the first year of data we want --- default: None - check all_data
        end_year --- the last year of data we want --- default: None - check all_data
        all_data --- DataFrame with multi-year data
        """

        if not start_year:
            start_year = min(all_data.Date)
        if not end_year:
            end_year = max(all_data.Date)

        return start_year, end_year


    start,end = yearly_data_arg_test (authors_df)
    print('Default values:\t\t\t', start, end)

    start,end = yearly_data_arg_test (authors_df, 1600, 1660)
    print('No keywords:\t\t\t', start, end)

    start,end = yearly_data_arg_test (authors_df, start_year = 1600, end_year = 1660)
    print('Both keywords, in order:\t', start, end)

    start,end = yearly_data_arg_test (authors_df, end_year = 1660, start_year = 1600)
    print('Both keywords, flipped:\t\t', start, end)

    start,end = yearly_data_arg_test (authors_df, start_year = 1600)
    print('One keyword, default end:\t', start, end)

    start,end = yearly_data_arg_test (authors_df, end_year = 1660)
    print('One keyword, default start:\t', start, end)
    Default values:			1515 1640
    No keywords:			1600 1640
    Both keywords, in order:	        1600 1660
    Both keywords, flipped:		1600 1660
    One keyword, default end:	        1600 1640
    One keyword, default start:	        1515 1660

Challenge - Modifying functions

  1. Rewrite the one_year_csv_writer and yearly_data_csv_writer functions to have keyword arguments with default values

  2. Modify the functions so that they don’t create yearly files if there is no data for a given year and display an alert to the user (Hint: use conditional statements and if loops to do this. For an extra challenge, use try statements!)

  3. The code below checks to see whether a directory exists and creates one if it doesn’t. Add some code to your function that writes out the CSV files, to check for a directory to write to.

	if 'dir_name_here' in os.listdir('.'):
	    print('Processed directory exists')
	else:
	    os.mkdir('dir_name_here')
	    print('Processed directory created')
  1. The code that you have written so far to loop through the years is good, however it is not necessarily reproducible with different datasets. For instance, what happens to the code if we have additional years of data in our CSV files? Using the tools that you learned in the previous activities, make a list of all years represented in the data. Then create a loop to process your data, that begins at the earliest year and ends at the latest year using that list.

HINT: you can create a loop with a list as follows: for years in year_list:

Key Points


Plotting with bokeh

Overview

Teaching: 20 min
Exercises: 25 min
Questions
  • Can I use Python to create plots?

  • How can I customize plots generated in Python?

Objectives
  • Create a ggplot object

  • Set universal plot settings

  • Modify an existing ggplot object

  • Change the aesthetics of a plot such as colour

  • Edit the axis labels

  • Build complex plots using a step-by-step approach

  • Create scatter plots, box plots, and time series plots

  • Use the facet_wrap and facet_grid commands to create a collection of plots splitting the data by a factor variable

  • Create customized plot styles to meet their needs

Disclaimer

Python has powerful built-in plotting capabilities such as matplotlib, but for this exercise, we will be using the bokeh package, which facilitates the creation of highly-informative plots of structured data.

import pandas as pd

authors_complete = pd.read_csv( '../data/eebo.csv', index_col=0)
authors_complete.index.name = 'X'
authors_complete
        EEBO    VID  ... Page Count             Place X                        ...                              A00002  99850634  15849  ...        134            London A00005  99842408   7058  ...        302            London A00007  99844302   9101  ...        386            London A00008  99848896  14017  ...         14  The Netherlands? A00011  99837000   1304  ...         54         Amsterdam A00012  99853871  19269  ...         99            London A00014  33143147  28259  ...          1            London A00015  99837006   1310  ...         16            London A00018  99850740  15965  ...         26          Germany?

149 rows x 10 columns

from bokeh.plotting import figure, output_file, show
from bokeh.io import output_notebook

Plotting with bokeh

We will make the same plot using the bokeh package.

bokeh is a plotting package that makes it simple to create complex plots from data in a dataframe. It uses default settings, which help creating publication quality plots with a minimal amount of settings and tweaking.

bokeh graphics are built step by step by adding new elements.

To build a bokeh plot we need to:

We also set some notebook settings with a “output_notebook()” statement to get interactive and exportable plots

output_notebook()

p = figure(plot_width=400, plot_height=400)

png

We can add simple points to create a scatter plot using circle.

list_dates = authors_complete['Date']
list_numbers = authors_complete['Page Count']
p.circle(list_dates, list_numbers)
show(p)

png

Building your plots

We can add extra arguments into circle’s argument.

For comparison, we create a new figure and then add the alpha argument to circle to change the opacity of the points.

p1 = figure(plot_width=400, plot_height=400)

p1.circle(list_dates, list_numbers, alpha=0.1)

show(p1)

png

We can also add colors for all the points.

p2 = figure(plot_width=400, plot_height=400)

p2.circle(list_dates, list_numbers, color="blue", alpha=0.1)

show(p2)

png

Plotting time series data

Let’s calculate number of counts per year across the dataset. To do that we need to group data first and count records within each group.

yearly = authors_df[['Date','Place','Page Count']].groupby(['Date', 'Place']).count().reset_index()
p3 = figure(plot_width=800, plot_height=250)

p3.line(yearly['Date'], yearly['Page Count'], color='navy', alpha=0.5)

show(p3)
year 	place 	count 0 	1515 	London 	1 1 	1519 	Londini 	1 2 	1526 	London 	2 3 	1528 	London 	1 4 	1529 	Malborow i.e. Antwerp 	1 5 	1531 	London 	1

[121 rows x 3 columns]

Timelapse data can be visualised as a line plot with years on x axis and counts on y axis.

p3 = figure(plot_width=800, plot_height=250)
p3.line(yearly['Date'], yearly['Page Count'], color='navy', alpha=0.5)
show(p3)

png

Customization

Now, let’s add a title to this figure:

from bokeh.models import ColumnDataSource, Range1d, LabelSet, Label

p4 = figure(title="Plot of Page Counts by Year", plot_width=400, plot_height=400)

p4.circle(list_dates, list_numbers)

p4.xaxis[0].axis_label = 'Date'
p4.yaxis[0].axis_label = 'Page Count'

show(p4)

png

or we canadd labels to the axes and change the font size for the labels

p5 = figure(title="Plot of Page Counts by Year", plot_width=400, plot_height=400)

p5.circle(list_dates, list_numbers)

p5.xaxis[0].axis_label = 'Date'
p5.yaxis[0].axis_label = 'Page Count'
p5.xaxis[0].axis_label_text_font_size = "24pt"

show(p5)

png

With all of this information in hand, please take another five minutes to either improve one of the plots generated in this exercise or create a beautiful graph of your own.

Here are some ideas:

After creating your plot, you can save it to a file as a png file:

from bokeh.io import export_png

export_png(p4, filename="plot.png")

Key Points


Data Ingest & Visualization - Matplotlib & Pandas

Overview

Teaching: 20 min
Exercises: 25 min
Questions
  • What other tools can I use to create plots apart from ggplot?

  • Why should I use Python to create plots?

Objectives
  • Import the pyplot toolbox to create figures in Python.

Putting it all together

Up to this point, we have walked through tasks that are often involved in handling and processing data using the workshop ready cleaned
files that we have provided. In this wrap-up exercise, we will perform many of the same tasks with real data sets. This lesson also covers data visualization.

As opposed to the previous ones, this lesson does not give step-by-step directions to each of the tasks. Use the lesson materials you’ve already gone through as well as the Python documentation to help you along.

1. Obtain data

There are many repositories online from which you can obtain data. We are providing you with one data file to use with these exercises, but feel free to use any data that is relevant to your research. The file TCP.csv contains the complete EEBO metadata of books between 1300-1700. If you’d like to use this dataset, please find it in the data folder.

2. Clean up your data and open it using Python and Pandas

To begin, import your data file into Python using Pandas. Did it fail? Your data file probably has a header that Pandas does not recognize as part of the data table. Remove this header, but do not simply delete it in a text editor! Use either a shell script or Python to do this - you wouldn’t want to do it by hand if you had many files to process.

If you are still having trouble importing the data as a table using Pandas, check the documentation. You can open the docstring in an ipython notebook using a question mark. For example:

    import pandas as pd
    pd.read_csv?

Look through the function arguments to see if there is a default value that is different from what your file requires (Hint: the problem is most likely the delimiter or separator. Common delimiters are ',' for comma, ' ' for space, and '\t' for tab).

Create a DataFrame that includes only the values of the data that are useful to you. In the streamgage file, those values might be the date, title, terms, and TCP. Convert and joined strings into individual ones. You can also change the name of the columns in the DataFrame like this:

    df['Author'] = df['Author'].str.split(';') #split a column in the data frame

    df = pd.DataFrame({'1stcolumn':[100,200], '2ndcolumn':[10,20]}) # this just creates a DataFrame for the example!
    print('With the old column names:\n') # the \n makes a new line, so it's easier to see
    print(df)

    df.columns = ['FirstColumn','SecondColumn'] # rename the columns!
    print('\n\nWith the new column names:\n')
    print(df)

    With the old column names:

       1stcolumn  2ndcolumn
    0        100         10
    1        200         20


    With the new column names:

       FirstColumn  SecondColumn
    0          100            10
    1          200            20

3. Make a line plot of your data

Matplotlib is a Python library that can be used to visualize data. The toolbox matplotlib.pyplot is a collection of functions that make matplotlib work like MATLAB. In most cases, this is all that you will need to use, but there are many other useful tools in matplotlib that you should explore.

We will cover a few basic commands for formatting plots in this lesson. A great resource for help styling your figures is the matplotlib gallery (http://matplotlib.org/gallery.html), which includes plots in many different styles and the source code that creates them. The simplest of plots is the 2 dimensional line plot. These examples walk through the basic commands for making line plots using pyplots.

Challenge - Lots of plots

Make a variety of line plots from your data. If you are using the streamgage data, these could include (1) a hydrograph of the entire month of September 2013, (2) the discharge record for the week of the 2013 Front Range flood (September 9 through 15), (3) discharge vs. time of day, for every day in the record in one figure (Hint: use loops to combine strings and give every line a different style and color), and (4) minimum, maximum, and mean daily discharge values. Add axis labels, titles, and legends to your figures. Make at least one figure with multiple plots using the function subplot().

Using pyplot:

First, import the pyplot toolbox:

    import matplotlib.pyplot as plt

By default, matplotlib will create the figure in a separate window. When using ipython notebooks, we can make figures appear in-line within the notebook by writing:

    %matplotlib inline

We can start by plotting the values of a list of numbers (matplotlib can handle many types of numeric data, including numpy arrays and pandas DataFrames - we are just using a list as an example!):

    list_numbers = [1.5, 4, 2.2, 5.7]
    plt.plot(list_numbers)
    plt.show()

The command plt.show() prompts Python to display the figure. Without it, it creates an object in memory but doesn’t produce a visible plot. The ipython notebooks (if using %matplotlib inline) will automatically show you the figure even if you don’t write plt.show(), but get in the habit of including this command!

If you provide the plot() function with only one list of numbers, it assumes that it is a sequence of y-values and plots them against their index (the first value in the list is plotted at x=0, the second at x=1, etc). If the function plot() receives two lists, it assumes the first one is the x-values and the second the y-values. The line connecting the points will follow the list in order:

    plt.plot([6.8, 4.3, 3.2, 8.1], list_numbers)
    plt.show()

A third, optional argument in plot() is a string of characters that indicates the line type and color for the plot. The default value is a continuous blue line. For example, we can make the line red ('r'), with circles at every data point ('o'), and a dot-dash pattern ('-.'). Look through the matplotlib gallery for more examples.

    plt.plot([6.8, 4.3, 3.2, 8.1], list_numbers, 'ro-.')
    plt.axis([0,10,0,6])
    plt.show()

The command plt.axis() sets the limits of the axes from a list of [xmin, xmax, ymin, ymax] values (the square brackets are needed because the argument for the function axis() is one list of values, not four separate numbers!). The functions xlabel() and ylabel() will label the axes, and title() will write a title above the figure.

A single figure can include multiple lines, and they can be plotted using the same plt.plot() command by adding more pairs of x values and y values (and optionally line styles):

    import numpy as np

    # create a numpy array between 0 and 10, with values evenly spaced every 0.5
    t = np.arange(0., 10., 0.5)

    # red dashes with no symbols, blue squares with a solid line, and green triangles with a dotted line
    plt.plot(t, t, 'r--', t, t**2, 'bs-', t, t**3, 'g^:')

    plt.xlabel('This is the x axis')
    plt.ylabel('This is the y axis')
    plt.title('This is the figure title')

    plt.show()

We can include a legend by adding the optional keyword argument label='' in plot(). Caution: We cannot add labels to multiple lines that are plotted simultaneously by the plt.plot() command like we did above because Python won’t know to which line to assign the value of the argument label. Multiple lines can also be plotted in the same figure by calling the plot() function several times:

    # red dashes with no symbols, blue squares with a solid line, and green triangles with a dotted line
    plt.plot(t, t, 'r--', label='linear')
    plt.plot(t, t**2, 'bs-', label='square')
    plt.plot(t, t**3, 'g^:', label='cubic')

    plt.legend(loc='upper left', shadow=True, fontsize='x-large')

    plt.xlabel('This is the x axis')
    plt.ylabel('This is the y axis')
    plt.title('This is the figure title')

    plt.show()

The function legend() adds a legend to the figure, and the optional keyword arguments change its style. By default [typing just plt.legend()], the legend is on the upper right corner and has no shadow.

Like MATLAB, pyplot is stateful; it keeps track of the current figure and plotting area, and any plotting functions are directed to those axes. To make more than one figure, we use the command plt.figure() with an increasing figure number inside the parentheses:

    # this is the first figure
    plt.figure(1)
    plt.plot(t, t, 'r--', label='linear')

    plt.legend(loc='upper left', shadow=True, fontsize='x-large')
    plt.title('This is figure 1')

    plt.show()

    # this is a second figure
    plt.figure(2)
    plt.plot(t, t**2, 'bs-', label='square')

    plt.legend(loc='upper left', shadow=True, fontsize='x-large')
    plt.title('This is figure 2')

    plt.show()

A single figure can also include multiple plots in a grid pattern. The subplot() command especifies the number of rows, the number of columns, and the number of the space in the grid that particular plot is occupying:

    plt.figure(1)

    plt.subplot(2,2,1) # two row, two columns, position 1
    plt.plot(t, t, 'r--', label='linear')

    plt.subplot(2,2,2) # two row, two columns, position 2
    plt.plot(t, t**2, 'bs-', label='square')

    plt.subplot(2,2,3) # two row, two columns, position 3
    plt.plot(t, t**3, 'g^:', label='cubic')

    plt.show()

4. Make other types of plots:

Matplotlib can make many other types of plots in much the same way that it makes 2 dimensional line plots. Look through the examples in http://matplotlib.org/users/screenshots.html and try a few of them (click on the “Source code” link and copy and paste into a new cell in ipython notebook or save as a text file with a .py extension and run in the command line).

Challenge - Final Plot

Display your data using one or more plot types from the example gallery. Which ones to choose will depend on the content of your own data file. If you are using the streamgage file, you could make a histogram of the number of days with a given mean discharge, use bar plots to display daily discharge statistics, or explore the different ways matplotlib can handle dates and times for figures.

Key Points


Accessing SQLite Databases Using Python & Pandas

Overview

Teaching: 20 min
Exercises: 25 min
Questions
Objectives
  • Use the sqlite3 module to interact with a SQL database.

  • Access data stored in SQLite using Python.

  • Describe the difference in interacting with data stored as a CSV file versus in SQLite.

  • Describe the benefits of accessing data using a database compared to a CSV file.

Python and SQL

When you open a CSV in python, and assign it to a variable name, you are using your computers memory to save that variable. Accessing data from a database like SQL is not only more efficient, but also it allows you to subset and import only the parts of the data that you need.

In the following lesson, we’ll see some approaches that can be taken to do so.

The sqlite3 module

The sqlite3 module provides a straightforward interface for interacting with SQLite databases. A connection object is created using sqlite3.connect(); the connection must be closed at the end of the session with the .close() command. While the connection is open, any interactions with the database require you to make a cursor object with the .cursor() command. The cursor is then ready to perform all kinds of operations with .execute().

import sqlite3

# Create a SQL connection to our SQLite database
con = sqlite3.connect("data/eebo.db")

cur = con.cursor()

# the result of a "cursor.execute" can be iterated over by row
for row in cur.execute('SELECT * FROM eebo;'):
    print(row)

#Be sure to close the connection.
con.close()

Queries

One of the most common ways to interact with a database is by querying: retrieving data based on some search parameters. Use a SELECT statement string. The query is returned as a single tuple or a tuple of tuples. Add a WHERE statement to filter your results based on some parameter.

import sqlite3

# Create a SQL connection to our SQLite database
con = sqlite3.connect("data/eebo.db")

cur = con.cursor()

# Return all results of query
cur.execute('SELECT Title FROM eebo WHERE Status="Free"')
cur.fetchall()

# Return first result of query
cur.execute('SELECT Title FROM eebo WHERE Status="Free"')
cur.fetchone()

#Be sure to close the connection.
con.close()

Accessing data stored in SQLite using Python and Pandas

Using pandas, we can import results of a SQLite query into a dataframe. Note that you can use the same SQL commands / syntax that we used in the SQLite lesson. An example of using pandas together with sqlite is below:

import pandas as pd
import sqlite3

# Read sqlite query results into a pandas DataFrame
con = sqlite3.connect("data/eebo.db")
df = pd.read_sql_query("SELECT * from eebo", con)

# verify that result of SQL query is stored in the dataframe
print(df.head())

con.close()

Storing data: CSV vs SQLite

Storing your data in an SQLite database can provide substantial performance improvements when reading/writing compared to CSV. The difference in performance becomes more noticable as the size of the dataset grows (see for example these benchmarks).

Challenge - SQL

  1. Create a query that contains title data published between 1550 - 1650 that includes book’s Title, Author, and TCP id. How many records are returned?

Storing data: Create new tables using Pandas

We can also us pandas to create new tables within an SQLite database. Here, we run we re-do an excercise we did before with CSV files using our SQLite database. We first read in our survey data, then select only those survey results for 2002, and then save it out to its own table so we can work with it on its own later.

import pandas as pd
import sqlite3

con = sqlite3.connect("data/eebo.db")

# Load the data into a DataFrame
books_df = pd.read_sql_query("SELECT * from eebo", con)

# Select only data for 1640
titles1640 = books_df[books_df.Date == '1640']

# Write the new DataFrame to a new SQLite table
titles1640.to_sql("titles1640", con, if_exists="replace")

con.close()

Challenge - Saving your work

  1. For each of the challenges in the previous challenge block, modify your code to save the results to their own tables in the eebo database.

  2. What are some of the reasons you might want to save the results of your queries back into the database? What are some of the reasons you might avoid doing this.

Key Points