Workflows with Python and Git

Review Exercise

Overview

Teaching: 0 min
Exercises: 20 min
Questions
  • How can we put together all of yesterday’s material?

Objectives
  • Apply use of functions, conditionals and loops to solve a problem.

Review From Yesterday

In your notebook, write a function that determines whether a year between 1901 and 2000 is a leap year, where it prints a message like “1904 is a leap year” or “1905 is not a leap year” as output. Use this function to evaluate the years 1928, 1950, 1959, 1972 and 1990. Essentially, given this list of years:

years = [1928, 1950, 1959, 1972, 1990]

Produce something like:

1928 is a leap year
1950 is not a leap year.
1959 is not a leap year.
1972 is a leap year
1990 is not a leap year.
8 mod 4 equals 0
10 mod 4 equals 2

If you’re not sure where to start, see the partial answers below:

Suggested Approach

First, try to determine how to use the mod operator % to determine if a year is divisible by 4 (and thus a leap year or not).

Then, create a conditional statement to use this information, and put it into a function.

Finally, create a list of the years given in the exercise. Use a for loop and your function to evaluate these years.

Modular Arthimetic

If a year in the range specified is divisible by four, it is a leap year. If a number is divisible by 4, then the arithmetic expression “number mod four” (or num % 4 in Python) will equal zero.

Conditional Statement

Fill in the blanks:

year = 1904
if year % 4 == _____:
    print(year, _______________)
______:
    print(year, "is not a leap year.")

Function

Fill in the blanks:

def leap_year(year):
    _________

Loop

Fill in the blanks:

year_list = [1928, 1950, 1959, 1972, 1990]
for year in ______:
    ________(year)

Complete Solution

def leap_year(year):
    if year % 4 == 0:
        print(year, "is a leap year")
    else:
        print(year, "is not a leap year.")

year_list = [1928, 1950, 1959, 1972, 1990]
for year in year_list:
    leap_year(year)

If you have time:

  1. Expand your function so that it correctly categorizes any year from 0 onwards

  2. Instead of printing whether a year is a leap year or not, save the results to a python dictionary, where there are two keys (“leap” and “not-leap”) and the values are a list of years.

Key Points

  • Use skills together.


Command-Line Programs

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • How can I write Python programs that will work like Unix command-line tools?

Objectives
  • Use the values of command-line arguments in a program.

  • Read data from standard input in a program so that it can be used in a pipeline.

The Jupyter Notebook and other interactive tools are great for prototyping code and exploring data, but sooner or later one will want to use that code in a program we can run from the command line. In order to do that, we need to make our programs work like other Unix command-line tools. For example, we may want a program that reads a gapminder data set and plots the gdp of countries over time.

Switching to Shell Commands

In this lesson we are switching from typing commands in Jupyter notebooks to typing commands in a shell terminal window (such as bash). When you see a $ in front of a command that tells you to run that command in the shell rather than the Python interpreter.

Converting Notebooks

The Jupyter Notebook has the ability to convert all of the cells of a current Notebook into a python program. To do this, go to File -> Download as and select Python (.py) to get the current notebook as a Python script.

Setting up your project

Up until now, we’ve been working in the data folder directly. Because we’re going to be dealing with more files of different types in this lesson, let’s do a little rearranging:

import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv('data/gapminder_gdp_oceania.csv', index_col = 'country').T

# create a plot of the transposed data
ax = data.plot()

# display the plot
plt.show()

This program imports the pandas and matplotlib Python modules, reads some of the gapminder data into a pandas dataframe, and plots that data using matplotlib with some default settings.

We can run this program from the command line using

$ python gdp_plots.py

This is much easier than starting a notebook, going to the browser, and running each cell in the notebook to get the same result.

Initialize a repository

But before we modify our gdp_plots.py program, we are going to put it under version control so that we can track its changes as we go through this lesson.

$ git init
$ git add gdp_plots.py
$ git commit -m "First commit of analysis script"

Because we’re only concerned with changes to our analysis script, we are going to create a .gitignore file for all of the gapminder .csv files and any Python notebook files (.ipynb) files we have created thus far.

$ echo "data/*.csv" > .gitignore
$ echo "*.ipynb" >> .gitignore
$ git add .gitignore
$ git commit -m "Adding ignore file"

Now that we have a clean repository, let’s get back to work on adding command line arguments to our program.

Changing code under Version Control

As it is, this plot isn’t bad but let’s add some labels for clarity. We’ll use the data filename as a title for the plot and indicate what information is in on each axis.

import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

filename = 'data/gapminder_gdp_oceania.csv'

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T

# create a plot of the transposed data
ax = data.plot(title = filename)

# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)))
ax.set_xticklabels(data.index, rotation = 45)

# display the plot
plt.show()

Now when we run this, our plot looks a little bit nicer.

$ python gdp_plots.py

Updating the Repository

$ git add gdp_plots.py
$ git commit -m "Improving plot format"

Command-Line Arguments

This program currently only works for the Oceania set of data. How might we modify the program to work for any of the gapminder gdp data sets? We could go into the script and change the .csv filename to generate the same plot for different sets of data, but there is an even better way.

Python programs can use additional arguments provided in the following manner.

$ python <program> <argument1> <argument2> <other_arguments>

The program can then use these arguments to alter its behavior based on those arguments. In this case, we’ll be using arguments to tell our program to operate on a specific file.

We’ll be using the sys module to do so. sys (short for system) is a standard Python module used to store information about the program and its running environment, including what arguments were passed to the program when the command was executed. These arguments are stored as a list in sys.argv.

These arguments can be accessed in our program by importing the sys module. The first argument in sys.argv is always the name of the program, so we’ll find any additional arguments right after that in the list.

Let’s try this out in a separate script. Using the text editor of your choice, let’s write a new program called args_list.py containing the two following lines:

import sys
print('sys.argv is', sys.argv)

The strange name argv stands for “argument values”. Whenever Python runs a program, it takes all of the values given on the command line and puts them in the list sys.argv so that the program can determine what they were. If we run this program with no arguments:

$ python argv_list.py
sys.argv is ['argv_list.py']

the only thing in the list is the full path to our script, which is always sys.argv[0].

If we run it with a few arguments, however:

$ python argv_list.py first second third
sys.argv is ['argv_list.py', 'first', 'second', 'third']

then Python adds each of those arguments to that magic list.

Using this new information, let’s add command line arguments to our gdp_plots.py program.

To do this, we’ll make two changes:

  1. add the import of the sys module at the beginning of the program.
  2. replace the filename (“data/gapminder_gdp_oceania.csv”) with the the second entry in the sys.argv list.

Now our program should look as follows:

import sys
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

filename = sys.argv[1]

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T

# create a plot of the transposed data
ax = data.plot(title = filename)

# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)) )
ax.set_xticklabels(data.index, rotation = 45)

# display the plot
plt.show()

Let’s take a look at what happens when we provide a gapminder filename to the program.

$ python gdp_plots.py data/gapminder_gdp_oceania.csv

And the same plot as before is displayed, but this file is now being read from an argument we’ve provided on the command line. We can now do this for files with similar information and get the same set of plots for that data without any changes to our program’s code. Try this our for yourself now.

Update the Repository

Now that we’ve made this change to our program and see that it works. Let’s update our repository with these changes.

$ git add gdp_plots.py
$ git commit -m "Adding command line arguments"

Exercise: read multiple files

Try to run the gdp_plots.py so that it reads in all the .csv files in the data folder using the wildcard symbol. Does it work? Why or why not?

Solution

if you run it with the argument ‘data/*.csv’ you get an error on the Americas file because it has an extra file. However, it works if you omit that file.

Key Points

  • The sys library connects a Python program to the system it is running on.

  • The variable sys.argv is a list with each item being a command-line argument.


Trying Different Methods

Overview

Teaching: 5 min
Exercises: 25 min
Questions
  • How do I plot multiple data sets using different methods?

Objectives
  • Read data from standard input in a program so that it can be used in a pipeline.

  • Compare using different methods to accomplish the same task.

  • Practice making branches and merging in a Git repository.

Handling Multiple Files

Perhaps we would like a script that can operate on multiple data files at once. This can actually be done in two different ways: (1) writing a bash script that calls our already-existing Python script in a for-loop, or (2) modifying our Python script to read multiple files in a for-loop. We will try both methods and compare the two.

Create New Branches

First we will create two new branches where we can develop each of these two different methods. We will call these branches python-multi-files and bash-multi-files

$ git branch python-multi-files
$ git branch bash-multi-files

We can check that these two branches were created with $ git branch -a.

Handling Multiple Files with Python

First, we’ll try using just Python to loop through mutliple files. Let’s switch to our Python branch.

$ git checkout python-multi-files

To process each file separately, we’ll need a loop that executes our plotting statements for each file.

We want our program to process each file separately, and the easiest way to do this is a loop that executes once for each of the filenames provided in sys.argv.

But we need to be careful: sys.argv[0] will always be the name of our program, rather than the name of a file. We also need to handle an unknown number of filenames, since our program could be run for any number of files.

A solution is to loop over the contents of sys.argv[1:]. The ‘1’ tells Python to start the slice at location 1, so the program’s name isn’t included. Since we’ve left off the upper bound, the slice runs to the end of the list, and includes all the filenames.

Here is our updated program.

import sys
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

for filename in sys.argv[1:]:

    # load data and transpose so that country names are
    # the columns and their gdp data becomes the rows
    data = pandas.read_csv(filename, index_col = 'country').T

    # create a plot of the transposed data
    ax = data.plot(title = filename)

    # set some plot attributes
    ax.set_xlabel("Year")
    ax.set_ylabel("GDP Per Capita")
    # set the x locations and labels
    ax.set_xticks(range(len(data.index)))
    ax.set_xticklabels(data.index, rotation = 45)

    # display the plot
    plt.show()

Now when the program is given multiple filenames

$ python gdp_plots.py data/gapminder_gdp_oceania.csv data/gapminder_gdp_africa.csv

one plot for each filename is generated.

Update the Repository

$ git add gdp_plots.py
$ git commit -m "Allowing plot generation for multiple files at once"

Saving Figures

By using plt.show() with multiple files, the program stops each time a figure is generated and the user must exit it to continue. To avoid this slow down, we can replace this with plt.savefig() and view all the figures after the script finishes. This function has one required argument which is the filename to save the figure as. The filename must have a valid image extension (eg. PNG, JPEG, etc.).

Let’s replace our plt.show() with plt.savefig('fig/gdp-plot.png'). First, create the fig directory using mkdir Our new script should like like this:

import sys
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

for filename in sys.argv[1:]:

    # load data and transpose so that country names are
    # the columns and their gdp data becomes the rows
    data = pandas.read_csv(filename, index_col = 'country').T

    # create a plot of the transposed data
    ax = data.plot(title = filename)

    # set some plot attributes
    ax.set_xlabel("Year")
    ax.set_ylabel("GDP Per Capita")
    # set the x locations and labels
    ax.set_xticks(range(len(data.index)))
    ax.set_xticklabels(data.index, rotation = 45)

    # save the plot
    plt.savefig('fig/gdp-plot.png')

If we look at the contents of our folder now, we should have a new file called gdp-plot.png. But why is there only one when we supplied multiple data files? This is because each time the plot is created, it is being saved as the same file name and overwriting the previous plot.

We can fix this by creating a unique file name each time. A simple unique name can be used based on the original file name. We can use Python’s split() function to split filename, which is a string, by any character. This returns a list like so:

name = 'my-data.csv'
split_name = name.split('.')
print(split_name)
print(split_name[0])
['my-data', 'csv']
'my-data'

We’ll split the original file name and use the first part to rename our plot. And then we will concatenate .png to the name to specify our file type.

import sys
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

for filename in sys.argv[1:]:
    # load data and transpose so that country names are
    # the columns and their gdp data becomes the rows
    data = pandas.read_csv(filename, index_col = 'country').T

    # create a plot of the transposed data
    ax = data.plot(title = filename)

    # set some plot attributes
    ax.set_xlabel("Year")
    ax.set_ylabel("GDP Per Capita")
    # set the x locations and labels
    ax.set_xticks(range(len(data.index)))
    ax.set_xticklabels(data.index, rotation = 45)

    # save the plot with a unique file name
    split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX
    split_name2 = split_name1.split('/')[1]
    save_name = 'figs/'+split_name2 + '.png'
    plt.savefig(save_name)

Updating the repository

Yet another successful update to the code. Let’s commit our changes. If we do git status we see we also have image files untracked. Let’s ignore those files because they will likely change as our data changes.

$ echo "*.png" >> .gitignore
$ git add .gitignore
$ git commit -m "ignoring generated images"
$ git add gdp_plots.py
$ git commit -m "Saves each figure as a separate file."

Handling Multiple Files with Bash

Now that we’ve created a Python script to save multiple files at once, let’s try to do the same thing in Bash. We’ll leave our current branch alone and switch to our bash-multi-files branch.

$ git checkout bash-multi-files

If we look at our gdp_plots.py file, it is not in a for-loop format because it is not up to date with other branch. Our script should look like this:

import sys
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

filename = sys.argv[1]

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T

# create a plot the transposed data
ax = data.plot(title = filename)

# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)))
ax.set_xticklabels(data.index, rotation = 45)

# display the plot
plt.show()

Let’s use bash to generate multiple plots by calling our Python script in a bash for-loop. First, let’s create a bash file for us to edit.

$ touch gdp_plots.sh

In this script, we’ll write a for-loop will call our gdp_plots.py script on multiple files. We can break up our long list of files by using a backslash \ and writing the rest on the next line.

for filename in data/gapminder_gdp_oceania.csv data/gapminder_gdp_africa.csv
do
   python gdp_plots.py $filename
done

We can run our script to see if it works:

$ bash gdp_plots.sh

When we run this, we see that it stops to show us each plot like before. Let’s update our script to save the figure like before.

import sys
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

filename = sys.argv[1]

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T

# create a plot the transposed data
ax = data.plot(title = filename)

# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)))
ax.set_xticklabels(data.index, rotation = 45)

# save the plot with a unique file name
split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX
split_name2 = split_name1.split('/')[1]
save_name = 'figs/'+split_name2 + '.png'
plt.savefig(save_name)

When we run the script again, we should have new image files generated.

Updating the repository

Yet another successful update to the code. Let’s commit our changes. Since our Python and bash scripts had somewhat unrelated changes, let’s make two separate commits. We will also ignore our images like before.

$ git add gdp_plots.sh
$ git commit -m "Wrote bash script to call python plotter."
$ git add gdp_plots.py
$ git commit -m "Saves figures with unique name."
$ echo "*.png" >> .gitignore
$ git add .gitignore
$ git commit -m "ignoring generated images"

Comparing Methods

We have successfully developed two different methods for accomplishing the same task. This is common to do in software development when there is not a clear path forward. Let’s compare our two methods and decide which to merge into our master branch.

One comparison we might be interested in is how fast each is. We can use bash’s time function to get the time to run the script. Let’s time each script. To do this, we just add time before each command when we run a script or command and it will give us timing information when it is completed.

While we are on our bash branch, we’ll time that script first.

$ time bash gdp_plots.sh
real    0m6.031s
user    0m5.535s
sys     0m0.388s

We are most interested in the “real” time in the output, which is the elapsed time we experience.

Let’s checkout our python branch and time our script there.

$ git checkout python-multi-files
$ time python gdp_plots.py data/gapminder_gdp_oceania.csv data/gapminder_gdp_africa.csv
real    0m3.163s
user    0m3.002s
sys     0m0.132s

As we can see, our Python method ran faster than the bash method. For this reason, we will merge our Python branch into master.

$ git checkout master
$ git merge python-multi-files

Another advantage to the Python method over the bash method is that we do not need to change a file and commit it each time we want to create plots from different data.

More Practice with Multiple Files in Python

Finding Particular Files

Using the glob module, write a simple version of ls that shows files in the current directory with a particular suffix. A call to this script should look like this:

$ python my_ls.py py
left.py
right.py
zero.py

Solution

import sys
import glob

def main():
    '''prints names of all files with sys.argv as suffix'''
    assert len(sys.argv) >= 2, 'Argument list cannot be empty'
    suffix = sys.argv[1] # NB: behaviour is not as you'd expect if sys.argv[1] is *
    glob_input = '*.' + suffix # construct the input
    glob_output = sorted(glob.glob(glob_input)) # call the glob function
    for item in glob_output: # print the output
        print(item)
    return

main()

Counting Lines

Write a program called line_count.py that works like the Unix wc command:

  • If no filenames are given, it reports the number of lines in standard input.
  • If one or more filenames are given, it reports the number of lines in each, followed by the total number of lines.

Solution

import sys

def main():
    '''print each input filename and the number of lines in it,
       and print the sum of the number of lines'''
    filenames = sys.argv[1:]
    sum_nlines = 0 #initialize counting variable

    if len(filenames) == 0: # no filenames, just stdin
        sum_nlines = count_file_like(sys.stdin)
        print('stdin: %d' % sum_nlines)
    else:
        for f in filenames:
            n = count_file(f)
            print('%s %d' % (f, n))
            sum_nlines += n
        print('total: %d' % sum_nlines)

def count_file(filename):
    '''count the number of lines in a file'''
    f = open(filename,'r')
    nlines = len(f.readlines())
    f.close()
    return(nlines)

def count_file_like(file_like):
    '''count the number of lines in a file-like object (eg stdin)'''
    n = 0
    for line in file_like:
        n = n+1
    return n

main()

Key Points

  • Make different branches in a Git repository to try different methods.

  • Use bash’s time command to time scripts.


Program Flags

Overview

Teaching: 5 min
Exercises: 5 min
Questions
  • How can I make an easy shortcut to analyze all files at once using a program flag?

Objectives
  • Handle flags and files separately in a command-line program.

Handling Program Flags

Now we have a program which is capable of handling any number of data sets at once.

But what if we have 50 GDP data sets? It would be awfully tedious to type in the names of 50 files in the command line, so let’s add a flag to our program indicating that we would like it to generate a plot for each data set in the current directory.

Flags are a convention used in programming to indicate to a program that a non-default behavior is being requested by the user. In this case, we’ll be using a “-a” flag to indicate to our program we would like it to operate on all data sets in our directory.

To explore what files are in the current directory, we’ll be using the Python’s glob module.

import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

# check for -a flag in arguments
if "-a" in sys.argv:
    filenames = glob.glob("data/*gdp*.csv")
else:
    filenames = sys.argv[1:]

for filename in filenames:

    # load data and transpose so that country names are
    # the columns and their gdp data becomes the rows
    data = pandas.read_csv(filename, index_col = 'country').T

    # create a plot of the transposed data
    ax = data.plot(title = filename)

    # set some plot attributes
    ax.set_xlabel("Year")
    ax.set_ylabel("GDP Per Capita")
    # set the x locations and labels
    ax.set_xticks(range(len(data.index)))
    ax.set_xticklabels(data.index, rotation = 45)

    # save the plot with a unique file name
    split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX
    split_name2 = filename.split('/')[1]
    save_name = 'figs/'+split_name2 + '.png'
    plt.savefig(save_name)

Updating the repository

Yet another successful update to the code. Let’s commit our changes.

$ git add gdp_plots.py
$ git commit -m "Adding a flag to run script for all gdp data sets."

The Right Way to Do It

If our programs can take complex parameters or multiple filenames, we shouldn’t handle sys.argv directly. Instead, we should use Python’s argparse library, which handles common cases in a systematic way, and also makes it easy for us to provide sensible error messages for our users. We will not cover this module in this lesson but you can go to Tshepang Lekhonkhobe’s Argparse tutorial that is part of Python’s Official Documentation.

Key Points

  • Adding command line flags can be a user-friendly way to accomplish common tasks.


Defensive Programming

Overview

Teaching: 10 min
Exercises: 5 min
Questions
  • How do I predict and avoid user confusion?

Objectives
  • Ensure that programs indicate use and provide meaningful output upon failure.

Defensive Programming

In our last lesson, we created a program which will plot our gapminder gdp data for an arbitrary number of files. This is great, but we didn’t cover some of the vulnerabilities of this program we’ve created.

First, let’s try running our program without any additional arguments or flags.

$ python gdp_plots.py
Traceback (most recent call last):
  File "gdp_plot.py", line 12, in <module>
    filenames = sys.argv[1:]
IndexError: list index out of range

Python returns an error when trying to find the command line argument in sys.argv. It cannot find that argument because we haven’t provided it to the command and as a result there is no entry in sys.argv where we’re telling it to look for this value. We may know all of this because we’re the ones who wrote the program, but another user of the program without this experience will not.

More on Function Errors/Exceptions

Python reports a runtime error when something goes wrong while a program is executing.

age = 53
remaining = 100 - aege # mis-spelled 'age'
NameError: name 'aege' is not defined
  • The message indicates a problem with the name of a variable

Python also reports a syntax error when it can’t understand the source of a program.

print("hello world"
  File "<ipython-input-6-d1cc229bf815>", line 1
    print ("hello world"
                        ^
SyntaxError: unexpected EOF while parsing
  • The message indicates a problem on first line of the input (“line 1”).
    • In this case the “ipython-input” section of the file name tells us that we are working with input into IPython, the Python interpreter used by the Jupyter Notebook.
  • The -6- part of the filename indicates that the error occurred in cell 6 of our Notebook.
  • Next is the problematic line of code, indicating the problem with a ^ pointer.

And if we run the program from another directory:

$ cd ..
$ python swc-gapminder/gdp_plots.py -a

We see no output from the program at all. This is what is referred to as a “silent failure”. The program has failed to produce a plot, but has reported no reason why. These kind of failures are difficult to debug and should be avoided.

It is important to employ “defensive programming” in this scenario so that our program indicates to the user

  1. what is going wrong
  2. how to correct this problem

Check Input Arguments

Let’s add a section to the code which checks the number of incoming arguments to the program and returns some information to the user if there is missing information.

import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

# make sure additional arguments or flags have
# been provided by the user
if len(sys.argv) == 1:
    # why the program will not continue
    print("Not enough arguments have been provided")
    # how this can be corrected
    print("Usage: python gdp_plots.py < filenames >")
    print("Options:")
    print("-a : plot all gdp data sets in current directory")

# check for -a flag in arguments
if "-a" in sys.argv:
    filenames = glob.glob("data/*gdp*.csv")
else:
    filenames = sys.argv[1:]

for filename in filenames:

    # load data and transpose so that country names are
    # the columns and their gdp data becomes the rows
    data = pandas.read_csv(filename, index_col = 'country').T

    # create a plot of the transposed data
    ax = data.plot(title = filename)

    # set some plot attributes
    ax.set_xlabel("Year")
    ax.set_ylabel("GDP Per Capita")
    # set the x locations and labels
    ax.set_xticks(range(len(data.index)))
    ax.set_xticklabels(data.index, rotation = 45)

    # save the plot with a unique file name
    split_name = filename.split('.')
    save_name = split_name[0] + '.png'
    plt.savefig(save_name)

If we run the program without a filename argument, here’s what we’ll see

$ python gdp_plots.py
Not enough arguments have been provided
Usage: python gdp_plots.py <filenames>
Options:
-a : plot all gdp file in current directory

Now if someone runs this program without having used it before (or written it themselves) the user will know how change their command to get the program running properly, rather than seeing an esoteric Python error.

Update the Repository

We’ve just made another successful change to our repository. Let’s add a commit to the repo.

$ git add gdp_plots.py
$ git commit -m "Handling case for missing filename argument"

Check for silent errors

Silent errors can be difficult to anticipate. If we try to run our program from another directory with the -a flag, we don’t see any errors, but it also doesn’t do anything. This is because when we do the -a flag here, there are no .csv files in the directory, so our filenames variable is empty. Let’s add a check to ensure there are files to plot.

import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

# make sure additional arguments or flags have
# been provided by the user
if len(sys.argv) == 1:
    # why the program will not continue
    print("Not enough arguments have been provided")
    # how this can be corrected
    print("Usage: python gdp_plots.py < filenames >")
    print("Options:")
    print("-a : plot all gdp data sets in current directory")

# check for -a flag in arguments
if "-a" in sys.argv:
    filenames = glob.glob("data/*gdp*.csv")
    # check if no files were found and print message.
    if filenames == []:
        # file list is empty (no files found)
        print("No files found in this folder.")
        print("Make sure data is located in current directory.")
else:
    filenames = sys.argv[1:]

for filename in filenames:

    # load data and transpose so that country names are
    # the columns and their gdp data becomes the rows
    data = pandas.read_csv(filename, index_col = 'country').T

    # create a plot of the transposed data
    ax = data.plot(title = filename)

    # set some plot attributes
    ax.set_xlabel("Year")
    ax.set_ylabel("GDP Per Capita")
    # set the x locations and labels
    ax.set_xticks(range(len(data.index)))
    ax.set_xticklabels(data.index, rotation = 45)

    # save the plot with a unique file name
    split_name = filename.split('.')
    save_name = split_name[0] + '.png'
    plt.savefig(save_name)

Now if someone runs this program in a directory with no valid datafiles, a message appears.

Update the Repository

We’ve just made another successful change to our repository. Let’s add a commit to the repo.

$ git add gdp_plots.py
$ git commit -m "Handling case if no files present in directory"

Key Points

  • Avoid silent failures.

  • Avoid esoteric output when a program fails.

  • Add checkpoints in code to check for common failures.


Refactoring

Overview

Teaching: 10 min
Exercises: 10 min
Questions
  • When should I reorganize my code so it is more clear and readable for others?

  • How can I organize my code so that it is useable in other places?

  • Why do I almost always want to write my code as though it will be used somewhere else?

Objectives
  • Understand the value of refactoring code and use of functions.

  • Practice determining where code can be divided into smaller functions.

This code works nicely for generating plots of multiple data sets, but there is now a lot of code to digest in our script. Picture yourself looking at this code for the first time. Would it be immediately clear to you where are arguments are handled and where plots are generated?

It would be nice to break this work into clear chunks of code. This can be accomplished by making the argument checking section and the body of the for loop their own functions. This requires surprisingly few changes to the code, but makes it much more clear. This process is called refactoring.

Exercise: make a refactoring plan

Given the guidance above, talk with your neighbors about which parts of the script should be moved into functions. Try to think of ways to make the functions the most reusable on their own.

Solution

A possible solution:

  1. A function that parses the arguments
  2. A function that makes the plots
  3. A function that calls the other functions

This isn’t the only “right” solution, but a reasonable way to split things up

Let’s refactor our script

Create a Branch

Because we’re making a major change, let’s make a new branch to work in.

$ git checkout -b refactor

Let’s break the code into 4 functions:

Below is a template that will help you write these functions. The """ syntax indicates a multi-line comment. If these comments are the first thing in a function, they are known as a Docstring.

import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt


def parse_arguments(argv):
    """
    Parse the argument list passed from the command line
    (after the program filename is removed) and return a list
    of filenames.

    Input:
    ------
        argument list (normally sys.argv[1:])

    Returns:
    --------
        filenames: list of strings, list of files to plot
    """


def create_plot(filename):
    """
    Creates a plot for the specified
    data file.

    Input:
    ------
        filename: string, path to file to plot

    Returns:
    --------
        none
    """


def create_plots(filenames):
    """
    Takes in a list of filenames to plot
    and creates a plot for each file.

    Input:
    ------
        filenames: list of strings, list of files to plot

    Returns:
    --------
        none
    """


def main():
    """
    main function - does all the work
    """



# call main
main()

Adding Docstrings

In an effort to create human-readable code. It is common to include a “docstring” or “Python Documentation String”, at the top of each function that clearly lay out three things: the purpose or objective of the function, a description of the inputs and their datatypes, and a description of what is returned and their data types. By including these things, debugging later on is much more efficient because now the developer knows the starting point (inputs), endpoints (returns), and what the intended change is (purpose).

Let’s move the code into the functions now:

Exercise: refactor the code

Now that we have a plan for refactoring and a template to work from, create a new script called refactored_gdp_plot.py. Paste the template from above into the new script. Then copy and paste the code from gdp_plot.py script into the corresponding functions.

Solution

import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt


def parse_arguments(argv):
    """
    Parse the argument list passed from the command line
    (after the program filename is removed) and return a list
    of filenames.

    Input:
    ------
        argument list (normally sys.argv[1:])

    Returns:
    --------
        filenames: list of strings, list of files to plot
    """
# make sure additional arguments or flags have
    # been provided by the user
    if argv == []:
        # why the program will not continue
        print("Not enough arguments have been provided")
        # how this can be corrected
        print("Usage: python gdp_plots.py <filenames>")
        print("Options:")
        print("-a : plot all gdp data sets in current directory")

    # check for -a flag in arguments
    if "-a" in argv:
        filenames = glob.glob("*gdp*.csv")
    else:
        filenames = argv

    return filenames

def create_plot(filename):
    """
    Creates a plot for the specified
    data file.

    Input:
    ------
        filename: string, path to file to plot

    Returns:
    --------
        none
    """

    # load data and transpose so that country names are
    # the columns and their gdp data becomes the rows
    data = pandas.read_csv(filename, index_col = 'country').T

    # create a plot of the transposed data
    ax = data.plot(title = filename)

    # set some plot attributes
    ax.set_xlabel("Year")
    ax.set_ylabel("GDP Per Capita")
    # set the x locations and labels
    ax.set_xticks(range(len(data.index)))
    ax.set_xticklabels(data.index, rotation = 45)

    # save the plot with a unique file name
    split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX
    split_name2 = filename.split('/')[1]
    save_name = 'figs/'+split_name2 + '.png'
    plt.savefig(save_name)

def create_plots(filenames):
    """
    Takes in a list of filenames to plot
    and creates a plot for each file.

    Input:
    ------
       filenames: list of strings, list of files to plot

   Returns:
   --------
       none
   """

    for filename in filenames:
        create_plot(filename)


def main():
    """
    main function - does all the work
    """

    # parse arguments
    files_to_plot = parse_arguments(sys.argv[1:])

    #generate plots
    create_plots(files_to_plot)

# call main
main()

The behavior of the program hasn’t changed, but it has been made more modular by separating the into different functions with their own purpose.

Why the extra function? Our function main has two primary components to it

  1. parsing arguments handed to the program
  2. generating the desired plots

But we’ve defined three functions above the main function: parse_arguments, create_plot, and create_plots.

If the functions in our file directly reflected the components in main we would only have the parse_arguments and create_plots functions. If we think about using these functions independently, however, a function which always takes in a list of filenames isn’t very convenient to use on its own. By defining the create_plot function, we have placed most of the plot generation work there, while allowing for a very simple definition of the create_plots function.

The importance of this design decision will be made clear in the next lesson.

Before we commit we need to change our refactored_gdp_plot.py script to gdp_plot.py since we don’t want to keep two copies of this script around in our repo. We only made this as a separate script to make it easier to copy-paste. Once you’ve tested it, you can either rename refactored_gdp_plot.py script to gdp_plot.py or copy the contents of refactored_gdp_plot.py script to gdp_plot.py and delete refactored_gdp_plot.py.

Update the Repository

We haven’t changed the behavior of our program, but our code has changed, so let’s update the repository.

$ git add gdp_plots.py
$ git commit -m "Refactoring code."

Branching and Refactoring

To demonstrate that the behavior of our program hasn’t changed, try running it a few different ways using both the master and refactor branches. Remember that the command for checking out a branch is git checkout <branch_name>.

Now that we’re satisfied with our refactor. We can merge this branch into our master branch.

$ git checkout master
$ git merge refactor

Key Points

  • Refactoring makes code more modular, easier to read, and easier to understand.

  • Refactoring requires one to consdier future implications and generally enables others to use your code more easily.


Running Scripts and Importing

Overview

Teaching: 10 min
Exercises: 10 min
Questions
  • How can I import some of my work even if it is part of a program?

Objectives
  • Learn how to import functions from one script into another.

  • Understand the difference between using a file as a module and running it as a script or program.

In our last lesson we learned how refactoring code makes it easier to understand organize into pices that each have their own purpose. The added modularity also makes it easier to use in other places.

First, let’s start a Jupyter notebook in our current directory and see what happens if we try to import our file as it is.

import gdp_plots

The result is an error related to how Python is attempting to interpret the file. This is because Python is encountering our call to the function main and isn’t sure how to proceed.

Running Versus Importing

Running a Python script in bash is very similar to importing that file in Python. The biggest difference is that we don’t expect anything to happen when we import a file, whereas when running a script, we expect to see some output printed to the console.

In order for a Python script to work as expected when imported or when run as a script, we typically put the part of the script that produces output in the following if statement:

if __name__ == '__main__':
    main()  # Or whatever function produces output

When you import a Python file, __name__ is the name of that file (e.g., when importing pandas_plots.py, __name__ is 'pandas_plots'). However, when running a script in bash, __name__ is always set to '__main__' in that script so that you can determine if the file is being imported or run as a script.

Let’s add the main function in our script to a section which identifies the program as being called from the command line.

import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt


def parse_arguments(argv):
    """
    Parse the argument list passed from the command line
    (after the program filename is removed) and return a list
    of filenames.

    Input:
    ------
        argument list (normally sys.argv[1:])

    Returns:
    --------
        filenames: list of strings, list of files to plot
    """

    # make sure additional arguments or flags have
    # been provided by the user
    if argv == []:
        # why the program will not continue
        print("Not enough arguments have been provided")
        # how this can be corrected
        print("Usage: python gdp_plots.py < filenames >")
        print("Options:")
        print("-a : plot all gdp data sets in current directory")

    # check for -a flag in arguments
    if "-a" in argv:
        filenames = glob.glob("data/*gdp*.csv")
    else:
        filenames = argv

    return filenames


def create_plot(filename):
    """
    Creates a plot for the specified
    data file.

    Input:
    ------
        filename: string, path to file to plot

    Returns:
    --------
        none
    """

    # load data and transpose so that country names are
    # the columns and their gdp data becomes the rows
    data = pandas.read_csv(filename, index_col = 'country').T

    # create a plot of the transposed data
    ax = data.plot(title = filename)

    # set some plot attributes
    ax.set_xlabel("Year")
    ax.set_ylabel("GDP Per Capita")
    # set the x locations and labels
    ax.set_xticks(range(len(data.index)))
    ax.set_xticklabels(data.index, rotation = 45)

    # save the plot with a unique file name
    split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX
    split_name2 = filename.split('/')[1]
    save_name = 'figs/'+split_name2 + '.png'
    plt.savefig(save_name)


def create_plots(filenames):
    """
    Takes in a list of filenames to plot
    and creates a plot for each file.

    Input:
    ------
        filenames: list of strings, list of files to plot

    Returns:
    --------
        none
    """

    for filename in filenames:
        create_plot(filename)


def main():
    """
    main function - does all the work
    """
    # parse arguments
    files_to_plot = parse_arguments(sys.argv[1:])

    #generate plots
    create_plots(files_to_plot)


if __name__ == "__main__":
    # call main
    main()

Now let’s go back to the Jupyter notebook and try importing the file again.

import gdp_plots

Success! You’ve just written your first Python module. Any of the functions in that module can now be accessed in our Jupyter notebook session.

%matplotlib inline
gdp_plots.create_plot("data/gapminder_gdp_oceania.csv")

Update the repository

Back in the terminal, let’s commit these changes to our repository.

$ git add gdp_plots.py
$ git commit -m "Moving call to the main function."

Writing Modular Code

In our previous lesson we refactored our code for clarity and modularity. As part of that process we created two functions create_plot and create_plots. While the create_plot function wasn’t used in our program, imagine importing our module and finding that the only way to generate a plot for a single file is to add that filename to a list before passing that list to create_plots.

This might seem very strange or confusing to someone importing our module for the first time. It take time to develop an intuition for design decisions like these, but here are a few questions to ask yourself as a guide when organizing code:

  • Are my functions able to stand on their own? Do they accomplish simple tasks?
  • Is it easy to write a clear function names in my module? A function with the name “create_plots_and_export_data” is likely better off being borken up into two functions “create_plots” and “export_data”
  • Are my function names plural? In our example, it may have felt more natural to create only one function, “create_plots”. In these cases, it is almost always useful to create a function which does our operation once (“create_plot”) and another plural form of the function (“create_plots”) which contains a loop over the singular function.

Key Points

  • The __name__ variable allows us to know whether the file is being imported or run as a script.


Programming Style

Overview

Teaching: 5 min
Exercises: 5 min
Questions
  • How can I make my programs more readable?

  • How do most programmers format their code?

  • How can programs check their own operation?

Objectives
  • Provide sound justifications for basic rules of coding style.

  • Refactor one-page programs to make them more readable and justify the changes.

  • Use Python community coding standards (PEP-8).

Follow standard Python style in your code.

Best Practice: Write programs for people and not for computers!

Use assertions to check for internal errors.

Assertions are a simple, but powerful method for making sure that the context in which your code is executing is as you expect.

def calc_bulk_density(mass, volume):
    '''Return dry bulk density = powder mass / powder volume.'''
    assert volume > 0
    return mass / volume

If the assertion is False, the Python interpreter raises an AssertionError runtime exception. The source code for the expression that failed will be displayed as part of the error message. To ignore assertions in your code run the interpreter with the ‘-O’ (optimize) switch. Assertions should contain only simple checks and never change the state of the program. For example, an assertion should never contain an assignment.

Best Practice: Plan for mistakes

Use docstrings to provide online help.

Best Practice: Document design & purpose, not just mechanics

def average(values):
    "Return average of values, or None if no values are supplied."

    if len(values) == 0:
        return None
    return sum(values) / average(values)

help(average)
Help on function average in module __main__:

average(values)
    Return average of values, or None if no values are supplied.

Multiline Strings

Often use multiline strings for documentation. These start and end with three quote characters (either single or double) and end with three matching characters.

"""This string spans
multiple lines.

Blank lines are allowed."""

What Will Be Shown?

Highlight the lines in the code below that will be available as online help. Are there lines that should be made available, but won’t be? Will any lines produce a syntax error or a runtime error?

"Find maximum edit distance between multiple sequences."
# This finds the maximum distance between all sequences.

def overall_max(sequences):
    '''Determine overall maximum edit distance.'''

    highest = 0
    for left in sequences:
        for right in sequences:
            '''Avoid checking sequence against itself.'''
            if left != right:
                this = edit_distance(left, right)
                highest = max(highest, this)

    # Report.
    return highest

Document This

Turn the comment on the following function into a docstring and check that help displays it properly.

def middle(a, b, c):
    # Return the middle value of three.
    # Assumes the values can actually be compared.
    values = [a, b, c]
    values.sort()
    return values[1]

Clean Up This Code

  1. Read this short program and try to predict what it does.
  2. Run it: how accurate was your prediction?
  3. Refactor the program to make it more readable. Remember to run it after each change to ensure its behavior hasn’t changed.
  4. Compare your rewrite with your neighbor’s. What did you do the same? What did you do differently, and why?
import sys
n = int(sys.argv[1])
s = sys.argv[2]
print(s)
i = 0
while i < n:
    # print('at', j)
    new = ''
    for j in range(len(s)):
        left = j-1
        right = (j+1)%len(s)
        if s[left]==s[right]: new += '-'
        else: new += '*'
    s=''.join(new)
    print(s)
    i += 1

Solution

Here’s one solution.

def string_machine(input_string, iterations):
    """
    Takes input_string and generates a new string with -'s and *'s
    corresponding to characters that have identical adjacent characters
    or not, respectively.  Iterates through this procedure with the resultant
    strings for the supplied number of iterations.
    """
    print(input_string)
    old = input_string
    for i in range(iterations):
        new = ''
        # iterate through characters in previous string
        for j in range(len(s)):
            left = j-1
            right = (j+1)%len(s) # ensure right index wraps around
            if old[left]==old[right]:
                new += '-'
            else:
                new += '*'
        print(new)
        # store new string as old
        old = new

string_machine('et cetera', 10)

Key Points

  • Follow standard Python style in your code.

  • Use docstrings to provide online help.


Wrap-Up

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • What have we learned?

  • What else is out there and where do I find it?

Objectives
  • Name and locate scientific Python community sites for software, workshops, and help.

Leslie Lamport once said, “Writing is nature’s way of showing you how sloppy your thinking is.” The same is true of programming: many things that seem obvious when we’re thinking about them turn out to be anything but when we have to explain them precisely.

Python supports a large community within and outwith research.

Key Points

  • Python supports a large community within and outwith research.