Review Exercise
Overview
Teaching: 0 min
Exercises: 20 minQuestions
How can we put together all of yesterday’s material?
Objectives
Apply use of functions, conditionals and loops to solve a problem.
Review From Yesterday
In your notebook, write a function that determines whether a year between 1901 and 2000 is a leap year, where it prints a message like “1904 is a leap year” or “1905 is not a leap year” as output. Use this function to evaluate the years 1928, 1950, 1959, 1972 and 1990. Essentially, given this list of years:
years = [1928, 1950, 1959, 1972, 1990]
Produce something like:
1928 is a leap year 1950 is not a leap year. 1959 is not a leap year. 1972 is a leap year 1990 is not a leap year.
8 mod 4 equals 0 10 mod 4 equals 2
If you’re not sure where to start, see the partial answers below:
Suggested Approach
First, try to determine how to use the mod operator
%
to determine if a year is divisible by 4 (and thus a leap year or not).Then, create a conditional statement to use this information, and put it into a function.
Finally, create a list of the years given in the exercise. Use a for loop and your function to evaluate these years.
Modular Arthimetic
If a year in the range specified is divisible by four, it is a leap year. If a number is divisible by 4, then the arithmetic expression “number mod four” (or
num % 4
in Python) will equal zero.Conditional Statement
Fill in the blanks:
year = 1904 if year % 4 == _____: print(year, _______________) ______: print(year, "is not a leap year.")
Function
Fill in the blanks:
def leap_year(year): _________
Loop
Fill in the blanks:
year_list = [1928, 1950, 1959, 1972, 1990] for year in ______: ________(year)
Complete Solution
def leap_year(year): if year % 4 == 0: print(year, "is a leap year") else: print(year, "is not a leap year.") year_list = [1928, 1950, 1959, 1972, 1990] for year in year_list: leap_year(year)
If you have time:
Expand your function so that it correctly categorizes any year from 0 onwards
Instead of printing whether a year is a leap year or not, save the results to a python dictionary, where there are two keys (“leap” and “not-leap”) and the values are a list of years.
Key Points
Use skills together.
Command-Line Programs
Overview
Teaching: 15 min
Exercises: 15 minQuestions
How can I write Python programs that will work like Unix command-line tools?
Objectives
Use the values of command-line arguments in a program.
Read data from standard input in a program so that it can be used in a pipeline.
The Jupyter Notebook and other interactive tools are great for prototyping code and exploring data, but sooner or later one will want to use that code in a program we can run from the command line. In order to do that, we need to make our programs work like other Unix command-line tools. For example, we may want a program that reads a gapminder data set and plots the gdp of countries over time.
Switching to Shell Commands
In this lesson we are switching from typing commands in Jupyter notebooks to typing commands in a shell terminal window (such as bash). When you see a
$
in front of a command that tells you to run that command in the shell rather than the Python interpreter.
Converting Notebooks
The Jupyter Notebook has the ability to convert all of the cells of a current Notebook into a python program. To do this, go to
File
->Download as
and selectPython (.py)
to get the current notebook as a Python script.
Setting up your project
Up until now, we’ve been working in the data folder directly. Because we’re going to be dealing with more files of different types in this lesson, let’s do a little rearranging:
- On your desktop, create a folder called
swc-gapminder
. - Move the
data
folder you’ve been using into this folder. - Inside swc-gapminder, create a folder called
figs
- To ensure that we’re all starting with the same set of code, copy the text below
into a file called
gdp_plots.py
in theswc-gapminder
folder or download the file from here.
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt
# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv('data/gapminder_gdp_oceania.csv', index_col = 'country').T
# create a plot of the transposed data
ax = data.plot()
# display the plot
plt.show()
This program imports the pandas
and matplotlib
Python modules, reads
some of the gapminder data into a pandas
dataframe, and plots that
data using matplotlib
with some default settings.
We can run this program from the command line using
$ python gdp_plots.py
This is much easier than starting a notebook, going to the browser, and running each cell in the notebook to get the same result.
Initialize a repository
But before we modify our gdp_plots.py
program, we are going to put it under
version control so that we can track its changes as we go through this lesson.
$ git init
$ git add gdp_plots.py
$ git commit -m "First commit of analysis script"
Because we’re only concerned with changes to our analysis script, we are going to
create a .gitignore file for all of the gapminder .csv
files and any Python notebook
files (.ipynb
) files we have created thus far.
$ echo "data/*.csv" > .gitignore
$ echo "*.ipynb" >> .gitignore
$ git add .gitignore
$ git commit -m "Adding ignore file"
Now that we have a clean repository, let’s get back to work on adding command line arguments to our program.
Changing code under Version Control
As it is, this plot isn’t bad but let’s add some labels for clarity. We’ll use the data filename as a title for the plot and indicate what information is in on each axis.
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt
filename = 'data/gapminder_gdp_oceania.csv'
# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T
# create a plot of the transposed data
ax = data.plot(title = filename)
# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)))
ax.set_xticklabels(data.index, rotation = 45)
# display the plot
plt.show()
Now when we run this, our plot looks a little bit nicer.
$ python gdp_plots.py
Updating the Repository
$ git add gdp_plots.py
$ git commit -m "Improving plot format"
Command-Line Arguments
This program currently only works for the Oceania set of data. How might we modify
the program to work for any of the gapminder gdp data sets? We could go into the
script and change the .csv
filename to generate the same plot for different
sets of data, but there is an even better way.
Python programs can use additional arguments provided in the following manner.
$ python <program> <argument1> <argument2> <other_arguments>
The program can then use these arguments to alter its behavior based on those arguments. In this case, we’ll be using arguments to tell our program to operate on a specific file.
We’ll be using the sys
module to do so. sys
(short for system) is a standard
Python module used to store information about the program and its running
environment, including what arguments were passed to the program when the
command was executed. These arguments are stored as a list in sys.argv
.
These arguments can be accessed in our program by importing the sys
module. The first argument in sys.argv
is always the name of the program, so
we’ll find any additional arguments right after that in the list.
Let’s try this out in a separate script. Using the text editor of your choice, let’s write
a new program called args_list.py
containing the two following lines:
import sys
print('sys.argv is', sys.argv)
The strange name argv
stands for “argument values”. Whenever Python runs a
program, it takes all of the values given on the command line and puts them in
the list sys.argv
so that the program can determine what they were. If we run
this program with no arguments:
$ python argv_list.py
sys.argv is ['argv_list.py']
the only thing in the list is the full path to our script,
which is always sys.argv[0]
.
If we run it with a few arguments, however:
$ python argv_list.py first second third
sys.argv is ['argv_list.py', 'first', 'second', 'third']
then Python adds each of those arguments to that magic list.
Using this new information, let’s add command line arguments to our
gdp_plots.py
program.
To do this, we’ll make two changes:
- add the import of the sys module at the beginning of the program.
- replace the filename (“data/gapminder_gdp_oceania.csv”) with the the second entry in the
sys.argv
list.
Now our program should look as follows:
import sys import pandas # we need to import part of matplotlib # because we are no longer in a notebook import matplotlib.pyplot as plt filename = sys.argv[1] # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot of the transposed data ax = data.plot(title = filename) # set some plot attributes ax.set_xlabel("Year") ax.set_ylabel("GDP Per Capita") # set the x locations and labels ax.set_xticks(range(len(data.index)) ) ax.set_xticklabels(data.index, rotation = 45) # display the plot plt.show()
Let’s take a look at what happens when we provide a gapminder filename to the program.
$ python gdp_plots.py data/gapminder_gdp_oceania.csv
And the same plot as before is displayed, but this file is now being read from an argument we’ve provided on the command line. We can now do this for files with similar information and get the same set of plots for that data without any changes to our program’s code. Try this our for yourself now.
Update the Repository
Now that we’ve made this change to our program and see that it works. Let’s update our repository with these changes.
$ git add gdp_plots.py
$ git commit -m "Adding command line arguments"
Exercise: read multiple files
Try to run the gdp_plots.py so that it reads in all the .csv files in the data folder using the wildcard symbol. Does it work? Why or why not?
Solution
if you run it with the argument ‘data/*.csv’ you get an error on the Americas file because it has an extra file. However, it works if you omit that file.
Key Points
The
sys
library connects a Python program to the system it is running on.The variable
sys.argv
is a list with each item being a command-line argument.
Trying Different Methods
Overview
Teaching: 5 min
Exercises: 25 minQuestions
How do I plot multiple data sets using different methods?
Objectives
Read data from standard input in a program so that it can be used in a pipeline.
Compare using different methods to accomplish the same task.
Practice making branches and merging in a Git repository.
Handling Multiple Files
Perhaps we would like a script that can operate on multiple data files at once. This can actually be done in two different ways: (1) writing a bash script that calls our already-existing Python script in a for-loop, or (2) modifying our Python script to read multiple files in a for-loop. We will try both methods and compare the two.
Create New Branches
First we will create two new branches where we can develop each of these
two different methods. We will call these branches python-multi-files
and
bash-multi-files
$ git branch python-multi-files
$ git branch bash-multi-files
We can check that these two branches were created with $ git branch -a
.
Handling Multiple Files with Python
First, we’ll try using just Python to loop through mutliple files. Let’s switch to our Python branch.
$ git checkout python-multi-files
To process each file separately, we’ll need a loop that executes our plotting statements for each file.
We want our program to process each file separately, and the easiest way to do
this is a loop that executes once for each of the filenames provided in
sys.argv
.
But we need to be careful: sys.argv[0]
will always be the name of our program,
rather than the name of a file. We also need to handle an unknown number of
filenames, since our program could be run for any number of files.
A solution is to loop over the contents of sys.argv[1:]
. The ‘1’ tells
Python to start the slice at location 1, so the program’s name isn’t included.
Since we’ve left off the upper bound, the slice runs to the end of the list, and
includes all the filenames.
Here is our updated program.
import sys import pandas # we need to import part of matplotlib # because we are no longer in a notebook import matplotlib.pyplot as plt for filename in sys.argv[1:]: # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot of the transposed data ax = data.plot(title = filename) # set some plot attributes ax.set_xlabel("Year") ax.set_ylabel("GDP Per Capita") # set the x locations and labels ax.set_xticks(range(len(data.index))) ax.set_xticklabels(data.index, rotation = 45) # display the plot plt.show()
Now when the program is given multiple filenames
$ python gdp_plots.py data/gapminder_gdp_oceania.csv data/gapminder_gdp_africa.csv
one plot for each filename is generated.
Update the Repository
$ git add gdp_plots.py
$ git commit -m "Allowing plot generation for multiple files at once"
Saving Figures
By using plt.show()
with multiple files, the program stops each
time a figure is generated and the user must exit it to continue.
To avoid this slow down, we can replace this with plt.savefig()
and
view all the figures after the script finishes. This function has one
required argument which is the filename to save the figure as. The filename
must have a valid image extension (eg. PNG, JPEG, etc.).
Let’s replace our plt.show()
with plt.savefig('fig/gdp-plot.png')
. First, create the fig
directory using mkdir
Our new script should like like this:
import sys import pandas # we need to import part of matplotlib # because we are no longer in a notebook import matplotlib.pyplot as plt for filename in sys.argv[1:]: # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot of the transposed data ax = data.plot(title = filename) # set some plot attributes ax.set_xlabel("Year") ax.set_ylabel("GDP Per Capita") # set the x locations and labels ax.set_xticks(range(len(data.index))) ax.set_xticklabels(data.index, rotation = 45) # save the plot plt.savefig('fig/gdp-plot.png')
If we look at the contents of our folder now, we should have a new
file called gdp-plot.png
. But why is there only one when we supplied
multiple data files? This is because each time the plot is created,
it is being saved as the same file name and overwriting the previous
plot.
We can fix this by creating a unique file name each time. A simple
unique name can be used based on the original file name. We can use
Python’s split()
function to split filename
, which is a string,
by any character. This returns a list like so:
name = 'my-data.csv'
split_name = name.split('.')
print(split_name)
print(split_name[0])
['my-data', 'csv']
'my-data'
We’ll split the original file name and use the first part to rename
our plot. And then we will concatenate .png
to the name to specify
our file type.
import sys import pandas # we need to import part of matplotlib # because we are no longer in a notebook import matplotlib.pyplot as plt for filename in sys.argv[1:]: # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot of the transposed data ax = data.plot(title = filename) # set some plot attributes ax.set_xlabel("Year") ax.set_ylabel("GDP Per Capita") # set the x locations and labels ax.set_xticks(range(len(data.index))) ax.set_xticklabels(data.index, rotation = 45) # save the plot with a unique file name split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX split_name2 = split_name1.split('/')[1] save_name = 'figs/'+split_name2 + '.png' plt.savefig(save_name)
Updating the repository
Yet another successful update to the code. Let’s
commit our changes. If we do git status
we see we also have image
files untracked. Let’s ignore those files because they will likely
change as our data changes.
$ echo "*.png" >> .gitignore
$ git add .gitignore
$ git commit -m "ignoring generated images"
$ git add gdp_plots.py
$ git commit -m "Saves each figure as a separate file."
Handling Multiple Files with Bash
Now that we’ve created a Python script to save multiple files at once,
let’s try to do the same thing in Bash. We’ll leave our current branch
alone and switch to our bash-multi-files
branch.
$ git checkout bash-multi-files
If we look at our gdp_plots.py
file, it is not in a for-loop format
because it is not up to date with other branch. Our script should look
like this:
import sys
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt
filename = sys.argv[1]
# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T
# create a plot the transposed data
ax = data.plot(title = filename)
# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)))
ax.set_xticklabels(data.index, rotation = 45)
# display the plot
plt.show()
Let’s use bash to generate multiple plots by calling our Python script in a bash for-loop. First, let’s create a bash file for us to edit.
$ touch gdp_plots.sh
In this script, we’ll write a for-loop will call our gdp_plots.py
script
on multiple files. We can break up our long list of files by using a
backslash \
and writing the rest on the next line.
for filename in data/gapminder_gdp_oceania.csv data/gapminder_gdp_africa.csv
do
python gdp_plots.py $filename
done
We can run our script to see if it works:
$ bash gdp_plots.sh
When we run this, we see that it stops to show us each plot like before. Let’s update our script to save the figure like before.
import sys import pandas # we need to import part of matplotlib # because we are no longer in a notebook import matplotlib.pyplot as plt filename = sys.argv[1] # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot the transposed data ax = data.plot(title = filename) # set some plot attributes ax.set_xlabel("Year") ax.set_ylabel("GDP Per Capita") # set the x locations and labels ax.set_xticks(range(len(data.index))) ax.set_xticklabels(data.index, rotation = 45) # save the plot with a unique file name split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX split_name2 = split_name1.split('/')[1] save_name = 'figs/'+split_name2 + '.png' plt.savefig(save_name)
When we run the script again, we should have new image files generated.
Updating the repository
Yet another successful update to the code. Let’s commit our changes. Since our Python and bash scripts had somewhat unrelated changes, let’s make two separate commits. We will also ignore our images like before.
$ git add gdp_plots.sh
$ git commit -m "Wrote bash script to call python plotter."
$ git add gdp_plots.py
$ git commit -m "Saves figures with unique name."
$ echo "*.png" >> .gitignore
$ git add .gitignore
$ git commit -m "ignoring generated images"
Comparing Methods
We have successfully developed two different methods for accomplishing
the same task. This is common to do in software development when there is not
a clear path forward. Let’s compare our two methods and decide which to
merge into our master
branch.
One comparison we might be interested in is how fast each is. We can
use bash’s time
function to get the time to run the script. Let’s time
each script. To do this, we just add time
before each command when we run a script
or command and it will give us timing information when it is completed.
While we are on our bash branch, we’ll time that script first.
$ time bash gdp_plots.sh
real 0m6.031s
user 0m5.535s
sys 0m0.388s
We are most interested in the “real” time in the output, which is the elapsed time we experience.
Let’s checkout our python branch and time our script there.
$ git checkout python-multi-files
$ time python gdp_plots.py data/gapminder_gdp_oceania.csv data/gapminder_gdp_africa.csv
real 0m3.163s
user 0m3.002s
sys 0m0.132s
As we can see, our Python method ran faster than the bash method. For this reason, we will merge our Python branch into master.
$ git checkout master
$ git merge python-multi-files
Another advantage to the Python method over the bash method is that we do not need to change a file and commit it each time we want to create plots from different data.
More Practice with Multiple Files in Python
Finding Particular Files
Using the
glob
module, write a simple version ofls
that shows files in the current directory with a particular suffix. A call to this script should look like this:$ python my_ls.py py
left.py right.py zero.py
Solution
import sys import glob def main(): '''prints names of all files with sys.argv as suffix''' assert len(sys.argv) >= 2, 'Argument list cannot be empty' suffix = sys.argv[1] # NB: behaviour is not as you'd expect if sys.argv[1] is * glob_input = '*.' + suffix # construct the input glob_output = sorted(glob.glob(glob_input)) # call the glob function for item in glob_output: # print the output print(item) return main()
Counting Lines
Write a program called
line_count.py
that works like the Unixwc
command:
- If no filenames are given, it reports the number of lines in standard input.
- If one or more filenames are given, it reports the number of lines in each, followed by the total number of lines.
Solution
import sys def main(): '''print each input filename and the number of lines in it, and print the sum of the number of lines''' filenames = sys.argv[1:] sum_nlines = 0 #initialize counting variable if len(filenames) == 0: # no filenames, just stdin sum_nlines = count_file_like(sys.stdin) print('stdin: %d' % sum_nlines) else: for f in filenames: n = count_file(f) print('%s %d' % (f, n)) sum_nlines += n print('total: %d' % sum_nlines) def count_file(filename): '''count the number of lines in a file''' f = open(filename,'r') nlines = len(f.readlines()) f.close() return(nlines) def count_file_like(file_like): '''count the number of lines in a file-like object (eg stdin)''' n = 0 for line in file_like: n = n+1 return n main()
Key Points
Make different branches in a Git repository to try different methods.
Use bash’s
time
command to time scripts.
Program Flags
Overview
Teaching: 5 min
Exercises: 5 minQuestions
How can I make an easy shortcut to analyze all files at once using a program flag?
Objectives
Handle flags and files separately in a command-line program.
Handling Program Flags
Now we have a program which is capable of handling any number of data sets at once.
But what if we have 50 GDP data sets? It would be awfully tedious to type in the names of 50 files in the command line, so let’s add a flag to our program indicating that we would like it to generate a plot for each data set in the current directory.
Flags are a convention used in programming to indicate to a program that a non-default behavior is being requested by the user. In this case, we’ll be using a “-a” flag to indicate to our program we would like it to operate on all data sets in our directory.
To explore what files are in the current directory, we’ll be using the Python’s glob
module.
- In Unix, the term “globbing” means “matching a set of files with a pattern”.
- The most common patterns are:
*
meaning “match zero or more characters”?
meaning “match exactly one character”
- Python contains the
glob
library to provide pattern matching functionality - The
glob
library contains a function also calledglob
to match file patterns - E.g.,
glob.glob('*.txt')
matches all files in the current directory whose names end with.txt
. - Result is a (possibly empty) list of character strings.
import sys import glob import pandas # we need to import part of matplotlib # because we are no longer in a notebook import matplotlib.pyplot as plt # check for -a flag in arguments if "-a" in sys.argv: filenames = glob.glob("data/*gdp*.csv") else: filenames = sys.argv[1:] for filename in filenames: # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot of the transposed data ax = data.plot(title = filename) # set some plot attributes ax.set_xlabel("Year") ax.set_ylabel("GDP Per Capita") # set the x locations and labels ax.set_xticks(range(len(data.index))) ax.set_xticklabels(data.index, rotation = 45) # save the plot with a unique file name split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX split_name2 = filename.split('/')[1] save_name = 'figs/'+split_name2 + '.png' plt.savefig(save_name)
Updating the repository
Yet another successful update to the code. Let’s commit our changes.
$ git add gdp_plots.py
$ git commit -m "Adding a flag to run script for all gdp data sets."
The Right Way to Do It
If our programs can take complex parameters or multiple filenames, we shouldn’t handle
sys.argv
directly. Instead, we should use Python’sargparse
library, which handles common cases in a systematic way, and also makes it easy for us to provide sensible error messages for our users. We will not cover this module in this lesson but you can go to Tshepang Lekhonkhobe’s Argparse tutorial that is part of Python’s Official Documentation.
Key Points
Adding command line flags can be a user-friendly way to accomplish common tasks.
Defensive Programming
Overview
Teaching: 10 min
Exercises: 5 minQuestions
How do I predict and avoid user confusion?
Objectives
Ensure that programs indicate use and provide meaningful output upon failure.
Defensive Programming
In our last lesson, we created a program which will plot our gapminder gdp data for an arbitrary number of files. This is great, but we didn’t cover some of the vulnerabilities of this program we’ve created.
- What happens if we run the program without any arguments at all?
- What happens if we run the program from another directory?
First, let’s try running our program without any additional arguments or flags.
$ python gdp_plots.py
Traceback (most recent call last):
File "gdp_plot.py", line 12, in <module>
filenames = sys.argv[1:]
IndexError: list index out of range
Python returns an error when trying to find the command line argument in
sys.argv
. It cannot find that argument because we haven’t provided it to the
command and as a result there is no entry in sys.argv
where we’re telling it to look for
this value. We may know all of this because we’re the ones who wrote the
program, but another user of the program without this experience will not.
More on Function Errors/Exceptions
Python reports a runtime error when something goes wrong while a program is executing.
age = 53 remaining = 100 - aege # mis-spelled 'age'
NameError: name 'aege' is not defined
- The message indicates a problem with the name of a variable
Python also reports a syntax error when it can’t understand the source of a program.
print("hello world"
File "<ipython-input-6-d1cc229bf815>", line 1 print ("hello world" ^ SyntaxError: unexpected EOF while parsing
- The message indicates a problem on first line of the input (“line 1”).
- In this case the “ipython-input” section of the file name tells us that we are working with input into IPython, the Python interpreter used by the Jupyter Notebook.
- The
-6-
part of the filename indicates that the error occurred in cell 6 of our Notebook.- Next is the problematic line of code, indicating the problem with a
^
pointer.
And if we run the program from another directory:
$ cd ..
$ python swc-gapminder/gdp_plots.py -a
We see no output from the program at all. This is what is referred to as a “silent failure”. The program has failed to produce a plot, but has reported no reason why. These kind of failures are difficult to debug and should be avoided.
It is important to employ “defensive programming” in this scenario so that our program indicates to the user
- what is going wrong
- how to correct this problem
Check Input Arguments
Let’s add a section to the code which checks the number of incoming arguments to the program and returns some information to the user if there is missing information.
import sys import glob import pandas # we need to import part of matplotlib # because we are no longer in a notebook import matplotlib.pyplot as plt # make sure additional arguments or flags have # been provided by the user if len(sys.argv) == 1: # why the program will not continue print("Not enough arguments have been provided") # how this can be corrected print("Usage: python gdp_plots.py < filenames >") print("Options:") print("-a : plot all gdp data sets in current directory") # check for -a flag in arguments if "-a" in sys.argv: filenames = glob.glob("data/*gdp*.csv") else: filenames = sys.argv[1:] for filename in filenames: # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot of the transposed data ax = data.plot(title = filename) # set some plot attributes ax.set_xlabel("Year") ax.set_ylabel("GDP Per Capita") # set the x locations and labels ax.set_xticks(range(len(data.index))) ax.set_xticklabels(data.index, rotation = 45) # save the plot with a unique file name split_name = filename.split('.') save_name = split_name[0] + '.png' plt.savefig(save_name)
If we run the program without a filename argument, here’s what we’ll see
$ python gdp_plots.py
Not enough arguments have been provided
Usage: python gdp_plots.py <filenames>
Options:
-a : plot all gdp file in current directory
Now if someone runs this program without having used it before (or written it themselves) the user will know how change their command to get the program running properly, rather than seeing an esoteric Python error.
Update the Repository
We’ve just made another successful change to our repository. Let’s add a commit to the repo.
$ git add gdp_plots.py
$ git commit -m "Handling case for missing filename argument"
Check for silent errors
Silent errors can be difficult to anticipate. If we try to run our program
from another directory with the -a
flag, we don’t see any errors, but it
also doesn’t do anything. This is because when we do the -a
flag here,
there are no .csv
files in the directory, so our filenames
variable is
empty. Let’s add a check to ensure there are files to plot.
import sys import glob import pandas # we need to import part of matplotlib # because we are no longer in a notebook import matplotlib.pyplot as plt # make sure additional arguments or flags have # been provided by the user if len(sys.argv) == 1: # why the program will not continue print("Not enough arguments have been provided") # how this can be corrected print("Usage: python gdp_plots.py < filenames >") print("Options:") print("-a : plot all gdp data sets in current directory") # check for -a flag in arguments if "-a" in sys.argv: filenames = glob.glob("data/*gdp*.csv") # check if no files were found and print message. if filenames == []: # file list is empty (no files found) print("No files found in this folder.") print("Make sure data is located in current directory.") else: filenames = sys.argv[1:] for filename in filenames: # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot of the transposed data ax = data.plot(title = filename) # set some plot attributes ax.set_xlabel("Year") ax.set_ylabel("GDP Per Capita") # set the x locations and labels ax.set_xticks(range(len(data.index))) ax.set_xticklabels(data.index, rotation = 45) # save the plot with a unique file name split_name = filename.split('.') save_name = split_name[0] + '.png' plt.savefig(save_name)
Now if someone runs this program in a directory with no valid datafiles, a message appears.
Update the Repository
We’ve just made another successful change to our repository. Let’s add a commit to the repo.
$ git add gdp_plots.py
$ git commit -m "Handling case if no files present in directory"
Key Points
Avoid silent failures.
Avoid esoteric output when a program fails.
Add checkpoints in code to check for common failures.
Refactoring
Overview
Teaching: 10 min
Exercises: 10 minQuestions
When should I reorganize my code so it is more clear and readable for others?
How can I organize my code so that it is useable in other places?
Why do I almost always want to write my code as though it will be used somewhere else?
Objectives
Understand the value of refactoring code and use of functions.
Practice determining where code can be divided into smaller functions.
This code works nicely for generating plots of multiple data sets, but there is now a lot of code to digest in our script. Picture yourself looking at this code for the first time. Would it be immediately clear to you where are arguments are handled and where plots are generated?
It would be nice to break this work into clear chunks of code. This can be accomplished by making the argument checking section and the body of the for loop their own functions. This requires surprisingly few changes to the code, but makes it much more clear. This process is called refactoring.
Exercise: make a refactoring plan
Given the guidance above, talk with your neighbors about which parts of the script should be moved into functions. Try to think of ways to make the functions the most reusable on their own.
Solution
A possible solution:
- A function that parses the arguments
- A function that makes the plots
- A function that calls the other functions
This isn’t the only “right” solution, but a reasonable way to split things up
Let’s refactor our script
Create a Branch
Because we’re making a major change, let’s make a new branch to work in.
$ git checkout -b refactor
Let’s break the code into 4 functions:
parse_arguments()
- gets the input from argv[], returns a list of file namescreate_plot()
- takes one file name as input, creates one plot and writes it to the fig foldercreate_plots()
- takes a list of files as input, callscreate_plot()
for each element in the listmain()
- callsparse_arguments()
andcreate_plots()
Below is a template that will help you write these functions. The """
syntax indicates a multi-line comment.
If these comments are the first thing in a function, they are known as a Docstring
.
import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt
def parse_arguments(argv):
"""
Parse the argument list passed from the command line
(after the program filename is removed) and return a list
of filenames.
Input:
------
argument list (normally sys.argv[1:])
Returns:
--------
filenames: list of strings, list of files to plot
"""
def create_plot(filename):
"""
Creates a plot for the specified
data file.
Input:
------
filename: string, path to file to plot
Returns:
--------
none
"""
def create_plots(filenames):
"""
Takes in a list of filenames to plot
and creates a plot for each file.
Input:
------
filenames: list of strings, list of files to plot
Returns:
--------
none
"""
def main():
"""
main function - does all the work
"""
# call main
main()
Adding Docstrings
In an effort to create human-readable code. It is common to include a “docstring” or “Python Documentation String”, at the top of each function that clearly lay out three things: the purpose or objective of the function, a description of the inputs and their datatypes, and a description of what is returned and their data types. By including these things, debugging later on is much more efficient because now the developer knows the starting point (inputs), endpoints (returns), and what the intended change is (purpose).
Let’s move the code into the functions now:
Exercise: refactor the code
Now that we have a plan for refactoring and a template to work from, create a new script called
refactored_gdp_plot.py
. Paste the template from above into the new script. Then copy and paste the code fromgdp_plot.py
script into the corresponding functions.Solution
import sys import glob import pandas # we need to import part of matplotlib # because we are no longer in a notebook import matplotlib.pyplot as plt def parse_arguments(argv): """ Parse the argument list passed from the command line (after the program filename is removed) and return a list of filenames. Input: ------ argument list (normally sys.argv[1:]) Returns: -------- filenames: list of strings, list of files to plot """ # make sure additional arguments or flags have # been provided by the user if argv == []: # why the program will not continue print("Not enough arguments have been provided") # how this can be corrected print("Usage: python gdp_plots.py <filenames>") print("Options:") print("-a : plot all gdp data sets in current directory") # check for -a flag in arguments if "-a" in argv: filenames = glob.glob("*gdp*.csv") else: filenames = argv return filenames def create_plot(filename): """ Creates a plot for the specified data file. Input: ------ filename: string, path to file to plot Returns: -------- none """ # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot of the transposed data ax = data.plot(title = filename) # set some plot attributes ax.set_xlabel("Year") ax.set_ylabel("GDP Per Capita") # set the x locations and labels ax.set_xticks(range(len(data.index))) ax.set_xticklabels(data.index, rotation = 45) # save the plot with a unique file name split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX split_name2 = filename.split('/')[1] save_name = 'figs/'+split_name2 + '.png' plt.savefig(save_name) def create_plots(filenames): """ Takes in a list of filenames to plot and creates a plot for each file. Input: ------ filenames: list of strings, list of files to plot Returns: -------- none """ for filename in filenames: create_plot(filename) def main(): """ main function - does all the work """ # parse arguments files_to_plot = parse_arguments(sys.argv[1:]) #generate plots create_plots(files_to_plot) # call main main()
The behavior of the program hasn’t changed, but it has been made more modular by separating the into different functions with their own purpose.
Why the extra function? Our function main has two primary components to it
- parsing arguments handed to the program
- generating the desired plots
But we’ve defined three functions above the main
function: parse_arguments
, create_plot
, and create_plots
.
If the functions in our file directly reflected the components in main
we would only have the parse_arguments
and create_plots
functions. If we think about using these functions independently, however, a function which always takes in a list of
filenames isn’t very convenient to use on its own. By defining the create_plot
function, we have placed most of the
plot generation work there, while allowing for a very simple definition of the create_plots
function.
The importance of this design decision will be made clear in the next lesson.
Before we commit we need to change our refactored_gdp_plot.py
script to gdp_plot.py
since we don’t want to keep two copies of this script around in our repo.
We only made this as a separate script to make it easier to copy-paste.
Once you’ve tested it, you can either rename refactored_gdp_plot.py
script to gdp_plot.py
or copy the contents of refactored_gdp_plot.py
script to gdp_plot.py
and delete refactored_gdp_plot.py
.
Update the Repository
We haven’t changed the behavior of our program, but our code has changed, so let’s update the repository.
$ git add gdp_plots.py
$ git commit -m "Refactoring code."
Branching and Refactoring
To demonstrate that the behavior of our program hasn’t changed, try running it a few different ways using both the
master
andrefactor
branches. Remember that the command for checking out a branch isgit checkout <branch_name>
.
Now that we’re satisfied with our refactor. We can merge this branch into our master branch.
$ git checkout master
$ git merge refactor
Key Points
Refactoring makes code more modular, easier to read, and easier to understand.
Refactoring requires one to consdier future implications and generally enables others to use your code more easily.
Running Scripts and Importing
Overview
Teaching: 10 min
Exercises: 10 minQuestions
How can I import some of my work even if it is part of a program?
Objectives
Learn how to import functions from one script into another.
Understand the difference between using a file as a module and running it as a script or program.
In our last lesson we learned how refactoring code makes it easier to understand organize into pices that each have their own purpose. The added modularity also makes it easier to use in other places.
First, let’s start a Jupyter notebook in our current directory and see what happens if we try to import our file as it is.
import gdp_plots
The result is an error related to how Python is attempting to interpret
the file. This is because Python is encountering our call to the function
main
and isn’t sure how to proceed.
Running Versus Importing
Running a Python script in bash is very similar to importing that file in Python. The biggest difference is that we don’t expect anything to happen when we import a file, whereas when running a script, we expect to see some output printed to the console.
In order for a Python script to work as expected when imported or when run as a script, we typically put the part of the script that produces output in the following if statement:
if __name__ == '__main__': main() # Or whatever function produces output
When you import a Python file,
__name__
is the name of that file (e.g., when importingpandas_plots.py
,__name__
is'pandas_plots'
). However, when running a script in bash,__name__
is always set to'__main__'
in that script so that you can determine if the file is being imported or run as a script.
Let’s add the main function in our script to a section which identifies the program as being called from the command line.
import sys import glob import pandas # we need to import part of matplotlib # because we are no longer in a notebook import matplotlib.pyplot as plt def parse_arguments(argv): """ Parse the argument list passed from the command line (after the program filename is removed) and return a list of filenames. Input: ------ argument list (normally sys.argv[1:]) Returns: -------- filenames: list of strings, list of files to plot """ # make sure additional arguments or flags have # been provided by the user if argv == []: # why the program will not continue print("Not enough arguments have been provided") # how this can be corrected print("Usage: python gdp_plots.py < filenames >") print("Options:") print("-a : plot all gdp data sets in current directory") # check for -a flag in arguments if "-a" in argv: filenames = glob.glob("data/*gdp*.csv") else: filenames = argv return filenames def create_plot(filename): """ Creates a plot for the specified data file. Input: ------ filename: string, path to file to plot Returns: -------- none """ # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot of the transposed data ax = data.plot(title = filename) # set some plot attributes ax.set_xlabel("Year") ax.set_ylabel("GDP Per Capita") # set the x locations and labels ax.set_xticks(range(len(data.index))) ax.set_xticklabels(data.index, rotation = 45) # save the plot with a unique file name split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX split_name2 = filename.split('/')[1] save_name = 'figs/'+split_name2 + '.png' plt.savefig(save_name) def create_plots(filenames): """ Takes in a list of filenames to plot and creates a plot for each file. Input: ------ filenames: list of strings, list of files to plot Returns: -------- none """ for filename in filenames: create_plot(filename) def main(): """ main function - does all the work """ # parse arguments files_to_plot = parse_arguments(sys.argv[1:]) #generate plots create_plots(files_to_plot) if __name__ == "__main__": # call main main()
Now let’s go back to the Jupyter notebook and try importing the file again.
import gdp_plots
Success! You’ve just written your first Python module. Any of the functions in that module can now be accessed in our Jupyter notebook session.
%matplotlib inline
gdp_plots.create_plot("data/gapminder_gdp_oceania.csv")
Update the repository
Back in the terminal, let’s commit these changes to our repository.
$ git add gdp_plots.py
$ git commit -m "Moving call to the main function."
Writing Modular Code
In our previous lesson we refactored our code for clarity and modularity. As part of that process we created two functions
create_plot
andcreate_plots
. While thecreate_plot
function wasn’t used in our program, imagine importing our module and finding that the only way to generate a plot for a single file is to add that filename to a list before passing that list tocreate_plots
.This might seem very strange or confusing to someone importing our module for the first time. It take time to develop an intuition for design decisions like these, but here are a few questions to ask yourself as a guide when organizing code:
- Are my functions able to stand on their own? Do they accomplish simple tasks?
- Is it easy to write a clear function names in my module? A function with the name “create_plots_and_export_data” is likely better off being borken up into two functions “create_plots” and “export_data”
- Are my function names plural? In our example, it may have felt more natural to create only one function, “create_plots”. In these cases, it is almost always useful to create a function which does our operation once (“create_plot”) and another plural form of the function (“create_plots”) which contains a loop over the singular function.
Key Points
The
__name__
variable allows us to know whether the file is being imported or run as a script.
Programming Style
Overview
Teaching: 5 min
Exercises: 5 minQuestions
How can I make my programs more readable?
How do most programmers format their code?
How can programs check their own operation?
Objectives
Provide sound justifications for basic rules of coding style.
Refactor one-page programs to make them more readable and justify the changes.
Use Python community coding standards (PEP-8).
Follow standard Python style in your code.
- PEP8:
a style guide for Python that discusses topics such as how you should name variables,
how you should use indentation in your code,
how you should structure your
import
statements, etc. Adhering to PEP8 makes it easier for other Python developers to read and understand your code, and to understand what their contributions should look like. The PEP8 application and Python library can check your code for compliance with PEP8.
Best Practice: Write programs for people and not for computers!
Use assertions to check for internal errors.
Assertions are a simple, but powerful method for making sure that the context in which your code is executing is as you expect.
def calc_bulk_density(mass, volume):
'''Return dry bulk density = powder mass / powder volume.'''
assert volume > 0
return mass / volume
If the assertion is False
, the Python interpreter raises an AssertionError
runtime exception. The source code for the expression that failed will be displayed as part of the error message. To ignore assertions in your code run the interpreter with the ‘-O’ (optimize) switch. Assertions should contain only simple checks and never change the state of the program. For example, an assertion should never contain an assignment.
Best Practice: Plan for mistakes
Use docstrings to provide online help.
Best Practice: Document design & purpose, not just mechanics
- If the first thing in a function is a character string that is not assigned to a variable, Python attaches it to the function as the online help.
- Called a docstring (short for “documentation string”).
def average(values):
"Return average of values, or None if no values are supplied."
if len(values) == 0:
return None
return sum(values) / average(values)
help(average)
Help on function average in module __main__:
average(values)
Return average of values, or None if no values are supplied.
Multiline Strings
Often use multiline strings for documentation. These start and end with three quote characters (either single or double) and end with three matching characters.
"""This string spans multiple lines. Blank lines are allowed."""
What Will Be Shown?
Highlight the lines in the code below that will be available as online help. Are there lines that should be made available, but won’t be? Will any lines produce a syntax error or a runtime error?
"Find maximum edit distance between multiple sequences." # This finds the maximum distance between all sequences. def overall_max(sequences): '''Determine overall maximum edit distance.''' highest = 0 for left in sequences: for right in sequences: '''Avoid checking sequence against itself.''' if left != right: this = edit_distance(left, right) highest = max(highest, this) # Report. return highest
Document This
Turn the comment on the following function into a docstring and check that
help
displays it properly.def middle(a, b, c): # Return the middle value of three. # Assumes the values can actually be compared. values = [a, b, c] values.sort() return values[1]
Clean Up This Code
- Read this short program and try to predict what it does.
- Run it: how accurate was your prediction?
- Refactor the program to make it more readable. Remember to run it after each change to ensure its behavior hasn’t changed.
- Compare your rewrite with your neighbor’s. What did you do the same? What did you do differently, and why?
import sys n = int(sys.argv[1]) s = sys.argv[2] print(s) i = 0 while i < n: # print('at', j) new = '' for j in range(len(s)): left = j-1 right = (j+1)%len(s) if s[left]==s[right]: new += '-' else: new += '*' s=''.join(new) print(s) i += 1
Solution
Here’s one solution.
def string_machine(input_string, iterations): """ Takes input_string and generates a new string with -'s and *'s corresponding to characters that have identical adjacent characters or not, respectively. Iterates through this procedure with the resultant strings for the supplied number of iterations. """ print(input_string) old = input_string for i in range(iterations): new = '' # iterate through characters in previous string for j in range(len(s)): left = j-1 right = (j+1)%len(s) # ensure right index wraps around if old[left]==old[right]: new += '-' else: new += '*' print(new) # store new string as old old = new string_machine('et cetera', 10)
Key Points
Follow standard Python style in your code.
Use docstrings to provide online help.
Wrap-Up
Overview
Teaching: 10 min
Exercises: 0 minQuestions
What have we learned?
What else is out there and where do I find it?
Objectives
Name and locate scientific Python community sites for software, workshops, and help.
Leslie Lamport once said, “Writing is nature’s way of showing you how sloppy your thinking is.” The same is true of programming: many things that seem obvious when we’re thinking about them turn out to be anything but when we have to explain them precisely.
Python supports a large community within and outwith research.
-
The Python 3 documentation covers the core language and the standard library.
-
PyCon is the largest annual conference for the Python community.
-
SciPy is a rich collection of scientific utilities. It is also the name of a series of annual conferences.
-
Jupyter is the home of the Jupyter Notebook.
-
Pandas is the home of the Pandas data library.
-
Stack Overflow’s general Python section can be helpful, as can the sections on NumPy, SciPy, Pandas, and other topics.
Key Points
Python supports a large community within and outwith research.