Content from Review Exercise
Last updated on 2025-02-10 | Edit this page
Overview
Questions
- How can we put together all of yesterday’s material?
Objectives
- Apply use of functions, conditionals and loops to solve a problem.
Review From Yesterday - All in One Exercise
In your notebook, write a function that determines whether a year between 1901 and 2000 is a leap year, where it prints a message like “1904 is a leap year” or “1905 is not a leap year” as output. Use this function to evaluate the years 1928, 1950, 1959, 1972 and 1990. Essentially, given this list of years:
Produce something like:
OUTPUT
1928 is a leap year
1950 is not a leap year.
1959 is not a leap year.
1972 is a leap year
1990 is not a leap year.
Hint: the percent symbol ‘%’ is the modular operator in Python. So:
OUTPUT
8 mod 4 equals 0
10 mod 4 equals 2
If you’re not sure where to start, see the fill-in-the-blank version of this exercise below.
Review From Yesterday - Step by Step Breakdown
First, try to determine how to use the mod operator %
to
determine if a year is divisible by 4 (and thus a leap year or not).
If a year in the range specified is divisible by four, it is a leap
year. If a number is divisible by 4, then the arithmetic expression
“number mod four” (or num % 4
in Python) will equal
zero.
Review From Yesterday - Step by Step Breakdown (continued)
Then, create a conditional statement to use this information, and put it into a function.
Review From Yesterday - Step by Step Breakdown (continued)
Then, create a list of the years given in the exercise. Use a for loop and your function to evaluate these years.
Review From Yesterday - Step by Step Breakdown (continued)
Finally, use a for loop and your function to evaluate these years.
Additonal Challenge
If you have time:
Expand your function so that it correctly categorizes any year from 0 onwards
Instead of printing whether a year is a leap year or not, save the results to a python dictionary, where there are two keys (“leap” and “not-leap”) and the values are a list of years.
Key Points
- Use skills together.
Content from Command-Line Programs
Last updated on 2025-02-10 | Edit this page
Overview
Questions
- How can I write Python programs that will work like Unix command-line tools?
Objectives
- Use the values of command-line arguments in a program.
- Read data from standard input in a program so that it can be used in a pipeline.
The Jupyter Notebook and other interactive tools are great for prototyping code and exploring data, but sooner or later one will want to use that code in a program we can run from the command line. In order to do that, we need to make our programs work like other Unix command-line tools. For example, we may want a program that reads a gapminder data set and plots the gdp of countries over time.
Switching to Shell Commands
In this lesson we are switching from typing commands in Jupyter
notebooks to typing commands in a shell terminal window (such as bash).
When you see a $
in front of a command that tells you to
run that command in the shell rather than the Python interpreter.
Converting Notebooks
The Jupyter Notebook has the ability to convert all of the cells of a
current Notebook into a python program. To do this, go to
File
-> Download as
and select
Python (.py)
to get the current notebook as a Python
script.
Setting up your project
Up until now, we’ve been working in the data folder directly. Because we’re going to be dealing with more files of different types in this lesson, let’s do a little rearranging:
- On your desktop, create a folder called
swc-gapminder
. - Move the
data
folder you’ve been using into this folder. - Inside swc-gapminder, create a folder called
figs
- To ensure that we’re all starting with the same set of code, copy
the text below into a file called
gdp_plots.py
in theswc-gapminder
folder or download the file from here.
PYTHON
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt
# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv('data/gapminder_gdp_oceania.csv', index_col = 'country').T
# create a plot of the transposed data
ax = data.plot()
# display the plot
plt.show()
This program imports the pandas
and
matplotlib
Python modules, reads some of the gapminder data
into a pandas
dataframe, and plots that data using
matplotlib
with some default settings.
We can run this program from the command line using
This is much easier than starting a notebook, going to the browser, and running each cell in the notebook to get the same result.
Initialize a repository
But before we modify our gdp_plots.py
program, we are
going to put it under version control so that we can track its changes
as we go through this lesson.
Because we’re only concerned with changes to our analysis script, we
are going to create a .gitignore file for all of the gapminder
.csv
files and any Python notebook files
(.ipynb
) files we have created thus far.
$ echo "data/*.csv" > .gitignore
$ echo "*.ipynb" >> .gitignore
$ git add .gitignore
$ git commit -m "Adding ignore file"
Now that we have a clean repository, let’s get back to work on adding command line arguments to our program.
Changing code under Version Control
As it is, this plot isn’t bad but let’s add some labels for clarity. We’ll use the data filename as a title for the plot and indicate what information is in on each axis.
PYTHON
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt
filename = 'data/gapminder_gdp_oceania.csv'
# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T
# create a plot of the transposed data
ax = data.plot(title = filename)
# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)))
ax.set_xticklabels(data.index, rotation = 45)
# display the plot
plt.show()
Now when we run this, our plot looks a little bit nicer.
Command-Line Arguments
This program currently only works for the Oceania set of data. How
might we modify the program to work for any of the gapminder gdp data
sets? We could go into the script and change the .csv
filename to generate the same plot for different sets of data, but there
is an even better way.
Python programs can use additional arguments provided in the following manner.
The program can then use these arguments to alter its behavior based on those arguments. In this case, we’ll be using arguments to tell our program to operate on a specific file.
We’ll be using the sys
module to do so. sys
(short for system) is a standard Python module used to store information
about the program and its running environment, including what arguments
were passed to the program when the command was executed. These
arguments are stored as a list in sys.argv
.
These arguments can be accessed in our program by importing the
sys
module. The first argument in sys.argv
is
always the name of the program, so we’ll find any additional arguments
right after that in the list.
Let’s try this out in a separate script. Using the text editor of
your choice, let’s write a new program called args_list.py
containing the two following lines:
The strange name argv
stands for “argument values”.
Whenever Python runs a program, it takes all of the values given on the
command line and puts them in the list sys.argv
so that the
program can determine what they were. If we run this program with no
arguments:
OUTPUT
sys.argv is ['argv_list.py']
the only thing in the list is the full path to our script, which is
always sys.argv[0]
.
If we run it with a few arguments, however:
then Python adds each of those arguments to that magic list.
Using this new information, let’s add command line arguments to our
gdp_plots.py
program.
To do this, we’ll make two changes:
- add the import of the sys module at the beginning of the program.
- replace the filename (“data/gapminder_gdp_oceania.csv”) with the the
second entry in the
sys.argv
list.
Now our program should look as follows:
import sys import pandas # we need to import part of matplotlib # because we are no longer in a notebook import matplotlib.pyplot as plt filename = sys.argv[1] # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot of the transposed data ax = data.plot(title = filename) # set some plot attributes ax.set_xlabel("Year") ax.set_ylabel("GDP Per Capita") # set the x locations and labels ax.set_xticks(range(len(data.index)) ) ax.set_xticklabels(data.index, rotation = 45) # display the plot plt.show()
Let’s take a look at what happens when we provide a gapminder filename to the program.
And the same plot as before is displayed, but this file is now being read from an argument we’ve provided on the command line. We can now do this for files with similar information and get the same set of plots for that data without any changes to our program’s code. Try this our for yourself now.
Update the Repository
Now that we’ve made this change to our program and see that it works. Let’s update our repository with these changes.
$ git add gdp_plots.py
$ git commit -m "Adding command line arguments"
Exercise: read multiple files
Try to run the gdp_plots.py so that it reads in all the .csv files in the data folder using the wildcard symbol. Does it work? Why or why not?
if you run it with the argument ‘data/*.csv’ you get an error on the Americas file because it has an extra file. However, it works if you omit that file.
Key Points
- The
sys
library connects a Python program to the system it is running on. - The variable
sys.argv
is a list with each item being a command-line argument.
Content from Trying Different Methods
Last updated on 2025-02-10 | Edit this page
Overview
Questions
- How do I plot multiple data sets using different methods?
Objectives
- Read data from standard input in a program so that it can be used in a pipeline.
- Compare using different methods to accomplish the same task.
- Practice making branches and merging in a Git repository.
Handling Multiple Files
Perhaps we would like a script that can operate on multiple data files at once. This can actually be done in two different ways: (1) writing a bash script that calls our already-existing Python script in a for-loop, or (2) modifying our Python script to read multiple files in a for-loop. We will try both methods and compare the two.
Handling Multiple Files with Python
First, we’ll try using just Python to loop through mutliple files. Let’s switch to our Python branch.
To process each file separately, we’ll need a loop that executes our plotting statements for each file.
We want our program to process each file separately, and the easiest
way to do this is a loop that executes once for each of the filenames
provided in sys.argv
.
But we need to be careful: sys.argv[0]
will always be
the name of our program, rather than the name of a file. We also need to
handle an unknown number of filenames, since our program could be run
for any number of files.
A solution is to loop over the contents of sys.argv[1:]
.
The ‘1’ tells Python to start the slice at location 1, so the program’s
name isn’t included. Since we’ve left off the upper bound, the slice
runs to the end of the list, and includes all the filenames.
Here is our updated program.
import sys import pandas # we need to import part of matplotlib # because we are no longer in a notebook import matplotlib.pyplot as plt for filename in sys.argv[1:]: # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot of the transposed data ax = data.plot(title = filename) # set some plot attributes ax.set_xlabel("Year") ax.set_ylabel("GDP Per Capita") # set the x locations and labels ax.set_xticks(range(len(data.index))) ax.set_xticklabels(data.index, rotation = 45) # display the plot plt.show()
Now when the program is given multiple filenames
one plot for each filename is generated.
Saving Figures
By using plt.show()
with multiple files, the program
stops each time a figure is generated and the user must exit it to
continue. To avoid this slow down, we can replace this with
plt.savefig()
and view all the figures after the script
finishes. This function has one required argument which is the filename
to save the figure as. The filename must have a valid image extension
(eg. PNG, JPEG, etc.).
Let’s replace our plt.show()
with
plt.savefig('fig/gdp-plot.png')
. First, create the
fig
directory using mkdir
Our new script
should like like this:
import sys import pandas # we need to import part of matplotlib # because we are no longer in a notebook import matplotlib.pyplot as plt for filename in sys.argv[1:]: # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot of the transposed data ax = data.plot(title = filename) # set some plot attributes ax.set_xlabel("Year") ax.set_ylabel("GDP Per Capita") # set the x locations and labels ax.set_xticks(range(len(data.index))) ax.set_xticklabels(data.index, rotation = 45) # save the plot plt.savefig('fig/gdp-plot.png')
If we look at the contents of our folder now, we should have a new
file called gdp-plot.png
. But why is there only one when we
supplied multiple data files? This is because each time the plot is
created, it is being saved as the same file name and overwriting the
previous plot.
We can fix this by creating a unique file name each time. A simple
unique name can be used based on the original file name. We can use
Python’s split()
function to split filename
,
which is a string, by any character. This returns a list like so:
We’ll split the original file name and use the first part to rename
our plot. And then we will concatenate .png
to the name to
specify our file type.
import sys import pandas # we need to import part of matplotlib # because we are no longer in a notebook import matplotlib.pyplot as plt for filename in sys.argv[1:]: # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot of the transposed data ax = data.plot(title = filename) # set some plot attributes ax.set_xlabel("Year") ax.set_ylabel("GDP Per Capita") # set the x locations and labels ax.set_xticks(range(len(data.index))) ax.set_xticklabels(data.index, rotation = 45) # save the plot with a unique file name split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX split_name2 = split_name1.split('/')[1] save_name = 'figs/'+split_name2 + '.png' plt.savefig(save_name)
Handling Multiple Files with Bash
Now that we’ve created a Python script to save multiple files at
once, let’s try to do the same thing in Bash. We’ll leave our current
branch alone and switch to our bash-multi-files
branch.
If we look at our gdp_plots.py
file, it is not in a
for-loop format because it is not up to date with other branch. Our
script should look like this:
PYTHON
import sys
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt
filename = sys.argv[1]
# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T
# create a plot the transposed data
ax = data.plot(title = filename)
# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)))
ax.set_xticklabels(data.index, rotation = 45)
# display the plot
plt.show()
Let’s use bash to generate multiple plots by calling our Python script in a bash for-loop. First, let’s create a bash file for us to edit.
In this script, we’ll write a for-loop will call our
gdp_plots.py
script on multiple files. We can break up our
long list of files by using a backslash \
and writing the
rest on the next line.
BASH
for filename in data/gapminder_gdp_oceania.csv data/gapminder_gdp_africa.csv
do
python gdp_plots.py $filename
done
We can run our script to see if it works:
When we run this, we see that it stops to show us each plot like before. Let’s update our script to save the figure like before.
import sys import pandas # we need to import part of matplotlib # because we are no longer in a notebook import matplotlib.pyplot as plt filename = sys.argv[1] # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot the transposed data ax = data.plot(title = filename) # set some plot attributes ax.set_xlabel("Year") ax.set_ylabel("GDP Per Capita") # set the x locations and labels ax.set_xticks(range(len(data.index))) ax.set_xticklabels(data.index, rotation = 45) # save the plot with a unique file name split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX split_name2 = split_name1.split('/')[1] save_name = 'figs/'+split_name2 + '.png' plt.savefig(save_name)
When we run the script again, we should have new image files generated.
Comparing Methods
We have successfully developed two different methods for
accomplishing the same task. This is common to do in software
development when there is not a clear path forward. Let’s compare our
two methods and decide which to merge into our master
branch.
One comparison we might be interested in is how fast each is. We can
use bash’s time
function to get the time to run the script.
Let’s time each script. To do this, we just add time
before
each command when we run a script or command and it will give us timing
information when it is completed.
While we are on our bash branch, we’ll time that script first.
real 0m6.031s
user 0m5.535s
sys 0m0.388s
We are most interested in the “real” time in the output, which is the elapsed time we experience.
Let’s checkout our python branch and time our script there.
BASH
$ git checkout python-multi-files
$ time python gdp_plots.py data/gapminder_gdp_oceania.csv data/gapminder_gdp_africa.csv
real 0m3.163s
user 0m3.002s
sys 0m0.132s
As we can see, our Python method ran faster than the bash method. For this reason, we will merge our Python branch into master.
Another advantage to the Python method over the bash method is that we do not need to change a file and commit it each time we want to create plots from different data.
More Practice with Multiple Files in Python
PYTHON
import sys
import glob
def main():
'''prints names of all files with sys.argv as suffix'''
assert len(sys.argv) >= 2, 'Argument list cannot be empty'
suffix = sys.argv[1] # NB: behaviour is not as you'd expect if sys.argv[1] is *
glob_input = '*.' + suffix # construct the input
glob_output = sorted(glob.glob(glob_input)) # call the glob function
for item in glob_output: # print the output
print(item)
return
main()
Counting Lines
Write a program called line_count.py
that works like the
Unix wc
command:
- If no filenames are given, it reports the number of lines in standard input.
- If one or more filenames are given, it reports the number of lines in each, followed by the total number of lines.
PYTHON
import sys
def main():
'''print each input filename and the number of lines in it,
and print the sum of the number of lines'''
filenames = sys.argv[1:]
sum_nlines = 0 #initialize counting variable
if len(filenames) == 0: # no filenames, just stdin
sum_nlines = count_file_like(sys.stdin)
print('stdin: %d' % sum_nlines)
else:
for f in filenames:
n = count_file(f)
print('%s %d' % (f, n))
sum_nlines += n
print('total: %d' % sum_nlines)
def count_file(filename):
'''count the number of lines in a file'''
f = open(filename,'r')
nlines = len(f.readlines())
f.close()
return(nlines)
def count_file_like(file_like):
'''count the number of lines in a file-like object (eg stdin)'''
n = 0
for line in file_like:
n = n+1
return n
main()
Key Points
- Make different branches in a Git repository to try different methods.
- Use bash’s
time
command to time scripts.
Content from Program Flags
Last updated on 2025-02-10 | Edit this page
Overview
Questions
- How can I make an easy shortcut to analyze all files at once using a program flag?
Objectives
- Handle flags and files separately in a command-line program.
Handling Program Flags
Now we have a program which is capable of handling any number of data sets at once.
But what if we have 50 GDP data sets? It would be awfully tedious to type in the names of 50 files in the command line, so let’s add a flag to our program indicating that we would like it to generate a plot for each data set in the current directory.
Flags are a convention used in programming to indicate to a program that a non-default behavior is being requested by the user. In this case, we’ll be using a “-a” flag to indicate to our program we would like it to operate on all data sets in our directory.
To explore what files are in the current directory, we’ll be using
the Python’s glob
module.
- In Unix, the term “globbing” means “matching a set of files with a pattern”.
- The most common patterns are:
-
*
meaning “match zero or more characters” -
?
meaning “match exactly one character”
-
- Python contains the
glob
library to provide pattern matching functionality - The
glob
library contains a function also calledglob
to match file patterns - E.g.,
glob.glob('*.txt')
matches all files in the current directory whose names end with.txt
. - Result is a (possibly empty) list of character strings.
import sys import glob import pandas # we need to import part of matplotlib # because we are no longer in a notebook import matplotlib.pyplot as plt # check for -a flag in arguments if "-a" in sys.argv: filenames = glob.glob("data/*gdp*.csv") else: filenames = sys.argv[1:] for filename in filenames: # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot of the transposed data ax = data.plot(title = filename) # set some plot attributes ax.set_xlabel("Year") ax.set_ylabel("GDP Per Capita") # set the x locations and labels ax.set_xticks(range(len(data.index))) ax.set_xticklabels(data.index, rotation = 45) # save the plot with a unique file name split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX split_name2 = filename.split('/')[1] save_name = 'figs/'+split_name2 + '.png' plt.savefig(save_name)
Updating the repository
Yet another successful update to the code. Let’s commit our changes.
The Right Way to Do It
If our programs can take complex parameters or multiple filenames, we
shouldn’t handle sys.argv
directly. Instead, we should use
Python’s argparse
library, which handles common cases in a
systematic way, and also makes it easy for us to provide sensible error
messages for our users. We will not cover this module in this lesson but
you can go to Tshepang Lekhonkhobe’s Argparse
tutorial that is part of Python’s Official Documentation.
Key Points
- Adding command line flags can be a user-friendly way to accomplish common tasks.
Content from Defensive Programming
Last updated on 2025-02-10 | Edit this page
Overview
Questions
- How do I predict and avoid user confusion?
Objectives
- Ensure that programs indicate use and provide meaningful output upon failure.
Defensive Programming
In our last lesson, we created a program which will plot our gapminder gdp data for an arbitrary number of files. This is great, but we didn’t cover some of the vulnerabilities of this program we’ve created.
- What happens if we run the program without any arguments at all?
- What happens if we run the program from another directory?
First, let’s try running our program without any additional arguments or flags.
OUTPUT
Traceback (most recent call last):
File "gdp_plot.py", line 12, in <module>
filenames = sys.argv[1:]
IndexError: list index out of range
Python returns an error when trying to find the command line argument
in sys.argv
. It cannot find that argument because we
haven’t provided it to the command and as a result there is no entry in
sys.argv
where we’re telling it to look for this value. We
may know all of this because we’re the ones who wrote the program, but
another user of the program without this experience will not.
More on Function Errors/Exceptions
Python reports a runtime error when something goes wrong while a program is executing.
ERROR
NameError: name 'aege' is not defined
- The message indicates a problem with the name of a variable
Python also reports a syntax error when it can’t understand the source of a program.
ERROR
File "<ipython-input-6-d1cc229bf815>", line 1
print ("hello world"
^
SyntaxError: unexpected EOF while parsing
- The message indicates a problem on first line of the input (“line
1”).
- In this case the “ipython-input” section of the file name tells us that we are working with input into IPython, the Python interpreter used by the Jupyter Notebook.
- The
-6-
part of the filename indicates that the error occurred in cell 6 of our Notebook. - Next is the problematic line of code, indicating the problem with a
^
pointer.
And if we run the program from another directory:
We see no output from the program at all. This is what is referred to as a “silent failure”. The program has failed to produce a plot, but has reported no reason why. These kind of failures are difficult to debug and should be avoided.
It is important to employ “defensive programming” in this scenario so that our program indicates to the user
- what is going wrong
- how to correct this problem
Check Input Arguments
Let’s add a section to the code which checks the number of incoming arguments to the program and returns some information to the user if there is missing information.
import sys import glob import pandas # we need to import part of matplotlib # because we are no longer in a notebook import matplotlib.pyplot as plt # make sure additional arguments or flags have # been provided by the user if len(sys.argv) == 1: # why the program will not continue print("Not enough arguments have been provided") # how this can be corrected print("Usage: python gdp_plots.py ") print("Options:") print("-a : plot all gdp data sets in current directory") # check for -a flag in arguments if "-a" in sys.argv: filenames = glob.glob("data/*gdp*.csv") else: filenames = sys.argv[1:] for filename in filenames: # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot of the transposed data ax = data.plot(title = filename) # set some plot attributes ax.set_xlabel("Year") ax.set_ylabel("GDP Per Capita") # set the x locations and labels ax.set_xticks(range(len(data.index))) ax.set_xticklabels(data.index, rotation = 45) # save the plot with a unique file name split_name = filename.split('.') save_name = split_name[0] + '.png' plt.savefig(save_name)
If we run the program without a filename argument, here’s what we’ll see
OUTPUT
Not enough arguments have been provided
Usage: python gdp_plots.py <filenames>
Options:
-a : plot all gdp file in current directory
Now if someone runs this program without having used it before (or written it themselves) the user will know how change their command to get the program running properly, rather than seeing an esoteric Python error.
Update the Repository
We’ve just made another successful change to our repository. Let’s add a commit to the repo.
Check for silent errors
Silent errors can be difficult to anticipate. If we try to run our
program from another directory with the -a
flag, we don’t
see any errors, but it also doesn’t do anything. This is because when we
do the -a
flag here, there are no .csv
files
in the directory, so our filenames
variable is empty. Let’s
add a check to ensure there are files to plot.
import sys import glob import pandas # we need to import part of matplotlib # because we are no longer in a notebook import matplotlib.pyplot as plt # make sure additional arguments or flags have # been provided by the user if len(sys.argv) == 1: # why the program will not continue print("Not enough arguments have been provided") # how this can be corrected print("Usage: python gdp_plots.py ") print("Options:") print("-a : plot all gdp data sets in current directory") # check for -a flag in arguments if "-a" in sys.argv: filenames = glob.glob("data/*gdp*.csv") # check if no files were found and print message. if filenames == []: # file list is empty (no files found) print("No files found in this folder.") print("Make sure data is located in current directory.") else: filenames = sys.argv[1:] for filename in filenames: # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot of the transposed data ax = data.plot(title = filename) # set some plot attributes ax.set_xlabel("Year") ax.set_ylabel("GDP Per Capita") # set the x locations and labels ax.set_xticks(range(len(data.index))) ax.set_xticklabels(data.index, rotation = 45) # save the plot with a unique file name split_name = filename.split('.') save_name = split_name[0] + '.png' plt.savefig(save_name)
Now if someone runs this program in a directory with no valid datafiles, a message appears.
Content from Refactoring
Last updated on 2025-02-10 | Edit this page
Overview
Questions
- When should I reorganize my code so it is more clear and readable for others?
- How can I organize my code so that it is useable in other places?
- Why do I almost always want to write my code as though it will be used somewhere else?
Objectives
- Understand the value of refactoring code and use of functions.
- Practice determining where code can be divided into smaller functions.
This code works nicely for generating plots of multiple data sets, but there is now a lot of code to digest in our script. Picture yourself looking at this code for the first time. Would it be immediately clear to you where are arguments are handled and where plots are generated?
It would be nice to break this work into clear chunks of code. This can be accomplished by making the argument checking section and the body of the for loop their own functions. This requires surprisingly few changes to the code, but makes it much more clear. This process is called refactoring.
Exercise: make a refactoring plan
Given the guidance above, talk with your neighbors about which parts of the script should be moved into functions. Try to think of ways to make the functions the most reusable on their own.
A possible solution:
- A function that parses the arguments
- A function that makes the plots
- A function that calls the other functions
This isn’t the only “right” solution, but a reasonable way to split things up
Let’s refactor our script
Create a Branch
Because we’re making a major change, let’s make a new branch to work in.
Let’s break the code into 4 functions:
-
parse_arguments()
- gets the input from argv[], returns a list of file names -
create_plot()
- takes one file name as input, creates one plot and writes it to the fig folder -
create_plots()
- takes a list of files as input, callscreate_plot()
for each element in the list -
main()
- callsparse_arguments()
andcreate_plots()
Below is a template that will help you write these functions. The
"""
syntax indicates a multi-line comment. If these
comments are the first thing in a function, they are known as a
Docstring
.
PYTHON
import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt
def parse_arguments(argv):
"""
Parse the argument list passed from the command line
(after the program filename is removed) and return a list
of filenames.
Input:
------
argument list (normally sys.argv[1:])
Returns:
--------
filenames: list of strings, list of files to plot
"""
def create_plot(filename):
"""
Creates a plot for the specified
data file.
Input:
------
filename: string, path to file to plot
Returns:
--------
none
"""
def create_plots(filenames):
"""
Takes in a list of filenames to plot
and creates a plot for each file.
Input:
------
filenames: list of strings, list of files to plot
Returns:
--------
none
"""
def main():
"""
main function - does all the work
"""
# call main
main()
Adding Docstrings
In an effort to create human-readable code. It is common to include a “docstring” or “Python Documentation String”, at the top of each function that clearly lay out three things: the purpose or objective of the function, a description of the inputs and their datatypes, and a description of what is returned and their data types. By including these things, debugging later on is much more efficient because now the developer knows the starting point (inputs), endpoints (returns), and what the intended change is (purpose).
Let’s move the code into the functions now:
Exercise: refactor the code
Now that we have a plan for refactoring and a template to work from,
create a new script called refactored_gdp_plot.py
. Paste
the template from above into the new script. Then copy and paste the
code from gdp_plot.py
script into the corresponding
functions.
PYTHON
import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt
def parse_arguments(argv):
"""
Parse the argument list passed from the command line
(after the program filename is removed) and return a list
of filenames.
Input:
------
argument list (normally sys.argv[1:])
Returns:
--------
filenames: list of strings, list of files to plot
"""
# make sure additional arguments or flags have
# been provided by the user
if argv == []:
# why the program will not continue
print("Not enough arguments have been provided")
# how this can be corrected
print("Usage: python gdp_plots.py <filenames>")
print("Options:")
print("-a : plot all gdp data sets in current directory")
# check for -a flag in arguments
if "-a" in argv:
filenames = glob.glob("*gdp*.csv")
else:
filenames = argv
return filenames
def create_plot(filename):
"""
Creates a plot for the specified
data file.
Input:
------
filename: string, path to file to plot
Returns:
--------
none
"""
# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T
# create a plot of the transposed data
ax = data.plot(title = filename)
# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)))
ax.set_xticklabels(data.index, rotation = 45)
# save the plot with a unique file name
split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX
split_name2 = filename.split('/')[1]
save_name = 'figs/'+split_name2 + '.png'
plt.savefig(save_name)
def create_plots(filenames):
"""
Takes in a list of filenames to plot
and creates a plot for each file.
Input:
------
filenames: list of strings, list of files to plot
Returns:
--------
none
"""
for filename in filenames:
create_plot(filename)
def main():
"""
main function - does all the work
"""
# parse arguments
files_to_plot = parse_arguments(sys.argv[1:])
#generate plots
create_plots(files_to_plot)
# call main
main()
The behavior of the program hasn’t changed, but it has been made more modular by separating the into different functions with their own purpose.
Why the extra function? Our function main has two primary components to it
- parsing arguments handed to the program
- generating the desired plots
But we’ve defined three functions above the main
function: parse_arguments
, create_plot
, and
create_plots
.
If the functions in our file directly reflected the components in
main
we would only have the parse_arguments
and create_plots
functions. If we think about using these
functions independently, however, a function which always takes in a
list of filenames isn’t very convenient to use on its own. By defining
the create_plot
function, we have placed most of the plot
generation work there, while allowing for a very simple definition of
the create_plots
function.
The importance of this design decision will be made clear in the next lesson.
Before we commit we need to change our
refactored_gdp_plot.py
script to gdp_plot.py
since we don’t want to keep two copies of this script around in our
repo. We only made this as a separate script to make it easier to
copy-paste. Once you’ve tested it, you can either rename
refactored_gdp_plot.py
script to gdp_plot.py
or copy the contents of refactored_gdp_plot.py
script to
gdp_plot.py
and delete
refactored_gdp_plot.py
.
Update the Repository
We haven’t changed the behavior of our program, but our code has changed, so let’s update the repository.
Branching and Refactoring
To demonstrate that the behavior of our program hasn’t changed, try
running it a few different ways using both the master
and
refactor
branches. Remember that the command for checking
out a branch is git checkout <branch_name>
.
Now that we’re satisfied with our refactor. We can merge this branch into our master branch.
Key Points
- Refactoring makes code more modular, easier to read, and easier to understand.
- Refactoring requires one to consdier future implications and generally enables others to use your code more easily.
Content from Running Scripts and Importing
Last updated on 2025-02-10 | Edit this page
Overview
Questions
- How can I import some of my work even if it is part of a program?
Objectives
- Learn how to import functions from one script into another.
- Understand the difference between using a file as a module and running it as a script or program.
In our last lesson we learned how refactoring code makes it easier to understand organize into pices that each have their own purpose. The added modularity also makes it easier to use in other places.
First, let’s start a Jupyter notebook in our current directory and see what happens if we try to import our file as it is.
The result is an error related to how Python is attempting to
interpret the file. This is because Python is encountering our call to
the function main
and isn’t sure how to proceed.
Running Versus Importing
Running a Python script in bash is very similar to importing that file in Python. The biggest difference is that we don’t expect anything to happen when we import a file, whereas when running a script, we expect to see some output printed to the console.
In order for a Python script to work as expected when imported or when run as a script, we typically put the part of the script that produces output in the following if statement:
When you import a Python file, __name__
is the name of
that file (e.g., when importing pandas_plots.py
,
__name__
is 'pandas_plots'
). However, when
running a script in bash, __name__
is always set to
'__main__'
in that script so that you can determine if the
file is being imported or run as a script.
Let’s add the main function in our script to a section which identifies the program as being called from the command line.
import sys import glob import pandas # we need to import part of matplotlib # because we are no longer in a notebook import matplotlib.pyplot as plt def parse_arguments(argv): """ Parse the argument list passed from the command line (after the program filename is removed) and return a list of filenames. Input: ------ argument list (normally sys.argv[1:]) Returns: -------- filenames: list of strings, list of files to plot """ # make sure additional arguments or flags have # been provided by the user if argv == []: # why the program will not continue print("Not enough arguments have been provided") # how this can be corrected print("Usage: python gdp_plots.py ") print("Options:") print("-a : plot all gdp data sets in current directory") # check for -a flag in arguments if "-a" in argv: filenames = glob.glob("data/*gdp*.csv") else: filenames = argv return filenames def create_plot(filename): """ Creates a plot for the specified data file. Input: ------ filename: string, path to file to plot Returns: -------- none """ # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot of the transposed data ax = data.plot(title = filename) # set some plot attributes ax.set_xlabel("Year") ax.set_ylabel("GDP Per Capita") # set the x locations and labels ax.set_xticks(range(len(data.index))) ax.set_xticklabels(data.index, rotation = 45) # save the plot with a unique file name split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX split_name2 = filename.split('/')[1] save_name = 'figs/'+split_name2 + '.png' plt.savefig(save_name) def create_plots(filenames): """ Takes in a list of filenames to plot and creates a plot for each file. Input: ------ filenames: list of strings, list of files to plot Returns: -------- none """ for filename in filenames: create_plot(filename) def main(): """ main function - does all the work """ # parse arguments files_to_plot = parse_arguments(sys.argv[1:]) #generate plots create_plots(files_to_plot) if __name__ == "__main__": # call main main()
Now let’s go back to the Jupyter notebook and try importing the file again.
Success! You’ve just written your first Python module. Any of the functions in that module can now be accessed in our Jupyter notebook session.
Update the repository
Back in the terminal, let’s commit these changes to our repository.
Writing Modular Code
In our previous lesson we refactored our code for clarity and
modularity. As part of that process we created two functions
create_plot
and create_plots
. While the
create_plot
function wasn’t used in our program, imagine
importing our module and finding that the only way to generate a plot
for a single file is to add that filename to a list before passing that
list to create_plots
.
This might seem very strange or confusing to someone importing our module for the first time. It take time to develop an intuition for design decisions like these, but here are a few questions to ask yourself as a guide when organizing code:
- Are my functions able to stand on their own? Do they accomplish simple tasks?
- Is it easy to write a clear function names in my module? A function with the name “create_plots_and_export_data” is likely better off being borken up into two functions “create_plots” and “export_data”
- Are my function names plural? In our example, it may have felt more natural to create only one function, “create_plots”. In these cases, it is almost always useful to create a function which does our operation once (“create_plot”) and another plural form of the function (“create_plots”) which contains a loop over the singular function.
Key Points
- The
__name__
variable allows us to know whether the file is being imported or run as a script.
Content from Programming Style
Last updated on 2025-02-10 | Edit this page
Overview
Questions
- How can I make my programs more readable?
- How do most programmers format their code?
- How can programs check their own operation?
Objectives
- Provide sound justifications for basic rules of coding style.
- Refactor one-page programs to make them more readable and justify the changes.
- Use Python community coding standards (PEP-8).
Follow standard Python style in your code.
-
PEP8: a style
guide for Python that discusses topics such as how you should name
variables, how you should use indentation in your code, how you should
structure your
import
statements, etc. Adhering to PEP8 makes it easier for other Python developers to read and understand your code, and to understand what their contributions should look like. The PEP8 application and Python library can check your code for compliance with PEP8.
Best Practice: Write programs for people and not for computers!
Use assertions to check for internal errors.
Assertions are a simple, but powerful method for making sure that the context in which your code is executing is as you expect.
PYTHON
def calc_bulk_density(mass, volume):
'''Return dry bulk density = powder mass / powder volume.'''
assert volume > 0
return mass / volume
If the assertion is False
, the Python interpreter raises
an AssertionError
runtime exception. The source code for
the expression that failed will be displayed as part of the error
message. To ignore assertions in your code run the interpreter with the
‘-O’ (optimize) switch. Assertions should contain only simple checks and
never change the state of the program. For example, an assertion should
never contain an assignment.
Best Practice: Plan for mistakes
Use docstrings to provide online help.
Best Practice: Document design & purpose, not just mechanics
- If the first thing in a function is a character string that is not assigned to a variable, Python attaches it to the function as the online help.
- Called a docstring (short for “documentation string”).
PYTHON
def average(values):
"Return average of values, or None if no values are supplied."
if len(values) == 0:
return None
return sum(values) / average(values)
help(average)
OUTPUT
Help on function average in module __main__:
average(values)
Return average of values, or None if no values are supplied.
What Will Be Shown?
Highlight the lines in the code below that will be available as online help. Are there lines that should be made available, but won’t be? Will any lines produce a syntax error or a runtime error?
PYTHON
"Find maximum edit distance between multiple sequences."
# This finds the maximum distance between all sequences.
def overall_max(sequences):
'''Determine overall maximum edit distance.'''
highest = 0
for left in sequences:
for right in sequences:
'''Avoid checking sequence against itself.'''
if left != right:
this = edit_distance(left, right)
highest = max(highest, this)
# Report.
return highest
Clean Up This Code
- Read this short program and try to predict what it does.
- Run it: how accurate was your prediction?
- Refactor the program to make it more readable. Remember to run it after each change to ensure its behavior hasn’t changed.
- Compare your rewrite with your neighbor’s. What did you do the same? What did you do differently, and why?
Here’s one solution.
PYTHON
def string_machine(input_string, iterations):
"""
Takes input_string and generates a new string with -'s and *'s
corresponding to characters that have identical adjacent characters
or not, respectively. Iterates through this procedure with the resultant
strings for the supplied number of iterations.
"""
print(input_string)
old = input_string
for i in range(iterations):
new = ''
# iterate through characters in previous string
for j in range(len(s)):
left = j-1
right = (j+1)%len(s) # ensure right index wraps around
if old[left]==old[right]:
new += '-'
else:
new += '*'
print(new)
# store new string as old
old = new
string_machine('et cetera', 10)
Key Points
- Follow standard Python style in your code.
- Use docstrings to provide online help.
Content from Wrap-Up
Last updated on 2025-02-10 | Edit this page
Overview
Questions
- What have we learned?
- What else is out there and where do I find it?
Objectives
- Name and locate scientific Python community sites for software, workshops, and help.
Leslie Lamport once said, “Writing is nature’s way of showing you how sloppy your thinking is.” The same is true of programming: many things that seem obvious when we’re thinking about them turn out to be anything but when we have to explain them precisely.
Python supports a large community within and outwith research.
The Python 3 documentation covers the core language and the standard library.
PyCon is the largest annual conference for the Python community.
SciPy is a rich collection of scientific utilities. It is also the name of a series of annual conferences.
Jupyter is the home of the Jupyter Notebook.
Pandas is the home of the Pandas data library.
Stack Overflow’s general Python section can be helpful, as can the sections on NumPy, SciPy, Pandas, and other topics.
Key Points
- Python supports a large community within and outwith research.