# Review Exercise

## Overview

Teaching: 0 min
Exercises: 20 min
Questions
• How can we put together all of yesterday’s material?

Objectives
• Apply use of functions, conditionals and loops to solve a problem.

## Review From Yesterday

In your notebook, write a function that determines whether a year between 1901 and 2000 is a leap year, where it prints a message like “1904 is a leap year” or “1905 is not a leap year” as output. Use this function to evaluate the years 1928, 1950, 1959, 1972 and 1990. Essentially, given this list of years:

``````years = [1928, 1950, 1959, 1972, 1990]
``````

Produce something like:

``````1928 is a leap year
1950 is not a leap year.
1959 is not a leap year.
1972 is a leap year
1990 is not a leap year.
``````
``````8 mod 4 equals 0
10 mod 4 equals 2
``````

If you’re not sure where to start, see the partial answers below:

## Suggested Approach

First, try to determine how to use the mod operator `%` to determine if a year is divisible by 4 (and thus a leap year or not).

Then, create a conditional statement to use this information, and put it into a function.

Finally, create a list of the years given in the exercise. Use a for loop and your function to evaluate these years.

## Modular Arthimetic

If a year in the range specified is divisible by four, it is a leap year. If a number is divisible by 4, then the arithmetic expression “number mod four” (or `num % 4` in Python) will equal zero.

## Conditional Statement

Fill in the blanks:

``````year = 1904
if year % 4 == _____:
print(year, _______________)
______:
print(year, "is not a leap year.")
``````

## Function

Fill in the blanks:

``````def leap_year(year):
_________
``````

## Loop

Fill in the blanks:

``````year_list = [1928, 1950, 1959, 1972, 1990]
for year in ______:
________(year)
``````

## Complete Solution

``````def leap_year(year):
if year % 4 == 0:
print(year, "is a leap year")
else:
print(year, "is not a leap year.")

year_list = [1928, 1950, 1959, 1972, 1990]
for year in year_list:
leap_year(year)
``````

If you have time:

1. Expand your function so that it correctly categorizes any year from 0 onwards

2. Instead of printing whether a year is a leap year or not, save the results to a python dictionary, where there are two keys (“leap” and “not-leap”) and the values are a list of years.

## Key Points

• Use skills together.

# Command-Line Programs

## Overview

Teaching: 15 min
Exercises: 15 min
Questions
• How can I write Python programs that will work like Unix command-line tools?

Objectives
• Use the values of command-line arguments in a program.

• Read data from standard input in a program so that it can be used in a pipeline.

The Jupyter Notebook and other interactive tools are great for prototyping code and exploring data, but sooner or later one will want to use that code in a program we can run from the command line. In order to do that, we need to make our programs work like other Unix command-line tools. For example, we may want a program that reads a gapminder data set and plots the gdp of countries over time.

## Switching to Shell Commands

In this lesson we are switching from typing commands in Jupyter notebooks to typing commands in a shell terminal window (such as bash). When you see a `\$` in front of a command that tells you to run that command in the shell rather than the Python interpreter.

## Converting Notebooks

The Jupyter Notebook has the ability to convert all of the cells of a current Notebook into a python program. To do this, go to `File` -> `Download as` and select `Python (.py)` to get the current notebook as a Python script.

Up until now, we’ve been working in the data folder directly. Because we’re going to be dealing with more files of different types in this lesson, let’s do a little rearranging:

• On your desktop, create a folder called `swc-gapminder`.
• Move the `data` folder you’ve been using into this folder.
• Inside swc-gapminder, create a folder called `figs`
• To ensure that we’re all starting with the same set of code, copy the text below into a file called `gdp_plots.py` in the `swc-gapminder` folder or download the file from here.
``````import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv('data/gapminder_gdp_oceania.csv', index_col = 'country').T

# create a plot of the transposed data
ax = data.plot()

# display the plot
plt.show()
``````

This program imports the `pandas` and `matplotlib` Python modules, reads some of the gapminder data into a `pandas` dataframe, and plots that data using `matplotlib` with some default settings.

We can run this program from the command line using

``````\$ python gdp_plots.py
``````

This is much easier than starting a notebook, going to the browser, and running each cell in the notebook to get the same result.

### Initialize a repository

But before we modify our `gdp_plots.py` program, we are going to put it under version control so that we can track its changes as we go through this lesson.

``````\$ git init
\$ git commit -m "First commit of analysis script"
``````

Because we’re only concerned with changes to our analysis script, we are going to create a .gitignore file for all of the gapminder `.csv` files and any Python notebook files (`.ipynb`) files we have created thus far.

``````\$ echo "data/*.csv" > .gitignore
\$ echo "*.ipynb" >> .gitignore
\$ git commit -m "Adding ignore file"
``````

Now that we have a clean repository, let’s get back to work on adding command line arguments to our program.

## Changing code under Version Control

As it is, this plot isn’t bad but let’s add some labels for clarity. We’ll use the data filename as a title for the plot and indicate what information is in on each axis.

``````import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

filename = 'data/gapminder_gdp_oceania.csv'

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T

# create a plot of the transposed data
ax = data.plot(title = filename)

# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)))
ax.set_xticklabels(data.index, rotation = 45)

# display the plot
plt.show()
``````

Now when we run this, our plot looks a little bit nicer.

``````\$ python gdp_plots.py
``````

### Updating the Repository

``````\$ git add gdp_plots.py
\$ git commit -m "Improving plot format"
``````

## Command-Line Arguments

This program currently only works for the Oceania set of data. How might we modify the program to work for any of the gapminder gdp data sets? We could go into the script and change the `.csv` filename to generate the same plot for different sets of data, but there is an even better way.

Python programs can use additional arguments provided in the following manner.

``````\$ python <program> <argument1> <argument2> <other_arguments>
``````

The program can then use these arguments to alter its behavior based on those arguments. In this case, we’ll be using arguments to tell our program to operate on a specific file.

We’ll be using the `sys` module to do so. `sys` (short for system) is a standard Python module used to store information about the program and its running environment, including what arguments were passed to the program when the command was executed. These arguments are stored as a list in `sys.argv`.

These arguments can be accessed in our program by importing the `sys` module. The first argument in `sys.argv` is always the name of the program, so we’ll find any additional arguments right after that in the list.

Let’s try this out in a separate script. Using the text editor of your choice, let’s write a new program called `args_list.py` containing the two following lines:

``````import sys
print('sys.argv is', sys.argv)
``````

The strange name `argv` stands for “argument values”. Whenever Python runs a program, it takes all of the values given on the command line and puts them in the list `sys.argv` so that the program can determine what they were. If we run this program with no arguments:

``````\$ python argv_list.py
``````
``````sys.argv is ['argv_list.py']
``````

the only thing in the list is the full path to our script, which is always `sys.argv`.

If we run it with a few arguments, however:

``````\$ python argv_list.py first second third
``````
``````sys.argv is ['argv_list.py', 'first', 'second', 'third']
``````

then Python adds each of those arguments to that magic list.

Using this new information, let’s add command line arguments to our `gdp_plots.py` program.

To do this, we’ll make two changes:

1. add the import of the sys module at the beginning of the program.
2. replace the filename (“data/gapminder_gdp_oceania.csv”) with the the second entry in the `sys.argv` list.

Now our program should look as follows:

```import sys
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

filename = sys.argv

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T

# create a plot of the transposed data
ax = data.plot(title = filename)

# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)) )
ax.set_xticklabels(data.index, rotation = 45)

# display the plot
plt.show()
```

Let’s take a look at what happens when we provide a gapminder filename to the program.

``````\$ python gdp_plots.py data/gapminder_gdp_oceania.csv
``````

And the same plot as before is displayed, but this file is now being read from an argument we’ve provided on the command line. We can now do this for files with similar information and get the same set of plots for that data without any changes to our program’s code. Try this our for yourself now.

### Update the Repository

Now that we’ve made this change to our program and see that it works. Let’s update our repository with these changes.

``````\$ git add gdp_plots.py
\$ git commit -m "Adding command line arguments"
``````

Try to run the gdp_plots.py so that it reads in all the .csv files in the data folder using the wildcard symbol. Does it work? Why or why not?

## Solution

if you run it with the argument ‘data/*.csv’ you get an error on the Americas file because it has an extra file. However, it works if you omit that file.

## Key Points

• The `sys` library connects a Python program to the system it is running on.

• The variable `sys.argv` is a list with each item being a command-line argument.

# Trying Different Methods

## Overview

Teaching: 5 min
Exercises: 25 min
Questions
• How do I plot multiple data sets using different methods?

Objectives
• Read data from standard input in a program so that it can be used in a pipeline.

• Compare using different methods to accomplish the same task.

• Practice making branches and merging in a Git repository.

## Handling Multiple Files

Perhaps we would like a script that can operate on multiple data files at once. This can actually be done in two different ways: (1) writing a bash script that calls our already-existing Python script in a for-loop, or (2) modifying our Python script to read multiple files in a for-loop. We will try both methods and compare the two.

### Create New Branches

First we will create two new branches where we can develop each of these two different methods. We will call these branches `python-multi-files` and `bash-multi-files`

``````\$ git branch python-multi-files
\$ git branch bash-multi-files
``````

We can check that these two branches were created with `\$ git branch -a`.

## Handling Multiple Files with Python

First, we’ll try using just Python to loop through mutliple files. Let’s switch to our Python branch.

``````\$ git checkout python-multi-files
``````

To process each file separately, we’ll need a loop that executes our plotting statements for each file.

We want our program to process each file separately, and the easiest way to do this is a loop that executes once for each of the filenames provided in `sys.argv`.

But we need to be careful: `sys.argv` will always be the name of our program, rather than the name of a file. We also need to handle an unknown number of filenames, since our program could be run for any number of files.

A solution is to loop over the contents of `sys.argv[1:]`. The ‘1’ tells Python to start the slice at location 1, so the program’s name isn’t included. Since we’ve left off the upper bound, the slice runs to the end of the list, and includes all the filenames.

Here is our updated program.

```import sys
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

for filename in sys.argv[1:]:

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T

# create a plot of the transposed data
ax = data.plot(title = filename)

# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)))
ax.set_xticklabels(data.index, rotation = 45)

# display the plot
plt.show()
```

Now when the program is given multiple filenames

``````\$ python gdp_plots.py data/gapminder_gdp_oceania.csv data/gapminder_gdp_africa.csv
``````

one plot for each filename is generated.

#### Update the Repository

``````\$ git add gdp_plots.py
\$ git commit -m "Allowing plot generation for multiple files at once"
``````

### Saving Figures

By using `plt.show()` with multiple files, the program stops each time a figure is generated and the user must exit it to continue. To avoid this slow down, we can replace this with `plt.savefig()` and view all the figures after the script finishes. This function has one required argument which is the filename to save the figure as. The filename must have a valid image extension (eg. PNG, JPEG, etc.).

Let’s replace our `plt.show()` with `plt.savefig('fig/gdp-plot.png')`. First, create the `fig` directory using `mkdir` Our new script should like like this:

```import sys
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

for filename in sys.argv[1:]:

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T

# create a plot of the transposed data
ax = data.plot(title = filename)

# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)))
ax.set_xticklabels(data.index, rotation = 45)

# save the plot
plt.savefig('fig/gdp-plot.png')
```

If we look at the contents of our folder now, we should have a new file called `gdp-plot.png`. But why is there only one when we supplied multiple data files? This is because each time the plot is created, it is being saved as the same file name and overwriting the previous plot.

We can fix this by creating a unique file name each time. A simple unique name can be used based on the original file name. We can use Python’s `split()` function to split `filename`, which is a string, by any character. This returns a list like so:

``````name = 'my-data.csv'
split_name = name.split('.')
print(split_name)
print(split_name)
``````
``````['my-data', 'csv']
'my-data'
``````

We’ll split the original file name and use the first part to rename our plot. And then we will concatenate `.png` to the name to specify our file type.

```import sys
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

for filename in sys.argv[1:]:
# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T

# create a plot of the transposed data
ax = data.plot(title = filename)

# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)))
ax.set_xticklabels(data.index, rotation = 45)

# save the plot with a unique file name
split_name1 = filename.split('.') #data/gapminder_gdp_XXX
split_name2 = split_name1.split('/')
save_name = 'figs/'+split_name2 + '.png'
plt.savefig(save_name)
```

### Updating the repository

Yet another successful update to the code. Let’s commit our changes. If we do `git status` we see we also have image files untracked. Let’s ignore those files because they will likely change as our data changes.

``````\$ echo "*.png" >> .gitignore
\$ git commit -m "ignoring generated images"
\$ git commit -m "Saves each figure as a separate file."
``````

## Handling Multiple Files with Bash

Now that we’ve created a Python script to save multiple files at once, let’s try to do the same thing in Bash. We’ll leave our current branch alone and switch to our `bash-multi-files` branch.

``````\$ git checkout bash-multi-files
``````

If we look at our `gdp_plots.py` file, it is not in a for-loop format because it is not up to date with other branch. Our script should look like this:

``````import sys
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

filename = sys.argv

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T

# create a plot the transposed data
ax = data.plot(title = filename)

# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)))
ax.set_xticklabels(data.index, rotation = 45)

# display the plot
plt.show()
``````

Let’s use bash to generate multiple plots by calling our Python script in a bash for-loop. First, let’s create a bash file for us to edit.

``````\$ touch gdp_plots.sh
``````

In this script, we’ll write a for-loop will call our `gdp_plots.py` script on multiple files. We can break up our long list of files by using a backslash `\` and writing the rest on the next line.

``````for filename in data/gapminder_gdp_oceania.csv data/gapminder_gdp_africa.csv
do
python gdp_plots.py \$filename
done
``````

We can run our script to see if it works:

``````\$ bash gdp_plots.sh
``````

When we run this, we see that it stops to show us each plot like before. Let’s update our script to save the figure like before.

```import sys
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

filename = sys.argv

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T

# create a plot the transposed data
ax = data.plot(title = filename)

# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)))
ax.set_xticklabels(data.index, rotation = 45)

# save the plot with a unique file name
split_name1 = filename.split('.') #data/gapminder_gdp_XXX
split_name2 = split_name1.split('/')
save_name = 'figs/'+split_name2 + '.png'
plt.savefig(save_name)
```

When we run the script again, we should have new image files generated.

### Updating the repository

Yet another successful update to the code. Let’s commit our changes. Since our Python and bash scripts had somewhat unrelated changes, let’s make two separate commits. We will also ignore our images like before.

``````\$ git add gdp_plots.sh
\$ git commit -m "Wrote bash script to call python plotter."
\$ git commit -m "Saves figures with unique name."
\$ echo "*.png" >> .gitignore
\$ git commit -m "ignoring generated images"
``````

## Comparing Methods

We have successfully developed two different methods for accomplishing the same task. This is common to do in software development when there is not a clear path forward. Let’s compare our two methods and decide which to merge into our `master` branch.

One comparison we might be interested in is how fast each is. We can use bash’s `time` function to get the time to run the script. Let’s time each script. To do this, we just add `time` before each command when we run a script or command and it will give us timing information when it is completed.

While we are on our bash branch, we’ll time that script first.

``````\$ time bash gdp_plots.sh
``````
``````real    0m6.031s
user    0m5.535s
sys     0m0.388s
``````

We are most interested in the “real” time in the output, which is the elapsed time we experience.

Let’s checkout our python branch and time our script there.

``````\$ git checkout python-multi-files
\$ time python gdp_plots.py data/gapminder_gdp_oceania.csv data/gapminder_gdp_africa.csv
``````
``````real    0m3.163s
user    0m3.002s
sys     0m0.132s
``````

As we can see, our Python method ran faster than the bash method. For this reason, we will merge our Python branch into master.

``````\$ git checkout master
\$ git merge python-multi-files
``````

Another advantage to the Python method over the bash method is that we do not need to change a file and commit it each time we want to create plots from different data.

## Finding Particular Files

Using the `glob` module, write a simple version of `ls` that shows files in the current directory with a particular suffix. A call to this script should look like this:

``````\$ python my_ls.py py
``````
``````left.py
right.py
zero.py
``````

## Solution

``````import sys
import glob

def main():
'''prints names of all files with sys.argv as suffix'''
assert len(sys.argv) >= 2, 'Argument list cannot be empty'
suffix = sys.argv # NB: behaviour is not as you'd expect if sys.argv is *
glob_input = '*.' + suffix # construct the input
glob_output = sorted(glob.glob(glob_input)) # call the glob function
for item in glob_output: # print the output
print(item)
return

main()
``````

## Counting Lines

Write a program called `line_count.py` that works like the Unix `wc` command:

• If no filenames are given, it reports the number of lines in standard input.
• If one or more filenames are given, it reports the number of lines in each, followed by the total number of lines.

## Solution

``````import sys

def main():
'''print each input filename and the number of lines in it,
and print the sum of the number of lines'''
filenames = sys.argv[1:]
sum_nlines = 0 #initialize counting variable

if len(filenames) == 0: # no filenames, just stdin
sum_nlines = count_file_like(sys.stdin)
print('stdin: %d' % sum_nlines)
else:
for f in filenames:
n = count_file(f)
print('%s %d' % (f, n))
sum_nlines += n
print('total: %d' % sum_nlines)

def count_file(filename):
'''count the number of lines in a file'''
f = open(filename,'r')
f.close()
return(nlines)

def count_file_like(file_like):
'''count the number of lines in a file-like object (eg stdin)'''
n = 0
for line in file_like:
n = n+1
return n

main()

``````

## Key Points

• Make different branches in a Git repository to try different methods.

• Use bash’s `time` command to time scripts.

# Program Flags

## Overview

Teaching: 5 min
Exercises: 5 min
Questions
• How can I make an easy shortcut to analyze all files at once using a program flag?

Objectives
• Handle flags and files separately in a command-line program.

## Handling Program Flags

Now we have a program which is capable of handling any number of data sets at once.

But what if we have 50 GDP data sets? It would be awfully tedious to type in the names of 50 files in the command line, so let’s add a flag to our program indicating that we would like it to generate a plot for each data set in the current directory.

Flags are a convention used in programming to indicate to a program that a non-default behavior is being requested by the user. In this case, we’ll be using a “-a” flag to indicate to our program we would like it to operate on all data sets in our directory.

To explore what files are in the current directory, we’ll be using the Python’s `glob` module.

• In Unix, the term “globbing” means “matching a set of files with a pattern”.
• The most common patterns are:
• `*` meaning “match zero or more characters”
• `?` meaning “match exactly one character”
• Python contains the `glob` library to provide pattern matching functionality
• The `glob` library contains a function also called `glob` to match file patterns
• E.g., `glob.glob('*.txt')` matches all files in the current directory whose names end with `.txt`.
• Result is a (possibly empty) list of character strings.
```import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

# check for -a flag in arguments
if "-a" in sys.argv:
filenames = glob.glob("data/*gdp*.csv")
else:
filenames = sys.argv[1:]

for filename in filenames:

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T

# create a plot of the transposed data
ax = data.plot(title = filename)

# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)))
ax.set_xticklabels(data.index, rotation = 45)

# save the plot with a unique file name
split_name1 = filename.split('.') #data/gapminder_gdp_XXX
split_name2 = filename.split('/')
save_name = 'figs/'+split_name2 + '.png'
plt.savefig(save_name)
```

### Updating the repository

Yet another successful update to the code. Let’s commit our changes.

``````\$ git add gdp_plots.py
\$ git commit -m "Adding a flag to run script for all gdp data sets."
``````

## The Right Way to Do It

If our programs can take complex parameters or multiple filenames, we shouldn’t handle `sys.argv` directly. Instead, we should use Python’s `argparse` library, which handles common cases in a systematic way, and also makes it easy for us to provide sensible error messages for our users. We will not cover this module in this lesson but you can go to Tshepang Lekhonkhobe’s Argparse tutorial that is part of Python’s Official Documentation.

## Key Points

• Adding command line flags can be a user-friendly way to accomplish common tasks.

# Defensive Programming

## Overview

Teaching: 10 min
Exercises: 5 min
Questions
• How do I predict and avoid user confusion?

Objectives
• Ensure that programs indicate use and provide meaningful output upon failure.

## Defensive Programming

In our last lesson, we created a program which will plot our gapminder gdp data for an arbitrary number of files. This is great, but we didn’t cover some of the vulnerabilities of this program we’ve created.

• What happens if we run the program without any arguments at all?
• What happens if we run the program from another directory?

First, let’s try running our program without any additional arguments or flags.

``````\$ python gdp_plots.py
``````
``````Traceback (most recent call last):
File "gdp_plot.py", line 12, in <module>
filenames = sys.argv[1:]
IndexError: list index out of range
``````

Python returns an error when trying to find the command line argument in `sys.argv`. It cannot find that argument because we haven’t provided it to the command and as a result there is no entry in `sys.argv` where we’re telling it to look for this value. We may know all of this because we’re the ones who wrote the program, but another user of the program without this experience will not.

## More on Function Errors/Exceptions

Python reports a runtime error when something goes wrong while a program is executing.

``````age = 53
remaining = 100 - aege # mis-spelled 'age'
``````
``````NameError: name 'aege' is not defined
``````
• The message indicates a problem with the name of a variable

Python also reports a syntax error when it can’t understand the source of a program.

``````print("hello world"
``````
``````  File "<ipython-input-6-d1cc229bf815>", line 1
print ("hello world"
^
SyntaxError: unexpected EOF while parsing
``````
• The message indicates a problem on first line of the input (“line 1”).
• In this case the “ipython-input” section of the file name tells us that we are working with input into IPython, the Python interpreter used by the Jupyter Notebook.
• The `-6-` part of the filename indicates that the error occurred in cell 6 of our Notebook.
• Next is the problematic line of code, indicating the problem with a `^` pointer.

And if we run the program from another directory:

``````\$ cd ..
\$ python swc-gapminder/gdp_plots.py -a
``````

We see no output from the program at all. This is what is referred to as a “silent failure”. The program has failed to produce a plot, but has reported no reason why. These kind of failures are difficult to debug and should be avoided.

It is important to employ “defensive programming” in this scenario so that our program indicates to the user

1. what is going wrong
2. how to correct this problem

### Check Input Arguments

Let’s add a section to the code which checks the number of incoming arguments to the program and returns some information to the user if there is missing information.

```import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

# make sure additional arguments or flags have
# been provided by the user
if len(sys.argv) == 1:
# why the program will not continue
print("Not enough arguments have been provided")
# how this can be corrected
print("Usage: python gdp_plots.py < filenames >")
print("Options:")
print("-a : plot all gdp data sets in current directory")

# check for -a flag in arguments
if "-a" in sys.argv:
filenames = glob.glob("data/*gdp*.csv")
else:
filenames = sys.argv[1:]

for filename in filenames:

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T

# create a plot of the transposed data
ax = data.plot(title = filename)

# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)))
ax.set_xticklabels(data.index, rotation = 45)

# save the plot with a unique file name
split_name = filename.split('.')
save_name = split_name + '.png'
plt.savefig(save_name)
```

If we run the program without a filename argument, here’s what we’ll see

``````\$ python gdp_plots.py
``````
``````Not enough arguments have been provided
Usage: python gdp_plots.py <filenames>
Options:
-a : plot all gdp file in current directory
``````

Now if someone runs this program without having used it before (or written it themselves) the user will know how change their command to get the program running properly, rather than seeing an esoteric Python error.

### Update the Repository

We’ve just made another successful change to our repository. Let’s add a commit to the repo.

``````\$ git add gdp_plots.py
\$ git commit -m "Handling case for missing filename argument"
``````

### Check for silent errors

Silent errors can be difficult to anticipate. If we try to run our program from another directory with the `-a` flag, we don’t see any errors, but it also doesn’t do anything. This is because when we do the `-a` flag here, there are no `.csv` files in the directory, so our `filenames` variable is empty. Let’s add a check to ensure there are files to plot.

```import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

# make sure additional arguments or flags have
# been provided by the user
if len(sys.argv) == 1:
# why the program will not continue
print("Not enough arguments have been provided")
# how this can be corrected
print("Usage: python gdp_plots.py < filenames >")
print("Options:")
print("-a : plot all gdp data sets in current directory")

# check for -a flag in arguments
if "-a" in sys.argv:
filenames = glob.glob("data/*gdp*.csv")
# check if no files were found and print message.
if filenames == []:
# file list is empty (no files found)
print("No files found in this folder.")
print("Make sure data is located in current directory.")
else:
filenames = sys.argv[1:]

for filename in filenames:

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T

# create a plot of the transposed data
ax = data.plot(title = filename)

# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)))
ax.set_xticklabels(data.index, rotation = 45)

# save the plot with a unique file name
split_name = filename.split('.')
save_name = split_name + '.png'
plt.savefig(save_name)
```

Now if someone runs this program in a directory with no valid datafiles, a message appears.

### Update the Repository

We’ve just made another successful change to our repository. Let’s add a commit to the repo.

``````\$ git add gdp_plots.py
\$ git commit -m "Handling case if no files present in directory"
``````

## Key Points

• Avoid silent failures.

• Avoid esoteric output when a program fails.

• Add checkpoints in code to check for common failures.

# Refactoring

## Overview

Teaching: 10 min
Exercises: 10 min
Questions
• When should I reorganize my code so it is more clear and readable for others?

• How can I organize my code so that it is useable in other places?

• Why do I almost always want to write my code as though it will be used somewhere else?

Objectives
• Understand the value of refactoring code and use of functions.

• Practice determining where code can be divided into smaller functions.

This code works nicely for generating plots of multiple data sets, but there is now a lot of code to digest in our script. Picture yourself looking at this code for the first time. Would it be immediately clear to you where are arguments are handled and where plots are generated?

It would be nice to break this work into clear chunks of code. This can be accomplished by making the argument checking section and the body of the for loop their own functions. This requires surprisingly few changes to the code, but makes it much more clear. This process is called refactoring.

## Exercise: make a refactoring plan

Given the guidance above, talk with your neighbors about which parts of the script should be moved into functions. Try to think of ways to make the functions the most reusable on their own.

## Solution

A possible solution:

1. A function that parses the arguments
2. A function that makes the plots
3. A function that calls the other functions

This isn’t the only “right” solution, but a reasonable way to split things up

## Let’s refactor our script

### Create a Branch

Because we’re making a major change, let’s make a new branch to work in.

``````\$ git checkout -b refactor
``````

Let’s break the code into 4 functions:

• `parse_arguments()` - gets the input from argv[], returns a list of file names
• `create_plot()` - takes one file name as input, creates one plot and writes it to the fig folder
• `create_plots()` - takes a list of files as input, calls `create_plot()` for each element in the list
• `main()` - calls `parse_arguments()` and `create_plots()`

Below is a template that will help you write these functions. The `"""` syntax indicates a multi-line comment. If these comments are the first thing in a function, they are known as a `Docstring`.

``````import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

def parse_arguments(argv):
"""
Parse the argument list passed from the command line
(after the program filename is removed) and return a list
of filenames.

Input:
------
argument list (normally sys.argv[1:])

Returns:
--------
filenames: list of strings, list of files to plot
"""

def create_plot(filename):
"""
Creates a plot for the specified
data file.

Input:
------
filename: string, path to file to plot

Returns:
--------
none
"""

def create_plots(filenames):
"""
Takes in a list of filenames to plot
and creates a plot for each file.

Input:
------
filenames: list of strings, list of files to plot

Returns:
--------
none
"""

def main():
"""
main function - does all the work
"""

# call main
main()
``````

In an effort to create human-readable code. It is common to include a “docstring” or “Python Documentation String”, at the top of each function that clearly lay out three things: the purpose or objective of the function, a description of the inputs and their datatypes, and a description of what is returned and their data types. By including these things, debugging later on is much more efficient because now the developer knows the starting point (inputs), endpoints (returns), and what the intended change is (purpose).

Let’s move the code into the functions now:

## Exercise: refactor the code

Now that we have a plan for refactoring and a template to work from, create a new script called `refactored_gdp_plot.py`. Paste the template from above into the new script. Then copy and paste the code from `gdp_plot.py` script into the corresponding functions.

## Solution

``````import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

def parse_arguments(argv):
"""
Parse the argument list passed from the command line
(after the program filename is removed) and return a list
of filenames.

Input:
------
argument list (normally sys.argv[1:])

Returns:
--------
filenames: list of strings, list of files to plot
"""
# make sure additional arguments or flags have
# been provided by the user
if argv == []:
# why the program will not continue
print("Not enough arguments have been provided")
# how this can be corrected
print("Usage: python gdp_plots.py <filenames>")
print("Options:")
print("-a : plot all gdp data sets in current directory")

# check for -a flag in arguments
if "-a" in argv:
filenames = glob.glob("*gdp*.csv")
else:
filenames = argv

return filenames

def create_plot(filename):
"""
Creates a plot for the specified
data file.

Input:
------
filename: string, path to file to plot

Returns:
--------
none
"""

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T

# create a plot of the transposed data
ax = data.plot(title = filename)

# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)))
ax.set_xticklabels(data.index, rotation = 45)

# save the plot with a unique file name
split_name1 = filename.split('.') #data/gapminder_gdp_XXX
split_name2 = filename.split('/')
save_name = 'figs/'+split_name2 + '.png'
plt.savefig(save_name)

def create_plots(filenames):
"""
Takes in a list of filenames to plot
and creates a plot for each file.

Input:
------
filenames: list of strings, list of files to plot

Returns:
--------
none
"""

for filename in filenames:
create_plot(filename)

def main():
"""
main function - does all the work
"""

# parse arguments
files_to_plot = parse_arguments(sys.argv[1:])

#generate plots
create_plots(files_to_plot)

# call main
main()
``````

The behavior of the program hasn’t changed, but it has been made more modular by separating the into different functions with their own purpose.

Why the extra function? Our function main has two primary components to it

1. parsing arguments handed to the program
2. generating the desired plots

But we’ve defined three functions above the `main` function: `parse_arguments`, `create_plot`, and `create_plots`.

If the functions in our file directly reflected the components in `main` we would only have the `parse_arguments` and `create_plots` functions. If we think about using these functions independently, however, a function which always takes in a list of filenames isn’t very convenient to use on its own. By defining the `create_plot` function, we have placed most of the plot generation work there, while allowing for a very simple definition of the `create_plots` function.

The importance of this design decision will be made clear in the next lesson.

Before we commit we need to change our `refactored_gdp_plot.py` script to `gdp_plot.py` since we don’t want to keep two copies of this script around in our repo. We only made this as a separate script to make it easier to copy-paste. Once you’ve tested it, you can either rename `refactored_gdp_plot.py` script to `gdp_plot.py` or copy the contents of `refactored_gdp_plot.py` script to `gdp_plot.py` and delete `refactored_gdp_plot.py`.

#### Update the Repository

We haven’t changed the behavior of our program, but our code has changed, so let’s update the repository.

``````\$ git add gdp_plots.py
\$ git commit -m "Refactoring code."
``````

## Branching and Refactoring

To demonstrate that the behavior of our program hasn’t changed, try running it a few different ways using both the `master` and `refactor` branches. Remember that the command for checking out a branch is `git checkout <branch_name>`.

Now that we’re satisfied with our refactor. We can merge this branch into our master branch.

``````\$ git checkout master
\$ git merge refactor
``````

## Key Points

• Refactoring makes code more modular, easier to read, and easier to understand.

• Refactoring requires one to consdier future implications and generally enables others to use your code more easily.

# Running Scripts and Importing

## Overview

Teaching: 10 min
Exercises: 10 min
Questions
• How can I import some of my work even if it is part of a program?

Objectives
• Learn how to import functions from one script into another.

• Understand the difference between using a file as a module and running it as a script or program.

In our last lesson we learned how refactoring code makes it easier to understand organize into pices that each have their own purpose. The added modularity also makes it easier to use in other places.

First, let’s start a Jupyter notebook in our current directory and see what happens if we try to import our file as it is.

``````import gdp_plots
``````

The result is an error related to how Python is attempting to interpret the file. This is because Python is encountering our call to the function `main` and isn’t sure how to proceed.

## Running Versus Importing

Running a Python script in bash is very similar to importing that file in Python. The biggest difference is that we don’t expect anything to happen when we import a file, whereas when running a script, we expect to see some output printed to the console.

In order for a Python script to work as expected when imported or when run as a script, we typically put the part of the script that produces output in the following if statement:

``````if __name__ == '__main__':
main()  # Or whatever function produces output
``````

When you import a Python file, `__name__` is the name of that file (e.g., when importing `pandas_plots.py`, `__name__` is `'pandas_plots'`). However, when running a script in bash, `__name__` is always set to `'__main__'` in that script so that you can determine if the file is being imported or run as a script.

Let’s add the main function in our script to a section which identifies the program as being called from the command line.

```import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

def parse_arguments(argv):
"""
Parse the argument list passed from the command line
(after the program filename is removed) and return a list
of filenames.

Input:
------
argument list (normally sys.argv[1:])

Returns:
--------
filenames: list of strings, list of files to plot
"""

# make sure additional arguments or flags have
# been provided by the user
if argv == []:
# why the program will not continue
print("Not enough arguments have been provided")
# how this can be corrected
print("Usage: python gdp_plots.py < filenames >")
print("Options:")
print("-a : plot all gdp data sets in current directory")

# check for -a flag in arguments
if "-a" in argv:
filenames = glob.glob("data/*gdp*.csv")
else:
filenames = argv

return filenames

def create_plot(filename):
"""
Creates a plot for the specified
data file.

Input:
------
filename: string, path to file to plot

Returns:
--------
none
"""

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T

# create a plot of the transposed data
ax = data.plot(title = filename)

# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)))
ax.set_xticklabels(data.index, rotation = 45)

# save the plot with a unique file name
split_name1 = filename.split('.') #data/gapminder_gdp_XXX
split_name2 = filename.split('/')
save_name = 'figs/'+split_name2 + '.png'
plt.savefig(save_name)

def create_plots(filenames):
"""
Takes in a list of filenames to plot
and creates a plot for each file.

Input:
------
filenames: list of strings, list of files to plot

Returns:
--------
none
"""

for filename in filenames:
create_plot(filename)

def main():
"""
main function - does all the work
"""
# parse arguments
files_to_plot = parse_arguments(sys.argv[1:])

#generate plots
create_plots(files_to_plot)

if __name__ == "__main__":
# call main
main()
```

Now let’s go back to the Jupyter notebook and try importing the file again.

``````import gdp_plots
``````

Success! You’ve just written your first Python module. Any of the functions in that module can now be accessed in our Jupyter notebook session.

``````%matplotlib inline
gdp_plots.create_plot("data/gapminder_gdp_oceania.csv")
``````

### Update the repository

Back in the terminal, let’s commit these changes to our repository.

``````\$ git add gdp_plots.py
\$ git commit -m "Moving call to the main function."
``````

## Writing Modular Code

In our previous lesson we refactored our code for clarity and modularity. As part of that process we created two functions `create_plot` and `create_plots`. While the `create_plot` function wasn’t used in our program, imagine importing our module and finding that the only way to generate a plot for a single file is to add that filename to a list before passing that list to `create_plots`.

This might seem very strange or confusing to someone importing our module for the first time. It take time to develop an intuition for design decisions like these, but here are a few questions to ask yourself as a guide when organizing code:

• Are my functions able to stand on their own? Do they accomplish simple tasks?
• Is it easy to write a clear function names in my module? A function with the name “create_plots_and_export_data” is likely better off being borken up into two functions “create_plots” and “export_data”
• Are my function names plural? In our example, it may have felt more natural to create only one function, “create_plots”. In these cases, it is almost always useful to create a function which does our operation once (“create_plot”) and another plural form of the function (“create_plots”) which contains a loop over the singular function.

## Key Points

• The `__name__` variable allows us to know whether the file is being imported or run as a script.

# Programming Style

## Overview

Teaching: 5 min
Exercises: 5 min
Questions
• How can I make my programs more readable?

• How do most programmers format their code?

• How can programs check their own operation?

Objectives
• Provide sound justifications for basic rules of coding style.

• Refactor one-page programs to make them more readable and justify the changes.

• Use Python community coding standards (PEP-8).

• PEP8: a style guide for Python that discusses topics such as how you should name variables, how you should use indentation in your code, how you should structure your `import` statements, etc. Adhering to PEP8 makes it easier for other Python developers to read and understand your code, and to understand what their contributions should look like. The PEP8 application and Python library can check your code for compliance with PEP8.

Best Practice: Write programs for people and not for computers!

## Use assertions to check for internal errors.

Assertions are a simple, but powerful method for making sure that the context in which your code is executing is as you expect.

``````def calc_bulk_density(mass, volume):
'''Return dry bulk density = powder mass / powder volume.'''
assert volume > 0
return mass / volume
``````

If the assertion is `False`, the Python interpreter raises an `AssertionError` runtime exception. The source code for the expression that failed will be displayed as part of the error message. To ignore assertions in your code run the interpreter with the ‘-O’ (optimize) switch. Assertions should contain only simple checks and never change the state of the program. For example, an assertion should never contain an assignment.

Best Practice: Plan for mistakes

Best Practice: Document design & purpose, not just mechanics

• If the first thing in a function is a character string that is not assigned to a variable, Python attaches it to the function as the online help.
• Called a docstring (short for “documentation string”).
``````def average(values):
"Return average of values, or None if no values are supplied."

if len(values) == 0:
return None
return sum(values) / average(values)

help(average)
``````
``````Help on function average in module __main__:

average(values)
Return average of values, or None if no values are supplied.
``````

## Multiline Strings

Often use multiline strings for documentation. These start and end with three quote characters (either single or double) and end with three matching characters.

``````"""This string spans
multiple lines.

Blank lines are allowed."""
``````

## What Will Be Shown?

Highlight the lines in the code below that will be available as online help. Are there lines that should be made available, but won’t be? Will any lines produce a syntax error or a runtime error?

``````"Find maximum edit distance between multiple sequences."
# This finds the maximum distance between all sequences.

def overall_max(sequences):
'''Determine overall maximum edit distance.'''

highest = 0
for left in sequences:
for right in sequences:
'''Avoid checking sequence against itself.'''
if left != right:
this = edit_distance(left, right)
highest = max(highest, this)

# Report.
return highest
``````

## Document This

Turn the comment on the following function into a docstring and check that `help` displays it properly.

``````def middle(a, b, c):
# Return the middle value of three.
# Assumes the values can actually be compared.
values = [a, b, c]
values.sort()
return values
``````

## Clean Up This Code

1. Read this short program and try to predict what it does.
2. Run it: how accurate was your prediction?
3. Refactor the program to make it more readable. Remember to run it after each change to ensure its behavior hasn’t changed.
4. Compare your rewrite with your neighbor’s. What did you do the same? What did you do differently, and why?
``````import sys
n = int(sys.argv)
s = sys.argv
print(s)
i = 0
while i < n:
# print('at', j)
new = ''
for j in range(len(s)):
left = j-1
right = (j+1)%len(s)
if s[left]==s[right]: new += '-'
else: new += '*'
s=''.join(new)
print(s)
i += 1
``````

## Solution

Here’s one solution.

``````def string_machine(input_string, iterations):
"""
Takes input_string and generates a new string with -'s and *'s
corresponding to characters that have identical adjacent characters
or not, respectively.  Iterates through this procedure with the resultant
strings for the supplied number of iterations.
"""
print(input_string)
old = input_string
for i in range(iterations):
new = ''
# iterate through characters in previous string
for j in range(len(s)):
left = j-1
right = (j+1)%len(s) # ensure right index wraps around
if old[left]==old[right]:
new += '-'
else:
new += '*'
print(new)
# store new string as old
old = new

string_machine('et cetera', 10)
``````

# Wrap-Up

## Overview

Teaching: 10 min
Exercises: 0 min
Questions
• What have we learned?

• What else is out there and where do I find it?

Objectives
• Name and locate scientific Python community sites for software, workshops, and help.

Leslie Lamport once said, “Writing is nature’s way of showing you how sloppy your thinking is.” The same is true of programming: many things that seem obvious when we’re thinking about them turn out to be anything but when we have to explain them precisely.

## Key Points

• Python supports a large community within and outwith research.