Workflows with Python and Git: All in One View

Content from Review Exercise

Last updated on 2025-02-10 | Edit this page

Overview

Questions

How can we put together all of yesterday’s material?

Objectives

Apply use of functions, conditionals and loops to solve a problem.

Challenge

Review From Yesterday - All in One Exercise

In your notebook, write a function that determines whether a year between 1901 and 2000 is a leap year, where it prints a message like “1904 is a leap year” or “1905 is not a leap year” as output. Use this function to evaluate the years 1928, 1950, 1959, 1972 and 1990. Essentially, given this list of years:

PYTHON

years = [1928, 1950, 1959, 1972, 1990]

Produce something like:

OUTPUT

1928 is a leap year
1950 is not a leap year.
1959 is not a leap year.
1972 is a leap year
1990 is not a leap year.

Hint: the percent symbol ‘%’ is the modular operator in Python. So:

PYTHON

print('8 mod 4 equals', 8 % 4)
print('10 mod 4 equals', 10 % 4)

OUTPUT

8 mod 4 equals 0
10 mod 4 equals 2

If you’re not sure where to start, see the fill-in-the-blank version of this exercise below.

Show me the solution

PYTHON

def leap_year(year):
    if year % 4 == 0:
        print(year, "is a leap year")
    else:
        print(year, "is not a leap year.")

year_list = [1928, 1950, 1959, 1972, 1990]
for year in year_list:
    leap_year(year)

Challenge

Review From Yesterday - Step by Step Breakdown

First, try to determine how to use the mod operator % to determine if a year is divisible by 4 (and thus a leap year or not).

Modular Arthimetic

If a year in the range specified is divisible by four, it is a leap year. If a number is divisible by 4, then the arithmetic expression “number mod four” (or num % 4 in Python) will equal zero.

Challenge

Review From Yesterday - Step by Step Breakdown (continued)

Then, create a conditional statement to use this information, and put it into a function.

Conditional Statement

Fill in the blanks:

PYTHON

year = 1904
if year % 4 == _____:
    print(year, _______________)
______:
    print(year, "is not a leap year.")

Challenge

Review From Yesterday - Step by Step Breakdown (continued)

Then, create a list of the years given in the exercise. Use a for loop and your function to evaluate these years.

Function

Fill in the blanks:

PYTHON

def leap_year(year):
    _________

Challenge

Review From Yesterday - Step by Step Breakdown (continued)

Finally, use a for loop and your function to evaluate these years.

Loop

Fill in the blanks:

PYTHON

year_list = [1928, 1950, 1959, 1972, 1990]
for year in ______:
    ________(year)

Show me the solution

PYTHON

def leap_year(year):
    if year % 4 == 0:
        print(year, "is a leap year")
    else:
        print(year, "is not a leap year.")

year_list = [1928, 1950, 1959, 1972, 1990]
for year in year_list:
    leap_year(year)

Discussion

Additonal Challenge

If you have time:

Expand your function so that it correctly categorizes any year from 0 onwards
Instead of printing whether a year is a leap year or not, save the results to a python dictionary, where there are two keys (“leap” and “not-leap”) and the values are a list of years.

Key Points

Use skills together.

Content from Command-Line Programs

Last updated on 2025-04-28 | Edit this page

Overview

Questions

How can I write Python programs that will work like Unix command-line tools?

Objectives

Use the values of command-line arguments in a program.
Read data from standard input in a program so that it can be used in a pipeline.

The Jupyter Notebook and other interactive tools are great for prototyping code and exploring data, but sooner or later one will want to use that code in a program we can run from the command line. In order to do that, we need to make our programs work like other Unix command-line tools. For example, we may want a program that reads a gapminder data set and plots the gdp of countries over time.

Callout

Switching to Shell Commands

In this lesson we are switching from typing commands in Jupyter notebooks to typing commands in a shell terminal window (such as bash). When you see a $ in front of a command that tells you to run that command in the shell rather than the Python interpreter.

Callout

Converting Notebooks

The Jupyter Notebook has the ability to convert all of the cells of a current Notebook into a python program. To do this, go to File -> Download as and select Python (.py) to get the current notebook as a Python script.

Setting up your project

Up until now, we’ve been working in the data folder directly. Because we’re going to be dealing with more files of different types in this lesson, let’s do a little rearranging:

On your desktop, create a folder called swc-gapminder.
Move the data folder you’ve been using into this folder.
Inside swc-gapminder, create a folder called figs
To ensure that we’re all starting with the same set of code, copy the text below into a file called gdp_plots.py in the swc-gapminder folder or download the starting script file from the lesson.

PYTHON

import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv('data/gapminder_gdp_oceania.csv', index_col = 'country').T

# create a plot of the transposed data
ax = data.plot()

# display the plot
plt.show()

This program imports the pandas and matplotlib Python modules, reads some of the gapminder data into a pandas dataframe, and plots that data using matplotlib with some default settings.

We can run this program from the command line using

BASH

$ python gdp_plots.py

This is much easier than starting a notebook, going to the browser, and running each cell in the notebook to get the same result.

Initialize a repository

But before we modify our gdp_plots.py program, we are going to put it under version control so that we can track its changes as we go through this lesson.

BASH

$ git init
$ git add gdp_plots.py
$ git commit -m "First commit of analysis script"

Because we’re only concerned with changes to our analysis script, we are going to create a .gitignore file for all of the gapminder .csv files and any Python notebook files (.ipynb) files we have created thus far.

$ echo "data/*.csv" > .gitignore
$ echo "*.ipynb" >> .gitignore
$ git add .gitignore
$ git commit -m "Adding ignore file"

Now that we have a clean repository, let’s get back to work on adding command line arguments to our program.

Changing code under Version Control

As it is, this plot isn’t bad but let’s add some labels for clarity. We’ll use the data filename as a title for the plot and indicate what information is in on each axis.

import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

filename = 'data/gapminder_gdp_oceania.csv'

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T

# create a plot of the transposed data
ax = data.plot(title = filename)

# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)))
ax.set_xticklabels(data.index, rotation = 45)

# display the plot
plt.show()

Now when we run this, our plot looks a little bit nicer.

BASH

$ python gdp_plots.py

Updating the Repository

BASH

$ git add gdp_plots.py
$ git commit -m "Improving plot format"

Command-Line Arguments

This program currently only works for the Oceania set of data. How might we modify the program to work for any of the gapminder gdp data sets? We could go into the script and change the .csv filename to generate the same plot for different sets of data, but there is an even better way.

Python programs can use additional arguments provided in the following manner.

BASH

$ python <program> <argument1> <argument2> <other_arguments>

The program can then use these arguments to alter its behavior based on those arguments. In this case, we’ll be using arguments to tell our program to operate on a specific file.

We’ll be using the sys module to do so. sys (short for system) is a standard Python module used to store information about the program and its running environment, including what arguments were passed to the program when the command was executed. These arguments are stored as a list in sys.argv.

These arguments can be accessed in our program by importing the sys module. The first argument in sys.argv is always the name of the program, so we’ll find any additional arguments right after that in the list.

Let’s try this out in a separate script. Using the text editor of your choice, let’s write a new program called args_list.py containing the two following lines:

PYTHON

import sys
print('sys.argv is', sys.argv)

The strange name argv stands for “argument values”. Whenever Python runs a program, it takes all of the values given on the command line and puts them in the list sys.argv so that the program can determine what they were. If we run this program with no arguments:

BASH

$ python argv_list.py

OUTPUT

sys.argv is ['argv_list.py']

the only thing in the list is the full path to our script, which is always sys.argv[0].

If we run it with a few arguments, however:

BASH

$ python argv_list.py first second third

PYTHON

sys.argv is ['argv_list.py', 'first', 'second', 'third']

then Python adds each of those arguments to that magic list.

Using this new information, let’s add command line arguments to our gdp_plots.py program.

To do this, we’ll make two changes:

add the import of the sys module at the beginning of the program.
replace the filename (“data/gapminder_gdp_oceania.csv”) with the the second entry in the sys.argv list.

Now our program should look as follows:

import sys
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

filename = sys.argv[1]

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T

# create a plot of the transposed data
ax = data.plot(title = filename)

# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)) )
ax.set_xticklabels(data.index, rotation = 45)

# display the plot
plt.show()

Let’s take a look at what happens when we provide a gapminder filename to the program.

BASH

$ python gdp_plots.py data/gapminder_gdp_oceania.csv

And the same plot as before is displayed, but this file is now being read from an argument we’ve provided on the command line. We can now do this for files with similar information and get the same set of plots for that data without any changes to our program’s code. Try this our for yourself now.

Update the Repository

Now that we’ve made this change to our program and see that it works. Let’s update our repository with these changes.

$ git add gdp_plots.py
$ git commit -m "Adding command line arguments"

Challenge

Exercise: read multiple files

Try to run the gdp_plots.py so that it reads in all the .csv files in the data folder using the wildcard symbol. Does it work? Why or why not?

Show me the solution

if you run it with the argument ‘data/*.csv’ you get an error on the Americas file because it has an extra file. However, it works if you omit that file.

Key Points

The sys library connects a Python program to the system it is running on.
The variable sys.argv is a list with each item being a command-line argument.

Content from Trying Different Methods

Last updated on 2025-04-28 | Edit this page

Overview

Questions

How do I plot multiple data sets using different methods?

Objectives

Read data from standard input in a program so that it can be used in a pipeline.
Compare using different methods to accomplish the same task.
Practice making branches and merging in a Git repository.

Handling Multiple Files

Perhaps we would like a script that can operate on multiple data files at once. This can actually be done in two different ways: (1) writing a bash script that calls our already-existing Python script in a for-loop, or (2) modifying our Python script to read multiple files in a for-loop. We will try both methods and compare the two.

Saving plot before branching

Our code currently pops up the figure instead of saving it. Most of the time we will want to save the figure as a separate file instead of only viewing it but saving instead of viewing will also help with the flow or our program’s for loops. With the current configuration, when we run the for loop it will pause each time it pops up a figure and wait for us to close the viewer.

import sys
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

filename = sys.argv[1]

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T

# create a plot of the transposed data
ax = data.plot(title = filename)

# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)) )
ax.set_xticklabels(data.index, rotation = 45)

# display the plot
plt.savefig('fig/gdp-plot.png')

This version of the code overwrites the figure name each time the file is run. This won’t work well for multiple figure generation so we need to edit the script to write a unique name for each input file.

A simple unique name can be generated based on the original file name. We can use Python’s split() function to split filename, which is a string, by any character. This returns a list like so:

PYTHON

name = 'my-data.csv'
split_name = name.split('.')
print(split_name)
print(split_name[0])

PYTHON

['my-data', 'csv']
'my-data'

We’ll split the original file name and use the first part to rename our plot. And then we will concatenate .png to the name to specify our file type.

import sys
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

filename = sys.argv[1]

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T

# create a plot of the transposed data
ax = data.plot(title = filename)

# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)) )
ax.set_xticklabels(data.index, rotation = 45)

# save the plot with a unique file name
split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX
split_name2 = split_name1.split('/')[1] #gapminder_gdp_XXX
save_name = 'figs/'+split_name2 + '.png'
plt.savefig(save_name)

Lets commit the version of the code that saves unique names for each plot

BASH

$ git add gdp_plots.py
$ git commit -m "Saving plots to a unique name"

Create New Branches

First we will create two new branches where we can develop each of these two different methods. We will call these branches py-loop and sh-loop

BASH

$ git branch py-loop
$ git branch sh-loop

We can check that these two branches were created with $ git branch -a.

Handling Multiple Files with Python

First, we’ll try using just Python to loop through mutliple files. Let’s switch to our Python branch.

BASH

$ git checkout py-loop

To process each file separately, we’ll need a loop that executes our plotting statements for each file.

We want our program to process each file separately, and the easiest way to do this is a loop that executes once for each of the filenames provided in sys.argv.

But we need to be careful: sys.argv[0] will always be the name of our program, rather than the name of a file. We also need to handle an unknown number of filenames, since our program could be run for any number of files.

A solution is to loop over the contents of sys.argv[1:]. The ‘1’ tells Python to start the slice at location 1, so the program’s name isn’t included. Since we’ve left off the upper bound, the slice runs to the end of the list, and includes all the filenames.

Here is our updated program.

import sys
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

for filename in sys.argv[1:]:

    # load data and transpose so that country names are
    # the columns and their gdp data becomes the rows
    data = pandas.read_csv(filename, index_col = 'country').T

    # create a plot of the transposed data
    ax = data.plot(title = filename)

    # set some plot attributes
    ax.set_xlabel("Year")
    ax.set_ylabel("GDP Per Capita")
    # set the x locations and labels
    ax.set_xticks(range(len(data.index)))
    ax.set_xticklabels(data.index, rotation = 45)

    # save the plot with a unique file name
    split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX
    split_name2 = split_name1.split('/')[1] #gapminder_gdp_XXX
    save_name = 'figs/'+split_name2 + '.png'
    plt.savefig(save_name)

Now when the program is given multiple filenames

BASH

$ python gdp_plots.py data/gapminder_gdp_oceania.csv data/gapminder_gdp_africa.csv

one plot for each filename is generated.

Update the Repository

BASH

$ git status # always check the branch before you commit
$ git add gdp_plots.py
$ git commit -m "Allowing plot generation for multiple files at once"

Updating the repository

Yet another successful update to the code. Let’s commit our changes. If we do git status we see we also have image files untracked. Let’s ignore those files because they will likely change as our data changes.

BASH

$ echo "*.png" >> .gitignore
$ git add .gitignore
$ git commit -m "ignoring generated images"

Handling Multiple Files with Bash

Now that we’ve created a Python script to save multiple files at once, let’s try to do the same thing in Bash. We’ll leave our current branch alone and switch to our sh-loop branch.

BASH

$ git checkout sh-loop

If we look at our gdp_plots.py file, it is not in a for-loop format because it is not up to date with other branch. Our script should look like this:

PYTHON

import sys
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

filename = sys.argv[1]

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T

# create a plot of the transposed data
ax = data.plot(title = filename)

# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)) )
ax.set_xticklabels(data.index, rotation = 45)

# save the plot with a unique file name
split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX
split_name2 = split_name1.split('/')[1] #gapminder_gdp_XXX
save_name = 'figs/'+split_name2 + '.png'
plt.savefig(save_name)

Let’s use bash to generate multiple plots by calling our Python script in a bash for-loop. First, let’s create a bash file for us to edit.

BASH

$ touch gdp_plots.sh

In this script, we’ll write a for-loop will call our gdp_plots.py script on multiple files. We can break up our long list of files by using a backslash \ and writing the rest on the next line.

BASH

for filename in data/gapminder_gdp_oceania.csv data/gapminder_gdp_africa.csv
do
   python gdp_plots.py $filename
done

We can run our script to see if it works:

PYTHON

$ bash gdp_plots.sh

This script now generates a plot for each file. We can check by checking the time each plot was created.

BASH

ls -lh figs/

Updating the repository

Yet another successful update to the code. Let’s commit our changes. Since our Python and bash scripts had somewhat unrelated changes, let’s make two separate commits. We will also ignore our images like before.

BASH

$ git add gdp_plots.sh
$ git commit -m "Wrote bash script to call python plotter."
$ echo "*.png" >> .gitignore
$ git add .gitignore
$ git commit -m "ignoring generated images"

Comparing Methods

We have successfully developed two different methods for accomplishing the same task. This is common to do in software development when there is not a clear path forward. Let’s compare our two methods and decide which to merge into our main branch.

One comparison we might be interested in is how fast each is. We can use bash’s time function to get the time to run the script. Let’s time each script. To do this, we just add time before each command when we run a script or command and it will give us timing information when it is completed.

While we are on our bash branch, we’ll time that script first.

BASH

$ time bash gdp_plots.sh

real    0m6.031s
user    0m5.535s
sys     0m0.388s

We are most interested in the “real” time in the output, which is the elapsed time we experience.

Let’s checkout our python branch and time our script there.

BASH

$ git checkout py-loop
$ time python gdp_plots.py data/gapminder_gdp_oceania.csv data/gapminder_gdp_africa.csv

real    0m3.163s
user    0m3.002s
sys     0m0.132s

As we can see, our Python method ran faster than the bash method. For this reason, we will merge our Python branch into main.

BASH

$ git checkout main
$ git merge py-loop

Another advantage to the Python method over the bash method is that we do not need to change a file and commit it each time we want to create plots from different data.

More Practice with Multiple Files in Python

Challenge

Finding Particular Files

Using the glob module, write a simple version of ls that shows files in the current directory with a particular suffix. A call to this script should look like this:

PYTHON

$ python my_ls.py py

OUTPUT

left.py
right.py
zero.py

Show me the solution

PYTHON

import sys
import glob

def main():
    '''prints names of all files with sys.argv as suffix'''
    assert len(sys.argv) >= 2, 'Argument list cannot be empty'
    suffix = sys.argv[1] # NB: behaviour is not as you'd expect if sys.argv[1] is *
    glob_input = '*.' + suffix # construct the input
    glob_output = sorted(glob.glob(glob_input)) # call the glob function
    for item in glob_output: # print the output
        print(item)
    return

main()

Challenge

Counting Lines

Write a program called line_count.py that works like the Unix wc command:

If no filenames are given, it reports the number of lines in standard input.
If one or more filenames are given, it reports the number of lines in each, followed by the total number of lines.

Show me the solution

PYTHON

import sys

def main():
    '''print each input filename and the number of lines in it,
       and print the sum of the number of lines'''
    filenames = sys.argv[1:]
    sum_nlines = 0 #initialize counting variable

    if len(filenames) == 0: # no filenames, just stdin
        sum_nlines = count_file_like(sys.stdin)
        print('stdin: %d' % sum_nlines)
    else:
        for f in filenames:
            n = count_file(f)
            print('%s %d' % (f, n))
            sum_nlines += n
        print('total: %d' % sum_nlines)

def count_file(filename):
    '''count the number of lines in a file'''
    f = open(filename,'r')
    nlines = len(f.readlines())
    f.close()
    return(nlines)

def count_file_like(file_like):
    '''count the number of lines in a file-like object (eg stdin)'''
    n = 0
    for line in file_like:
        n = n+1
    return n

main()

Key Points

Make different branches in a Git repository to try different methods.
Use bash’s time command to time scripts.

Content from Program Flags

Last updated on 2025-04-28 | Edit this page

Overview

Questions

How can I make an easy shortcut to analyze all files at once using a program flag?

Objectives

Handle flags and files separately in a command-line program.

Handling Program Flags

Now we have a program which is capable of handling any number of data sets at once.

But what if we have 50 GDP data sets? It would be awfully tedious to type in the names of 50 files in the command line, so let’s add a flag to our program indicating that we would like it to generate a plot for each data set in the current directory.

Flags are a convention used in programming to indicate to a program that a non-default behavior is being requested by the user. In this case, we’ll be using a “-a” flag to indicate to our program we would like it to operate on all data sets in our directory.

To explore what files are in the current directory, we’ll be using the Python’s glob module.

In Unix, the term “globbing” means “matching a set of files with a pattern”.
The most common patterns are:
- * meaning “match zero or more characters”
- ? meaning “match exactly one character”
Python contains the glob library to provide pattern matching functionality
The glob library contains a function also called glob to match file patterns
E.g., glob.glob('*.txt') matches all files in the current directory whose names end with .txt.
Result is a (possibly empty) list of character strings.

import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

# check for -a flag in arguments
if "-a" in sys.argv:
    filenames = glob.glob("data/*gdp*.csv")
else:
    filenames = sys.argv[1:]

for filename in filenames:

    # load data and transpose so that country names are
    # the columns and their gdp data becomes the rows
    data = pandas.read_csv(filename, index_col = 'country').T

    # create a plot of the transposed data
    ax = data.plot(title = filename)

    # set some plot attributes
    ax.set_xlabel("Year")
    ax.set_ylabel("GDP Per Capita")
    # set the x locations and labels
    ax.set_xticks(range(len(data.index)))
    ax.set_xticklabels(data.index, rotation = 45)

    # save the plot with a unique file name
    split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX
    split_name2 = filename.split('/')[1]
    save_name = 'figs/'+split_name2 + '.png'
    plt.savefig(save_name)

Let’s test if our run all flag works by running the script before we commit it. It is always good pratice to run your code first so you don’t accidently commit broken code.

BASH

python gdp_plots.py -a

OUTPUT

Traceback (most recent call last):
  File "gdp_plots.py", line 23, in <module>
    ax = data.plot(title = filename)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/pandas/plotting/_core.py", line 1031, in __call__
    return plot_backend.plot(data, kind=kind, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/pandas/plotting/_matplotlib/__init__.py", line 71, in plot
    plot_obj.generate()
  File "/opt/anaconda3/lib/python3.11/site-packages/pandas/plotting/_matplotlib/core.py", line 451, in generate
    self._compute_plot_data()
  File "/opt/anaconda3/lib/python3.11/site-packages/pandas/plotting/_matplotlib/core.py", line 636, in _compute_plot_data
    raise TypeError("no numeric data to plot")
TypeError: no numeric data to plot

This error is saying that the data in one or more of our files is non-numeric and it doesn’t know how to plot it. Lets add a little print statement to our code to check the head of our data file for each time through the loop.

import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

# check for -a flag in arguments
if "-a" in sys.argv:
    filenames = glob.glob("data/*gdp*.csv")
else:
    filenames = sys.argv[1:]

for filename in filenames:

    # load data and transpose so that country names are
    # the columns and their gdp data becomes the rows
    data = pandas.read_csv(filename, index_col = 'country').T
    print(filename)
    print(data.head())

    # create a plot of the transposed data
    ax = data.plot(title = filename)

    # set some plot attributes
    ax.set_xlabel("Year")
    ax.set_ylabel("GDP Per Capita")
    # set the x locations and labels
    ax.set_xticks(range(len(data.index)))
    ax.set_xticklabels(data.index, rotation = 45)

    # save the plot with a unique file name
    split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX
    split_name2 = filename.split('/')[1]
    save_name = 'figs/'+split_name2 + '.png'
    plt.savefig(save_name)

Let’s run the code again with our print statements

BASH

python gdp_plots.py -a

OUTPUT

data/gapminder_gdp_americas.csv
country           Argentina      Bolivia       Brazil  ... United States      Uruguay    Venezuela
continent          Americas     Americas     Americas  ...      Americas     Americas     Americas
gdpPercap_1952  5911.315053  2677.326347  2108.944355  ...   13990.48208  5716.766744  7689.799761
gdpPercap_1957  6856.856212  2127.686326  2487.365989  ...   14847.12712  6150.772969  9802.466526
gdpPercap_1962  7133.166023  2180.972546  3336.585802  ...   16173.14586  5603.357717  8422.974165
gdpPercap_1967  8052.953021  2586.886053  3429.864357  ...   19530.36557   5444.61962  9541.474188

[5 rows x 25 columns]
Traceback (most recent call last):
  File "gdp_plots.py", line 23, in <module>
    ax = data.plot(title = filename)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/pandas/plotting/_core.py", line 1031, in __call__
    return plot_backend.plot(data, kind=kind, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/pandas/plotting/_matplotlib/__init__.py", line 71, in plot
    plot_obj.generate()
  File "/opt/anaconda3/lib/python3.11/site-packages/pandas/plotting/_matplotlib/core.py", line 451, in generate
    self._compute_plot_data()
  File "/opt/anaconda3/lib/python3.11/site-packages/pandas/plotting/_matplotlib/core.py", line 636, in _compute_plot_data
    raise TypeError("no numeric data to plot")
TypeError: no numeric data to plot

Now we can see in the America’s gdp file there is a row of continent data. There was originally a continent column before we transposed it. Because a we need the same type of data within each column, when we include a string row pandas converts all the values in that column into a string. So it isn’t seeing any(!) of our number values as numeric data that can be plotted.

We can add a check into our code that drops the continent row if it exists.

import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

# check for -a flag in arguments
if "-a" in sys.argv:
    filenames = glob.glob("data/*gdp*.csv")
else:
    filenames = sys.argv[1:]

for filename in filenames:

    # load data and transpose so that country names are
    # the columns and their gdp data becomes the rows
    data = pandas.read_csv(filename, index_col = 'country').T
    if "continent" in data.index:
        data.drop("continent", inplace=True)
    print(filename)
    print(data.head())

    # create a plot of the transposed data
    ax = data.plot(title = filename)

    # set some plot attributes
    ax.set_xlabel("Year")
    ax.set_ylabel("GDP Per Capita")
    # set the x locations and labels
    ax.set_xticks(range(len(data.index)))
    ax.set_xticklabels(data.index, rotation = 45)

    # save the plot with a unique file name
    split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX
    split_name2 = filename.split('/')[1]
    save_name = 'figs/'+split_name2 + '.png'
    plt.savefig(save_name)

Now running the script again works!

BASH

python gdp_plots.py -a

It sill prints the filenames and head of each data frame though. We should delete those lines and our final code should be the following script:

import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

# check for -a flag in arguments
if "-a" in sys.argv:
    filenames = glob.glob("data/*gdp*.csv")
else:
    filenames = sys.argv[1:]

for filename in filenames:

    # load data and transpose so that country names are
    # the columns and their gdp data becomes the rows
    data = pandas.read_csv(filename, index_col = 'country').T
    if "continent" in data.index:
        data.drop("continent", inplace=True)

    # create a plot of the transposed data
    ax = data.plot(title = filename)

    # set some plot attributes
    ax.set_xlabel("Year")
    ax.set_ylabel("GDP Per Capita")
    # set the x locations and labels
    ax.set_xticks(range(len(data.index)))
    ax.set_xticklabels(data.index, rotation = 45)

    # save the plot with a unique file name
    split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX
    split_name2 = filename.split('/')[1]
    save_name = 'figs/'+split_name2 + '.png'
    plt.savefig(save_name)

Updating the repository

Yet another successful update to the code. Let’s commit our changes.

BASH

$ git add gdp_plots.py
$ git commit -m "Adding a flag to run script for all gdp data sets."

Callout

The Right Way to Do It

If our programs can take complex parameters or multiple filenames, we shouldn’t handle sys.argv directly. Instead, we should use Python’s argparse library, which handles common cases in a systematic way, and also makes it easy for us to provide sensible error messages for our users. We will not cover this module in this lesson but you can go to Tshepang Lekhonkhobe’s Argparse tutorial that is part of Python’s Official Documentation.

Key Points

Adding command line flags can be a user-friendly way to accomplish common tasks.

Content from Defensive Programming

Last updated on 2025-04-28 | Edit this page

Overview

Questions

How do I predict and avoid user confusion?

Objectives

Ensure that programs indicate use and provide meaningful output upon failure.

Defensive Programming

In our last lesson, we created a program which will plot our gapminder gdp data for an arbitrary number of files. This is great, but we didn’t cover some of the vulnerabilities of this program we’ve created.

What happens if we run the program without any arguments at all?
What happens if we run the program from another directory?

First, let’s try running our program without any additional arguments or flags.

BASH

$ python gdp_plots.py

OUTPUT

Traceback (most recent call last):
  File "gdp_plot.py", line 12, in <module>
    filenames = sys.argv[1:]
IndexError: list index out of range

Python returns an error when trying to find the command line argument in sys.argv. It cannot find that argument because we haven’t provided it to the command and as a result there is no entry in sys.argv where we’re telling it to look for this value. We may know all of this because we’re the ones who wrote the program, but another user of the program without this experience will not.

Callout

More on Function Errors/Exceptions

Python reports a runtime error when something goes wrong while a program is executing.

PYTHON

age = 53
remaining = 100 - aege # mis-spelled 'age'

ERROR

NameError: name 'aege' is not defined

The message indicates a problem with the name of a variable

Python also reports a syntax error when it can’t understand the source of a program.

PYTHON

print("hello world"

ERROR

  File "<ipython-input-6-d1cc229bf815>", line 1
    print ("hello world"
                        ^
SyntaxError: unexpected EOF while parsing

The message indicates a problem on first line of the input (“line 1”).
- In this case the “ipython-input” section of the file name tells us that we are working with input into IPython, the Python interpreter used by the Jupyter Notebook.
The -6- part of the filename indicates that the error occurred in cell 6 of our Notebook.
Next is the problematic line of code, indicating the problem with a ^ pointer.

And if we run the program from another directory:

BASH

$ cd ..
$ python swc-gapminder/gdp_plots.py -a

We see no output from the program at all. This is what is referred to as a “silent failure”. The program has failed to produce a plot, but has reported no reason why. These kind of failures are difficult to debug and should be avoided.

It is important to employ “defensive programming” in this scenario so that our program indicates to the user

what is going wrong
how to correct this problem

Check Input Arguments

Let’s add a section to the code which checks the number of incoming arguments to the program and returns some information to the user if there is missing information.

import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

# make sure additional arguments or flags have
# been provided by the user
if len(sys.argv) == 1:
    # why the program will not continue
    print("Not enough arguments have been provided")
    # how this can be corrected
    print("Usage: python gdp_plots.py ")
    print("Options:")
    print("-a : plot all gdp data sets in current directory")

# check for -a flag in arguments
if "-a" in sys.argv:
    filenames = glob.glob("data/*gdp*.csv")
else:
    filenames = sys.argv[1:]

for filename in filenames:

    # load data and transpose so that country names are
    # the columns and their gdp data becomes the rows
    data = pandas.read_csv(filename, index_col = 'country').T
    if "continent" in data.index:
        data.drop("continent", inplace=True)

    # create a plot of the transposed data
    ax = data.plot(title = filename)

    # set some plot attributes
    ax.set_xlabel("Year")
    ax.set_ylabel("GDP Per Capita")
    # set the x locations and labels
    ax.set_xticks(range(len(data.index)))
    ax.set_xticklabels(data.index, rotation = 45)

    # save the plot with a unique file name
    split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX
    split_name2 = filename.split('/')[1]
    save_name = 'figs/'+split_name2 + '.png'
    plt.savefig(save_name)

If we run the program without a filename argument, here’s what we’ll see

BASH

$ python gdp_plots.py

OUTPUT

Not enough arguments have been provided
Usage: python gdp_plots.py <filenames>
Options:
-a : plot all gdp file in current directory

Now if someone runs this program without having used it before (or written it themselves) the user will know how change their command to get the program running properly, rather than seeing an esoteric Python error.

Update the Repository

We’ve just made another successful change to our repository. Let’s add a commit to the repo.

PYTHON

$ git add gdp_plots.py
$ git commit -m "Handling case for missing filename argument"

Check for silent errors

Silent errors can be difficult to anticipate. If we try to run our program from another directory with the -a flag, we don’t see any errors, but it also doesn’t do anything. This is because when we do the -a flag here, there are no .csv files in the directory, so our filenames variable is empty. Let’s add a check to ensure there are files to plot.

import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

# make sure additional arguments or flags have
# been provided by the user
if len(sys.argv) == 1:
    # why the program will not continue
    print("Not enough arguments have been provided")
    # how this can be corrected
    print("Usage: python gdp_plots.py ")
    print("Options:")
    print("-a : plot all gdp data sets in current directory")

# check for -a flag in arguments
if "-a" in sys.argv:
    filenames = glob.glob("data/*gdp*.csv")
    # check if no files were found and print message.
    if filenames == []:
        # file list is empty (no files found)
        print("No files found in this folder.")
        print("Make sure data is located in current directory.")
else:
    filenames = sys.argv[1:]

for filename in filenames:

    # load data and transpose so that country names are
    # the columns and their gdp data becomes the rows
    data = pandas.read_csv(filename, index_col = 'country').T
    if "continent" in data.index:
        data.drop("continent", inplace=True)

    # create a plot of the transposed data
    ax = data.plot(title = filename)

    # set some plot attributes
    ax.set_xlabel("Year")
    ax.set_ylabel("GDP Per Capita")
    # set the x locations and labels
    ax.set_xticks(range(len(data.index)))
    ax.set_xticklabels(data.index, rotation = 45)

    # save the plot with a unique file name
    split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX
    split_name2 = filename.split('/')[1]
    save_name = 'figs/'+split_name2 + '.png'
    plt.savefig(save_name)

Now if someone runs this program in a directory with no valid datafiles, a message appears.

Update the Repository

We’ve just made another successful change to our repository. Let’s add a commit to the repo.

BASH

$ git add gdp_plots.py
$ git commit -m "Handling case if no files present in directory"

Key Points

Avoid silent failures.
Avoid esoteric output when a program fails.
Add checkpoints in code to check for common failures.

Content from Refactoring

Last updated on 2025-04-28 | Edit this page

Overview

Questions

When should I reorganize my code so it is more clear and readable for others?
How can I organize my code so that it is useable in other places?
Why do I almost always want to write my code as though it will be used somewhere else?

Objectives

Understand the value of refactoring code and use of functions.
Practice determining where code can be divided into smaller functions.

This code works nicely for generating plots of multiple data sets, but there is now a lot of code to digest in our script. Picture yourself looking at this code for the first time. Would it be immediately clear to you where are arguments are handled and where plots are generated?

It would be nice to break this work into clear chunks of code. This can be accomplished by making the argument checking section and the body of the for loop their own functions. This requires surprisingly few changes to the code, but makes it much more clear. This process is called refactoring.

Challenge

Exercise: make a refactoring plan

Given the guidance above, talk with your neighbors about which parts of the script should be moved into functions. Try to think of ways to make the functions the most reusable on their own.

Show me the solution

A possible solution:

A function that parses the arguments
A function that makes the plots
A function that calls the other functions

This isn’t the only “right” solution, but a reasonable way to split things up

Let’s refactor our script

Create a Branch

Because we’re making a major change, let’s make a new branch to work in.

BASH

$ git checkout -b refactor

Let’s break the code into 4 functions:

parse_arguments() - gets the input from argv[], returns a list of file names
create_plot() - takes one file name as input, creates one plot and writes it to the fig folder
create_plots() - takes a list of files as input, calls create_plot() for each element in the list
main() - calls parse_arguments() and create_plots()

Below is a template that will help you write these functions. The """ syntax indicates a multi-line comment. If these comments are the first thing in a function, they are known as a Docstring.

PYTHON

import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt


def parse_arguments(argv):
    """
    Parse the argument list passed from the command line
    (after the program filename is removed) and return a list
    of filenames.

    Input:
    ------
        argument list (normally sys.argv[1:])

    Returns:
    --------
        filenames: list of strings, list of files to plot
    """


def create_plot(filename):
    """
    Creates a plot for the specified
    data file.

    Input:
    ------
        filename: string, path to file to plot

    Returns:
    --------
        none
    """


def create_plots(filenames):
    """
    Takes in a list of filenames to plot
    and creates a plot for each file.

    Input:
    ------
        filenames: list of strings, list of files to plot

    Returns:
    --------
        none
    """


def main():
    """
    main function - does all the work
    """



# call main
main()

Callout

Adding Docstrings

In an effort to create human-readable code. It is common to include a “docstring” or “Python Documentation String”, at the top of each function that clearly lay out three things: the purpose or objective of the function, a description of the inputs and their datatypes, and a description of what is returned and their data types. By including these things, debugging later on is much more efficient because now the developer knows the starting point (inputs), endpoints (returns), and what the intended change is (purpose).

Let’s move the code into the functions now:

Challenge

Exercise: refactor the code

Now that we have a plan for refactoring and a template to work from, create a new script called refactored_gdp_plot.py. Paste the template from above into the new script. Then copy and paste the code from gdp_plot.py script into the corresponding functions.

Show me the solution

PYTHON

import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt


def parse_arguments(argv):
    """
    Parse the argument list passed from the command line
    (after the program filename is removed) and return a list
    of filenames.

    Input:
    ------
        argument list (normally sys.argv[1:])

    Returns:
    --------
        filenames: list of strings, list of files to plot
    """
# make sure additional arguments or flags have
    # been provided by the user
    if argv == []:
        # why the program will not continue
        print("Not enough arguments have been provided")
        # how this can be corrected
        print("Usage: python gdp_plots.py <filenames>")
        print("Options:")
        print("-a : plot all gdp data sets in current directory")

    # check for -a flag in arguments
    if "-a" in argv:
        filenames = glob.glob("*gdp*.csv")
    else:
        filenames = argv

    return filenames

def create_plot(filename):
    """
    Creates a plot for the specified
    data file.

    Input:
    ------
        filename: string, path to file to plot

    Returns:
    --------
        none
    """

    # load data and transpose so that country names are
    # the columns and their gdp data becomes the rows
    data = pandas.read_csv(filename, index_col = 'country').T
    if "continent" in data.index:
        data.drop("continent", inplace=True)

    # create a plot of the transposed data
    ax = data.plot(title = filename)

    # set some plot attributes
    ax.set_xlabel("Year")
    ax.set_ylabel("GDP Per Capita")
    # set the x locations and labels
    ax.set_xticks(range(len(data.index)))
    ax.set_xticklabels(data.index, rotation = 45)

    # save the plot with a unique file name
    split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX
    split_name2 = filename.split('/')[1]
    save_name = 'figs/'+split_name2 + '.png'
    plt.savefig(save_name)

def create_plots(filenames):
    """
    Takes in a list of filenames to plot
    and creates a plot for each file.

    Input:
    ------
       filenames: list of strings, list of files to plot

   Returns:
   --------
       none
   """

    for filename in filenames:
        create_plot(filename)


def main():
    """
    main function - does all the work
    """

    # parse arguments
    files_to_plot = parse_arguments(sys.argv[1:])

    #generate plots
    create_plots(files_to_plot)

# call main
main()

The behavior of the program hasn’t changed, but it has been made more modular by separating the into different functions with their own purpose.

Why the extra function? Our function main has two primary components to it

parsing arguments handed to the program
generating the desired plots

But we’ve defined three functions above the main function: parse_arguments, create_plot, and create_plots.

If the functions in our file directly reflected the components in main we would only have the parse_arguments and create_plots functions. If we think about using these functions independently, however, a function which always takes in a list of filenames isn’t very convenient to use on its own. By defining the create_plot function, we have placed most of the plot generation work there, while allowing for a very simple definition of the create_plots function.

The importance of this design decision will be made clear in the next lesson.

Before we commit we need to change our refactored_gdp_plot.py script to gdp_plot.py since we don’t want to keep two copies of this script around in our repo. We only made this as a separate script to make it easier to copy-paste. Once you’ve tested it, you can either rename refactored_gdp_plot.py script to gdp_plot.py or copy the contents of refactored_gdp_plot.py script to gdp_plot.py and delete refactored_gdp_plot.py.

Update the Repository

We haven’t changed the behavior of our program, but our code has changed, so let’s update the repository.

BASH

$ git add gdp_plots.py
$ git commit -m "Refactoring code."

Discussion

Branching and Refactoring

To demonstrate that the behavior of our program hasn’t changed, try running it a few different ways using both the main and refactor branches. Remember that the command for checking out a branch is git checkout <branch_name>.

Now that we’re satisfied with our refactor. We can merge this branch into our main branch.

BASH

$ git checkout main
$ git merge refactor

Key Points

Refactoring makes code more modular, easier to read, and easier to understand.
Refactoring requires one to consdier future implications and generally enables others to use your code more easily.

Content from Running Scripts and Importing

Last updated on 2025-04-28 | Edit this page

Overview

Questions

How can I import some of my work even if it is part of a program?

Objectives

Learn how to import functions from one script into another.
Understand the difference between using a file as a module and running it as a script or program.

In our last lesson we learned how refactoring code makes it easier to understand organize into pices that each have their own purpose. The added modularity also makes it easier to use in other places.

First, let’s start a Jupyter notebook in our current directory and see what happens if we try to import our file as it is.

PYTHON

import gdp_plots

The result is an error related to how Python is attempting to interpret the file. This is because Python is encountering our call to the function main and isn’t sure how to proceed.

Callout

Running Versus Importing

Running a Python script in bash is very similar to importing that file in Python. The biggest difference is that we don’t expect anything to happen when we import a file, whereas when running a script, we expect to see some output printed to the console.

In order for a Python script to work as expected when imported or when run as a script, we typically put the part of the script that produces output in the following if statement:

PYTHON

if __name__ == '__main__':
    main()  # Or whatever function produces output

When you import a Python file, __name__ is the name of that file (e.g., when importing pandas_plots.py, __name__ is 'pandas_plots'). However, when running a script in bash, __name__ is always set to '__main__' in that script so that you can determine if the file is being imported or run as a script.

Let’s add the main function in our script to a section which identifies the program as being called from the command line.

import sys
import glob
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt


def parse_arguments(argv):
    """
    Parse the argument list passed from the command line
    (after the program filename is removed) and return a list
    of filenames.

    Input:
    ------
        argument list (normally sys.argv[1:])

    Returns:
    --------
        filenames: list of strings, list of files to plot
    """

    # make sure additional arguments or flags have
    # been provided by the user
    if argv == []:
        # why the program will not continue
        print("Not enough arguments have been provided")
        # how this can be corrected
        print("Usage: python gdp_plots.py ")
        print("Options:")
        print("-a : plot all gdp data sets in current directory")

    # check for -a flag in arguments
    if "-a" in argv:
        filenames = glob.glob("data/*gdp*.csv")
    else:
        filenames = argv

    return filenames


def create_plot(filename):
    """
    Creates a plot for the specified
    data file.

    Input:
    ------
        filename: string, path to file to plot

    Returns:
    --------
        none
    """

    # load data and transpose so that country names are
    # the columns and their gdp data becomes the rows
    data = pandas.read_csv(filename, index_col = 'country').T
    if "continent" in data.index:
        data.drop("continent", inplace=True)

    # create a plot of the transposed data
    ax = data.plot(title = filename)

    # set some plot attributes
    ax.set_xlabel("Year")
    ax.set_ylabel("GDP Per Capita")
    # set the x locations and labels
    ax.set_xticks(range(len(data.index)))
    ax.set_xticklabels(data.index, rotation = 45)

    # save the plot with a unique file name
    split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX
    split_name2 = filename.split('/')[1]
    save_name = 'figs/'+split_name2 + '.png'
    plt.savefig(save_name)


def create_plots(filenames):
    """
    Takes in a list of filenames to plot
    and creates a plot for each file.

    Input:
    ------
        filenames: list of strings, list of files to plot

    Returns:
    --------
        none
    """

    for filename in filenames:
        create_plot(filename)


def main():
    """
    main function - does all the work
    """
    # parse arguments
    files_to_plot = parse_arguments(sys.argv[1:])

    #generate plots
    create_plots(files_to_plot)


if __name__ == "__main__":
    # call main
    main()

Now let’s go back to the Jupyter notebook and try importing the file again.

PYTHON

import gdp_plots

Success! You’ve just written your first Python module. Any of the functions in that module can now be accessed in our Jupyter notebook session.

PYTHON

gdp_plots.create_plot("data/gapminder_gdp_oceania.csv")

Update the repository

Back in the terminal, let’s commit these changes to our repository.

BASH

$ git add gdp_plots.py
$ git commit -m "Moving call to the main function."

Callout

Writing Modular Code

In our previous lesson we refactored our code for clarity and modularity. As part of that process we created two functions create_plot and create_plots. While the create_plot function wasn’t used in our program, imagine importing our module and finding that the only way to generate a plot for a single file is to add that filename to a list before passing that list to create_plots.

This might seem very strange or confusing to someone importing our module for the first time. It take time to develop an intuition for design decisions like these, but here are a few questions to ask yourself as a guide when organizing code:

Are my functions able to stand on their own? Do they accomplish simple tasks?
Is it easy to write a clear function names in my module? A function with the name “create_plots_and_export_data” is likely better off being borken up into two functions “create_plots” and “export_data”
Are my function names plural? In our example, it may have felt more natural to create only one function, “create_plots”. In these cases, it is almost always useful to create a function which does our operation once (“create_plot”) and another plural form of the function (“create_plots”) which contains a loop over the singular function.

Key Points

The __name__ variable allows us to know whether the file is being imported or run as a script.

Content from Programming Style

Last updated on 2025-02-10 | Edit this page

Overview

Questions

How can I make my programs more readable?
How do most programmers format their code?
How can programs check their own operation?

Objectives

Provide sound justifications for basic rules of coding style.
Refactor one-page programs to make them more readable and justify the changes.
Use Python community coding standards (PEP-8).

Follow standard Python style in your code.

PEP8: a style guide for Python that discusses topics such as how you should name variables, how you should use indentation in your code, how you should structure your import statements, etc. Adhering to PEP8 makes it easier for other Python developers to read and understand your code, and to understand what their contributions should look like. The PEP8 application and Python library can check your code for compliance with PEP8.

Best Practice: Write programs for people and not for computers!

Use assertions to check for internal errors.

Assertions are a simple, but powerful method for making sure that the context in which your code is executing is as you expect.

PYTHON

def calc_bulk_density(mass, volume):
    '''Return dry bulk density = powder mass / powder volume.'''
    assert volume > 0
    return mass / volume

If the assertion is False, the Python interpreter raises an AssertionError runtime exception. The source code for the expression that failed will be displayed as part of the error message. To ignore assertions in your code run the interpreter with the ‘-O’ (optimize) switch. Assertions should contain only simple checks and never change the state of the program. For example, an assertion should never contain an assignment.

Best Practice: Plan for mistakes

Use docstrings to provide online help.

Best Practice: Document design & purpose, not just mechanics

If the first thing in a function is a character string that is not assigned to a variable, Python attaches it to the function as the online help.
Called a docstring (short for “documentation string”).

PYTHON

def average(values):
    "Return average of values, or None if no values are supplied."

    if len(values) == 0:
        return None
    return sum(values) / average(values)

help(average)

OUTPUT

Help on function average in module __main__:

average(values)
    Return average of values, or None if no values are supplied.

Callout

Multiline Strings

Often use multiline strings for documentation. These start and end with three quote characters (either single or double) and end with three matching characters.

PYTHON

"""This string spans
multiple lines.

Blank lines are allowed."""

Discussion

What Will Be Shown?

Highlight the lines in the code below that will be available as online help. Are there lines that should be made available, but won’t be? Will any lines produce a syntax error or a runtime error?

PYTHON

"Find maximum edit distance between multiple sequences."
# This finds the maximum distance between all sequences.

def overall_max(sequences):
    '''Determine overall maximum edit distance.'''

    highest = 0
    for left in sequences:
        for right in sequences:
            '''Avoid checking sequence against itself.'''
            if left != right:
                this = edit_distance(left, right)
                highest = max(highest, this)

    # Report.
    return highest

Discussion

Document This

Turn the comment on the following function into a docstring and check that help displays it properly.

PYTHON

def middle(a, b, c):
    # Return the middle value of three.
    # Assumes the values can actually be compared.
    values = [a, b, c]
    values.sort()
    return values[1]

Challenge

Clean Up This Code

Read this short program and try to predict what it does.
Run it: how accurate was your prediction?
Refactor the program to make it more readable. Remember to run it after each change to ensure its behavior hasn’t changed.
Compare your rewrite with your neighbor’s. What did you do the same? What did you do differently, and why?

PYTHON

import sys
n = int(sys.argv[1])
s = sys.argv[2]
print(s)
i = 0
while i < n:
    # print('at', j)
    new = ''
    for j in range(len(s)):
        left = j-1
        right = (j+1)%len(s)
        if s[left]==s[right]: new += '-'
        else: new += '*'
    s=''.join(new)
    print(s)
    i += 1

Show me the solution

Here’s one solution.

PYTHON

def string_machine(input_string, iterations):
    """
    Takes input_string and generates a new string with -'s and *'s
    corresponding to characters that have identical adjacent characters
    or not, respectively.  Iterates through this procedure with the resultant
    strings for the supplied number of iterations.
    """
    print(input_string)
    old = input_string
    for i in range(iterations):
        new = ''
        # iterate through characters in previous string
        for j in range(len(s)):
            left = j-1
            right = (j+1)%len(s) # ensure right index wraps around
            if old[left]==old[right]:
                new += '-'
            else:
                new += '*'
        print(new)
        # store new string as old
        old = new

string_machine('et cetera', 10)

Key Points

Follow standard Python style in your code.
Use docstrings to provide online help.

Content from Wrap-Up

Last updated on 2025-02-10 | Edit this page

Overview

Questions

What have we learned?
What else is out there and where do I find it?

Objectives

Name and locate scientific Python community sites for software, workshops, and help.

Leslie Lamport once said, “Writing is nature’s way of showing you how sloppy your thinking is.” The same is true of programming: many things that seem obvious when we’re thinking about them turn out to be anything but when we have to explain them precisely.

Python supports a large community within and outwith research.

The Python 3 documentation covers the core language and the standard library.
PyCon is the largest annual conference for the Python community.
SciPy is a rich collection of scientific utilities. It is also the name of a series of annual conferences.
Jupyter is the home of the Jupyter Notebook.
Pandas is the home of the Pandas data library.
Stack Overflow’s general Python section can be helpful, as can the sections on NumPy, SciPy, Pandas, and other topics.

Key Points

Python supports a large community within and outwith research.