Trying Different Methods

Last updated on 2025-04-28 | Edit this page

Overview

Questions

How do I plot multiple data sets using different methods?

Objectives

Read data from standard input in a program so that it can be used in a pipeline.
Compare using different methods to accomplish the same task.
Practice making branches and merging in a Git repository.

Handling Multiple Files

Perhaps we would like a script that can operate on multiple data files at once. This can actually be done in two different ways: (1) writing a bash script that calls our already-existing Python script in a for-loop, or (2) modifying our Python script to read multiple files in a for-loop. We will try both methods and compare the two.

Saving plot before branching

Our code currently pops up the figure instead of saving it. Most of the time we will want to save the figure as a separate file instead of only viewing it but saving instead of viewing will also help with the flow or our program’s for loops. With the current configuration, when we run the for loop it will pause each time it pops up a figure and wait for us to close the viewer.

import sys
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

filename = sys.argv[1]

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T

# create a plot of the transposed data
ax = data.plot(title = filename)

# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)) )
ax.set_xticklabels(data.index, rotation = 45)

# display the plot
plt.savefig('fig/gdp-plot.png')

This version of the code overwrites the figure name each time the file is run. This won’t work well for multiple figure generation so we need to edit the script to write a unique name for each input file.

A simple unique name can be generated based on the original file name. We can use Python’s split() function to split filename, which is a string, by any character. This returns a list like so:

PYTHON

name = 'my-data.csv'
split_name = name.split('.')
print(split_name)
print(split_name[0])

PYTHON

['my-data', 'csv']
'my-data'

We’ll split the original file name and use the first part to rename our plot. And then we will concatenate .png to the name to specify our file type.

import sys
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

filename = sys.argv[1]

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T

# create a plot of the transposed data
ax = data.plot(title = filename)

# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)) )
ax.set_xticklabels(data.index, rotation = 45)

# save the plot with a unique file name
split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX
split_name2 = split_name1.split('/')[1] #gapminder_gdp_XXX
save_name = 'figs/'+split_name2 + '.png'
plt.savefig(save_name)

Lets commit the version of the code that saves unique names for each plot

BASH

$ git add gdp_plots.py
$ git commit -m "Saving plots to a unique name"

Create New Branches

First we will create two new branches where we can develop each of these two different methods. We will call these branches py-loop and sh-loop

BASH

$ git branch py-loop
$ git branch sh-loop

We can check that these two branches were created with $ git branch -a.

Handling Multiple Files with Python

First, we’ll try using just Python to loop through mutliple files. Let’s switch to our Python branch.

BASH

$ git checkout py-loop

To process each file separately, we’ll need a loop that executes our plotting statements for each file.

We want our program to process each file separately, and the easiest way to do this is a loop that executes once for each of the filenames provided in sys.argv.

But we need to be careful: sys.argv[0] will always be the name of our program, rather than the name of a file. We also need to handle an unknown number of filenames, since our program could be run for any number of files.

A solution is to loop over the contents of sys.argv[1:]. The ‘1’ tells Python to start the slice at location 1, so the program’s name isn’t included. Since we’ve left off the upper bound, the slice runs to the end of the list, and includes all the filenames.

Here is our updated program.

import sys
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

for filename in sys.argv[1:]:

    # load data and transpose so that country names are
    # the columns and their gdp data becomes the rows
    data = pandas.read_csv(filename, index_col = 'country').T

    # create a plot of the transposed data
    ax = data.plot(title = filename)

    # set some plot attributes
    ax.set_xlabel("Year")
    ax.set_ylabel("GDP Per Capita")
    # set the x locations and labels
    ax.set_xticks(range(len(data.index)))
    ax.set_xticklabels(data.index, rotation = 45)

    # save the plot with a unique file name
    split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX
    split_name2 = split_name1.split('/')[1] #gapminder_gdp_XXX
    save_name = 'figs/'+split_name2 + '.png'
    plt.savefig(save_name)

Now when the program is given multiple filenames

BASH

$ python gdp_plots.py data/gapminder_gdp_oceania.csv data/gapminder_gdp_africa.csv

one plot for each filename is generated.

Update the Repository

BASH

$ git status # always check the branch before you commit
$ git add gdp_plots.py
$ git commit -m "Allowing plot generation for multiple files at once"

Updating the repository

Yet another successful update to the code. Let’s commit our changes. If we do git status we see we also have image files untracked. Let’s ignore those files because they will likely change as our data changes.

BASH

$ echo "*.png" >> .gitignore
$ git add .gitignore
$ git commit -m "ignoring generated images"

Handling Multiple Files with Bash

Now that we’ve created a Python script to save multiple files at once, let’s try to do the same thing in Bash. We’ll leave our current branch alone and switch to our sh-loop branch.

BASH

$ git checkout sh-loop

If we look at our gdp_plots.py file, it is not in a for-loop format because it is not up to date with other branch. Our script should look like this:

PYTHON

import sys
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt

filename = sys.argv[1]

# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T

# create a plot of the transposed data
ax = data.plot(title = filename)

# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)) )
ax.set_xticklabels(data.index, rotation = 45)

# save the plot with a unique file name
split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX
split_name2 = split_name1.split('/')[1] #gapminder_gdp_XXX
save_name = 'figs/'+split_name2 + '.png'
plt.savefig(save_name)

Let’s use bash to generate multiple plots by calling our Python script in a bash for-loop. First, let’s create a bash file for us to edit.

BASH

$ touch gdp_plots.sh

In this script, we’ll write a for-loop will call our gdp_plots.py script on multiple files. We can break up our long list of files by using a backslash \ and writing the rest on the next line.

BASH

for filename in data/gapminder_gdp_oceania.csv data/gapminder_gdp_africa.csv
do
   python gdp_plots.py $filename
done

We can run our script to see if it works:

PYTHON

$ bash gdp_plots.sh

This script now generates a plot for each file. We can check by checking the time each plot was created.

BASH

ls -lh figs/

Updating the repository

Yet another successful update to the code. Let’s commit our changes. Since our Python and bash scripts had somewhat unrelated changes, let’s make two separate commits. We will also ignore our images like before.

BASH

$ git add gdp_plots.sh
$ git commit -m "Wrote bash script to call python plotter."
$ echo "*.png" >> .gitignore
$ git add .gitignore
$ git commit -m "ignoring generated images"

Comparing Methods

We have successfully developed two different methods for accomplishing the same task. This is common to do in software development when there is not a clear path forward. Let’s compare our two methods and decide which to merge into our main branch.

One comparison we might be interested in is how fast each is. We can use bash’s time function to get the time to run the script. Let’s time each script. To do this, we just add time before each command when we run a script or command and it will give us timing information when it is completed.

While we are on our bash branch, we’ll time that script first.

BASH

$ time bash gdp_plots.sh

real    0m6.031s
user    0m5.535s
sys     0m0.388s

We are most interested in the “real” time in the output, which is the elapsed time we experience.

Let’s checkout our python branch and time our script there.

BASH

$ git checkout py-loop
$ time python gdp_plots.py data/gapminder_gdp_oceania.csv data/gapminder_gdp_africa.csv

real    0m3.163s
user    0m3.002s
sys     0m0.132s

As we can see, our Python method ran faster than the bash method. For this reason, we will merge our Python branch into main.

BASH

$ git checkout main
$ git merge py-loop

Another advantage to the Python method over the bash method is that we do not need to change a file and commit it each time we want to create plots from different data.

More Practice with Multiple Files in Python

Finding Particular Files

Using the glob module, write a simple version of ls that shows files in the current directory with a particular suffix. A call to this script should look like this:

PYTHON

$ python my_ls.py py

OUTPUT

left.py
right.py
zero.py

Show me the solution

PYTHON

import sys
import glob

def main():
    '''prints names of all files with sys.argv as suffix'''
    assert len(sys.argv) >= 2, 'Argument list cannot be empty'
    suffix = sys.argv[1] # NB: behaviour is not as you'd expect if sys.argv[1] is *
    glob_input = '*.' + suffix # construct the input
    glob_output = sorted(glob.glob(glob_input)) # call the glob function
    for item in glob_output: # print the output
        print(item)
    return

main()

Counting Lines

Write a program called line_count.py that works like the Unix wc command:

If no filenames are given, it reports the number of lines in standard input.
If one or more filenames are given, it reports the number of lines in each, followed by the total number of lines.

Show me the solution

PYTHON

import sys

def main():
    '''print each input filename and the number of lines in it,
       and print the sum of the number of lines'''
    filenames = sys.argv[1:]
    sum_nlines = 0 #initialize counting variable

    if len(filenames) == 0: # no filenames, just stdin
        sum_nlines = count_file_like(sys.stdin)
        print('stdin: %d' % sum_nlines)
    else:
        for f in filenames:
            n = count_file(f)
            print('%s %d' % (f, n))
            sum_nlines += n
        print('total: %d' % sum_nlines)

def count_file(filename):
    '''count the number of lines in a file'''
    f = open(filename,'r')
    nlines = len(f.readlines())
    f.close()
    return(nlines)

def count_file_like(file_like):
    '''count the number of lines in a file-like object (eg stdin)'''
    n = 0
    for line in file_like:
        n = n+1
    return n

main()

Key Points

Make different branches in a Git repository to try different methods.
Use bash’s time command to time scripts.