Trying Different Methods
Last updated on 2025-04-28 | Edit this page
Overview
Questions
- How do I plot multiple data sets using different methods?
Objectives
- Read data from standard input in a program so that it can be used in a pipeline.
- Compare using different methods to accomplish the same task.
- Practice making branches and merging in a Git repository.
Handling Multiple Files
Perhaps we would like a script that can operate on multiple data files at once. This can actually be done in two different ways: (1) writing a bash script that calls our already-existing Python script in a for-loop, or (2) modifying our Python script to read multiple files in a for-loop. We will try both methods and compare the two.
Saving plot before branching
Our code currently pops up the figure instead of saving it. Most of the time we will want to save the figure as a separate file instead of only viewing it but saving instead of viewing will also help with the flow or our program’s for loops. With the current configuration, when we run the for loop it will pause each time it pops up a figure and wait for us to close the viewer.
import sys import pandas # we need to import part of matplotlib # because we are no longer in a notebook import matplotlib.pyplot as plt filename = sys.argv[1] # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot of the transposed data ax = data.plot(title = filename) # set some plot attributes ax.set_xlabel("Year") ax.set_ylabel("GDP Per Capita") # set the x locations and labels ax.set_xticks(range(len(data.index)) ) ax.set_xticklabels(data.index, rotation = 45) # display the plot plt.savefig('fig/gdp-plot.png')
This version of the code overwrites the figure name each time the file is run. This won’t work well for multiple figure generation so we need to edit the script to write a unique name for each input file.
A simple unique name can be generated based on the original file
name. We can use Python’s split()
function to split
filename
, which is a string, by any character. This returns
a list like so:
We’ll split the original file name and use the first part to rename
our plot. And then we will concatenate .png
to the name to
specify our file type.
import sys import pandas # we need to import part of matplotlib # because we are no longer in a notebook import matplotlib.pyplot as plt filename = sys.argv[1] # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot of the transposed data ax = data.plot(title = filename) # set some plot attributes ax.set_xlabel("Year") ax.set_ylabel("GDP Per Capita") # set the x locations and labels ax.set_xticks(range(len(data.index)) ) ax.set_xticklabels(data.index, rotation = 45) # save the plot with a unique file name split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX split_name2 = split_name1.split('/')[1] #gapminder_gdp_XXX save_name = 'figs/'+split_name2 + '.png' plt.savefig(save_name)
Lets commit the version of the code that saves unique names for each plot
Handling Multiple Files with Python
First, we’ll try using just Python to loop through mutliple files. Let’s switch to our Python branch.
To process each file separately, we’ll need a loop that executes our plotting statements for each file.
We want our program to process each file separately, and the easiest
way to do this is a loop that executes once for each of the filenames
provided in sys.argv
.
But we need to be careful: sys.argv[0]
will always be
the name of our program, rather than the name of a file. We also need to
handle an unknown number of filenames, since our program could be run
for any number of files.
A solution is to loop over the contents of sys.argv[1:]
.
The ‘1’ tells Python to start the slice at location 1, so the program’s
name isn’t included. Since we’ve left off the upper bound, the slice
runs to the end of the list, and includes all the filenames.
Here is our updated program.
import sys import pandas # we need to import part of matplotlib # because we are no longer in a notebook import matplotlib.pyplot as plt for filename in sys.argv[1:]: # load data and transpose so that country names are # the columns and their gdp data becomes the rows data = pandas.read_csv(filename, index_col = 'country').T # create a plot of the transposed data ax = data.plot(title = filename) # set some plot attributes ax.set_xlabel("Year") ax.set_ylabel("GDP Per Capita") # set the x locations and labels ax.set_xticks(range(len(data.index))) ax.set_xticklabels(data.index, rotation = 45) # save the plot with a unique file name split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX split_name2 = split_name1.split('/')[1] #gapminder_gdp_XXX save_name = 'figs/'+split_name2 + '.png' plt.savefig(save_name)
Now when the program is given multiple filenames
one plot for each filename is generated.
Handling Multiple Files with Bash
Now that we’ve created a Python script to save multiple files at
once, let’s try to do the same thing in Bash. We’ll leave our current
branch alone and switch to our sh-loop
branch.
If we look at our gdp_plots.py
file, it is not in a
for-loop format because it is not up to date with other branch. Our
script should look like this:
PYTHON
import sys
import pandas
# we need to import part of matplotlib
# because we are no longer in a notebook
import matplotlib.pyplot as plt
filename = sys.argv[1]
# load data and transpose so that country names are
# the columns and their gdp data becomes the rows
data = pandas.read_csv(filename, index_col = 'country').T
# create a plot of the transposed data
ax = data.plot(title = filename)
# set some plot attributes
ax.set_xlabel("Year")
ax.set_ylabel("GDP Per Capita")
# set the x locations and labels
ax.set_xticks(range(len(data.index)) )
ax.set_xticklabels(data.index, rotation = 45)
# save the plot with a unique file name
split_name1 = filename.split('.')[0] #data/gapminder_gdp_XXX
split_name2 = split_name1.split('/')[1] #gapminder_gdp_XXX
save_name = 'figs/'+split_name2 + '.png'
plt.savefig(save_name)
Let’s use bash to generate multiple plots by calling our Python script in a bash for-loop. First, let’s create a bash file for us to edit.
In this script, we’ll write a for-loop will call our
gdp_plots.py
script on multiple files. We can break up our
long list of files by using a backslash \
and writing the
rest on the next line.
BASH
for filename in data/gapminder_gdp_oceania.csv data/gapminder_gdp_africa.csv
do
python gdp_plots.py $filename
done
We can run our script to see if it works:
This script now generates a plot for each file. We can check by checking the time each plot was created.
Comparing Methods
We have successfully developed two different methods for
accomplishing the same task. This is common to do in software
development when there is not a clear path forward. Let’s compare our
two methods and decide which to merge into our main
branch.
One comparison we might be interested in is how fast each is. We can
use bash’s time
function to get the time to run the script.
Let’s time each script. To do this, we just add time
before
each command when we run a script or command and it will give us timing
information when it is completed.
While we are on our bash branch, we’ll time that script first.
real 0m6.031s
user 0m5.535s
sys 0m0.388s
We are most interested in the “real” time in the output, which is the elapsed time we experience.
Let’s checkout our python branch and time our script there.
BASH
$ git checkout py-loop
$ time python gdp_plots.py data/gapminder_gdp_oceania.csv data/gapminder_gdp_africa.csv
real 0m3.163s
user 0m3.002s
sys 0m0.132s
As we can see, our Python method ran faster than the bash method. For this reason, we will merge our Python branch into main.
Another advantage to the Python method over the bash method is that we do not need to change a file and commit it each time we want to create plots from different data.
More Practice with Multiple Files in Python
PYTHON
import sys
import glob
def main():
'''prints names of all files with sys.argv as suffix'''
assert len(sys.argv) >= 2, 'Argument list cannot be empty'
suffix = sys.argv[1] # NB: behaviour is not as you'd expect if sys.argv[1] is *
glob_input = '*.' + suffix # construct the input
glob_output = sorted(glob.glob(glob_input)) # call the glob function
for item in glob_output: # print the output
print(item)
return
main()
Counting Lines
Write a program called line_count.py
that works like the
Unix wc
command:
- If no filenames are given, it reports the number of lines in standard input.
- If one or more filenames are given, it reports the number of lines in each, followed by the total number of lines.
PYTHON
import sys
def main():
'''print each input filename and the number of lines in it,
and print the sum of the number of lines'''
filenames = sys.argv[1:]
sum_nlines = 0 #initialize counting variable
if len(filenames) == 0: # no filenames, just stdin
sum_nlines = count_file_like(sys.stdin)
print('stdin: %d' % sum_nlines)
else:
for f in filenames:
n = count_file(f)
print('%s %d' % (f, n))
sum_nlines += n
print('total: %d' % sum_nlines)
def count_file(filename):
'''count the number of lines in a file'''
f = open(filename,'r')
nlines = len(f.readlines())
f.close()
return(nlines)
def count_file_like(file_like):
'''count the number of lines in a file-like object (eg stdin)'''
n = 0
for line in file_like:
n = n+1
return n
main()
Key Points
- Make different branches in a Git repository to try different methods.
- Use bash’s
time
command to time scripts.