Content from Introduction


Last updated on 2025-04-16 | Edit this page

Estimated time: 25 minutes

Overview

Questions

  • How can I make my results easier to reproduce?

Objectives

  • Explain what SCons is for.
  • Explain why SCons differs from shell scripts.
  • Name other popular build tools.

Let’s imagine that we’re interested in testing Zipf’s Law in some of our favorite books.

Zipf’s Law

The most frequently-occurring word occurs approximately twice as often as the second most frequent word. This is Zipf’s Law.

We’ve compiled our raw data i.e. the books we want to analyze and have prepared several Python scripts that together make up our analysis pipeline.

Let’s take quick look at one of the books using the command head books/isles.txt.

Our directory has the Python scripts and data files we will be working with:

OUTPUT

|- books
|  |- abyss.txt
|  |- isles.txt
|  |- last.txt
|  |- LICENSE_TEXTS.md
|  |- sierra.txt
|- plotcounts.py
|- countwords.py
|- testzipf.py

The first step is to count the frequency of each word in a book. For this purpose we will use a python script countwords.py which takes two command line arguments. The first argument is the input file (books/isles.txt) and the second is the output file that is generated (here isles.dat) by processing the input.

BASH

$ python countwords.py books/isles.txt isles.dat

Let’s take a quick peek at the result.

BASH

$ head -5 isles.dat

This shows us the top 5 lines in the output file:

OUTPUT

the 3822 6.7371760973
of 2460 4.33632998414
and 1723 3.03719372466
to 1479 2.60708619778
a 1308 2.30565838181

We can see that the file consists of one row per word. Each row shows the word itself, the number of occurrences of that word, and the number of occurrences as a percentage of the total number of words in the text file.

We can do the same thing for a different book:

BASH

$ python countwords.py books/abyss.txt abyss.dat
$ head -5 abyss.dat

OUTPUT

the 4044 6.35449402891
and 2807 4.41074795726
of 1907 2.99654305468
a 1594 2.50471401634
to 1515 2.38057825267

Let’s visualize the results. The script plotcounts.py reads in a data file and plots the 10 most frequently occurring words as a text-based bar plot:

BASH

$ python plotcounts.py isles.dat ascii

OUTPUT

the   ########################################################################
of    ##############################################
and   ################################
to    ############################
a     #########################
in    ###################
is    #################
that  ############
by    ###########
it    ###########

plotcounts.py can also show the plot graphically:

BASH

$ python plotcounts.py isles.dat show

Close the window to exit the plot.

plotcounts.py can also create the plot as an image file (e.g. a PNG file):

BASH

$ python plotcounts.py isles.dat isles.png

Finally, let’s test Zipf’s law for these books:

BASH

$ python testzipf.py abyss.dat isles.dat

OUTPUT

Book	First	Second	Ratio
abyss	4044	2807	1.44
isles	3822	2460	1.55

So we’re not too far off from Zipf’s law.

Together these scripts implement a common workflow:

  1. Read a data file.
  2. Perform an analysis on this data file.
  3. Write the analysis results to a new file.
  4. Plot a graph of the analysis results.
  5. Save the graph as an image, so we can put it in a paper.
  6. Make a summary table of the analyses

Running countwords.py and plotcounts.py at the shell prompt, as we have been doing, is fine for one or two files. If, however, we had 5 or 10 or 20 text files, or if the number of steps in the pipeline were to expand, this could turn into a lot of work. Plus, no one wants to sit and wait for a command to finish, even just for 30 seconds.

The most common solution to the tedium of data processing is to write a shell script that runs the whole pipeline from start to finish.

So to reproduce the tasks that we have just done we create a new file named run_pipeline.sh in which we place the commands one by one. Using a text editor of your choice (e.g. for nano use the command nano run_pipeline.sh) copy and paste the following text and save it.

BASH

# USAGE: bash run_pipeline.sh
# to produce plots for isles and abyss
# and the summary table for the Zipf's law tests

python countwords.py books/isles.txt isles.dat
python countwords.py books/abyss.txt abyss.dat

python plotcounts.py isles.dat isles.png
python plotcounts.py abyss.dat abyss.png

# Generate summary table
python testzipf.py abyss.dat isles.dat > results.txt

Run the script and check that the output is the same as before:

BASH

$ bash run_pipeline.sh
$ cat results.txt

This shell script solves several problems in computational reproducibility:

  1. It explicitly documents our pipeline, making communication with colleagues (and our future selves) more efficient.
  2. It allows us to type a single command, bash run_pipeline.sh, to reproduce the full analysis.
  3. It prevents us from repeating typos or mistakes. You might not get it right the first time, but once you fix something it’ll stay fixed.

Despite these benefits it has a few shortcomings.

Let’s adjust the width of the bars in our plot produced by plotcounts.py.

Edit plotcounts.py so that the bars are 0.8 units wide instead of 1 unit. (Hint: replace width = 1.0 with width = 0.8 in the definition of plot_word_counts.)

Now we want to recreate our figures. We could just bash run_pipeline.sh again. That would work, but it could also be a big pain if counting words takes more than a few seconds. The word counting routine hasn’t changed; we shouldn’t need to recreate those files.

Alternatively, we could manually rerun the plotting for each word-count file. (Experienced shell scripters can make this easier on themselves using a for-loop.)

BASH

for book in abyss isles; do
    python plotcounts.py $book.dat $book.png
done

With this approach, however, we don’t get many of the benefits of having a shell script in the first place.

Another popular option is to comment out a subset of the lines in run_pipeline.sh:

BASH

# USAGE: bash run_pipeline.sh
# to produce plots for isles and abyss
# and the summary table for the Zipf's law tests.

# These lines are commented out because they don't need to be rerun.
#python countwords.py books/isles.txt isles.dat
#python countwords.py books/abyss.txt abyss.dat

python plotcounts.py isles.dat isles.png
python plotcounts.py abyss.dat abyss.png

# Generate summary table
# This line is also commented out because it doesn't need to be rerun.
#python testzipf.py abyss.dat isles.dat > results.txt

Then, we would run our modified shell script using bash run_pipeline.sh.

But commenting out these lines, and subsequently uncommenting them, can be a hassle and source of errors in complicated pipelines.

What we really want is an executable description of our pipeline that allows software to do the tricky part for us: figuring out what steps need to be rerun.

For our pipeline SCons can execute the commands needed to run our analysis and plot our results. Like shell scripts it allows us to execute complex sequences of commands via a single shell command. Unlike shell scripts it explicitly records the dependencies between files - what files are needed to create what other files - and so can determine when to recreate our data files or image files, if our text files change. SCons can be used for any commands that follow the general pattern of processing files to create new files, for example:

  • Run analysis scripts on raw data files to get data files that summarize the raw data (e.g. creating files with word counts from book text).
  • Run visualization scripts on data files to produce plots (e.g. creating images of word counts).
  • Parse and combine text files and plots to create papers.
  • Compile source code into executable programs or libraries.

There are now many build tools available, for example GNU Make, Apache ANT, doit, and nmake for Windows. Which is best for you depends on your requirements, intended usage, and operating system. However, they all share the same fundamental concepts.

Also, you might come across build generation scripts e.g. GNU Autoconf and CMake. Those tools do not run the pipelines directly, but rather generate files for use with the build tools.

As a Python based build tool, SCons is available on Windows, MacOS, and Linux. It is distributed with the pip and conda package managers, so it can be installed in the same Python scientific computing environments popular with computational science and engineering communities. SCons also uses Python as the configuration language, so the configuration files will feel familiar to many students.

Key Points

  • SCons allows us to specify what depends on what and how to update things that are out of date.

Content from SConscript files


Last updated on 2025-04-16 | Edit this page

Estimated time: 40 minutes

Overview

Questions

  • How do I write a simple SConstruct file?

Objectives

  • Recognize the key parts of the SConstruct file, tasks, targets, sources, and actions.
  • Write a simple SConstruct file.
  • Run SCons from the shell.
  • Explain how to create aliases for collections of targets.
  • Explain constraints on dependencies.

Create a file, called SConstruct, with the following content:

PYTHON

import os

env = Environment(ENV=os.environ.copy())

# Count words.
env.Command(
    target=["isles.dat"],
    source=["books/isles.txt"],
    action=["python countwords.py books/isles.txt isles.dat"],
)

This is a build file, which for SCons is called an SConscript file - a file executed by SCons. SConstruct is the conventional name for the root configuration file. Secondary configuration files are named SConscript by convention, but can take any filename. Together all SCons configuration files take the generic name SConscript files. From now on, SCons configuration files will be referred to collectively as SConscript files, but it is important to remember that projects usually start with the SConstruct file naming convention.

The syntax should be familiar to Python users because SCons uses Python as the configuration language. Note how the action resembles a line from our shell script.

Let us go through each section in turn:

  • First we import the os module and create an SCons construction environment. as a copy of the active shell environment. Most build managers inherit the active shell environment by default. SCons requires a little more effort, but this separation of construction environment from the external environment is valuable in complex computational science and engineering workflows which may require several mutually exclusive environments in a single workflow. For the purposes of this lesson, we will use a single construction environment inherited from the shell’s active Conda environment.
  • # denotes a comment. Any text from # to the end of the line is ignored by SCons but could be very helpful for anyone reading your SConstruct file.
  • env.Command is the generic task definition class used by SCons. Note that the task is defined inside the construction environment we created earlier. If there were more than one construction environment available, additional tasks could use unique, task specific, construction environments.
  • isles.dat is a target, a file to be created, or built.
  • books/isles.txt is a source, also called a dependency, a file that is needed to build or update the target. Targets can have one or more dependencies.
  • python countwords.py books/isles.txt isles.dat is an action, a command to run to build or update the target using the sources. Targets can have one or more actions. These actions form a recipe to build the target from its sources and are executed similarly to a shell script.
  • Targets, sources, and actions are passed as keyword arguments and may be a string or a list of strings.
  • Together, the target, sources, and actions form a task.

Our task above describes how to build the target isles.dat using the action python countwords.py and the source books/isles.txt.

Information that was implicit in our shell script - that we are generating a file called isles.dat and that creating this file requires books/isles.txt - is now made explicit by SCons’ syntax.

Let’s first ensure we start from scratch and delete the .dat and .png files we created earlier:

BASH

$ rm *.dat *.png

By default, SCons looks for a root SConscript file, called SConstruct, and we can run SCons as follows:

BASH

$ scons

By default, SCons prints several status messages and the actions it executes:

OUTPUT

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
python countwords.py books/isles.txt isles.dat
scons: done building targets.

The status messages can be silenced with the -Q option.

Let’s see if we got what we expected.

BASH

$ head -5 isles.dat

The first 5 lines of isles.dat should look exactly like before.

The SConstruct File Does Not Have to be Called SConstruct

We don’t have to call our root SCons configuration file SConstruct. However, if we call it something else we need to tell SCons where to find it. This we can do using --sconstruct option. For example, if our SConstruct file is named MyOtherSConstruct:

BASH

$ scons --sconstruct=MyOtherSConstruct

SCons does not require a specific file extension. The suffix .scons can be used to identify SConscript files that are not called SConstruct or SConscript e.g. install.scons, common.scons etc.

When we re-run our SConstruct file, SCons now informs us that:

OUTPUT

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
scons: `.' is up to date.
scons: done building targets.

SCons uses the special target alias ‘.’ to indicate ‘all targets’. No command is run because our target, isles.dat, has now been created, and SCons will not create it again. To see how this works, let’s pretend to update one of the text files. Rather than opening the file in an editor, we can use the shell touch command to update its timestamp (which would happen if we did edit the file):

BASH

$ touch books/isles.txt

If we compare the timestamps of books/isles.txt and isles.dat,

BASH

$ ls -l books/isles.txt isles.dat

then we see that isles.dat, the target, is now older than books/isles.txt, its dependency:

OUTPUT

-rw-r--r--    1 mjj      Administ   323972 Jun 12 10:35 books/isles.txt
-rw-r--r--    1 mjj      Administ   182273 Jun 12 09:58 isles.dat

If we run SCons again,

BASH

$ scons

it does not recreate isles.dat. Instead reporting that ‘all targets’ are up to date.

OUTPUT

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
scons: `.' is up to date.
scons: done building targets.

This is a surprising result if you are already familiar with other build managers. Many build managers, such as GNU Make use timestamps to track the state of source and target files. If we were using Make, Make would have re-created the isles.dat file.

By default SCons computes content signatures from the file content to track the state of source and target files. If the content of a file has not changed, it is considered up-to-date and SCons will not create it again. Computing the content signature takes more time than checking a timestamp, so SCons provides an option to use the more traditional timestamp state. However, in computational science and engineering workflows, which often contain tasks requiring hours or days to compute, the added time required to check file content is often a valuable trade-off because it avoids launching long-running tasks more robustly than a simple timestamp check.

To observe SCons re-creating the target isles.dat, we must actually modify the books/isles.txt file. Any change to the file contents, even adding a newline, will change the content signature computed as an md5sum. If we run the md5sum ourselves, we can see the signature change before and after the file edit.

BASH

$ md5sum books/isles.txt

OUTPUT

6cc2c020856be849418f9d744ac1f5ee  books/isles.txt

Append an empty newline to the books/isles.txt file and check the md5sum signature again.

BASH

$ echo "" >> books/isles.txt
$ md5sum books/isles.txt

OUTPUT

22b5adfc3b267e2e658ba75de4aeb74b  books/isles.txt

We can see that appending a blank newline changes the computed content signature. If we run SCons again, it will re-create isles.dat because the content of the source file books/isles.txt has changed.

BASH

$ scons

OUTPUT

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
python countwords.py books/isles.txt isles.dat
scons: done building targets.

When it is asked to build a target, SCons checks the ‘content signature’ of both the target and its sources and the ‘action signature’ of the associated action list. If any source or action content has changed since the target was built, then the actions are re-run to update the target. Using this approach, SCons knows to only rebuild the files that, either directly or indirectly, depend on the file that changed. This is called an incremental build.

SConscript Files as Documentation

By explicitly recording the inputs to and outputs from steps in our analysis and the dependencies between files, SConscript files act as a type of documentation, reducing the number of things we have to remember.

Let’s add another task to the end of SConstruct:

PYTHON

env.Command(
    target=["abyss.dat"],
    source=["books/abyss.txt"],
	  action=["python countwords.py books/abyss.txt abyss.dat"],
)

If we run SCons,

BASH

$ scons

then we get:

OUTPUT

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
python countwords.py books/abyss.txt abyss.dat
scons: done building targets.

SCons builds the second target but not the first target. The default behavior of SCons is to build all default targets and, unless otherwise specified, all targets are added to the default targets list.

If we do not want to build all targets, we can also build a specific target by name. First, confirm that running SCons again reports the special target ‘.’ up to date to indicate that all targets are up to date.

BASH

$ scons

OUTPUT

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
scons: `.' is up to date.
scons: done building targets.

Then confirm that when specifying a target, SCons only reports on the requested target.

BASH

$ scons abyss.dat

OUTPUT

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
scons: `abyss.dat' is up to date.
scons: done building targets.

“Up to Date” Versus “Nothing to be Done”

If we ask SCons to build a file that already exists and is up to date, then SCons informs us that:

OUTPUT

scons: `isles.dat' is up to date.

If we ask SCons to build a file that exists but for which there is no rule in our SConstruct file, then we get message like:

BASH

$ scons countwords.py

OUTPUT

scons: Nothing to be done for `countwords.py'.

up to date means that the SConstruct file has a task with one or more actions whose target is the name of a file (or directory) and the file is up to date.

Nothing to be done means that the file exists but the SConstruct file has no task for it. Targets that are defined, but have no action, result in an empty ‘Building targets …’ message without issuing any commands.

We may want to remove all our data files so we can explicitly recreate them all. SCons provides the --clean command line option that will remove targets by request. We can clean all default targets

BASH

scons --clean

OUTPUT

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Cleaning targets ...
Removed abyss.dat
Removed isles.dat
scons: done cleaning targets.

or clean all targets with the special target ‘.’, regardless of the default list contents

BASH

scons . --clean

OUTPUT

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Cleaning targets ...
Removed abyss.dat
Removed isles.dat
scons: done cleaning targets.

or clean specific targets by name

BASH

scons abyss.dat --clean

OUTPUT

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Cleaning targets ...
Removed abyss.dat
scons: done cleaning targets.

We may want to simplify specification of some, but not all, targets. We can add an alias to reference all of the data files.

PYTHON

env.Alias("dats", ["isles.dat", "abyss.dat"])

This simplifies calling a non-default target list such that we do not have to write out each target by name. The following two executions of SCons are equivalent.

BASH

$ scons isles.dat abyss.dat

BASH

$ scons dats

OUTPUT

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
python countwords.py books/isles.txt isles.dat
python countwords.py books/abyss.txt abyss.dat
scons: done building targets.

When requesting specific targets, the requested targets are reported up-to-date according the name used on the command line. Calling two targets by name results in individual reports, one per target.

BASH

$ scons isles.dat abyss.dat

OUTPUT

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
scons: `isles.dat' is up to date.
scons: `abyss.dat' is up to date.
scons: done building targets.

Calling the collector alias dats results in a single report for the alias.

BASH

$ scons dats

OUTPUT

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
scons: `dats' is up to date.
scons: done building targets.

Dependencies

The order of rebuilding dependencies is arbitrary. Required sources are always built before targets, but if two targets are independent of one another, you should not assume that they will be built in the order in which they are listed.

Dependencies must form a directed acyclic graph. A target cannot depend on a dependency which itself, or one of its dependencies, depends on that target.

Our SConstruct now looks like this:

PYTHON

import os

env = Environment(ENV=os.environ.copy())

env.Command(
    target=["isles.dat"],
    source=["books/isles.txt"],
    action=["python countwords.py books/isles.txt isles.dat"],
)

env.Command(
    target=["abyss.dat"],
    source=["books/abyss.txt"],
	  action=["python countwords.py books/abyss.txt abyss.dat"],
)

env.Alias("dats", ["isles.dat", "abyss.dat"])

The following figure shows a graph of the dependencies embodied within our SConstruct file, involved in building the dats alias:

Dependencies represented within the SConstruct file

Write Two New Tasks

  1. Write a new task for last.dat, created from books/last.txt.
  2. Update the dats alias with this target.
  3. Write a new task for results.txt, which creates the summary table. The rule needs to:
    • Depend upon each of the three .dat files.
    • Invoke the action python testzipf.py abyss.dat isles.dat last.dat > results.txt.
  4. Add this target to the default target list so that it is the default target.

The starting SConstruct file is here.

See this file for a solution.

The following figure shows the dependencies embodied within our SConstruct file, involved in building the results.txt target:

results.txt dependencies represented within the SConstruct file

Key Points

  • SConstruct is the default name of the root SCons configuration file
  • SCons configuration files are collectively called SConscript files
  • SConscript files are Python files.
  • SCons tasks are attached to a construction environment, which can be inherited from the shell’s active environment.
  • Use # for comments in SConscript files.
  • Write tasks as lists of targets, sources, and actions with the Command class
  • Use an Alias to collect targets in a convenient alias for shorter build commands.
  • Use the Default function to limit the number of default targets to a subset of all targets.

Content from Special Substitution Variables


Last updated on 2025-04-16 | Edit this page

Estimated time: 15 minutes

Overview

Questions

  • How can I abbreviate the tasks in my SConscript files?

Objectives

  • Use SCons special substitution variables to remove duplication in SConscript files.
  • Explain why shell wildcards in dependencies can cause problems.

After the exercise at the end of the previous episode, our SConstruct file looked like this:

PYTHON

import os

env = Environment(ENV=os.environ.copy())

env.Command(
    target=["isles.dat"],
    source=["books/isles.txt"],
    action=["python countwords.py books/isles.txt isles.dat"],
)

env.Command(
    target=["abyss.dat"],
    source=["books/abyss.txt"],
	action=["python countwords.py books/abyss.txt abyss.dat"],
)

env.Command(
    target=["last.dat"],
    source=["books/last.txt"],
	action=["python countwords.py books/last.txt last.dat"],
)

env.Alias("dats", ["isles.dat", "abyss.dat", "last.dat"])

env.Command(
    target=["results.txt"],
    source=["isles.dat", "abyss.dat", "last.dat"],
    action=["python testzipf.py abyss.dat isles.dat last.dat > results.txt"],
)

env.Default(["results.txt"])

Our SConstruct file has a lot of duplication. For example, the names of text files and data files are repeated in many places throughout the file. SConscript files are a form of code and, in any code, repeated code can lead to problems e.g. we rename a data file in one part of the SConscript file but forget to rename it elsewhere.

D.R.Y. (Don’t Repeat Yourself)

In many programming languages, the bulk of the language features are there to allow the programmer to describe long-winded computational routines as short, expressive, beautiful code. Features in Python or R or Java, such as user-defined variables and functions are useful in part because they mean we don’t have to write out (or think about) all of the details over and over again. This good habit of writing things out only once is known as the “Don’t Repeat Yourself” principle or D.R.Y.

Let us set about removing some of the repetition from our SConstruct file.

In our results.txt task we duplicate the data file names and the name of the results file name:

PYTHON

env.Command(
    target=["results.txt"],
    source=["isles.dat", "abyss.dat", "last.dat"],
    action=["python testzipf.py abyss.dat isles.dat last.dat > results.txt"],
)

Looking at the results file name first, we can replace it in the action with ${TARGET}:

PYTHON

env.Command(
    target=["results.txt"],
    source=["isles.dat", "abyss.dat", "last.dat"],
    action=["python testzipf.py abyss.dat isles.dat last.dat > ${TARGET}"],
)

${TARGET} is an SCons special variable which means ‘the target of the current task’. When SCons is run it will replace this variable with the target name.

We can replace the sources in the action with ${SOURCES}:

env.Command(
    target=["results.txt"],
    source=["isles.dat", "abyss.dat", "last.dat"],
    action=["python testzipf.py ${SOURCES} > ${TARGET}"],
)

${SOURCES} is another special substitution variable which means ‘all the dependencies of the current task’. Again, when SCons is run it will replace this variable with the sources.

Let’s clean our workflow and re-run our task:

BASH

$ scons . --clean
$ scons results.txt

We get:

OUTPUT

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
python countwords.py books/isles.txt isles.dat
python countwords.py books/abyss.txt abyss.dat
python countwords.py books/last.txt last.dat
python testzipf.py isles.dat abyss.dat last.dat > results.txt
scons: done building targets.

Update Dependencies

What will happen if you now execute:

BASH

$ touch *.dat
$ scons results.txt
  1. nothing
  2. all files recreated
  3. only .dat files recreated
  4. only results.txt recreated

`1. Nothing.

The content of the *.dat has not changed, so results.txt is up to date.

If you run:

BASH

$ echo "" | tee -a books/*.txt
$ scons results.txt

you will find that the .dat files as well as results.txt are recreated.

If you run:

BASH

$ echo "manually-edited-ouput-is-bad 1 0.1" | tee -a *.dat
$ scons results.txt

you will find that the results.txt file is recreated because the content signature of the .dat files has changed. However, the .dat files are not recreated. Despite our edit, the source and action signatures of the .dat tasks have not changed. It is important that you never edit targets manually to avoid out-of-sync and reproducibility errors arising in the middle of your workflow. You can rely on SCons to know when to rebuild targets if you have well defined tasks with complete target and sources lists

If you run this command or manually edited the .dat files, be sure to clean and rebuild them to remove the manually edited lines.

BASH

scons dats --clean
scons results.txt

As we saw, ${SOURCES} means ‘all the dependencies of the current task’. This works well for results.txt as its action treats all the dependencies the same - as the input for the testzipf.py script.

However, for some tasks, we may want to treat the first dependency differently. For example, our tasks for .dat use their first (and only) dependency specifically as the input file to countwords.py. If we add additional dependencies (as we will soon do) then we don’t want these being passed as input files to countwords.py as it expects only one input file to be named when it is invoked.

SCons allows Pythonic, zero-based indexing of special substitution variables ${SOURCES} and ${TARGETS} for this use case. For example, ${SOURCES[0]} means ‘the first dependency of the current task’.

Rewrite .dat Tasks to Use Special Substitution Variables

Rewrite each .dat task to use the special substitution variables ${TARGET} (‘the target of the current task’) and ${SOURCES[0]} (‘the first dependency of the current task’). This file contains the SConstruct immediately before the challenge.

See this file for a solution to this challenge.

Key Points

  • Use ${TARGET} to refer to the target of the current task.
  • Use ${SOURCES} to refer to the dependencies of the current task.
  • Use ${SOURCES[0]} to refer to the first dependency of the current task.

Content from Dependencies on Data and Code


Last updated on 2025-04-16 | Edit this page

Estimated time: 20 minutes

Overview

Questions

  • How can I write an SConsript file to update things when my scripts have changed rather than my input files?

Objectives

  • Output files are a product not only of input files but of the scripts or code that created the output files.
  • Recognize and avoid false dependencies.

Our SConstruct file now looks like this:

PYTHON

import os

env = Environment(ENV=os.environ.copy())

env.Command(
    target=["isles.dat"],
    source=["books/isles.txt"],
    action=["python countwords.py ${SOURCES[0]} ${TARGET}"],
)

env.Command(
    target=["abyss.dat"],
    source=["books/abyss.txt"],
	action=["python countwords.py ${SOURCES[0]} ${TARGET}"],
)

env.Command(
    target=["last.dat"],
    source=["books/last.txt"],
	action=["python countwords.py ${SOURCES[0]} ${TARGET}"],
)

env.Alias("dats", ["isles.dat", "abyss.dat", "last.dat"])

env.Command(
    target=["results.txt"],
    source=["isles.dat", "abyss.dat", "last.dat"],
    action=["python testzipf.py ${SOURCES} > ${TARGET}"],
)

env.Default(["results.txt"])

Our data files are produced using not only the input text files but also the script countwords.py that processes the text files and creates the data files. A change to countwords.py (e.g. adding a new column of summary data or removing an existing one) results in changes to the .dat files it outputs. So, let’s pretend to edit countwords.py, using echo to append a blank line, and re-run SCons:

BASH

$ scons dats
$ echo "" >> countwords.py
$ scons dats

Nothing happens! Though we’ve updated countwords.py our data files are not updated because our rules for creating .dat files don’t record any dependencies on countwords.py.

We need to add countwords.py as a dependency of each of our data files also:

PYTHON

env.Command(
    target=["isles.dat"],
    source=["books/isles.txt", "countwords.py"],
    action=["python countwords.py ${SOURCES[0]} ${TARGET}"],
)

env.Command(
    target=["abyss.dat"],
    source=["books/abyss.txt", "countwords.py"],
	action=["python countwords.py ${SOURCES[0]} ${TARGET}"],
)

env.Command(
    target=["last.dat"],
    source=["books/last.txt", "countwords.py"],
	action=["python countwords.py ${SOURCES[0]} ${TARGET}"],
)

If we re-run SCons,

BASH

$ scons dats

then we get:

OUTPUT

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
python countwords.py books/isles.txt isles.dat
python countwords.py books/abyss.txt abyss.dat
python countwords.py books/last.txt last.dat
scons: done building targets.

SCons tracks the source list as part of the task signature. Adding a new source triggers a rebuild of the targets. Now if we edit the countwords.py file, the targets will re-build again.

BASH

$ echo "" >> countwords.py
$ scons dats

OUTPUT

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
python countwords.py books/isles.txt isles.dat
python countwords.py books/abyss.txt abyss.dat
python countwords.py books/last.txt last.dat
scons: done building targets.

Dry run

scons can show the commands it will execute without actually running them if we pass the --dry-run flag:

BASH

$ echo "" >> countwords.py
$ scons --dry-run dats

This gives the same output to the screen as without the --dry-run flag, but the commands are not actually run. Using this ‘dry-run’ mode is a good way to check that you have set up your SConscript tasks properly before actually running the commands.

You can also get an explanation for why SCons would like to recreate the targets with the --debug=explain option. This is helpful when the dry run shows commands you did not expect to run and you need help tracking down the incorrect task definition.

BASH

$ echo "" >> countwords.py
$ scons --dry-run --debug=explain dats

OUTPUT

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
scons: rebuilding `isles.dat' because `countwords.py' changed
python countwords.py books/isles.txt isles.dat
scons: rebuilding `abyss.dat' because `countwords.py' changed
python countwords.py books/abyss.txt abyss.dat
scons: rebuilding `last.dat' because `countwords.py' changed
python countwords.py books/last.txt last.dat
scons: done building targets.

The following figure shows a graph of the dependencies that are involved in building the target results.txt. Notice the recently added dependencies countwords.py and testzipf.py. This is how the SConstruct should look after completing the rest of the exercises in this episode.

results.txt dependencies after adding countwords.py and testzipf.py as dependencies

Why Don’t the .txt Files Depend on countwords.py?

.txt files are input files and as such have no dependencies. To make these depend on countwords.py would introduce a false dependency which is not desirable.

Intuitively, we should also add countwords.py as a dependency for results.txt, because the final table should be rebuilt if we remake the .dat files. However, it turns out we don’t have to do that! Let’s see what happens to results.txt when we update countwords.py:

BASH

$ echo "" >> countwords.py
$ scons results.txt

then we get:

OUTPUT

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
python countwords.py books/isles.txt isles.dat
python countwords.py books/abyss.txt abyss.dat
python countwords.py books/last.txt last.dat
scons: `results.txt' is up to date.
scons: done building targets.

The whole pipeline is triggered, starting with the .dat files and finishing with the requested results.txt file! To understand this, note that according to the dependency figure, results.txt depends on the .dat files. The update of countwords.py triggers an update of the *.dat files. Finally, SCons reports that the requested results.txt file is still up to date.

In a timestamp based build manager, such as Make, the build manager would see that the dependencies (the .dat files) are newer than the target file (results.txt) and thus it recreates results.txt.

With SCons, the results.txt file is not recreated because the recreated intermediate .dat files contain the same content as the first time we created the results.txt file, which is therefore not out-of-date.

Both behaviors are examples of the power of a build manager: updating a subset of the files in the pipeline triggers rerunning the appropriate downstream steps. The additional SCons behavior of stopping a pipeline early when intermediate file content has not changed is desirable for computational science and engineering workflows, where some tasks may require hours to days to complete.

Updating One Input File

What will happen if you now execute:

BASH

$ echo "" >> books/last.txt
$ scons results.txt
  1. only last.dat is recreated
  2. all .dat files are recreated
  3. only last.dat and results.txt are recreated
  4. all .dat and results.txt are recreated

1. only last.dat is recreated.

Follow the dependency tree and consider the effect of an empty line on the word count calculations to understand the answer(s).

testzipf.py as a Dependency of results.txt.

What would happen if you added testzipf.py as dependency of results.txt, and why?

If you change the rule for the results.txt file like this:

PYTHON

env.Command(
    target=["results.txt"],
    source=["isles.dat", "abyss.dat", "last.dat", "testzipf.py"],
    action=["python testzipf.py ${SOURCES} > ${TARGET}"],
)

testzipf.py becomes a part of ${SOURCES}, thus the post-substitution command becomes

BASH

python testzipf.py abyss.dat isles.dat last.dat testzipf.py > results.txt

This results in an error from testzipf.py as it tries to parse the script as if it were a .dat file. Try this by running:

BASH

$ scons results.txt

You’ll get

ERROR

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
python testzipf.py isles.dat abyss.dat last.dat testzipf.py > results.txt
Traceback (most recent call last):
  File "/home/roppenheimer/scons-lesson/testzipf.py", line 19, in <module>
    counts = load_word_counts(input_file)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/roppenheimer/scons-lesson/countwords.py", line 39, in load_word_counts
    counts.append((fields[0], int(fields[1]), float(fields[2])))
                              ^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: 'countwords'
scons: *** [results.txt] Error 1
scons: building terminated because of errors.

We still have to add the testzipf.py script as dependency to results.txt. Given the answer to the challenge above, we need to make a couple of small changes so that we can still use special substitution variables.

We’ll move testzipf.py to be the first source. We could then edit the action so that we pass all the dependencies as arguments to python using ${SOURCES}.

PYTHON

env.Command(
    target=["results.txt"],
    source=["testzipf.py", "isles.dat", "abyss.dat", "last.dat"],
    action=["python ${SOURCES} > ${TARGET}"],
)

But it would be helpful to clarify the unique role of the testzipf.py as a Python script. We can clarify the intended roles for different source files by indexing the sources in our action. SCons allows for Pythonic slicing when indexing special substitution variables.

PYTHON

env.Command(
    target=["results.txt"],
    source=["testzipf.py", "isles.dat", "abyss.dat", "last.dat"],
    action=["python ${SOURCES[0]} ${SOURCES[1:]} > ${TARGET}"],
)

Index the .dat task actions

Index the sources for .dat task actions without changing the source file order. Remember that SCons allows Pythonic slicing when indexing special substitution variables.

SCons allows reverse indexing with the Python style [-1] slice

PYTHON

    action=["python ${SOURCES[-1]} ${SOURCES[0]} ${TARGET}"],

Where We Are

This SConstruct file contains everything done so far in this topic.

Key Points

  • SCons results depend on processing scripts as well as data files.
  • Dependencies are transitive: if A depends on B and B depends on C, a change to C will indirectly trigger the pipeline to update to A.
  • SCons content signatures help prevent recomputing work if intermediate targets’ contents do not change after recreation.

Content from Builders and Pseudo-builders


Last updated on 2025-04-16 | Edit this page

Estimated time: 10 minutes

Overview

Questions

  • How can I define common task operations for similar files?

Objectives

  • Write SCons builders and pseudo-builders

Our SConstruct file still has repeated content. The task for each .dat file are identical apart from the text and data file names. We can replace these tasks with a single builder which can be used to build any .dat file from a .txt file in books/:

PYTHON

count_words_builder = Builder(
    action=["python ${SOURCES[-1]} ${SOURCES[0]} ${TARGET}"],
)

After creating the custom builder, we need to add it to the construction environment to make it available for task definitions.

PYTHON

env = Environment(ENV=os.environ.copy())
env.Append(BUILDERS={"CountWords": count_words_builder})

Now we can convert our .dat tasks from the Command to CountWords builder.

PYTHON

env.CountWords(
    target=["isles.dat"],
    source=["books/isles.txt", "countwords.py"],
)

env.CountWords(
    target=["abyss.dat"],
    source=["books/abyss.txt", "countwords.py"],
)

env.CountWords(
    target=["last.dat"],
    source=["books/last.txt", "countwords.py"],
)

Custom builders like CountWords allow us to apply the same action to many tasks.

If we re-run SCons,

BASH

$ scons dats --clean
$ scons dats

then we get:

OUTPUT

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
python countwords.py books/isles.txt isles.dat
python countwords.py books/abyss.txt abyss.dat
python countwords.py books/last.txt last.dat
scons: done building targets.

We can further simplify the task definition by moving the text file handling inside a pseudo-builder function. Pseudo-builders behave like builders, but allow flexibility in task construction through user-defined arguments. We will use the pathlib module to help us construct OS-agnostic paths and perform path manipulation.

At the top of your SConstruct file, update the imports as below. Then define a new count_words pseudo-builder function after the imports to replace the count_word_builder and add it to the construction environment.

PYTHON

import os
import pathlib


def count_words(env, data_file):
    """Pseudo-builder to run the `countwords.py` script and produce `.dat` target

    Assumes that the source text file is found in `books/{data_file}.txt`

    :param env: SCons construction environment. Do not provide when using this
        function with the `env.AddMethod` and `env.CountWords` access style.
    :param data_file: String name of the data file to create.
    """
    data_path = pathlib.Path(data_file)
    text_file = pathlib.Path("books") / data_path.with_suffix(".txt")
    target_nodes = env.Command(
        target=[data_file],
        source=[text_file, "countwords.py"],
        action=["python ${SOURCES[-1]} ${SOURCES[0]} ${TARGET}"],
    )
    return target_nodes


env = Environment(ENV=os.environ.copy())
env.AddMethod(count_words, "CountWords")

This pseudo-builder has further reduced the interface necessary to define the .dat tasks, which now can be re-written as

PYTHON

env.CountWords("isles.dat")
env.CountWords("abyss.dat")
env.CountWords("last.dat")

For students familiar with GNU Make, pseudo-builders are similar to Make ‘pattern rules’, but pseudo-builders are both more verbose and more flexible. Pseudo-builders require a full Python function definition syntax, but they can do more than simple file extension pattern matching and anything the user requires.

A psuedo-builder alone will not allow us to match arbitrary files using the .dat file extension. If we desire the full Make ‘pattern rule’ behavior, we can accept a target name and match it to our pseudo-builder with the SCons COMMAND_LINE_TARGETS variable.

Add the following to the bottom of your SConstruct file

PYTHON

for target in COMMAND_LINE_TARGETS:
    if pathlib.Path(target).suffix == ".dat":
        env.CountWords(target)

Now we can define tasks for new files not found in our pre-defined tasks as

BASH

$ scons sierra.dat

OUTPUT

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
python countwords.py books/sierra.txt sierra.dat
scons: done building targets.

Where We Are

This SConstruct file contains all of our work so far.

Key Points

  • Use the Builder function and Append the construction environment BUILDERS dictionary to define common actions.
  • Use the AddMethod function and Python functions to define pseudo-builders with custom tailored task handling.
  • Use the special SCons variable COMMAND_LINE_TARGETS to perform dynamic handling that depends on command line target requests.

Content from Variables


Last updated on 2025-04-16 | Edit this page

Estimated time: 20 minutes

Overview

Questions

  • How can I eliminate redundancy in my SConscript files?

Objectives

  • Use variables in SConscript files.
  • Explain the benefits of decoupling configuration from computation.

Despite our efforts, our SConstruct still has repeated content, i.e. the program we use to run our scripts, python. Additionally, if we renamed our scripts we’d have to hunt through our SConstruct file in multiple places.

We can introduce Python variables after the import statements in SConstruct to hold our script name:

PYTHON

COUNT_SOURCE = "countwords.py"

This is a variable assignment - COUNT_SOURCE is assigned the value "countwords.py" and behaves like (actually is) normal Python variable assignment. The all capitals naming convention indicates that the variable is intended for use as a setting or constant value.

We can do the same thing with the interpreter language used to run the script:

PYTHON

LANGUAGE = "python"

Similar to the SCons special substitution variables, we can define any number of per-task or per-builder substitution variables with keyword arguments. The same ${...} substitution syntax tells SCons to replace a task action string variable with its value when SCons is run.

Defining the variable LANGUAGE in this way avoids repeating python in our SConstruct file, and allows us to easily change how our script is run (e.g. we might want to use a different version of Python and need to change python to python2 – or we might want to rewrite the script using another language (e.g. switch from Python to R)).

In the count_words pseudo-builder function we will define optional arguments language and count_source, which are defined to default as LANGUAGE and COUNT_SOURCE respectively and passed through as Command task keyword arguments. This tells SCons to replace the variable language with its value python, and to replace the variable count_source with its value countwords.py.

We will define and use the intermediate function keyword argument variables instead of using the upper case variables directly to avoid mixing up the function’s scope with the SConstruct scope. This is a detail of good practice in Python development, and since SConscript files are Python code, you should follow the usual Python practices and style guides wherever possible.

Use Variables

Update SConstruct so that the .dat rule references the variable count_source and language. Then do the same for the testzipf.py script and the results.txt rule, using ZIPF_SOURCE as the variable name.

This SConstruct file contains a solution to this challenge.

We place variables intended for use as configuration constants at the top of an SConstruct file so they are easy to find and modify. Alternatively, we can pull them out into a new file that just holds variable definitions (i.e. delete them from the original SConstruct file). Because SCons uses Python as the configuration language, we can also move our custom builders and psuedo-builders. Let us create scons_lesson_configuration.py from the content below.

PYTHON

import pathlib

COUNT_SOURCE = "countwords.py"
LANGUAGE = "python"
ZIPF_SOURCE = "testzipf.py"


def count_words(env, data_file, language=LANGUAGE, count_source=COUNT_SOURCE):
    """Pseudo-builder to produce `.dat` targets from the `countwords.py` script

    Assumes that the source text file is found in `books/{data_file}.txt`

    :param env: SCons construction environment. Do not provide when using this
        function with the `env.AddMethod` and `env.CountWords` access style.
    :param data_file: String name of the data file to create.
    """
    data_path = pathlib.Path(data_file)
    text_file = pathlib.Path("books") / data_path.with_suffix(".txt")
    target_nodes = env.Command(
        target=[data_file],
        source=[text_file, count_source],
        action=["${language} ${count_source} ${SOURCES[0]} ${TARGET}"],
        language=language,
        count_source=count_source,
    )
    return target_nodes

An added benefit to moving our custom functions into a file with the .py extension is that we can use automated documentation tools, such as Sphinx, to build project documentation.

We can then import scons_lesson_configuration.py into the SConstruct file with a standard Python import:

PYTHON

from scons_lesson_configuration import *

Note that the above import statement merges the module namespace into the SConstruct file namespace. We must be careful to avoid re-defining variable and function names provided by scons_lesson_configuration.py in our SConstruct file, which would overwrite the names provided by our module and lead to unexpected behavior.

We can re-run SCons to see that everything still works:

BASH

$ scons . --clean
$ scons dats
$ scons results.txt

Where We Are

This SConstruct file and this Python module contain all of our work so far.

Key Points

  • Define variables by assigning values to names with Python syntax
  • Reference variables in action strings using SCons substitution syntax ${...}.

Content from Functions


Last updated on 2025-04-16 | Edit this page

Estimated time: 25 minutes

Overview

Questions

  • How else can I eliminate redundancy in my SConscript files?

Objectives

  • Write SConscript files that use functions to match and transform sets of files.

At this point, we have the following SConstruct file:

PYTHON

import os
import pathlib

from scons_lesson_configuration import *

env = Environment(ENV=os.environ.copy())
env.AddMethod(count_words, "CountWords")

env.CountWords("isles.dat")
env.CountWords("abyss.dat")
env.CountWords("last.dat")

env.Alias("dats", ["isles.dat", "abyss.dat", "last.dat"])

env.Command(
    target=["results.txt"],
    source=[ZIPF_SOURCE, "isles.dat", "abyss.dat", "last.dat"],
    action=["${language} ${zipf_source} ${SOURCES[1:]} > ${TARGET}"],
    language=LANGUAGE,
    zipf_source=ZIPF_SOURCE,
)

env.Default(["results.txt"])

for target in COMMAND_LINE_TARGETS:
    if pathlib.Path(target).suffix == ".dat":
        env.CountWords(target)

Python and SCons have many functions which can be used to write more complex tasks. One example is Glob. Glob gets a list of files matching some pattern, which we can then save in a variable. So, for example, we can get a list of all our text files (files ending in .txt) and save these in a variable by updating the beginning of our scons_lesson_configuration.py file:

PYTHON

import pathlib

import SCons.Script

COUNT_SOURCE = "countwords.py"
LANGUAGE = "python"
ZIPF_SOURCE = "testzipf.py"
TEXT_FILES = SCons.Script.Glob("books/*.txt")

Because our new Python module is no longer part of the SConstruct file, it does not have direct access to the special SCons namespace. We need to import SCons like a Python package to use the Glob function.

We can add a custom command-line option --variables to our SConstruct file to print the TEXT_FILES value and exit configuration prior to building using Python f-string syntax:

PYTHON

AddOption(
    "--variables",
    action="store_true",
    help="Print the text files returned by Glob and exit (default: %default)",
)
if GetOption("variables"):
    text_file_strings = [str(node) for node in TEXT_FILES]
    print(f"TEXT_FILES: {text_file_strings}")
    Exit(0)

If we run SCons:

BASH

$ scons --variables

We get:

OUTPUT

scons: Reading SConscript files ...
TEXT_FILES: ['books/abyss.txt', 'books/isles.txt', 'books/last.txt', 'books/sierra.txt']

Note how sierra.txt is now included too. There are some progress messages missing from the output due to the early Exit. The configuration phase is exited immediately and there is no build phase.

We can construct a list of data files with a list comprehension that performs path manipulation of the text files list to the scons_lesson_configuration.py module. We will use the pathlib module again for OS-agnostic path separators.

PYTHON

DATA_FILES = [
    pathlib.Path(str(text_file)).with_suffix(".dat").name
    for text_file in TEXT_FILES
]

We can extend the --variables option in SConstruct file to show the value of DATA_FILES too. We cast the SCons node objects into a string, then create a pathlib.Path object, and finally trim the parent directory to get our data file name. These operations return a list of strings, so we can print the list directly.

PYTHON

if GetOption("variables"):
    text_file_strings = [str(node) for node in TEXT_FILES]
    print(f"TEXT_FILES: {text_file_strings}")
    print(f"DATA_FILES: {DATA_FILES}")
    Exit(0)

If we run SCons,

BASH

$ scons --variables

then we get:

OUTPUT

scons: Reading SConscript files ...
TEXT_FILES: ['books/abyss.txt', 'books/isles.txt', 'books/last.txt', 'books/sierra.txt']
DATA_FILES: ['abyss.dat', 'isles.dat', 'last.dat', 'sierra.dat']

Finally, we can update our count_words function in scons_lesson_configuration.py to accept a list of data files and reduce our CountWords function calls to a single instance in SConstruct. We will have to collect the target nodes returned by Command and compile the full list of target nodes to return from our psuedo-builder.

PYTHON

def count_words(env, data_files, language=LANGUAGE, count_source=COUNT_SOURCE):
    """Pseudo-builder to produce `.dat` targets from the `countwords.py` script

    Assumes that the source text files are found in `books/{data_file}.txt`

    :param env: SCons construction environment. Do not provide when using this
        function with the `env.AddMethod` and `env.CountWords` access style.
    :param data_files: List of string names of the data files to create.
    """
    target_nodes = []
    for data_file in data_files:
        data_path = pathlib.Path(data_file)
        text_file = pathlib.Path("books") / data_path.with_suffix(".txt")
        target_nodes.extend(
            env.Command(
                target=[data_file],
                source=[text_file, count_source],
                action=["${language} ${count_source} ${SOURCES[0]} ${TARGET}"],
                language=language,
                count_source=count_source,
            )
        )
    return target_nodes

PYTHON

env = Environment(ENV=os.environ.copy())
env.AddMethod(count_words, "CountWords")

env.CountWords(DATA_FILES)

env.Alias("dats", DATA_FILES)

Now, sierra.txt is processed, too. If you update the Alias function call, we can process all .txt files with the same dats alias. The COMMAND_LIST_TARGETS loop is no longer required and may be removed.

BASH

$ scons dats --clean
$ scons dats

We get:

OUTPUT

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
python countwords.py books/abyss.txt abyss.dat
python countwords.py books/isles.txt isles.dat
python countwords.py books/last.txt last.dat
python countwords.py books/sierra.txt sierra.dat
scons: done building targets.

We can also rewrite results.txt:

env.Command(
    target=["results.txt"],
    source=[ZIPF_SOURCE] + DATA_FILES,
    action=["${language} ${zipf_source} ${SOURCES[1:]} > ${TARGET}"],
    language=LANGUAGE,
    zipf_source=ZIPF_SOURCE,
)

If we re-run SCons:

BASH

$ scons . --clean
$ scons results.txt

We get:

OUTPUT

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
python countwords.py books/abyss.txt abyss.dat
python countwords.py books/isles.txt isles.dat
python countwords.py books/last.txt last.dat
python countwords.py books/sierra.txt sierra.dat
python testzipf.py abyss.dat isles.dat last.dat sierra.dat > results.txt
scons: done building targets.

Let’s check the results.txt file:

BASH

$ cat results.txt

OUTPUT

Book	First	Second	Ratio
abyss	4044	2807	1.44
isles	3822	2460	1.55
last	12244	5566	2.20
sierra	4242	2469	1.72

So the range of the ratios of occurrences of the two most frequent words in our books is indeed around 2, as predicted by Zipf’s Law, i.e., the most frequently-occurring word occurs approximately twice as often as the second most frequent word.

Here is our final SConstruct file:

PYTHON

import os
import pathlib

from scons_lesson_configuration import *

AddOption(
    "--variables",
    action="store_true",
    default=False,
    help="Print the files returned by Glob and exit (default: '%default')",
)
if GetOption("variables"):
    text_file_strings = [str(node) for node in TEXT_FILES]
    print(f"TEXT_FILES: {text_file_strings}")
    print(f"DATA_FILES: {DATA_FILES}")
    Exit(0)

env = Environment(ENV=os.environ.copy())
env.AddMethod(count_words, "CountWords")

env.CountWords(DATA_FILES)

env.Alias("dats", DATA_FILES)

env.Command(
    target=["results.txt"],
    source=[ZIPF_SOURCE] + DATA_FILES,
    action=["${language} ${zipf_source} ${SOURCES[1:]} > ${TARGET}"],
    language=LANGUAGE,
    zipf_source=ZIPF_SOURCE,
)

env.Default(["results.txt"])

and the supporting scons_lesson_configuration.py module:

PYTHON

import pathlib

import SCons.Script

COUNT_SOURCE = "countwords.py"
LANGUAGE = "python"
ZIPF_SOURCE = "testzipf.py"
TEXT_FILES = SCons.Script.Glob("books/*.txt")
DATA_FILES = [
    pathlib.Path(str(text_file)).with_suffix(".dat").name
    for text_file in TEXT_FILES
]


def count_words(env, data_files, language=LANGUAGE, count_source=COUNT_SOURCE):
    """Pseudo-builder to produce `.dat` targets from the `countwords.py` script

    Assumes that the source text files are found in `books/{data_file}.txt`

    :param env: SCons construction environment. Do not provide when using this
        function with the `env.AddMethod` and `env.CountWords` access style.
    :param data_files: List of string names of the data files to create.
    """
    target_nodes = []
    for data_file in data_files:
        data_path = pathlib.Path(data_file)
        text_file = pathlib.Path("books") / data_path.with_suffix(".txt")
        target_nodes.extend(
            env.Command(
                target=[data_file],
                source=[text_file, count_source],
                action=["${language} ${count_source} ${SOURCES[0]} ${TARGET}"],
                language=language,
                count_source=count_source,
            )
        )
    return target_nodes

The following figure shows the dependencies embodied within our SConstruct file, involved in building the results.txt target, now we have introduced our function:

results.txt dependencies after introducing a function

Where We Are

This SConstruct file and this Python module contain all of our work so far.

Adding more books

We can now do a better job at testing Zipf’s rule by adding more books. The books we have used come from the Project Gutenberg website. Project Gutenberg offers thousands of free ebooks to download.

Exercise instructions:

  • go to Project Gutenberg and use the search box to find another book, for example ‘The Picture of Dorian Gray’ from Oscar Wilde.
  • download the ‘Plain Text UTF-8’ version and save it to the books folder; choose a short name for the file (that doesn’t include spaces) e.g. “dorian_gray.txt” because the filename is going to be used in the results.txt file
  • optionally, open the file in a text editor and remove extraneous text at the beginning and end (look for the phrase END OF THE PROJECT GUTENBERG EBOOK [title])
  • run scons and check that the correct commands are run, given the dependency tree
  • check the results.txt file to see how this book compares to the others

Key Points

  • SCons uses the Python programming language with acces to all of Python’s many built-in functions.
  • SCons provides many functions that work natively with the internal node objects required to manage the SCons directed graph.
  • Use the SCons Glob function to get lists of SCons nodes from file names matching a pattern.
  • Use Python built-in and standard library modules to manage file names and paths.

Content from Self-Documenting SConstruct files


Last updated on 2025-04-16 | Edit this page

Estimated time: 10 minutes

Overview

Questions

  • How should I document an SConscript file?

Objectives

  • Write self-documenting SConstruct files with built-in help.

Many bash commands, and programs that people have written that can be run from within bash, support a --help flag to display more information on how to use the commands or programs. In this spirit, it can be useful, both for ourselves and for others, to provide a --help option in our SConstruct file. This can provide a summary of the names of the key targets and what they do, so we don’t need to look at the SConstruct file itself unless we want to.

SCons provides the common --help flag and a Help function for building user customizable help messages. The less common -H flag will print the SCons help message. For our SConstruct file, running with --help option might print:

BASH

$ scons --help

OUTPUT

Local Options:
  --variables  Print the text files returned by Glob and exit (default: False)

Default Targets:
  results.txt

Aliases:
  dats

Where SCons is composing the help message for our custom command-line options for us already. So, how would we implement this? We could call Help like:

PYTHON

help_message = "\n\nDefault Targets:\n  results.txt\n\nAliases:\n  dats"
env.Help(help_message, append=True, keep_local=True)

But every time we add or remove a task, or change the default target list, we would have to update the help message string manually. It would be better if we could construct the list of default targets and aliases from the configured tasks. We can use the SCons default_ans and DEFAULT_TARGETS variables. First update the imports at the top of the scons_lesson_configuration.py file

PYTHON

import pathlib

import SCons.Script
from SCons.Script import DEFAULT_TARGETS
from SCons.Node.Alias import default_ans

Then add new help message construction functions at the bottom of the scons_lesson_configuration.py file.

PYTHON

def return_help_content(nodes, message="", help_content=dict()):
    """Return a dictionary of {node: message} string pairs

    Helpful in constructing help content for :meth:`project_help`. Will not
    overwrite existing keys.

    :param nodes: SCons node objects, e.g. targets and aliases
    :param str message: Help message to assign to every node in nodes
    :param dict help_content: Optional dictionary with target help messages
        ``{target: help}``

    :returns: Dictionary of {node: message} string pairs
    :rtype: dict
    """
    new_help_content = {str(node): message for node in nodes}
    new_help_content.update(help_content)
    return new_help_content


def project_help(help_content=dict()):
    """Append the SCons help message with default targets and aliases

    Must come *after* all default targets and aliases are defined.

    :param dict help_content: Optional dictionary with target help messages
        ``{target: help}``
    """
    def add_content(nodes, message="", help_content=help_content):
        """Append a help message for all nodes using provided help content if
        available.

        :param nodes: SCons node objects, e.g. targets and aliases
        :param str message: Help message to assign to every node in nodes
        :param dict help_content: Optional dictionary with target help messages
            ``{target: help}``

        :returns: appended help message
        :rtype: str
        """
        keys = [str(node) for node in nodes]
        for key in keys:
            if key in help_content.keys():
                message += f"    {key}: {help_content[key]}\n"
            else:
                message += f"    {key}\n"
        return message

    defaults_message = add_content(
        DEFAULT_TARGETS, message="\nDefault Targets:\n"
    )
    alias_message = add_content(default_ans, message="\nTarget Aliases:\n")
    SCons.Script.Help(
        defaults_message + alias_message, append=True, keep_local=True
    )

Finally, update the bottom of the SConstruct file with the new function calls. It is important that the project_help call comes after all default targets are assigned and all aliases are created.

PYTHON

dats = env.Alias("dats", DATA_FILES)
help_content = return_help_content(
    dats,
    "Count words in text files.",
)

results = env.Command(
    target=["results.txt"],
    source=[ZIPF_SOURCE] + DATA_FILES,
    action=["${language} ${zipf_source} ${SOURCES[1:]} > ${TARGET}"],
    language=LANGUAGE,
    zipf_source=ZIPF_SOURCE,
)
help_content = return_help_content(
    results,
    "Generate Zipf summary table.",
    help_content,
)

env.Default(["results.txt"])

project_help(help_content)

If we now run

BASH

$ scons --help

we get some SCons status messages, our help message, and the hint for the full SCons help message:

OUTPUT

scons: Reading SConscript files ...
scons: done reading SConscript files.
Local Options:
  --variables  Print the files returned by Glob and exit (default: 'False')

Default Targets:
    results.txt: Generate Zipf summary table.

Target Aliases:
    dats: Count words in text files.

Use scons -H for help about SCons built-in command-line options.

If we add, change or remove a default target or alias, we will automatically see updated lists in our help messages.

Where We Are

This SConstruct file and this Python module contain all of our work so far.

Key Points

  • Document SConstruct options, targets, and aliases with the SCons default_ans and DEFAULT_TARGETS variables and the Help function.

Content from Conclusion


Last updated on 2025-04-16 | Edit this page

Estimated time: 35 minutes

Overview

Questions

  • What are the advantages and disadvantages of using tools like SCons?

Objectives

  • Understand advantages of automated build tools such as SCons.

Automated build tools such as SCons can help us in a number of ways. They help us to automate repetitive commands, hence saving us time and reducing the likelihood of errors compared with running these commands manually.

They can also save time by ensuring that automatically-generated artifacts (such as data files or plots) are only recreated when the files that were used to create these have changed in some way.

Through their notion of targets, sources, and actions, they serve as a form of documentation, recording dependencies between code, scripts, tools, configurations, raw data, derived data, plots, and papers.

Creating PNGs

Add new rules, update existing rules, and add new variables to:

  • Create .png files from .dat files using plotcounts.py.
  • Update the default target to include the .png files.
  • Remove all auto-generated files (.dat, .png, results.txt).

This SConstruct file and this Python module contain a simple solution to this challenge.

Can you think of a way to reduce duplication in count_words and plot_counts functions?

To remove all targets, use the SCons special ‘.’ target and the --clean flag.

BASH

scons . --clean

The following figure shows the dependencies involved in building the ‘.’ or ‘all’ target, once we’ve added support for images:

results.txt dependencies once images have been added

Creating an Archive

Often it is useful to create an archive file of your project that includes all data, code and results. An archive file can package many files into a single file that can easily be downloaded and shared with collaborators. We can add steps to create the archive file inside the SConstruct itself so it’s easy to update our archive file as the project changes.

Edit the SConstruct to create an archive file of your project. Add new rules, update existing rules and add new variables to:

  • Create a zipf_analysis.tar.gz archive including our code, data, plots, the Zipf summary table, the SConstruct file with the SCons Tar builder.

  • Update the default targets list so that it creates zipf_analysis.tar.gz.

  • Print the values of any additional variables you have defined when scons --variables is called.

This SConstruct file and this Python module contain a simple solution to this challenge.

Archiving the SConstruct file

Why does the SCons task for the archive directory add the SConstruct to our archive of code, data, plots and Zipf summary table?

Our code files (countwords.py, plotcounts.py, testzipf.py) implement the individual parts of our workflow. They allow us to create .dat files from .txt files, and results.txt and .png files from .dat files. Our SConstruct file, however, documents dependencies between our code, raw data, derived data, and plots, as well as implementing our workflow as a whole.

Key Points

  • SCons and SConscript files save time by automating repetitive work, and save thinking by documenting how to reproduce results.