HPC Workflow Management with Snakemake: All in One View

Content from Running commands with Snakemake

Last updated on 2024-05-02 | Edit this page

Overview

Questions

“How do I run a simple command with Snakemake?”

Objectives

“Create a Snakemake recipe (a Snakefile)”

What is the workflow I’m interested in?

In this lesson we will make an experiment that takes an application which runs in parallel and investigate it’s scalability. To do that we will need to gather data, in this case that means running the application multiple times with different numbers of CPU cores and recording the execution time. Once we’ve done that we need to create a visualisation of the data to see how it compares against the ideal case.

From the visualisation we can then decide at what scale it makes most sense to run the application at in production to maximise the use of our CPU allocation on the system.

We could do all of this manually, but there are useful tools to help us manage data analysis pipelines like we have in our experiment. Today we’ll learn about one of those: Snakemake.

In order to get started with Snakemake, let’s begin by taking a simple command and see how we can run that via Snakemake. Let’s choose the command hostname which prints out the name of the host where the command is executed:

BASH

[ocaisa@node1 ~]$ hostname

OUTPUT

node1.int.jetstream2.hpc-carpentry.org

That prints out the result but Snakemake relies on files to know the status of your workflow, so let’s redirect the output to a file:

BASH

[ocaisa@node1 ~]$ hostname > hostname_login.txt

Making a Snakefile

Edit a new text file named Snakefile.

Contents of Snakefile:

PYTHON

rule hostname_login:
    output: "hostname_login.txt"
    input:  
    shell:
        "hostname > hostname_login.txt"

Callout

Key points about this file

The file is named Snakefile - with a capital S and no file extension.
Some lines are indented. Indents must be with space characters, not tabs. See the setup section for how to make your text editor do this.
The rule definition starts with the keyword rule followed by the rule name, then a colon.
We named the rule hostname_login. You may use letters, numbers or underscores, but the rule name must begin with a letter and may not be a keyword.
The keywords input, output, and shell are all followed by a colon (“:”).
The file names and the shell command are all in "quotes".
The output filename is given before the input filename. In fact, Snakemake doesn’t care what order they appear in but we give the output first throughout this course. We’ll see why soon.
In this use case there is no input file for the command so we leave this blank.

Back in the shell we’ll run our new rule. At this point, if there were any missing quotes, bad indents, etc., we may see an error.

BASH

snakemake -j1 -p hostname_login

Callout

`bash: snakemake: command not found...`

If your shell tells you that it cannot find the command snakemake then we need to make the software available somehow. In our case, this means searching for the module that we need to load:

BASH

module spider snakemake

OUTPUT

[ocaisa@node1 ~]$ module spider snakemake

--------------------------------------------------------------------------------------------------------
  snakemake:
--------------------------------------------------------------------------------------------------------
     Versions:
        snakemake/8.2.1-foss-2023a
        snakemake/8.2.1 (E)

Names marked by a trailing (E) are extensions provided by another module.


--------------------------------------------------------------------------------------------------------
  For detailed information about a specific "snakemake" package (including how to load the modules) use the module's full name.
  Note that names that have a trailing (E) are extensions provided by other modules.
  For example:

     $ module spider snakemake/8.2.1
--------------------------------------------------------------------------------------------------------

Now we want the module, so let’s load that to make the package available

BASH

[ocaisa@node1 ~]$ module load snakemake

and then make sure we have the snakemake command available

BASH

[ocaisa@node1 ~]$ which snakemake

OUTPUT

/cvmfs/software.eessi.io/host_injections/2023.06/software/linux/x86_64/amd/zen3/software/snakemake/8.2.1-foss-2023a/bin/snakemake

BASH

snakemake -j1 -p hostname_login

Challenge

Running Snakemake

Run snakemake --help | less to see the help for all available options. What does the -p option in the snakemake command above do?

Protects existing output files
Prints the shell commands that are being run to the terminal
Tells Snakemake to only run one process at a time
Prompts the user for the correct input file

Give me a hint

You can search in the text by pressing /, and quit back to the shell with q.

Show me the solution

Prints the shell commands that are being run to the terminal

This is such a useful thing we don’t know why it isn’t the default! The -j1 option is what tells Snakemake to only run one process at a time, and we’ll stick with this for now as it makes things simpler. Answer 4 is a total red-herring, as Snakemake never prompts interactively for user input.

Key Points

“Before running Snakemake you need to write a Snakefile”
“A Snakefile is a text file which defines a list of rules”
“Rules have inputs, outputs, and shell commands to be run”
“You tell Snakemake what file to make and it will run the shell command defined in the appropriate rule”

Content from Running Snakemake on the cluster

Last updated on 2024-12-05 | Edit this page

Overview

Questions

“How do I run my Snakemake rule on the cluster?”

Objectives

“Define rules to run locally and on the cluster”

What happens when we want to make our rule run on the cluster rather than the login node? The cluster we are using uses Slurm, and it happens that Snakemake has built in support for Slurm, we just need to tell it that we want to use it.

Snakemake uses the executor option to allow you to select the plugin that you wish to execute the rule. The quickest way to apply this to your Snakefile is to define this on the command line. Let’s try it out

BASH

[ocaisa@node1 ~]$ snakemake -j1 -p --executor slurm hostname_login

OUTPUT

Building DAG of jobs...
Retrieving input from storage.
Nothing to be done (all requested files are present and up to date).

Nothing happened! Why not? When it is asked to build a target, Snakemake checks the ‘last modification time’ of both the target and its dependencies. If any dependency has been updated since the target, then the actions are re-run to update the target. Using this approach, Snakemake knows to only rebuild the files that, either directly or indirectly, depend on the file that changed. This is called an incremental build.

Callout

Incremental Builds Improve Efficiency

By only rebuilding files when required, Snakemake makes your processing more efficient.

Challenge

Running on the cluster

We need another rule now that executes the hostname on the cluster. Create a new rule in your Snakefile and try to execute it on cluster with the option --executor slurm to snakemake.

Show me the solution

The rule is almost identical to the previous rule save for the rule name and output file:

PYTHON

rule hostname_remote:
    output: "hostname_remote.txt"
    input:
    shell:
        "hostname > hostname_remote.txt"

You can then execute the rule with

BASH

[ocaisa@node1 ~]$ snakemake -j1 -p --executor slurm hostname_remote

OUTPUT

Building DAG of jobs...
Retrieving input from storage.
Using shell: /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/bin/bash
Provided remote nodes: 1
Job stats:
job                count
---------------  -------
hostname_remote        1
total                  1

Select jobs to execute...
Execute 1 jobs...

[Mon Jan 29 18:03:46 2024]
rule hostname_remote:
    output: hostname_remote.txt
    jobid: 0
    reason: Missing output files: hostname_remote.txt
    resources: tmpdir=<TBD>

hostname > hostname_remote.txt
No SLURM account given, trying to guess.
Guessed SLURM account: def-users
No wall time information given. This might or might not work on your cluster.
If not, specify the resource runtime in your rule or as a reasonable default
via --default-resources. No job memory information ('mem_mb' or
'mem_mb_per_cpu') is given - submitting without.
This might or might not work on your cluster.
Job 0 has been submitted with SLURM jobid 326 (log: /home/ocaisa/.snakemake/slurm_logs/rule_hostname_remote/326.log).
[Mon Jan 29 18:04:26 2024]
Finished job 0.
1 of 1 steps (100%) done
Complete log: .snakemake/log/2024-01-29T180346.788174.snakemake.log

Note all the warnings that Snakemake is giving us about the fact that the rule may not be able to execute on our cluster as we may not have given enough information. Luckily for us, this actually works on our cluster and we can take a look in the output file the new rule creates, hostname_remote.txt:

BASH

[ocaisa@node1 ~]$ cat hostname_remote.txt

OUTPUT

tmpnode1.int.jetstream2.hpc-carpentry.org

Snakemake profile

Adapting Snakemake to a particular environment can entail many flags and options. Therefore, it is possible to specify a configuration profile to be used to obtain default options. This looks like

BASH

snakemake --profile myprofileFolder ...

The profile folder must contain a file called config.yaml which is what will store our options. The folder may also contain other files necessary for the profile. Let’s create the file cluster_profile/config.yaml and insert some of our existing options:

YAML

printshellcmds: True
jobs: 3
executor: slurm

We should now be able rerun our workflow by pointing to the profile rather than the listing out the options. To force our workflow to rerun, we first need to remove the output file hostname_remote.txt, and then we can try out our new profile

BASH

[ocaisa@node1 ~]$ rm hostname_remote.txt
[ocaisa@node1 ~]$ snakemake --profile cluster_profile hostname_remote

The profile is extremely useful in the context of our cluster, as the Slurm executor has lots of options, and sometimes you need to use them to be able to submit jobs to the cluster you have access to. Unfortunately, the names of the options in Snakemake are not exactly the same as those of Slurm, so we need the help of a translation table:

SLURM	Snakemake	Description
`--partition`	`slurm_partition`	the partition a rule/job is to use
`--time`	`runtime`	the walltime per job in minutes
`--constraint`	`constraint`	may hold features on some clusters
`--mem`	`mem, mem_mb`	memory a cluster node must
		provide (mem: string with unit), mem_mb: int
`--mem-per-cpu`	`mem_mb_per_cpu`	memory per reserved CPU
`--ntasks`	`tasks`	number of concurrent tasks / ranks
`--cpus-per-task`	`cpus_per_task`	number of cpus per task (in case of SMP, rather use `threads`)
`--nodes`	`nodes`	number of nodes

The warnings given by Snakemake hinted that we may need to provide these options. One way to do it is to provide them is as part of the Snakemake rule using the keyword resources, e.g.,

PYTHON

rule:
    input: ...
    output: ...
    resources:
        partition = <partition name>
        runtime = <some number>

and we can also use the profile to define default values for these options to use with our project, using the keyword default-resources. For example, the available memory on our cluster is about 4GB per core, so we can add that to our profile:

YAML

printshellcmds: True
jobs: 3
executor: slurm
default-resources:
  - mem_mb_per_cpu=3600

Challenge

We know that our problem runs in a very short time. Change the default length of our jobs to two minutes for Slurm.

Show me the solution

YAML

printshellcmds: True
jobs: 3
executor: slurm
default-resources:
  - mem_mb_per_cpu=3600
  - runtime=2

There are various sbatch options not directly supported by the resource definitions in the table above. You may use the slurm_extra resource to specify any of these additional flags to sbatch:

PYTHON

rule myrule:
    input: ...
    output: ...
    resources:
        slurm_extra="--mail-type=ALL --mail-user=<your email>"

Local rule execution

Our initial rule was to get the hostname of the login node. We always want to run that rule on the login node for that to make sense. If we tell Snakemake to run all rules via the Slurm executor (which is what we are doing via our new profile) this won’t happen any more. So how do we force the rule to run on the login node?

Well, in the case where a Snakemake rule performs a trivial task job submission might be overkill (e.g., less than 1 minute worth of compute time). Similar to our case, it would be a better idea to have these rules execute locally (i.e. where the snakemake command is run) instead of as a job. Snakemake lets you indicate which rules should always run locally with the localrules keyword. Let’s define hostname_login as a local rule near the top of our Snakefile.

PYTHON

localrules: hostname_login

Key Points

“Snakemake generates and submits its own batch scripts for your scheduler.”
“You can store default configuration settings in a Snakemake profile”
“localrules defines rules that are executed locally, and never submitted to a cluster.”

Content from Placeholders

Last updated on 2024-05-02 | Edit this page

Overview

Questions

“How do I make a generic rule?”

Objectives

“See how Snakemake deals with some errors”

Our Snakefile has some duplication. For example, the names of text files are repeated in places throughout the Snakefile rules. Snakefiles are a form of code and, in any code, repetition can lead to problems (e.g. we rename a data file in one part of the Snakefile but forget to rename it elsewhere).

Callout

D.R.Y. (Don’t Repeat Yourself)

In many programming languages, the bulk of the language features are there to allow the programmer to describe long-winded computational routines as short, expressive, beautiful code. Features in Python, R, or Java, such as user-defined variables and functions are useful in part because they mean we don’t have to write out (or think about) all of the details over and over again. This good habit of writing things out only once is known as the “Don’t Repeat Yourself” principle or D.R.Y.

Let us set about removing some of the repetition from our Snakefile.

Placeholders

To make a more general-purpose rule we need placeholders. Let’s take a look at what a placeholder looks like

PYTHON

rule hostname_remote:
    output: "hostname_remote.txt"
    input:
    shell:
        "hostname > {output}"

As a reminder, here’s the previous version from the last episode:

PYTHON

rule hostname_remote:
    output: "hostname_remote.txt"
    input:
    shell:
        "hostname > hostname_remote.txt"

The new rule has replaced explicit file names with things in {curly brackets}, specifically {output} (but it could also have been {input}…if that had a value and were useful).

`{input}` and `{output}` are placeholders

Placeholders are used in the shell section of a rule, and Snakemake will replace them with appropriate values - {input} with the full name of the input file, and {output} with the full name of the output file – before running the command.

{resources} is also a placeholder, and we can access a named element of the {resources} with the notation {resources.runtime} (for example).

Key Points

“Snakemake rules are made more generic with placeholders”
“Placeholders in the shell part of the rule are replaced with values based on the chosen wildcards”

Content from MPI applications and Snakemake

Last updated on 2024-05-02 | Edit this page

Overview

Questions

“How do I run an MPI application via Snakemake on the cluster?”

Objectives

“Define rules to run locally and on the cluster”

Now it’s time to start getting back to our real workflow. We can execute a command on the cluster, but what about executing the MPI application we are interested in? Our application is called amdahl and is available as an environment module.

Challenge

Locate and load the amdahl module and then replace our hostname_remote rule with a version that runs amdahl. (Don’t worry about parallel MPI just yet, run it with a single CPU, mpiexec -n 1 amdahl).

Does your rule execute correctly? If not look through the log files to find out why?

Show me the solution

BASH

module spider amdahl
module load amdahl

will locate and then load the amdahl module. We can then update/replace our rule to run the amdahl application:

PYTHON

rule amdahl_run:
    output: "amdahl_run.txt"
    input:
    shell:
        "mpiexec -n 1 amdahl > {output}"

However, when we try to execute the rule we get an error (unless you already have a different version of amdahl available in your path). Snakemake reports the location of the logs and if we look inside we can (eventually) find

OUTPUT

...
mpiexec -n 1 amdahl > amdahl_run.txt
--------------------------------------------------------------------------
mpiexec was unable to find the specified executable file, and therefore
did not launch the job.  This error was first reported for process
rank 0; it may have occurred for other processes as well.

NOTE: A common cause for this error is misspelling a mpiexec command
      line parameter option (remember that mpiexec interprets the first
      unrecognized command line token as the executable).

Node:       tmpnode1
Executable: amdahl
--------------------------------------------------------------------------
...

So, even though we loaded the module before running the workflow, our Snakemake rule didn’t find the executable. That’s because the Snakemake rule is running in a clean runtime environment, and we need to somehow tell it to load the necessary environment module before trying to execute the rule.

Snakemake and environment modules

Our application is called amdahl and is available on the system via an environment module, so we need to tell Snakemake to load the module before it tries to execute the rule. Snakemake is aware of environment modules, and these can be specified via (yet another) option:

PYTHON

rule amdahl_run:
    output: "amdahl_run.txt"
    input:
    envmodules:
      "mpi4py",
      "amdahl"
    input:
    shell:
        "mpiexec -n 1 amdahl > {output}"

Adding these lines are not enough to make the rule execute however. Not only do you have to tell Snakemake what modules to load, but you also have to tell it to use environment modules in general (since the use of environment modules is considered to make your runtime environment less reproducible as the available modules may differ from cluster to cluster). This requires you to give Snakemake an additonal option

BASH

snakemake --profile cluster_profile --use-envmodules amdahl_run

Challenge

We’ll be using environment modules throughout the rest of tutorial, so make that a default option of our profile (by setting it’s value to True)

Show me the solution

Update our cluster profile to

YAML

printshellcmds: True
jobs: 3
executor: slurm
default-resources:
  - mem_mb_per_cpu=3600
  - runtime=2
use-envmodules: True

If you want to test it, you need to erase the output file of the rule and rerun Snakemake.

Snakemake and MPI

We didn’t really run an MPI application in the last section as we only ran on one core. How do we request to run on multiple cores for a single rule?

Snakemake has general support for MPI, but the only executor that currently explicitly supports MPI is the Slurm executor (lucky for us!). If we look back at our Slurm to Snakemake translation table we notice the relevant options appear near the bottom:

SLURM	Snakemake	Description
…	…	…
`--ntasks`	`tasks`	number of concurrent tasks / ranks
`--cpus-per-task`	`cpus_per_task`	number of cpus per task (in case of SMP, rather use `threads`)
`--nodes`	`nodes`	number of nodes

The one we are interested is tasks as we are only going to increase the number of ranks. We can define these in a resources section of our rule and refer to them using placeholders:

PYTHON

rule amdahl_run:
    output: "amdahl_run.txt"
    input:
    envmodules:
      "amdahl"
    resources:
      mpi='mpiexec',
      tasks=2
    input:
    shell:
        "{resources.mpi} -n {resources.tasks} amdahl > {output}"

That worked but now we have a bit of an issue. We want to do this for a few different values of tasks that would mean we would need a different output file for every run. It would be great if we can somehow indicate in the output the value that we want to use for tasks…and have Snakemake pick that up.

We could use a wildcard in the output to allow us to define the tasks we wish to use. The syntax for such a wildcard looks like

PYTHON

output: "amdahl_run_{parallel_tasks}.txt"

where parallel_tasks is our wildcard.

Callout

Wildcards

Wildcards are used in the input and output lines of the rule to represent parts of filenames. Much like the * pattern in the shell, the wildcard can stand in for any text in order to make up the desired filename. As with naming your rules, you may choose any name you like for your wildcards, so here we used parallel_tasks. Using the same wildcards in the input and output is what tells Snakemake how to match input files to output files.

If two rules use a wildcard with the same name then Snakemake will treat them as different entities - rules in Snakemake are self-contained in this way.

In the shell line you can reference the wildcard with {wildcards.parallel_tasks}

Snakemake order of operations

We’re only just getting started with some simple rules, but it’s worth thinking about exactly what Snakemake is doing when you run it. There are three distinct phases:

Prepares to run:
1. Reads in all the rule definitions from the Snakefile
Plans what to do:
1. Sees what file(s) you are asking it to make
2. Looks for a matching rule by looking at the outputs of all the rules it knows
3. Fills in the wildcards to work out the input for this rule
4. Checks that this input file (if required) is actually available
Runs the steps:
1. Creates the directory for the output file, if needed
2. Removes the old output file if it is already there
3. Only then, runs the shell command with the placeholders replaced
4. Checks that the command ran without errors and made the new output file as expected

Callout

Dry-run (`-n`) mode

It’s often useful to run just the first two phases, so that Snakemake will plan out the jobs to run, and print them to the screen, but never actually run them. This is done with the -n flag, eg:

BASH

> $ snakemake -n ...

The amount of checking may seem pedantic right now, but as the workflow gains more steps this will become very useful to us indeed.

Using wildcards in our rule

We would like to use a wildcard in the output to allow us to define the number of tasks we wish to use. Based on what we’ve seen so far, you might imagine this could look like

PYTHON

rule amdahl_run:
    output: "amdahl_run_{parallel_tasks}.txt"
    input:
    envmodules:
      "amdahl"
    resources:
      mpi="mpiexec",
      tasks="{parallel_tasks}"
    input:
    shell:
        "{resources.mpi} -n {resources.tasks} amdahl > {output}"

but there are two problems with this:

The only way for Snakemake to know the value of the wildcard is for the user to explicitly request a concrete output file (rather than call the rule):

BASH

  snakemake --profile cluster_profile amdahl_run_2.txt

This is perfectly valid, as Snakemake can figure out that it has a rule that can match that filename.

The bigger problem is that even doing that does not work, it seems we cannot use a wildcard for tasks:

OUTPUT

WorkflowError:
SLURM job submission failed. The error message was sbatch:
error: Invalid numeric value "{parallel_tasks}" for --ntasks.

Unfortunately for us, there is no direct way for us to access the wildcards for tasks. The reason for this is that Snakemake tries to use the value of tasks during its initialisation stage, which is before we know the value of the wildcard. We need to defer the determination of tasks to later on. This can be achieved by specifying an input function instead of a value for this scenario. The solution then is to write a one-time use function to manipulate Snakemake into doing this for us. Since the function is specifically for the rule, we can use a one-line function without a name. These kinds of functions are called either anonymous functions or lamdba functions (both mean the same thing), and are a feature of Python (and other programming languages).

To define a lambda function in python, the general syntax is as follows:

PYTHON

lambda x: x + 54

Since our function can take the wildcards as arguments, we can use that to set the value for tasks:

PYTHON

rule amdahl_run:
    output: "amdahl_run_{parallel_tasks}.txt"
    input:
    envmodules:
      "amdahl"
    resources:
      mpi="mpiexec",
      # No direct way to access the wildcard in tasks, so we need to do this
      # indirectly by declaring a short function that takes the wildcards as an
      # argument
      tasks=lambda wildcards: int(wildcards.parallel_tasks)
    input:
    shell:
        "{resources.mpi} -n {resources.tasks} amdahl > {output}"

Now we have a rule that can be used to generate output from runs of an arbitrary number of parallel tasks.

Callout

Comments in Snakefiles

In the above code, the line beginning # is a comment line. Hopefully you are already in the habit of adding comments to your own scripts. Good comments make any script more readable, and this is just as true with Snakefiles.

Since our rule is now capable of generating an arbitrary number of output files things could get very crowded in our current directory. It’s probably best then to put the runs into a separate folder to keep things tidy. We can add the folder directly to our output and Snakemake will take of directory creation for us:

PYTHON

rule amdahl_run:
    output: "runs/amdahl_run_{parallel_tasks}.txt"
    input:
    envmodules:
      "amdahl"
    resources:
      mpi="mpiexec",
      # No direct way to access the wildcard in tasks, so we need to do this
      # indirectly by declaring a short function that takes the wildcards as an
      # argument
      tasks=lambda wildcards: int(wildcards.parallel_tasks)
    input:
    shell:
        "{resources.mpi} -n {resources.tasks} amdahl > {output}"

Challenge

Create an output file (under the runs folder) for the case where we have 6 parallel tasks

(HINT: Remember that Snakemake needs to be able to match the requested file to the output from a rule)

Show me the solution

BASH

snakemake --profile cluster_profile runs/amdahl_run_6.txt

Another thing about our application amdahl is that we ultimately want to process the output to generate our scaling plot. The output right now is useful for reading but makes processing harder. amdahl has an option that actually makes this easier for us. To see the amdahl options we can use

BASH

[ocaisa@node1 ~]$ module load amdahl
[ocaisa@node1 ~]$ amdahl --help

OUTPUT

usage: amdahl [-h] [-p [PARALLEL_PROPORTION]] [-w [WORK_SECONDS]] [-t] [-e]

options:
  -h, --help            show this help message and exit
  -p [PARALLEL_PROPORTION], --parallel-proportion [PARALLEL_PROPORTION]
                        Parallel proportion should be a float between 0 and 1
  -w [WORK_SECONDS], --work-seconds [WORK_SECONDS]
                        Total seconds of workload, should be an integer greater than 0
  -t, --terse           Enable terse output
  -e, --exact           Disable random jitter

The option we are looking for is --terse, and that will make amdahl print output in a format that is much easier to process, JSON. JSON format in a file typically uses the file extension .json so let’s add that option to our shell command and change the file format of the output to match our new command:

PYTHON

rule amdahl_run:
    output: "runs/amdahl_run_{parallel_tasks}.json"
    input:
    envmodules:
      "amdahl"
    resources:
      mpi="mpiexec",
      # No direct way to access the wildcard in tasks, so we need to do this
      # indirectly by declaring a short function that takes the wildcards as an
      # argument
      tasks=lambda wildcards: int(wildcards.parallel_tasks)
    input:
    shell:
        "{resources.mpi} -n {resources.tasks} amdahl --terse > {output}"

There was another parameter for amdahl that caught my eye. amdahl has an option --parallel-proportion (or -p) which we might be interested in changing as it changes the behaviour of the code,and therefore has an impact on the values we get in our results. Let’s add another directory layer to our output format to reflect a particular choice for this value. We can use a wildcard so we done have to choose the value right away:

PYTHON

rule amdahl_run:
    output: "p_{parallel_proportion}/runs/amdahl_run_{parallel_tasks}.json"
    input:
    envmodules:
      "amdahl"
    resources:
      mpi="mpiexec",
      # No direct way to access the wildcard in tasks, so we need to do this
      # indirectly by declaring a short function that takes the wildcards as an
      # argument
      tasks=lambda wildcards: int(wildcards.parallel_tasks)
    input:
    shell:
        "{resources.mpi} -n {resources.tasks} amdahl --terse -p {wildcards.parallel_proportion} > {output}"

Challenge

Create an output file for a value of -p of 0.999 (the default value is 0.8) for the case where we have 6 parallel tasks.

Show me the solution

BASH

snakemake --profile cluster_profile p_0.999/runs/amdahl_run_6.json

Key Points

“Snakemake chooses the appropriate rule by replacing wildcards such that the output matches the target”
“Snakemake checks for various error conditions and will stop if it sees a problem”

Content from Chaining rules

Last updated on 2024-05-02 | Edit this page

Overview

Questions

“How do I combine rules into a workflow?”
“How do I make a rule with multiple inputs and outputs?”

Objectives

“”

A pipeline of multiple rules

We now have a rule that can generate output for any value of -p and any number of tasks, we just need to call Snakemake with the parameters that we want:

BASH

snakemake --profile cluster_profile p_0.999/runs/amdahl_run_6.json

That’s not exactly convenient though, to generate a full dataset we have to run Snakemake lots of times with different output file targets. Rather than that, let’s create a rule that can generate those files for us.

Chaining rules in Snakemake is a matter of choosing filename patterns that connect the rules. There’s something of an art to it - most times there are several options that will work:

PYTHON

rule generate_run_files:
    output: "p_{parallel_proportion}_runs.txt"
    input:  "p_{parallel_proportion}/runs/amdahl_run_6.json"
    shell:
        "echo {input} done > {output}"

Challenge

The new rule is doing no work, it’s just making sure we create the file we want. It’s not worth executing on the cluster. How do ensure it runs on the login node only?

Show me the solution

We need to add the new rule to our localrules:

PYTHON

localrules: hostname_login, generate_run_files

Now let’s run the new rule (remember we need to request the output file by name as the output in our rule contains a wildcard pattern):

BASH

[ocaisa@node1 ~]$ snakemake --profile cluster_profile/ p_0.999_runs.txt

OUTPUT

Using profile cluster_profile/ for setting default command line arguments.
Building DAG of jobs...
Retrieving input from storage.
Using shell: /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/bin/bash
Provided remote nodes: 3
Job stats:
job                   count
------------------  -------
amdahl_run                1
generate_run_files        1
total                     2

Select jobs to execute...
Execute 1 jobs...

[Tue Jan 30 17:39:29 2024]
rule amdahl_run:
    output: p_0.999/runs/amdahl_run_6.json
    jobid: 1
    reason: Missing output files: p_0.999/runs/amdahl_run_6.json
    wildcards: parallel_proportion=0.999, parallel_tasks=6
    resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954,
               tmpdir=<TBD>, mem_mb_per_cpu=3600, runtime=2, mpi=mpiexec, tasks=6

mpiexec -n 6 amdahl --terse -p 0.999 > p_0.999/runs/amdahl_run_6.json
No SLURM account given, trying to guess.
Guessed SLURM account: def-users
Job 1 has been submitted with SLURM jobid 342 (log: /home/ocaisa/.snakemake/slurm_logs/rule_amdahl_run/342.log).
[Tue Jan 30 17:47:31 2024]
Finished job 1.
1 of 2 steps (50%) done
Select jobs to execute...
Execute 1 jobs...

[Tue Jan 30 17:47:31 2024]
localrule generate_run_files:
    input: p_0.999/runs/amdahl_run_6.json
    output: p_0.999_runs.txt
    jobid: 0
    reason: Missing output files: p_0.999_runs.txt;
            Input files updated by another job: p_0.999/runs/amdahl_run_6.json
    wildcards: parallel_proportion=0.999
    resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954,
               tmpdir=/tmp, mem_mb_per_cpu=3600, runtime=2

echo p_0.999/runs/amdahl_run_6.json done > p_0.999_runs.txt
[Tue Jan 30 17:47:31 2024]
Finished job 0.
2 of 2 steps (100%) done
Complete log: .snakemake/log/2024-01-30T173929.781106.snakemake.log

Look at the logging messages that Snakemake prints in the terminal. What has happened here?

Snakemake looks for a rule to make p_0.999_runs.txt
It determines that “generate_run_files” can make this if parallel_proportion=0.999
It sees that the input needed is therefore p_0.999/runs/amdahl_run_6.json
Snakemake looks for a rule to make p_0.999/runs/amdahl_run_6.json
It determines that “amdahl_run” can make this if parallel_proportion=0.999 and parallel_tasks=6
Now Snakemake has reached an available input file (in this case, no input file is actually required), it runs both steps to get the final output

This, in a nutshell, is how we build workflows in Snakemake.

Define rules for all the processing steps
Choose input and output naming patterns that allow Snakemake to link the rules
Tell Snakemake to generate the final output file(s)

If you are used to writing regular scripts this takes a little getting used to. Rather than listing steps in order of execution, you are alway working backwards from the final desired result. The order of operations is determined by applying the pattern matching rules to the filenames, not by the order of the rules in the Snakefile.

Callout

Outputs first?

The Snakemake approach of working backwards from the desired output to determine the workflow is why we’re putting the output lines first in all our rules - to remind us that these are what Snakemake looks at first!

Many users of Snakemake, and indeed the official documentation, prefer to have the input first, so in practice you should use whatever order makes sense to you.

Callout

`log` outputs in Snakemake

Snakemake has a dedicated rule field for outputs that are log files, and these are mostly treated as regular outputs except that log files are not removed if the job produces an error. This means you can look at the log to help diagnose the error. In a real workflow this can be very useful, but in terms of learning the fundamentals of Snakemake we’ll stick with regular input and output fields here.

Callout

Errors are normal

Don’t be disheartened if you see errors when first testing your new Snakemake pipelines. There is a lot that can go wrong when writing a new workflow, and you’ll normally need several iterations to get things just right. One advantage of the Snakemake approach compared to regular scripts is that Snakemake fails fast when there is a problem, rather than ploughing on and potentially running junk calculations on partial or corrupted data. Another advantage is that when a step fails we can safely resume from where we left off.

Key Points

“Snakemake links rules by iteratively looking for rules that make missing inputs”
“Rules may have multiple named inputs and/or outputs”
“If a shell command does not yield an expected output then Snakemake will regard that as a failure”

Content from Processing lists of inputs

Last updated on 2024-06-20 | Edit this page

Overview

Questions

“How do I process multiple files at once?”
“How do I combine multiple files together?”

Objectives

“Use Snakemake to process all our samples at once”
“Make a scalability plot that brings our results together”

We created a rule that can generate a single output file, but we’re not going to create multiple rules for every output file. We want to generate all of the run files with a single rule if we could, well Snakemake can indeed take a list of input files:

PYTHON

rule generate_run_files:
    output: "p_{parallel_proportion}_runs.txt"
    input:  "p_{parallel_proportion}/runs/amdahl_run_2.json", "p_{parallel_proportion}/runs/amdahl_run_6.json"
    shell:
        "echo {input} done > {output}"

That’s great, but we don’t want to have to list all of the files we’re interested in individually. How can we do this?

Defining a list of samples to process

To do this, we can define some lists as Snakemake global variables.

Global variables should be added before the rules in the Snakefile.

PYTHON

# Task sizes we wish to run
NTASK_SIZES = [1, 2, 3, 4, 5]

Unlike with variables in shell scripts, we can put spaces around the = sign, but they are not mandatory.
The lists of quoted strings are enclosed in square brackets and comma-separated. If you know any Python you’ll recognise this as Python list syntax.
A good convention is to use capitalized names for these variables, but this is not mandatory.
Although these are referred to as variables, you can’t actually change the values once the workflow is running, so lists defined this way are more like constants.

Using a Snakemake rule to define a batch of outputs

Now let’s update our Snakefile to leverage the new global variable and create a list of files:

PYTHON

rule generate_run_files:
    output: "p_{parallel_proportion}_runs.txt"
    input:  expand("p_{{parallel_proportion}}/runs/amdahl_run_{count}.json", count=NTASK_SIZES)
    shell:
        "echo {input} done > {output}"

The expand(...) function in this rule generates a list of filenames, by taking the first thing in the single parentheses as a template and replacing {count} with all the NTASK_SIZES. Since there are 5 elements in the list, this will yield 5 files we want to make. Note that we had to protect our wildcard in a second set of parentheses so it wouldn’t be interpreted as something that needed to be expanded.

In our current case we still rely on the file name to define the value of the wildcard parallel_proportion so we can’t call the rule directly, we still need to request a specific file:

BASH

snakemake --profile cluster_profile/ p_0.999_runs.txt

If you don’t specify a target rule name or any file names on the command line when running Snakemake, the default is to use the first rule in the Snakefile as the target.

Callout

Rules as targets

Giving the name of a rule to Snakemake on the command line only works when that rule has no wildcards in the outputs, because Snakemake has no way to know what the desired wildcards might be. You will see the error “Target rules may not contain wildcards.” This can also happen when you don’t supply any explicit targets on the command line at all, and Snakemake tries to runthe first rule defined in the Snakefile.

Rules that combine multiple inputs

Our generate_run_files rule is a rule which takes a list of input files. The length of that list is not fixed by the rule, but can change based on NTASK_SIZES.

In our workflow the final step is to take all the generated files and combine them into a plot. To do that, you may have heard that some people use a python library called matplotlib. It’s beyond the scope of this tutorial to write the python script to create a final plot, so we provide you with the script as part of this lesson. You can download it with

BASH

curl -O https://ocaisa.github.io/hpc-workflows/files/plot_terse_amdahl_results.py

The script plot_terse_amdahl_results.py needs a command line that looks like:

BASH

python plot_terse_amdahl_results.py --output <output image filename> <1st input file> <2nd input file> ...

Let’s introduce that into our generate_run_files rule:

PYTHON

rule generate_run_files:
    output: "p_{parallel_proportion}_runs.txt"
    input:  expand("p_{{parallel_proportion}}/runs/amdahl_run_{count}.json", count=NTASK_SIZES)
    shell:
        "python plot_terse_amdahl_results.py --output {output} {input}"

Challenge

This script relies on matplotlib, is it available as an environment module? Add this requirement to our rule.

Show me the solution

PYTHON

rule generate_run_files:
    output: "p_{parallel_proportion}_scalability.jpg"
    input:  expand("p_{{parallel_proportion}}/runs/amdahl_run_{count}.json", count=NTASK_SIZES)
    envmodules:
      "matplotlib"
    shell:
        "python plot_terse_amdahl_results.py --output {output} {input}"

Now we finally get to generate a scaling plot! Run the final Snakemake command:

BASH

snakemake --profile cluster_profile/ p_0.999_scalability.jpg

Challenge

Generate the scalability plot for all values from 1 to 10 cores.

Show me the solution

PYTHON

NTASK_SIZES = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Challenge

Rerun the workflow for a p value of 0.8

Show me the solution

BASH

snakemake --profile cluster_profile/ p_0.8_scalability.jpg

Discussion

Bonus round

Create a final rule that can be called directly and generates a scaling plot for 3 different values of p.

Key Points

“Use the expand() function to generate lists of filenames you want to combine”
“Any {input} to a rule can be a variable-length list”

Overview

Questions

Objectives

What is the workflow I’m interested in?

BASH

OUTPUT

BASH

Making a Snakefile

PYTHON

Key points about this file

BASH

bash: snakemake: command not found...

BASH

OUTPUT

BASH

BASH

OUTPUT

BASH

Running Snakemake

Give me a hint

Show me the solution

Overview

Questions

Objectives

BASH

OUTPUT

Incremental Builds Improve Efficiency

Running on the cluster

Show me the solution

PYTHON

BASH

OUTPUT

BASH

OUTPUT

Snakemake profile

BASH

YAML

BASH

PYTHON

YAML

Challenge

Show me the solution

YAML

PYTHON

Local rule execution

PYTHON

Overview

Questions

Objectives

D.R.Y. (Don’t Repeat Yourself)

Placeholders

PYTHON

PYTHON

{input} and {output} are placeholders

Overview

Questions

Objectives

Challenge

Show me the solution

BASH

PYTHON

OUTPUT

Snakemake and environment modules

PYTHON

BASH

Challenge

Show me the solution

YAML

Snakemake and MPI

PYTHON

PYTHON

Wildcards

Snakemake order of operations

Dry-run (-n) mode

BASH

Using wildcards in our rule

PYTHON

BASH

OUTPUT

PYTHON

`bash: snakemake: command not found...`

`{input}` and `{output}` are placeholders

Dry-run (`-n`) mode

`log` outputs in Snakemake