Chaining rules

Last updated on 2024-03-01 | Edit this page

Overview

Questions

  • How do I combine rules into a workflow?
  • How do I make a rule with multiple inputs and outputs?

Objectives

  • Use Snakemake to filter and then count the lines in a FASTQ file
  • Add an RNA quantification step in the data analysis
  • See how Snakemake deals with missing outputs

For reference, this is the Snakefile you should have to start the episode.

A pipeline of multiple rules


We now have a “trimreads” rule and a “countreads” rule. Following the previous chapter, the contents of the Snakefile should be:

# New generic read counter
rule countreads:
    output: "{myfile}.fq.count"
    input:  "reads/{myfile}.fq"
    shell:
        "echo $(( $(wc -l <{input}) / 4 )) > {output}"

# Trim any FASTQ reads for base quality
rule trimreads:
    output: "trimmed/{myfile}.fq"
    input:  "reads/{myfile}.fq"
    shell:
        "fastq_quality_trimmer -t 20 -l 100 -o {output} <{input}"

The problem is there is no good way to use these rules together, that is, to trim an input file and then count the reads in the trimmed file. The countreads rule only takes input reads from the reads directory, whereas the trimreads rule puts all results into the trimmed directory.

Chaining rules in Snakemake is a matter of choosing filename patterns that connect the rules. There’s something of an art to it - most times there are several options that will work. Consider the following alternative version of the countreads rule:

# New even-more-generic read counter
rule countreads:
    output: "{indir}.{myfile}.fq.count"
    input:  "{indir}/{myfile}.fq"
    shell:
        "echo $(( $(wc -l <{input}) / 4 )) > {output}"

Now, the rule no longer requires the input files to be in the “reads” directory. The directory name has been replaced by the {indir} wildcard. We can request Snakemake to create a file following this output pattern:

BASH

$ snakemake -j1 -F -p trimmed.ref1_1.fq.count
$ cat trimmed.ref1_1.fq.count
14278

Look at the logging messages that Snakemake prints in the terminal. What has happened here?

  1. Snakemake looks for a rule to make trimmed.ref1_1.fq.count
  2. It determines that “countreads” can make this if indir=trimmed and myfile=ref1_1
  3. It sees that the input needed is therefore trimmed/ref1_1.fq

  4. Snakemake looks for a rule to make trimmed/ref1_1.fq
  5. It determines that “trimreads” can make this if myfile=ref1_1
  6. It sees that the input needed is therefore reads/ref1_1.fq

  7. Now Snakemake has reached an available input file, it runs both steps to get the final output

Here’s a visual representation of this process:

A visual representation of the above process showing the rule definitions, with arrows added to indicate the order wildcards and placeholders are substituted. Blue arrows start from the final target at the top, which is the file trimmed.ref1_1.fq.count, then point down from components of the filename to wildcards in the output of the countreads rule. Arrows from the input of this rule go down to the output of the trimreads rule. Orange arrows then track back up through the shell parts of both rules, where the placeholders are, and finally back to the target output filename at the top.

This, in a nutshell, is how we build workflows in Snakemake.

  1. Define rules for all the processing steps
  2. Choose input and output naming patterns that allow Snakemake to link the rules
  3. Tell Snakemake to generate the final output files

If you are used to writing regular scripts this takes a little getting used to. Rather than listing steps in order of execution, you are always working backwards from the final desired result. The order of operations is determined by applying the pattern matching rules to the filenames, not by the order of the rules in the Snakefile.

Outputs first?

The Snakemake approach of working backwards from the desired output to determine the workflow is why we’re putting the output lines first in all our rules - to remind us that these are what Snakemake looks at first!

Many users of Snakemake, and indeed the official documentation, prefer to have the input first, so in practise you should use whatever order makes sense to you.

Thinking about your own workflows

Think about any data processing task you have done yourself, and write down three or four steps from that workflow.

What were the inputs to, and outputs from, each step?

How did the steps connect up, in terms of data going from one to the next? You may want to sketch this out and use arrows to indicate the linkages between the steps.

TODO - draw this out too.

"Boil Water" : input=cold water, output=hot water

"Brew Tea" : input=[hot water, tea bag, mug], output=tea infusion

"Add Milk And Sugar" : input=[tea infusion, milk, sugar], output=cuppa

Adding an alignment step to the pipeline


Let’s add another rule to our Snakefile. The reads we have are from a yeast RNA-seq experiment so we might reasonably want to quantify transcript abundance using the kallisto aligner. The command to do so looks like this:

BASH

$ kallisto quant -i index_file -o output_dir in_1.fastq in_2.fastq

This command has three input files:

  1. The transcriptome index
  2. The first of the paired FASTQ files
  3. The second of the paired FASTQ files

And it produces a directory of output files. According to the Kallisto manual this directory will have three output files in it:

  1. abundance.h5
  2. abundance.tsv
  3. run_info.json

We’ll not worry about what the contents of these files mean just now, or how Kallisto generates them. We just know that we want to run the kallisto quant command and have it make the output files, and the output files are going to be useful once we add later steps in the analysis.

Making a rule with multiple inputs and outputs like this works much like the previous rules.

rule kallisto_quant:
    output:
        h5   = "kallisto.{sample}/abundance.h5",
        tsv  = "kallisto.{sample}/abundance.tsv",
        json = "kallisto.{sample}/run_info.json",
    input:
        index = "Saccharomyces_cerevisiae.R64-1-1.kallisto_index",
        fq1   = "trimmed/{sample}_1.fq",
        fq2   = "trimmed/{sample}_2.fq",
    shell:
        "kallisto quant -i {input.index} -o kallisto.{wildcards.sample} {input.fq1} {input.fq2}"

There are many things to note here:

  1. The individual input and output files are given names using the = syntax.
  2. Each of these lines must end with a , (optional for the last one).
  3. In the shell part, the input placeholders are now like {input.name}.
  4. We’ve chosen to only quantify the trimmed version of the reads.
  5. We’ve used the wildcard name {sample} rather than {myfile} because this will match only the sample name, eg ref1, not ref1_1. Snakemake doesn’t care what name we use, but carefully chosen names make for more readable rules.
  6. Because kallisto quant only takes the output directory name, we’ve used the placeholder {wildcards.sample} rather than {output} which would give the full file names.
  7. We don’t actually have the {input.index} file yet. This will need to be created using the kallisto index command.
  8. If the number of input or output files had been variable, we’d need a slightly different approach. We’ll come on to this in a later episode.

Even though the rule is not going to work without the index, we can still run it to check that Snakemake is happy with the rule definition.

Running Kallisto on all replicates

If you know about the Kallisto software, you may be thinking about running Kallisto on all replicates of the condition at once. We’ll look at this later in the course, but for now assume that Kallisto is run once per sample, ie. once for each pair of FASTQ files.

Running the kallisto_quant rule

Given that the index input is missing, what would you expect Snakemake to do if the new rule was run now?

Try it by telling Snakemake to run the new rule on the files ref1_1.fq and ref1_2.fq. Since the rule defines multiple outputs, asking for any one of the output files will be enough.

BASH

$ snakemake -j1 -F -p kallisto.ref1/abundance.h5

Resulting error is “Missing input files”. Note that Snakemake does not try to run kallisto at all. It stops while still making a work plan, because it sees that a required input file is missing.

Building the index

Instruct Snakemake how to build the genome index as part of the pipeline by adding another rule. The command we need to run is:

BASH

$ kallisto index -i index_file_to_make fasta_file_to_index

The file to be indexed is transcriptome/Saccharomyces_cerevisiae.R64-1-1.cdna.all.fa.gz. As there is only one input to the rule you don’t have to give it a name, but you may do so if you like.

Make it so that the output printed by the program is captured to a file, and therefore your rule will have two separate outputs: the index file and the log file. Note that the program prints messages on stderr, so you will need to use >& rather than > to capture the output.

The rule you write could look something like this, but there are many variations that will work just as well. Since there is only one transcriptome in the project, you may feel that use of the {strain} wildcard is overkill, but who’s to say we might not want to use another in future?

rule kallisto_index:
    output:
        idx = "{strain}.kallisto_index",
        log = "{strain}.kallisto_log",
    input:
        fasta = "transcriptome/{strain}.cdna.all.fa.gz"
    shell:
        "kallisto index -i {output.idx} {input.fasta} >& {output.log}"

logoutputs in Snakemake

Snakemake has a dedicated rule field for outputs that are log files, and these are mostly treated as regular outputs except that log files are not removed if the job produces an error. This means you can look at the log to help diagnose the error. In a real workflow this can be very useful, but in terms of learning the fundementals of Snakemake we’ll stick with regular input and output fields here.

Dealing with a missing files error


All being well, your new rules are now ready to run Kallisto.

BASH

$ snakemake -j1 -F -p kallisto.ref1/abundance.h5
...lots of output...
4 of 4 steps (100%) done
Complete log: /home/zenmaster/data/yeast/.snakemake/log/2021-04-23T142649.632834.snakemake.log

We’ll end the chapter by looking at a common problem that can arise if you mistype a file name in a rule. Remember that we wrote the rule based on the expected output filenames given in the Kallisto manual. In an older version of this manual there was a typo where the file names were incorrectly given as abundances.h5 and abundances.tsv, with the extra s on each.

It may seem silly to break the workflow when we just got it working, but it will be instructive, so edit the Snakefile and change these names to the incorrect versions.

rule kallisto_quant:
    output:
        h5   = "kallisto.{sample}/abundances.h5",
        tsv  = "kallisto.{sample}/abundances.tsv",
        json = "kallisto.{sample}/run_info.json",
...

To keep things tidy, this time we’ll manually remove the output directory.

BASH

$ rm -rvf kallisto.ref1

And re-run. Note that the output file name you’ll need to use on the command line must match the edited Snakefile, or you will get a MissingRuleException.

OUTPUT

$ snakemake -j1 -F -p kallisto.ref1/abundances.h5

...
kallisto quant -i Saccharomyces_cerevisiae.R64-1-1.kallisto_index -o kallisto.ref1 trimmed/ref1_2.fq trimmed/ref1_2.fq

[quant] fragment length distribution will be estimated from the data
...more Kallisto output...
[   em] the Expectation-Maximization algorithm ran for 265 rounds

Waiting at most 5 seconds for missing files.
MissingOutputException in line 24 of /home/zenmaster/data/yeast/Snakefile:
Job Missing files after 5 seconds:
kallisto.ref1/abundances.h5
kallisto.ref1/abundances.tsv
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 0 completed successfully, but some output files are missing. 0
  File "/opt/python3.7/site-packages/snakemake/executors/__init__.py", line 583, in handle_job_success
  File "/opt/python3.7/site-packages/snakemake/executors/__init__.py", line 259, in handle_job_success
Removing output files of failed job kallisto_quant since they might be corrupted:
kallisto.ref1/run_info.json
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/zenmaster/data/yeast/.snakemake/log/2021-04-23T142649.632834.snakemake.log

There’s a lot to take in here. Some of the messages are very informative. Some less so.

  1. Snakemake did actually run kallisto, as evidenced by the output from kallisto that we see on the screen.
  2. There is no obvious error message being reported by kallisto.
  3. Snakemake complains some expected output files are missing: kallisto.ref1/abundances.h5 and kallisto.ref1/abundances.tsv.
  4. The third expected output file kallisto.ref1/run_info.json was found but has now been removed by Snakemake.
  5. Snakemake suggest this might be due to “filesystem latency”.

This last point is a red herring. “Filesystem latency” is not an issue here, and never will be since we are not using a network filesystem. We know what the problem is, as we deliberately caused it, but to diagnose an unexpected error like this we would investigate further by looking at the kallisto.ref1 subdirectory.

BASH

$ ls kallisto.ref1/
abundance.h5  abundance.tsv

Remember that Snakemake itself does not create any output files. It just runs the commands you put in the shell sections, then checks to see if all the expected output files have appeared.

So if the file names created by kallisto are not exactly the same as in the Snakefile you will get this error, and you will, in this case, find that some output files are present but others (run_info.json, which was named correctly) have been cleaned up by Snakemake.

Errors are normal

Don’t be disheartened if you see errors like the one above when first testing your new Snakemake pipelines. There is a lot that can go wrong when writing a new workflow, and you’ll normally need several iterations to get things just right. One advantage of the Snakemake approach compared to regular scripts is that Snakemake fails fast when there is a problem, rather than ploughing on and potentially running junk calculations on partial or corrupted data. Another advantage is that when a step fails we can safely resume from where we left off, as we’ll see in the next episode.

Finally, edit the names in the Snakefile back to the correct version and re-run to confirm that all is well. Assure yourself that that the rules are still generic by processing the temp33_1 sample too:

BASH

$ snakemake -j1 -F -p kallisto.ref1/abundance.h5  kallisto.temp33_1/abundance.h5

For reference, this is a Snakefile incorporating the changes made in this episode.

Key Points

  • Snakemake links rules by iteratively looking for rules that make missing inputs
  • Rules may have multiple named inputs and/or outputs
  • If a shell command does not yield an expected output then Snakemake will regard that as a failure