Handling awkward programs
Last updated on 2024-10-07 | Edit this page
Estimated time: 80 minutes
Overview
Questions
- How do I handle tools which don’t let me specify output file names?
- How do I define a rule where the output is a directory?
Objectives
- Add the FastQC tool to the pipeline
- Understand some different choices available when defining a rule
- Learn about the
directory()
function - Add multiple command lines to the
shell
section of a rule
For reference, this is the Snakefile you should have to start the episode.
Introducing FastQC
FastQC is a popular tool for scanning FASTQ files and producing a selection of quality plots. It’s “fast” in that it runs quickly, and also in that you can normally use the default options so there is no configuration needed.
The program can be run interactively or in batch mode, where it saves out results as an HTML file plus a ZIP file. We’ll obviously need to use the batch mode to include it as part of our workflow, and we’ll see shortly that this presents a minor problem.
In general, real bioinformatics programs like FastQC can have quirks like:
- Not allowing you to specify input or output file names
- Outputting a whole directory of files
- Creating temporary files in various places
- Not supporting robust error handling
With a little care we can handle all these problems while still keeping our workflow tidy.
Adding a FastQC rule
We’ll first run FastQC on a single input file and see what is produced.
BASH
$ fastqc reads/ref_1_1.fq
$ ls -ltr reads
...
-rw-r--r-- 1 zenmaster users 464852 Jun 9 14:31 ref_1_1_fastqc.zip
-rw-r--r-- 1 zenmaster users 654810 Jun 9 14:31 ref_1_1_fastqc.html
It’s possible to supply multiple input files, but the resulting
output is exactly the same as if processing the files one at a time. Two
files are produced for each FASTQ file, and these files appear in the
same directory as the input file. The fastqc
command does
not let us choose the output filenames, but we can set an alternative
output directory.
BASH
$ fastqc --help
...
-o --outdir Create all output files in the specified output directory.
Please note that this directory must exist as the program
will not create it. If this option is not set then the
output file for each sequence file is created in the same
directory as the sequence file which was processed.
...
For the countreads
rule we wrote earlier (see episode 4), we chose our preferred
output file name pattern so as to allow us to effectively link rules.
This gives us a rule that can count both trimmed and
untrimmed reads.
# Our existing countreads rule, for reference...
rule countreads:
output: "{indir}.{myfile}.fq.count"
input: "{indir}/{myfile}.fq"
shell:
"echo $(( $(wc -l <{input}) / 4 )) > {output}"
To do the same with FastQC, to report on both the trimmed and untrimmed reads, we have various options:
- Work with the default file names produced by FastQC and leave the reports in the same directory with the FASTQ files.
- Make the outputs in a new directory named, eg. “reads.fastqc.ref_1_1/” (similar to what we did with Kallisto).
- Do this, but tell Snakemake that the directory itself is the output.
- Force our preferred naming convention by renaming the FastQC output files within the rule.
We’ll try all four, and see where this gets us.
Option 1: Working with the default FastQC output files
Adding a FastQC rule using the default output file names
Fill in the ???
to make a working rule for FastQC where
indir
may be “reads” or “trimmed”. Do not change the shell
command or input pattern at all. Remember FastQC always makes two output
files, so add two named outputs.
rule fastqc:
output:
???
input: "{indir}/{myfile}.fq"
shell:
"fastqc {input}"
Since the shell
command is not to be changed, the output
names will be dictated by FastQC as we saw when running the command
directly in the terminal.
rule fastqc:
output:
html = "{indir}/{myfile}_fastqc.html",
zip = "{indir}/{myfile}_fastqc.zip",
input: "{indir}/{myfile}.fq"
shell:
"fastqc {input}"
This rule contains wildcards, so in order to run it you specify one or more target output files:
This rule is fine, but maybe we don’t want to put the reports in with the sequences. As a general principle, when writing Snakemake rules, we want to be in charge of output file names. FastQC lets us specify the output directory, so we can use that…
Option 2: Using the default output filenames in a new directory
A FastQC rule where the output files go into a new directory
Modify the rule so that the output files go into a new directory.
This will be very similar to the rule for kallisto quant
added in episode 3.
For example, when running on the file “trimmed/ref_1_1.fq” the outputs should be
OUTPUT
trimmed.fastqc.ref_1_1/ref_1_1_fastqc.html
trimmed.fastqc.ref_1_1/ref_1_1_fastqc.zip
This involves using the {myfile} wildcard twice and then constructing
the output directory name to place in the -o
option to
fastqc.
rule fastqc:
output:
html = "{indir}.fastqc.{myfile}/{myfile}_fastqc.html",
zip = "{indir}.fastqc.{myfile}/{myfile}_fastqc.zip",
input: "{indir}/{myfile}.fq"
shell:
"fastqc -o {wildcards.indir}.fastqc.{wildcards.myfile} {input}"
Option 3: Using a directory()
output
Our next option is to tell Snakemake not to worry about the individual files at all, and just say that the output of the rule is a whole directory. This makes the rule definition simpler.
We’ll amend the fastqc rule like so:
rule fastqc:
output: directory("{indir}.fastqc.{myfile}")
input: "{indir}/{myfile}.fq"
shell:
"fastqc -o {output} {input}"
Only use directory()
on outputs,
not inputs
You only use the directory()
declaration for outputs.
Any input to a rule may be a directory without the need for any special
syntax.
On running this, we get an error.
BASH
$ snakemake -j1 -p reads.fastqc.ref_1_1
...
Specified output directory 'reads.fastqc.ref_1_1' does not exist
...
This error is being printed by FastQC. FastQC requires that the
output directory must exist. (Other programs might insist that the
output directory does not already exist.) The error can be
rectified by making the directory explicitly in the shell
code.
rule fastqc:
output: directory("{indir}.fastqc.{myfile}")
input: "{indir}/{myfile}.fq"
shell:
"mkdir {output} ; fastqc -o {output} {input}"
The rationale for making output directories
Remember that in most cases it is not necessary to manually create
directories because Snakemake will auto-create the directory for every
output file listed by a rule. When using a directory()
output, Snakemake will not create the directory itself but most
applications will make the directory for you. FastQC is an exception.
The best approach is to only add a mkdir
command if you
first test the rule without it and see an error.
The modified rule works because the shell
part of the
rule can contain a whole script, with multiple commands to be run. Above
we used a semicolon to split the commands on one line. For putting
multiple lines into a shell
section there is a special
quoting syntax.
rule fastqc:
output: directory("{indir}.fastqc.{myfile}")
input: "{indir}/{myfile}.fq"
shell:
"""mkdir {output}
fastqc -o {output} {input}
"""
The “triple quoting” syntax comes from Python. Not only does it allow
multiple lines to be added within the quotes but it also allows you to
embed both single and double quotes into the shell commands. For a
further discussion of string quoting and a way to disable the
interpretation of “backslash escapes” like \n
and
\t
see the extra
episode on quoting
This rule is also fine, but because the individual files are not explicitly named as outputs we may have problems chaining later rules. Also consider that some applications won’t give you any control at all over the output filenames. The most powerful solution is to use shell commands to move and/or rename the files to exactly the names you want.
Option 4: Insisting on our own file names
Fixing FastQC to use our own output file names
Complete the rule below so that the output filenames are correctly
produced. You will need to add extra commands to the shell
part after running fastqc
. Do not alter the
output
or input
parts of the rule.
rule fastqc:
output:
html = "{indir}.{myfile}_fastqc.html",
zip = "{indir}.{myfile}_fastqc.zip"
input: "{indir}/{myfile}.fq"
shell:
"""???
"""
This is one solution, using -o .
to tell FastQC to
produce the files in the current directory, then explicitly renaming
them to match the declared rule outputs. Remember that, after Snakemake
runs all the commands in the shell
block, it checks to see
that all the expected output
files are present, but within
the block you can run any commands and make any files you like.
rule fastqc:
output:
html = "{indir}.{myfile}_fastqc.html",
zip = "{indir}.{myfile}_fastqc.zip"
input: "{indir}/{myfile}.fq"
shell:
"""fastqc -o . {input}
mv {wildcards.myfile}_fastqc.html {output.html}
mv {wildcards.myfile}_fastqc.zip {output.zip}
"""
One problem with making intermediate outputs
There is actually a problem with the above solution which only starts
to matter when we allow Snakemake to run multiple jobs in parallel.
Right now we are always using -j1
, but if we used, eg.
-j2
, then potentially Snakemake may try to make
“reads.ref_1_1_fastqc.html” and “trimmed.ref_1_1_fastqc.html” in
parallel, and both instances would be trying to write to the same
temporary files at the same time. Snakemake has an elegant solution to
this, in the form of shadow
rules, but we’re getting ahead
of ourselves. For now we’re running one job at a time, and this will
work.
Altering the Kallisto rule to have a directory()
output
We saw above that the output of a rule can be a directory and saw the
directory()
function which declares this. If you remember
the rule for kallisto quant
you may be thinking that this
could have been written with the whole directory as the output, and you
would be right.
# Existing rule for kallisto_quant
rule kallisto_quant:
output:
h5 = "kallisto.{sample}/abundance.h5",
tsv = "kallisto.{sample}/abundance.tsv",
json = "kallisto.{sample}/run_info.json",
...
# Alternative rule with directory() output
rule kallisto_quant:
output: directory("kallisto.{sample}")
...
Make the change in your Snakefile now. In other workflows this might not be the right approach but in this case it works fine and makes the Snakefile neater. It will also make sense in the next chapter where we add the remaining rules and finish the workflow.
For reference, this is a Snakefile incorporating the changes made in this episode.
Key Points
- Different bioinformatics tools will have different quirks
- If programs limit your options for choosing input and output filenames, you have several ways to deal with this
- Use triple-quote syntax to make longer shell scripts with multiple commands