Snakemake for Bioinformatics: Key Points

Beta

Snakemake for Bioinformatics

Running commands with Snakemake

Before running Snakemake you need to write a Snakefile
A Snakefile is a text file which defines a list of rules
Rules have inputs, outputs, and shell commands to be run
You tell Snakemake what file to make and it will run the shell command defined in the appropriate rule

Placeholders and wildcards

Snakemake rules are made generic with placeholders and wildcards
Snakemake chooses the appropriate rule by replacing wildcards such the the output matches the target
Placeholders in the shell part of the rule are replaced with values based on the chosen wildcards
Snakemake checks for various error conditions and will stop if it sees a problem

Chaining rules

Snakemake links up rules by iteratively looking for rules that make missing inputs
Careful choice of filenames allows this to work
Rules may have multiple named input files (and output files)

Complex outputs, logs and errors

Try out commands on test files before adding them to the workflow
You can build up the workflow in the order that makes sense to you, but always test as you go
Use log outputs to capture the messages printed by programs as they run
If a shell command exits with an error, or does not yield an expected output then Snakemake will regard that as a failure and stop the workflow

How Snakemake plans its jobs

A job in Snakemake is a rule plus wildcard values (determined by working back from the requested output)
Snakemake plans its work by arranging all the jobs into a DAG (directed acyclic graph)
If output files already exist, Snakemake can skip parts of the DAG
Snakemake compares file timestamps and a log of previous runs to determine what need regenerating

Processing lists of inputs

Rename your input files if necessary to maintain consistent naming
List the things you want to proces as global variables, or discover input files with glob_wildcards()
Use the expand() function to generate lists of filenames you want to combine
These functions can be tested in the Python interpreter
Any {input} to a rule can be a variable-length list
(But variable lists of outputs are trickier and rarely needed)

Handling awkward programs

Different bioinformatics tools will have different quirks
If programs limit your options for choosing input and output filenames, you have several ways to deal with this
Use triple-quote syntax to make longer shell scripts with multiple commands

Finishing the basic workflow

Once a workflow is complete and working, there will still be room for refinement
This completes the introduction to the fundamentals of Snakemake

Configuring workflows

Break out significant options into rule parameters
Use a YAML config file to separate your configuration from your workflow logic
Decide if different config items should be mandatory or else have a default
Reference the config file in your Snakefile or else on the command line with --configfile
Override or add config values using --config name1=value1 name2=value2 and end the list with a new parameter, eg. -p

Optimising workflow performance

To make your workflow run as fast as possible, try to match the number of threads to the number of cores you have
You also need to consider RAM, disk, and network bottlenecks
Profile your jobs to see what is taking most resources
Snakemake is great for running workflows on compute clusters

Conda integration

Conda is a system for managing software packages in self-contained environments
Snakemake rules may be associated with specific Conda environments
When run with the --use-conda option, Snakemake will set these up for you
Conda gives you fine control over software versions, without modifying globally-installed packages
Workflows made this way are super-portable, because Conda handles installing the correct versions of everything

Constructing a whole new workflow

By now, you are ready to start using Snakemake for your own workflow tasks
You may wish to replace an existing script with a Snakemake workflow
Don’t be disheartened by errors, which are normal; use a systematic approach to diagnose the problem

Cleaning up

Cleaning up working files is good practise
Make use of the temporary() function on outputs you don’t need to keep
Protect outputs which are expensive to reproduce
Shadow rules can solve issues with commands that produce unwanted files