Running commands with Snakemake
- Before running Snakemake you need to write a Snakefile
- A Snakefile is a text file which defines a list of rules
- Rules have inputs, outputs, and shell commands to be run
- You tell Snakemake what file to make and it will run the shell command defined in the appropriate rule
Placeholders and wildcards
- Snakemake rules are made generic with placeholders and wildcards
- Snakemake chooses the appropriate rule by replacing wildcards such the the output matches the target
- Placeholders in the shell part of the rule are replaced with values based on the chosen wildcards
- Snakemake checks for various error conditions and will stop if it sees a problem
Chaining rules
- Snakemake links up rules by iteratively looking for rules that make missing inputs
- Careful choice of filenames allows this to work
- Rules may have multiple named input files (and output files)
Complex outputs, logs and errors
- Try out commands on test files before adding them to the workflow
- You can build up the workflow in the order that makes sense to you, but always test as you go
- Use log outputs to capture the messages printed by programs as they run
- If a shell command exits with an error, or does not yield an expected output then Snakemake will regard that as a failure and stop the workflow
How Snakemake plans its jobs
- A job in Snakemake is a rule plus wildcard values (determined by working back from the requested output)
- Snakemake plans its work by arranging all the jobs into a DAG (directed acyclic graph)
- If output files already exist, Snakemake can skip parts of the DAG
- Snakemake compares file timestamps and a log of previous runs to determine what need regenerating
Processing lists of inputs
- Rename your input files if necessary to maintain consistent naming
- List the things you want to proces as global variables, or discover
input files with
glob_wildcards()
- Use the
expand()
function to generate lists of filenames you want to combine - These functions can be tested in the Python interpreter
- Any
{input}
to a rule can be a variable-length list - (But variable lists of outputs are trickier and rarely needed)
Handling awkward programs
- Different bioinformatics tools will have different quirks
- If programs limit your options for choosing input and output filenames, you have several ways to deal with this
- Use triple-quote syntax to make longer shell scripts with multiple commands
Finishing the basic workflow
- Once a workflow is complete and working, there will still be room for refinement
- This completes the introduction to the fundamentals of Snakemake
Configuring workflows
- Break out significant options into rule parameters
- Use a YAML config file to separate your configuration from your workflow logic
- Decide if different config items should be mandatory or else have a default
- Reference the config file in your Snakefile or else on the command
line with
--configfile
- Override or add config values using
--config name1=value1 name2=value2
and end the list with a new parameter, eg.-p
Optimising workflow performance
- To make your workflow run as fast as possible, try to match the number of threads to the number of cores you have
- You also need to consider RAM, disk, and network bottlenecks
- Profile your jobs to see what is taking most resources
- Snakemake is great for running workflows on compute clusters
Conda integration
- Conda is a system for managing software packages in self-contained environments
- Snakemake rules may be associated with specific Conda environments
- When run with the
--use-conda
option, Snakemake will set these up for you - Conda gives you fine control over software versions, without modifying globally-installed packages
- Workflows made this way are super-portable, because Conda handles installing the correct versions of everything
Constructing a whole new workflow
- By now, you are ready to start using Snakemake for your own workflow tasks
- You may wish to replace an existing script with a Snakemake workflow
- Don’t be disheartened by errors, which are normal; use a systematic approach to diagnose the problem
Cleaning up
- Cleaning up working files is good practise
- Make use of the
temporary()
function on outputs you don’t need to keep - Protect outputs which are expensive to reproduce
- Shadow rules can solve issues with commands that produce unwanted files