Running commands with Snakemake


  • Before running Snakemake you need to write a Snakefile
  • A Snakefile is a text file which defines a list of rules
  • Rules have inputs, outputs, and shell commands to be run
  • You tell Snakemake what file to make and it will run the shell command defined in the appropriate rule

Placeholders and wildcards


  • Snakemake rules are made generic with placeholders and wildcards
  • Snakemake chooses the appropriate rule by replacing wildcards such the the output matches the target
  • Placeholders in the shell part of the rule are replaced with values based on the chosen wildcards
  • Snakemake checks for various error conditions and will stop if it sees a problem

Chaining rules


  • Snakemake links up rules by iteratively looking for rules that make missing inputs
  • Careful choice of filenames allows this to work
  • Rules may have multiple named input files (and output files)

Complex outputs, logs and errors


  • Try out commands on test files before adding them to the workflow
  • You can build up the workflow in the order that makes sense to you, but always test as you go
  • Use log outputs to capture the messages printed by programs as they run
  • If a shell command exits with an error, or does not yield an expected output then Snakemake will regard that as a failure and stop the workflow

How Snakemake plans its jobs


  • A job in Snakemake is a rule plus wildcard values (determined by working back from the requested output)
  • Snakemake plans its work by arranging all the jobs into a DAG (directed acyclic graph)
  • If output files already exist, Snakemake can skip parts of the DAG
  • Snakemake compares file timestamps and a log of previous runs to determine what need regenerating

Processing lists of inputs


  • Rename your input files if necessary to maintain consistent naming
  • List the things you want to proces as global variables, or discover input files with glob_wildcards()
  • Use the expand() function to generate lists of filenames you want to combine
  • These functions can be tested in the Python interpreter
  • Any {input} to a rule can be a variable-length list
  • (But variable lists of outputs are trickier and rarely needed)

Handling awkward programs


  • Different bioinformatics tools will have different quirks
  • If programs limit your options for choosing input and output filenames, you have several ways to deal with this
  • Use triple-quote syntax to make longer shell scripts with multiple commands

Finishing the basic workflow


  • Once a workflow is complete and working, there will still be room for refinement
  • This completes the introduction to the fundamentals of Snakemake

Configuring workflows


  • Break out significant options into rule parameters
  • Use a YAML config file to separate your configuration from your workflow logic
  • Decide if different config items should be mandatory or else have a default
  • Reference the config file in your Snakefile or else on the command line with --configfile
  • Override or add config values using --config name1=value1 name2=value2 and end the list with a new parameter, eg. -p

Optimising workflow performance


  • To make your workflow run as fast as possible, try to match the number of threads to the number of cores you have
  • You also need to consider RAM, disk, and network bottlenecks
  • Profile your jobs to see what is taking most resources
  • Snakemake is great for running workflows on compute clusters

Conda integration


  • Conda is a system for managing software packages in self-contained environments
  • Snakemake rules may be associated with specific Conda environments
  • When run with the --use-conda option, Snakemake will set these up for you
  • Conda gives you fine control over software versions, without modifying globally-installed packages
  • Workflows made this way are super-portable, because Conda handles installing the correct versions of everything

Constructing a whole new workflow


  • By now, you are ready to start using Snakemake for your own workflow tasks
  • You may wish to replace an existing script with a Snakemake workflow
  • Don’t be disheartened by errors, which are normal; use a systematic approach to diagnose the problem

Cleaning up


  • Cleaning up working files is good practise
  • Make use of the temporary() function on outputs you don’t need to keep
  • Protect outputs which are expensive to reproduce
  • Shadow rules can solve issues with commands that produce unwanted files