Manual Data Processing Workflow
|
|
Snakefiles
|
Snakemake is one method of managing a complex computational workflow.
If you have previously used make , then Snakemake will be familiar.
Snakemake follows Python syntax.
Rules can have an input and/or outputs, and a command to be run.
Snakemake only executes rules when required.
|
Wildcards
|
Use {output} to refer to the output of the current rule.
Use {input} to refer to the dependencies of the current rule.
You can use Python indexing to retrieve individual outputs and inputs (example: {input[0]} )
Wildcards can be named (example: {input.file1} ).
Naming the code or scripts used by a rule as inputs ensures that the rule is executed if the code or script changes.
|
Pattern Rules
|
Use any named wildcard ({some_name} ) as a placeholder in targets and dependencies. Snakemake will apply the pattern rule to all matching files.
You cannot execute pattern rules by name. You need to request specific files.
Wildcards can be used directly in input: and output: but not in actions. To use the current value of a wildcard in an action, prefix it with {wildcards.} .
|
Snakefiles are Python code
|
Snakefiles are Python code.
The entire Snakefile is executed whenever you run snakemake .
All actual work should be done by rules.
A shell action executes a command-line instruction.
A run action executes Python code.
|
Completing the Pipeline
|
Keeping output files in the top-level directory can get messy. One solution is to put files into subdirectories.
It is common practice to have a clean rule that deletes all intermediate and generated files, taking your workflow back to a blank slate.
A default rule is the rule that Snakemake runs if you don’t specify a rule on the command line. It is simply the first rule in a Snakefile.
Many Snakefiles define a default target called all as first target in the file. This runs by default and typically executes the entire workflow.
|
Resources and parallelism
|
Use threads to indicate the number of cores required by a rule.
Use the -j argument to Snakemake to indicate how many CPU cores can be used for parallel tasks.
Resources are arbitrary and can be used for anything.
The && operator is a useful tool when chaining bash commands.
While available resources will limit the total number of tasks that can execute in parallel, Snakemake will attempt to run at least one task even when sufficient resources are not available.
It is up to you to tell the applications called by Snakemake how many resources it should be using.
If your rule requires a minimum number of cores or resources, you can use a Bash if test to check the requirements.
|
Make your workflow portable and reduce duplication
|
Careful use of global variables can eliminate duplication of file names and patterns in your Snakefiles
Consistent naming conventions help keep your code readable.
Configuration files can make your workflow portable.
In general, don’t add configuration files to source control. Instead, provide a template.
Take care when mixing global variables and Snakemake wildcards in a formatted string. In general, surround wildcards with double curly braces.
|
Scaling a pipeline across a cluster
|
Snakemake generates and submits its own batch scripts for your scheduler.
localrules defines rules that are executed locally, and never submitted to a cluster.
$PATH must be passed to Snakemake rules.
nohup <command> & prevents <command> from exiting when you log off.
|
Final notes
|
snakemake -n performs a dry-run.
Using log files can make your workflow easier to debug.
Put log files in the same location as the rule outputs.
Token files can be used to take the place of output files if none are created.
snakemake --unlock can unlock a directory if snakemake crashes.
snakemake --dag | dot -Tsvg > dag.svg creates a graphic of your workflow.
snakemake --gui opens a browser window with your workflow.
|