Glossary

Key Points

Manual Data Processing Workflow	Bash scripts are not an efficient way of defining a workflow.
Snakefiles	Snakemake is one method of managing a complex computational workflow. If you have previously used `make`, then Snakemake will be familiar. Snakemake follows Python syntax. Rules can have an input and/or outputs, and a command to be run. Snakemake only executes rules when required.
Wildcards	Use `{output}` to refer to the output of the current rule. Use `{input}` to refer to the dependencies of the current rule. You can use Python indexing to retrieve individual outputs and inputs (example: `{input[0]}`) Wildcards can be named (example: `{input.file1}`). Naming the code or scripts used by a rule as inputs ensures that the rule is executed if the code or script changes.
Pattern Rules	Use any named wildcard (`{some_name}`) as a placeholder in targets and dependencies. Snakemake will apply the pattern rule to all matching files. You cannot execute pattern rules by name. You need to request specific files. Wildcards can be used directly in `input:` and `output:` but not in actions. To use the current value of a wildcard in an action, prefix it with `{wildcards.}`.
Snakefiles are Python code	Snakefiles are Python code. The entire Snakefile is executed whenever you run `snakemake`. All actual work should be done by rules. A `shell` action executes a command-line instruction. A `run` action executes Python code.
Completing the Pipeline	Keeping output files in the top-level directory can get messy. One solution is to put files into subdirectories. It is common practice to have a `clean` rule that deletes all intermediate and generated files, taking your workflow back to a blank slate. A default rule is the rule that Snakemake runs if you don’t specify a rule on the command line. It is simply the first rule in a Snakefile. Many Snakefiles define a default target called `all` as first target in the file. This runs by default and typically executes the entire workflow.
Resources and parallelism	Use `threads` to indicate the number of cores required by a rule. Use the `-j` argument to Snakemake to indicate how many CPU cores can be used for parallel tasks. Resources are arbitrary and can be used for anything. The `&&` operator is a useful tool when chaining bash commands. While available resources will limit the total number of tasks that can execute in parallel, Snakemake will attempt to run at least one task even when sufficient resources are not available. It is up to you to tell the applications called by Snakemake how many resources it should be using. If your rule requires a minimum number of cores or resources, you can use a Bash `if` test to check the requirements.
Make your workflow portable and reduce duplication	Careful use of global variables can eliminate duplication of file names and patterns in your Snakefiles Consistent naming conventions help keep your code readable. Configuration files can make your workflow portable. In general, don’t add configuration files to source control. Instead, provide a template. Take care when mixing global variables and Snakemake wildcards in a formatted string. In general, surround wildcards with double curly braces.
Scaling a pipeline across a cluster	Snakemake generates and submits its own batch scripts for your scheduler. `localrules` defines rules that are executed locally, and never submitted to a cluster. `$PATH` must be passed to Snakemake rules. `nohup <command> &` prevents `<command>` from exiting when you log off.
Final notes	`snakemake -n` performs a dry-run. Using log files can make your workflow easier to debug. Put log files in the same location as the rule outputs. Token files can be used to take the place of output files if none are created. `snakemake --unlock` can unlock a directory if snakemake crashes. `snakemake --dag \| dot -Tsvg > dag.svg` creates a graphic of your workflow. `snakemake --gui` opens a browser window with your workflow.

Zipf’s Law: In the field of quantitative linguistics, Zipf’s Law states that the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.: the rank-frequency distribution is an inverse relation (source: Wikipedia).
Build File: A build file describes all the steps required to execute or build your code or data. The format of the build file depends on the build system being used. Snakemake build files are called Snakefiles, and use Python 3 as the definition language.
Dependency: A file that is needed to build a target. In Snakemake, dependencies are specified as inputs to a rule.
Target: A file to be created or built. In Snakemake targets are also called outputs.
Rule: Describes how to create outputs from inputs. Dependencies between rules are handled implicitly by matching filenames of inputs to outputs. A rule can also contain no inputs or outputs, in which case it simply specifies a command that can be run manually. Snakemake rules are composed of inputs, outputs, and an action.
Default Target: The first rule in a Snakefile defines the default target. This is the target that will be built if you do not specify any targets when running snakemake.
Action: The part of a rule that specifies the commands to run.
Incremental Builds: Incremental builds are builds that are optimized so that targets that have up-to-date output files with respect to their corresponding input files are not executed.
Directed Acyclic Graph: A directed acyclic graph (DAG) is a finite graph with no directed cycles. In the context of Snakemake, it means that you cannot define any circular dependencies between rules. A rule cannot, directly or indirectly, depend on itself.
Pattern Rule: A Snakemake rule that uses wildcard patterns to describe a general rule to make any output that matches the pattern.
Wildcard: Wildcards are used in an input or output to indicate a pattern (e.g.: {file}). Snakemake matches each wildcard to the regular expression .+, although additional constraints can be specified. See the documentation for details.
HPC Cluster: An HPC cluster is a collection of many separate servers (computers), called nodes, which are connected via a fast interconnect.
Scheduler: A job scheduler is a computer application for controlling unattended background program execution of jobs. (source: Wikipedia).

Getting Started with Snakemake: Glossary

Key Points

Glossary