This lesson is being piloted (Beta version)

Snakemake for Bioinformatics

A lesson introducing the Snakemake workflow system for bioinformatics analysis. The Snakemake system enables the writing of reliable, scalable and reproducible scientific workflows as a series of chained rules. Simple workflows to replace shell scripts can be written in a few lines of code, while for more complex cases there is support for conda integration, software containers, cluster execution, cloud execution, etc. You can also add Python and R code directly into your workflow.

This lesson introduces the core concepts of Snakemake in the context of a typical analysis task, aligning short cDNA reads to a reference transcriptome. Later episodes focus on practical questions of workflow design, debugging and configuration.

We also look at the Conda integration feature of Snakemake, with which you can author reproducible and shareable workflows with a fully-specified software environment.

In the planning phase of writing this course material we outlined some learner profiles, to expand on who we think will benefit from this lesson and why.

Prerequisites

This is an intermediate lesson and assumes learners have some prior experience in bioinformatics:

  • Familiarity with the [Bash command shell[(http://swcarpentry.github.io/shell-novice), including concepts like pipes, variables and loops.
  • Knowledge of bioinformatics fundamentals like the FASTQ file format and short read mapping, in order to understand the example workflow.

No previous knowledge of Snakemake or workflow systems, or Python programming, is required.

Schedule

Setup Download files required for the lesson
00:00 1. Running commands with Snakemake How do I run a simple command with Snakemake?
01:00 2. Placeholders and wildcards How do I make a generic rule?
How does Snakemake decide what rule to run?
02:10 3. Chaining rules How do I combine rules into a workflow?
How do I make a rule with multiple inputs and outputs?
03:20 4. How Snakemake plans what jobs to run How do I visualise a Snakemake workflow?
How does Snakemake avoid unecessary work?
How do I control what steps will be run?
04:30 5. Processing lists of inputs How do I process multiple files at once?
How do I define a default set of outputs for my Snakefile?
How do I make Snakemake detect what to process?
How do I combine multiple files together?
05:50 6. Handling awkward programs How do I handle real bioinformatics tools?
How do I define a rule where the output is a directory?
07:10 7. Configuring workflows How do I separate my rules from my configuration?
08:00 8. Optimising workflow performance What compute resources are available on my system?
How do I define jobs with more than one thread?
How do I measure the compute resources being used by a workflow?
How do I run my workflow steps in parallel?
08:40 9. Input functions What is a Python function?
How can Python functions help in defining Snakemake rules?
09:50 10. Conda integration How do I install new packages with Conda?
How do I get Snakemake to manage software dependencies?
11:00 11. Constructing a whole new workflow How do I approach making a new workflow from scratch?
How do I understand and debug the errors I see?
12:10 12. Cleaning up How do I save disk space by removing temporary files?
How do I protect important outputs from deletion?
12:45 13. Robust quoting in Snakefiles How are shell commands processed before being run?
How do I avoid quoting problems in filenames and commands?
13:25 Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.