Summary and Schedule
A lesson introducing the Snakemake workflow system for bioinformatics analysis. Snakemake enables the writing of reliable, scalable and reproducible scientific workflows as a series of chained rules. Simple workflows to replace shell scripts can be written in a few lines of code, while for more complex cases there is support for conda integration, software containers, cluster execution, cloud execution, etc. You can also add Python and R code directly into your workflow.
This lesson introduces the core concepts of Snakemake in the context of a typical analysis task, aligning short cDNA reads to a reference transcriptome. Later episodes focus on practical questions of workflow design, debugging and configuration.
We also look at the Conda integration feature of Snakemake, with which you can author reproducible and shareable workflows with a fully-specified software environment.
In the planning phase of writing this course material we outlined some learner profiles, to expand on who we think will benefit from this lesson and why.
Learner Prerequisites
See the prerequisites page for a full list of skills and concepts we assume that learners will know prior to taking this lesson. In brief:
This is an intermediate lesson and assumes learners have some prior experience in bioinformatics:
- Familiarity with the Bash command shell, including concepts like pipes, variables, loops and scripts.
- Knowing about bioinformatics fundamentals like the FASTQ file format and read mapping, in order to understand the example workflows.
No previous knowledge of Snakemake or workflow systems, or Python programming, is assumed.
Setup Instructions | Download files required for the lesson | |
Duration: 00h 00m | 1. Running commands with Snakemake | How do I run a simple command with Snakemake? |
Duration: 01h 00m | 2. Placeholders and wildcards |
How do I make a generic rule? How does Snakemake decide what rule to run? |
Duration: 02h 10m | 3. Chaining rules |
How do I combine rules into a workflow? How can I make a rule with multiple input files? |
Duration: 03h 00m | 4. Complex outputs, logs and errors |
How can we start to analyse the sample data? What can cause a job in Snakemake to fail? |
Duration: 03h 50m | 5. How Snakemake plans its jobs |
How do I visualise a Snakemake workflow? How does Snakemake avoid unecessary work? How do I control what steps will be run? |
Duration: 05h 00m | 6. Processing lists of inputs |
How do I define a default set of outputs for my Snakefile? How do I make rules which combine whole lists of files? How do I process all available input files at once? |
Duration: 06h 20m | 7. Handling awkward programs |
How do I handle tools which don’t let me specify output file
names? How do I define a rule where the output is a directory? |
Duration: 07h 40m | 8. Finishing the basic workflow |
What does the full sample workflow look like? How can we report some initial results from this analysis? |
Duration: 09h 00m | 9. Configuring workflows | How do I separate my rules from my configuration? |
Duration: 09h 50m | 10. Optimising workflow performance |
What compute resources are available on my system? How do I define jobs with more than one thread? How do I measure the compute resources being used by a workflow? How do I run my workflow steps in parallel? |
Duration: 10h 30m | 11. Conda integration |
How do I install new packages with Conda? How do I get Snakemake to manage software dependencies? |
Duration: 11h 40m | 12. Constructing a whole new workflow |
How do I approach making a new workflow from scratch? How do I understand and debug the errors I see? |
Duration: 13h 40m | 13. Cleaning up |
How do I save disk space by removing temporary files? How do I isolate the interim files created by jobs from each other? |
Duration: 14h 15m | Finish |
The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.
Software installation
These instructions set out how to obtain and install the software and data on Linux. It is assumed that you have:
- access to the Bash shell on a fairly modern Linux system
- sufficient disk space (~1GB) to store the software and data
You do not need root/administrator level access.
Note
The recommended installation method here uses a frozen conda environment so that you will be running the exact version of Snakemake, and other tools, that has been tested with this material.
There are other ways to install these tools, but Snakemake in particular is under active development and so newer or older versions may not behave the same way that the material shows.
The frozen environment in conda_env.yaml
only works on
Linux just now, but we would welcome contributions
in this regard, such as setup instructions for Mac users. See the
Keeping the course updated section for more details.
We can install all the software packages, including Snakemake and the bioinformatics tools, via Conda. For more info on Conda, see the first part of episode 11.
- If you don’t already have Conda, start with the Miniforge installer.
These are suggested commands to install and initialise Miniforge in a Linux Bash environment.
BASH
$ wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh -O installer.sh
$ bash ./installer.sh -b -u -p ~/miniforge3
$ rm installer.sh
$ ~/miniforge3/bin/conda init bash
$ exit
Then open a new shell. Your shell prompt should now start with
(base)
.
Now save the following into a file named ~/.condarc
:
channels:
- conda-forge
- bioconda
solver: libmamba
channel_priority: strict
Notes
- You may want to enable additional channels but all the packages needed in the course are in the two listed above.
- The use of strict channel priority when resolving dependencies is recommended by both Snakemake and by Conda itself.
- Setting
solver: libmamba
is nowadays preferred to explicitly running themamba
command.
Once this is set up, get the conda_env.yaml environment file and create then activate the environment by running:
You will also need a text editor such as GEdit or Nano. These are standard on most Linux distros. Other text editors will work fine but we only provide specific setup instructions for GEdit and Nano here.
Obtaining the data
Download and unpack the sample dataset tarball from https://figshare.com/ndownloader/files/42467370
You may do this in the shell with the command:
The tar file needs to be unpacked to yield the directory of files used in the course. In the shell you may do this with:
See this link for details about this dataset and the redistribution license.
Preparing your editor
There are some settings you should change in your editor to most effectively edit Snakemake workflows. These are also good for editing most other types of script and code.
GEdit
GEdit is the text editor that comes with the GNOME desktop and is a
simple to use general-purpose editor. Within the application menus it is
normally just called “Text Editor” but you can also start it from your
shell by typing “gedit
”.
Running gedit from the shell
If you type “gedit
” in the shell, and GEdit is already
running, a new tab will be created in the existing editor and your shell
prompt will return. If GEdit was not already running the shell prompt in
your terminal will not come back until GEdit is closed.
You can type “gedit &
” to get your shell prompt back
immediately, but the program has a habit of printing sporadic warnings
into that terminal, so the cleanest option is to just start a new
terminal window.
Snakemake uses Python file structure which is very fussy about the use of tab characters and line breaks. Before writing any code in GEdit, you need to go into the preferences and select the following settings:
- Insert spaces instead of tabs (under the “Editor” tab)
- Disable text wrapping (under the “View” tab)
The following settings are recommended but not required:
- Set the Syntax to “Python3”, rather than “Plain Text”
- Set Tab width to 4
- Enable automatic indentation
- Highlight matching brackets
- Display line numbers
Nano
The Nano editor, as introduced in the shell-novice lesson works directly in the terminal and is found on virtually every Linux system.
The following command will start editing with the suggested settings, in particular regarding Tab handling as mentioned above, and with Python syntax highlighting.
To avoid typing all those options each time, you can add defaults to
your ~/.nanorc
file:
set nowrap
set autoindent
set tabstospaces
set tabsize 4
set smooth
set morespace
Some of these are already defaults in later versions of Nano, but it doesn’t hurt to have them here anyway. There isn’t a way to set the default syntax highlighting within this file.