Getting Started with Snakemake

Learn to tame your unruly data processing workflow with Snakemake, a tool for creating reproducible and scalable data analyses. Workflows are described via a human readable, Python-based language. They can be seamlessly scaled to server, cluster, grid, and cloud environments, without the need to modify the workflow definition.

Starting with some independent analysis tasks, you will explore the limitations of manual processing and shell scripting before building up a reproducible, automated, and efficient workflow step by step with Snakemake. Along the way, you will learn the benefits of modern workflow engines and how to apply them to your own work.

The example workflow performs a frequency analysis of several public domain books sourced from Project Gutenberg, testing how closely each book conforms to Zipf’s Law. This example has been chosen over a more complex scientific workflow as the goal is to appeal to a wide audience and to focus on building the workflow without getting distracted by the underlying science domain.

At the end of this lesson, you will:

Understand the benefits of workflow engines.
Be able to create reproducible analysis pipelines with Snakemake.

All code and data are provided.

Pre-requisites

Some basic Python programming experience, ideally in Python 3.
Familiarity with running programs on a command line.

If you require a refresher or introductory course, then I suggest one or more of these Carpentry courses:

Setup

Please follow the instructions in the Setup page.

The files used in this lesson can be downloaded:

Linux/Mac

Windows

Once downloaded, please extract to the directory you wish to work in for all the hands-on exercises.

Solutions for most episodes can be found in the .solutions directory inside the code download.

A requirements.txt file is included in the download. This can be used to install the required Python packages.

Schedule

	Setup	Download files required for the lesson
00:00	1. Manual Data Processing Workflow	How can I make my results easier to reproduce?
00:30	2. Snakefiles	How do I write a simple workflow?
01:10	3. Wildcards	How can I abbreviate the rules in my pipeline?
02:00	4. Pattern Rules	How can I define rules to operate on similar files?
02:20	5. Snakefiles are Python code	How can I automatically manage dependencies and outputs? How can I use Python code to add features to my pipeline?
03:20	6. Completing the Pipeline	How do I move generated files into a subdirectory? How do I add new processing rules to a Snakefile? What are some common practices for Snakemake? How can I get my workflow to clean up generated files? What is a default rule?
04:00	7. Resources and parallelism	How do I scale a pipeline across multiple cores? How do I manage access to resources while working in parallel?
04:45	8. Make your workflow portable and reduce duplication	How can I eliminate duplicated file names and paths in my workflows? How can I make my workflows portable and easily shared?
05:20	9. Scaling a pipeline across a cluster	How do I run my workflow on an HPC system?
06:05	10. Final notes	What are some tips and tricks I can use to make this easier?
06:25	Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.