Learn to tame your unruly data processing workflow with Snakemake, a tool for creating reproducible and scalable data analyses. Workflows are described via a human readable, Python-based language. They can be seamlessly scaled to server, cluster, grid, and cloud environments, without the need to modify the workflow definition.
Starting with some independent analysis tasks, you will explore the limitations of manual processing and shell scripting before building up a reproducible, automated, and efficient workflow step by step with Snakemake. Along the way, you will learn the benefits of modern workflow engines and how to apply them to your own work.
The example workflow performs a frequency analysis of several public domain books sourced from Project Gutenberg, testing how closely each book conforms to Zipf’s Law. This example has been chosen over a more complex scientific workflow as the goal is to appeal to a wide audience and to focus on building the workflow without getting distracted by the underlying science domain.
At the end of this lesson, you will:
- Understand the benefits of workflow engines.
- Be able to create reproducible analysis pipelines with Snakemake.
All code and data are provided.
Pre-requisites
- Some basic Python programming experience, ideally in Python 3.
- Familiarity with running programs on a command line.
If you require a refresher or introductory course, then I suggest one or more of these Carpentry courses:
Setup
Please follow the instructions in the Setup page.
The files used in this lesson can be downloaded:
Once downloaded, please extract to the directory you wish to work in for all the hands-on exercises.
Solutions for most episodes can be found in the
.solutions
directory inside the code download.A
requirements.txt
file is included in the download. This can be used to install the required Python packages.