This lesson is in the early stages of development (Alpha version)

Introduction to Workflows with Common Workflow Language: Learner Profiles

Target learners that may wish to follow this training instance will have wide range of previous experiences but will show a common, possibly recently acquired, interest in automating their analysis. The reasons will vary: they may want to use distributed computing, clusters, handle large datasets, large files, etc. They may be looking for reproducibilty, ways for documenting the analysis and to cite them adequately. Learner profiles inform what we teach, by describing (fictional) representatives of our target audience. Each profile answers three questions:

Martha

Early this year, Martha successfully defended her PhD research into localized expression of a transcription factor during mouse embryo development. She recently began a post-doctoral position developing a new experimental method building on approaches she used in her doctoral studies. She knows that the new method will result in the generation of a lot of data, which she is going to need to analyse in order to make sense of and publish her results. She hopes that, as well as a high-impact paper describing the initial findings, she’ll be able to publish a methods paper describing the novel protocol.

Martha attended a Software Carpentry workshop during her PhD studies and has used the skills she learned there to write a collection of shell scripts that will perform most of the analysis she will need. Each script combines multiple command line tools and some R scripts for statistical analysis and plotting.

In the past, Martha was able to run scripts like these on her research group’s local analysis server. However, this server is shared between all members of the group and she knows this new, high-throughput protocol she’s developing will produce data on a scale much larger than she’s ever had to deal with before. Additionally, the funding body supporting Martha’s research requires that she and her collaborators publish the methods they use in full, including a complete, reproducible description of the data analysis pipeline.

After following this tutorial, Martha will be able to start converting her collection of shell scripts into a workflow. This will allow her to run the workflow on her university’s HPC cluster, which has a capacity far greater than the local server she’s been working with up to now. She can also include the workflow description when she publishes her research findings, allowing others to take and use her method in their own research. The workflow runner can also provide provenance information for all her results, allowing her to trace the origin of her findings and report this information to her collaborators and funders.

Jorge

Jorge completed a Masters and PhD in bioinformatics and loves developing scientific software. He’s just started a new position as a Research Software Engineer in a lab that combines wet and dry lab approaches in their research. Although he was hired primarily to lead the development of a tool for analysis of images generated by Cryo-EM, he’s also inherited several tools and scripts written by a departing post-doc, who had just enough time on their way out the door to show him where everything was saved on the group’s fileshare.

Jorge knows the image processing tool he’s developing in his main project will need to run in the cloud computing environment associated with the facility where the images will be generated. The size of the images and the number collected will vary greatly over the lifetime of the project, and Jorge wants to make sure the analysis can scale according to this demand. He’d love to get his teeth into that problem, but it feels like his time is always taken up fixing problems with the pipelines he inherited: updates to several key dependencies have broken two of the most-used scripts in the group’s collection, and Jorge’s boss also wants to replace the short read mapper used in all of their single cell RNA-seq analyses.

After following this tutorial, Jorge will be able to describe the scripts and tools in his research group’s collection as CWL workflows, which will be much easier to maintain in the medium to long term. He knows how grateful his colleagues will be for the reduction in downtime while problems are solved, and for enabling them to run these tools in other compute environments. He will also be able to describe his own image processing software in CWL, and the workflow it will be part of too. Once he containerizes his software, his whole workflow can easily be executed in the imaging facility’s cloud environment.

Shawn

Shawn racked up a lot of hours in the wet lab while studying for his Masters in molecular biology. Understanding that biological research is becoming increasingly data-driven, he decided to gain more experience with computational research with one of the lab placements required for the course. He’s enjoyed playing around with computers and writing his own programs ever since his fifteenth birthday, when his aunt gave him a Raspberry Pi as a gift, and he likes the idea of combining this hobby with his interest in biology.

For his Masters thesis project, Shawn has been implementing a pipeline to process and analyse data from a variation on ChIP-seq used by one of his supervisor’s collaborators. A keen follower of developing technologies herself, his supervisor told him about the benefits of describing the analysis as a workflow, explaining that the analysis would eventually need to be deployed at large scale (though she doesn’t yet know exactly where it would be deployed). So far, Shawn has been developing and testing the pipeline on his laptop but knows that soon he’ll need to run it on real data, which will mean connecting to an external server and deploying it there. He’s read a lot about the complications and considerations involved in migrating between environments and is a little intimidated by the task.

After following this tutorial, Shawn will know where to find CWL descriptions for most of the tools in his pipeline, and combine these together in a workflow. Having a workflow for his data analysis protocol will allow him and his supervisor to deploy and scale up their data analysis on whichever platform their research consortium chooses to adopt. He will also have a clear understanding of the key concepts involved in workflow design and tool description, so he can read more about CWL about how to describe tools and write descriptions for the remaining steps himself.