Learner Profiles

Bobby the Bioinformatician


Bobby is a postdoctoral researcher in bioinformatics who finished his PhD 2 years ago. He attended a Software Carpentry workshop 3 years ago during his PhD, which covered foundational knowledge of UNIX shell, version control system Git and Python programming language.

Bobby wrote a couple of Python data analysis scripts during his PhD, but has not done much coding recently. He now needs to run a set of Python scripts written by another researcher who has recently left his group. These scripts read data in FASTQ format (textual file format for storing biological sequence data and its corresponding quality score) and write out a set of CSV files with summary information. The scripts may require minor bug fixes and improved documentation so that others can more easily understand and reuse them in the future. The scripts are located in GitHub - but Bobby does not have much experience with it having never shared any of his work on GitHub.

In addition, Bobby’s favourite journal has started requesting that the raw data, scripts and computational environments used to generate the results are submitted along with a manuscript. This is an extra motivation for Bobby to have an up-to-date GitHub repository with the latest working code, data, documentation on how to run the code, results and any supporting information.

Gerry the Geographer


Gerry is a research assistant in human geography, with a Masters degree in geography. During her studies for a Masters degree, Gerry wrote some Python scripts for downloading and analysing data.

Gerry has joined a new research project and is now responsible for developing new R analysis scripts that take CSV and JSON geo-referenced time series data from several online sources, perform some statistical analysis and generate visualisations. Some of the data sources Gerry has to use have gaps and inconsistencies and require cleaning first.

A senior researcher has shared some sample R code that does the analysis with Gerry via email, but it is not well documented. She needs to understand this code and include it in the analysis pipeline, write documentation and make it available to all project stakeholders for comments and reuse.

Philippa the Physicist


Philippa is a Research Fellow with a PhD in particle physics and several years of postdoc experience. She wrote some Fortran modules as an undergraduate, and is a self-taught Python programmer.

She has built a large set of Python routines which perform a novel type of simulation on an HPC system. The simulation is configured via a JSON file and writes out data in a custom binary format. Another set of scripts produce visualisations out of the resulting simulation data.

Collaborators on another project have heard about Philippa’s work and would like to reuse her code, but she is the only one who fully understands the whole workflow, most of the code is not documented and is not very readable, has no unit tests and only exist on Philippa’s machine and her external back-up drive.

Phillipa wants to help these colleagues and get some credit for her hard work, so she needs to improve documentation for her code and set up a GitHub project to share her work and provide other collaborators (and wider community) with access to it. She also needs to create a DOI (Digital Object Identifier) and a citation for her code, so that people can include a reference to it in their work and publications.

Sam the Sociologist


Sam is a Lecturer in sociology. Sam has several datasets on political groups on different social media platforms in a mix of different data structures in JSON format. These datasets require some complex statistical analysis, which is proving too complex to analyse in Excel or SPSS (the tools that Sam has mainly used so far).

Sam has never shared his code, apart from emailing it to a few close collaborators. He has now received some funding to employ a software engineering undergraduate student over the summer, and wants to be able to direct them towards best practices for building research software. Sam wants the student to use Python over proprietary SPSS for this project to make the work more easily reproducible.

In addition, members of Sam’s current group all use Python and could verify and contribute to the code, and maintain it after the summer placement finishes.

Finally, a colleague has told Sam that the code should be shared on GitHub (so that the wider community can access and benefit from it) and Sam has since learned the foundations of Git and version control, but Sam does not know how to use GitHub nor how to license the code and data for reuse.