Preparing Software for Reuse and Release

Overview

Teaching: 35 min
Exercises: 20 min

Questions

What can we do to make our programs reusable by others?

How should we document and license our code?

Objectives

Describe the different levels of software reusability

Explain why documentation is important

Describe the minimum components of software documentation to aid reuse

Create a repository README file to guide others to successfully reuse a program

Understand other documentation components and where they are useful

Describe the basic types of open source software licence

Explain the importance of conforming to data policy and regulation

Prioritise and work on improvements for release as a team

Introduction

In previous episodes we’ve looked at skills, practices, and tools to help us design and develop software in a collaborative environment. In this lesson we’ll be looking at a critical piece of the development puzzle that builds on what we’ve learnt so far - sharing our software with others.

The Levels of Software Reusability - Good Practice Revisited

Let’s begin by taking a closer look at software reusability and what we want from it.

Firstly, whilst we want to ensure our software is reusable by others, as well as ourselves, we should be clear what we mean by ‘reusable’. There are a number of definitions out there, but a helpful one written by Benureau and Rougler in 2017 offers the following levels by which software can be characterised:

Re-runnable: the code is simply executable and can be run again (but there are no guarantees beyond that)
Repeatable: the software will produce the same result more than once
Reproducible: published research results generated from the same version of the software can be generated again from the same input data
Reusable: easy to use, understand, and modify
Replicable: the software can act as an available reference for any ambiguity in the algorithmic descriptions made in the published article. That is, a new implementation can be created from the descriptions in the article that provide the same results as the original implementation, and that the original - or reference - implementation, can be used to clarify any ambiguity in those descriptions for the purposes of reimplementation

Later levels imply the earlier ones. So what should we aim for? As researchers who develop software - or developers who write research software - we should be aiming for at least the fourth one: reusability. Reproducibility is required if we are to successfully claim that what we are doing when we write software fits within acceptable scientific practice, but it is also crucial that we can write software that can be understood and ideally modified by others. If others are unable to verify that a piece of software follows published algorithms, how can they be certain it is producing correct results? Where ‘others’, of course, can include a future version of ourselves.

Documenting Code to Improve Reusability

Reproducibility is a cornerstone of science, and scientists who work in many disciplines are expected to document the processes by which they’ve conducted their research so it can be reproduced by others. In medicinal, pharmacological, and similar research fields for example, researchers use logbooks which are then used to write up protocols and methods for publication.

Many things we’ve covered so far contribute directly to making our software reproducible - and indeed reusable - by others. A key part of this we’ll cover now is software documentation, which is ironically very often given short shrift in academia. This is often the case even in fields where the documentation and publication of research method is otherwise taken very seriously.

A few reasons for this are that writing documentation is often considered:

A low priority compared to actual research (if it’s even considered at all)
Expensive in terms of effort, with little reward
Writing documentation is boring!

A very useful form of documentation for understanding our code is code commenting, and is most effective when used to explain complex interfaces or behaviour, or the reasoning behind why something is coded a certain way. But code comments only go so far.

Whilst it’s certainly arguable that writing documentation isn’t as exciting as writing code, it doesn’t have to be expensive and brings many benefits. In addition to enabling general reproducibility by others, documentation…

Helps bring new staff researchers and developers up to speed quickly with using the software
Functions as a great aid to research collaborations involving software, where those from other teams need to use it
When well written, can act as a basis for detailing algorithms and other mechanisms in research papers, such that the software’s functionality can be replicated and re-implemented elsewhere
Provides a descriptive link back to the science that underlies it. As a reference, it makes it far easier to know how to update the software as the scientific theory changes (and potentially vice versa)
Importantly, it can enable others to understand the software sufficiently to modify and reuse it to do different things

In the next section we’ll see that writing a sensible minimum set of documentation in a single document doesn’t have to be expensive, and can greatly aid reproducibility.

Writing a README

A README file is the first piece of documentation (perhaps other than publications that refer to it) that people should read to acquaint themselves with the software. It concisely explains what the software is about and what it’s for, and covers the steps necessary to obtain and install the software and use it to accomplish basic tasks. Think of it not as a comprehensive reference of all functionality, but more a short tutorial with links to further information - hence it should contain brief explanations and be focused on instructional steps.

Our repository already has a README that describes the purpose of the repository for this workshop, but let’s replace it with a new one that describes the software itself. First let’s delete the old one:

$ rm README.md

In the root of your repository create a replacement README.md file. The .md indicates this is a Markdown file, a lightweight markup language which is basically a text file with some extra syntax to provide ways of formatting them. A big advantage of them is that they can be read as plain-text files or as source files for rendering them with formatting structures, and are very quick to write. GitHub provides a very useful guide to writing Markdown for its repositories.

Let’s start writing README.md using a text editor of your choice and add the following line.

# RiverCatch

So here, we’re giving our software a name. Ideally something unique, short, snappy, and perhaps to some degree an indicator of what it does. We would ideally rename the repository to reflect the new name, but let’s leave that for now. In Markdown, the # designates a heading, two ## are used for a subheading, and so on. The Software Sustainability Institute’s guide on naming projects and products provides some helpful pointers.

We should also add a short description underneath the title.

# RiverCatch
RiverCatch is a data management system written in Python that manages measurement data collected in river catchment surveys and campaigns.

To give readers an idea of the software’s capabilities, let’s add some key features next:

# RiverCatch
RiverCatch is a data management system written in Python that manages measurement data collected in river catchment surveys and campaigns.

## Main features
Here are some key features of Inflam:

- Provide basic statistical analyses of data
- Ability to work on measurement data in Comma-Separated Value (CSV) format
- Generate plots of measurement data
- Analytical functions and views can be easily extended based on its Model-View-Controller architecture

As well as knowing what the software aims to do and its key features, it’s very important to specify what other software and related dependencies are needed to use the software (typically called dependencies or prerequisites):

# RiverCatch
RiverCatch is a data management system written in Python that manages measurement data collected in river catchment surveys and campaigns.

## Main features
Here are some key features of Inflam:

- Provide basic statistical analyses of data
- Ability to work on measurement data in Comma-Separated Value (CSV) format
- Generate plots of measurement data
- Analytical functions and views can be easily extended based on its Model-View-Controller architecture

## Prerequisites
RiverCatch requires the following Python packages:

- [NumPy](https://www.numpy.org/) - makes use of NumPy's statistical functions
- [Pandas](https://pandas.pydata.org/) - makes use of Panda's dataframes
- [GeoPandas](https://geopandas.org/) - makes use of GeoPanda's spatial operations
- [Matplotlib](https://matplotlib.org/stable/index.html) - uses Matplotlib to generate statistical plots

The following optional packages are required to run RiverCatch's unit tests:

- [pytest](https://docs.pytest.org/en/stable/) - RiverCatch's unit tests are written using pytest
- [pytest-cov](https://pypi.org/project/pytest-cov/) - Adds test coverage stats to unit testing

Here we’re making use of Markdown links, with some text describing the link within [] followed by the link itself within ().

One really neat feature - and a common practice - of using many CI infrastructures is that we can include the status of running recent tests within our README file. Just below the # RiverCatch title on our README.md file, add the following (replacing <your_github_username> with your own:

# RiverCatch
![Continuous Integration build in GitHub Actions](https://github.com/<your_github_username>/python-intermediate-catchment/workflows/CI/badge.svg?branch=main)
...

This will embed a badge (icon) at the top of our page that reflects the most recent GitHub Actions build status of your software repository, essentially showing whether the tests that were run when the last change was made to the main branch succeeded or failed.

That’s got us started with documenting our code, but there are other aspects we should also cover:

Installation/deployment: step-by-step instructions for setting up the software so it can be used
Basic usage: step-by-step instructions that cover using the software to accomplish basic tasks
Contributing: for those wishing to contribute to the software’s development, this is an opportunity to detail what kinds of contribution are sought and how to get involved
Contact information/getting help: which may include things like key author email addresses, and links to mailing lists and other resources
Credits/acknowledgements: where appropriate, be sure to credit those who have helped in the software’s development or inspired it
Citation: particularly for academic software, it’s a very good idea to specify a reference to an appropriate academic publication so other academics can cite use of the software in their own publications and media. You can do this within a separate CITATION text file within the repository’s root directory and link to it from the Markdown
Licence: a short description of and link to the software’s licence

For more verbose sections, there are usually just highlights in the README with links to further information, which may be held within other Markdown files within the repository or elsewhere.

We’ll finish these off later. See Matias Singer’s curated list of awesome READMEs for inspiration.

Choosing an Open Source Licence

Software licensing is a whole topic in itself, so we’ll just summarise here. Your institution’s Intellectual Property (IP) team will be able to offer specific guidance that fits the way your institution thinks about software.

In IP law, software is considered a creative work of literature, so any code you write automatically has copyright protection applied. This copyright will usually belong to the institution that employs you, but this may be different for PhD students. If you need to check, this should be included in your employment/studentship contract or talk to your university’s IP team.

Since software is automatically under copyright, without a licence no one may:

Copy it
Distribute it
Modify it
Extend it
Use it (actually unclear at present - this has not been properly tested in court yet)

Fundamentally there are two kinds of licence, Open Source licences and Proprietary licences, which serve slightly different purposes:

Proprietary licences are designed to pass on limited rights to end users, and are most suitable if you want to commercialise your software. They tend to be customised to suit the requirements of the software and the institution to which it belongs - again your institutions IP team will be able to help here.
Open Source licences are designed more to protect the rights of end users - they specifically grant permission to make modifications and redistribute the software to others. The website Choose A License provides recommendations and a simple summary of some of the most common open source licences.

Within the open source licences, there are two categories, copyleft and permissive:

The permissive licences such as MIT and the multiple variants of the BSD licence are designed to give maximum freedom to the end users of software. These licences allow the end user to do almost anything with the source code.
The copyleft licences in the GPL still give a lot of freedom to the end users, but any code that they write based on GPLed code must also be licensed under the same licence. This gives the developer assurance that anyone building on their code is also contributing back to the community. It’s actually a little more complicated than this, and the variants all have slightly different conditions and applicability, but this is the core of the licence.

Which of these types of licence you prefer is up to you and those you develop code with. If you want more information, or help choosing a licence, the Choose An Open-Source Licence or tl;dr Legal sites can help.

Exercise: Preparing for Release

In a (hopefully) highly unlikely and thoroughly unrecommended scenario, your project leader has informed you of the need to release your software within the next half hour, so it can be assessed for use by another team. You’ll need to consider finishing the README, choosing a licence, and fixing any remaining problems you are aware of in your codebase. Ensure you prioritise and work on the most pressing issues first!

Merging into `main`

Once you’ve done these updates, commit your changes, and if you’re doing this work on a feature branch also ensure you merge it into develop, e.g.:

$ git switch develop
$ git merge my-feature-branch

Finally, once we’ve fully tested our software and are confident it works as expected on develop, we can merge our develop branch into main:

$ git switch main
$ git merge develop
$ git push

Tagging a Release in GitHub

There are many ways in which Git and GitHub can help us make a software release from our code. One of these is via tagging, where we attach a human-readable label to a specific commit. Let’s see what tags we currently have in our repository:

$ git tag

Since we haven’t tagged any commits yet, there’s unsurprisingly no output. We can create a new tag on the last commit we did by doing:

$ git tag -a v1.0.0 -m "Version 1.0.0"

So we can now do:

$ git tag

v.1.0.0

And also, for more information:

$ git show v1.0.0

You should see something like this:

tag v1.0.0
Tagger: <Name> <email>
Date:   Fri Dec 10 10:22:36 2021 +0000

Version 1.0.0

commit 2df4bfcbfc1429c12f92cecba751fb2d7c1a4e28 (HEAD -> main, tag: v1.0.0, origin/main, origin/develop, origin/HEAD, develop)
Author: <Name> <email>
Date:   Fri Dec 10 10:21:24 2021 +0000

	Finalising README.

diff --git a/README.md b/README.md
index 4818abb..5b8e7fd 100644
--- a/README.md
+++ b/README.md
@@ -22,4 +22,33 @@ Flimflam requires the following Python packages:
 The following optional packages are required to run Flimflam's unit tests:

 - [pytest](https://docs.pytest.org/en/stable/) - Flimflam's unit tests are written using pytest
-- [pytest-cov](https://pypi.org/project/pytest-cov/) - Adds test coverage stats to unit testing
\ No newline at end of file
+- [pytest-cov](https://pypi.org/project/pytest-cov/) - Adds test coverage stats to unit testing
+
+## Installation
+- Clone the repo ``git clone repo``
+- Check everything runs by running ``python -m pytest`` in the root directory
+- Hurray 😊
+
+## Contributing
+- Create an issue [here](https://github.com/Onoddil/python-intermediate-inflammation/issues)
+  - What works, what doesn't? You tell me
+- Randomly edit some code and see if it improves things, then submit a [pull request](https://github.com/Onoddil/python-intermediate-inflammation/pulls)
+- Just yell at me while I edit the code, pair programmer style!
+
+## Getting Help
+- Nice try
+
+## Credits
+- Directed by Michael Bay
+
+## Citation
+Please cite [J. F. W. Herschel, 1829, MmRAS, 3, 177](https://ui.adsabs.harvard.edu/abs/1829MmRAS...3..177H/abstract) if you used this work in your day-to-day life.
+Please cite [C. Herschel, 1787, RSPT, 77, 1](https://ui.adsabs.harvard.edu/abs/1787RSPT...77....1H/abstract) if you actually use this for scientific work.
+
+## License
+This source code is protected under international copyright law.  All rights
+reserved and protected by the copyright holders.
+This file is confidential and only available to authorized individuals with the
+permission of the copyright holders.  If you encounter this file and do not have
+permission, please contact the copyright holders and delete this file.
\ No newline at end of file

So now we’ve added a tag, we need this reflected in our Github repository. You can push this tag to your remote by doing:

$ git push origin v1.0.0

What is a Version Number Anyway?

Software version numbers are everywhere, and there are many different ways to do it. A popular one to consider is Semantic Versioning, where a given version number uses the format MAJOR.MINOR.PATCH. You increment the:

MAJOR version when you make incompatible API changes

MINOR version when you add functionality in a backwards compatible manner

PATCH version when you make backwards compatible bug fixes

You can also add a hyphen followed by characters to denote a pre-release version, e.g. 1.0.0-alpha1 (first alpha release) or 1.2.3-beta4 (fourth beta release)

We can now use the more memorable tag to refer to this specific commit. Plus, once we’ve pushed this back up to GitHub, it appears as a specific release within our code repository which can be downloaded in compressed .zip or .tar.gz formats. Note that these downloads just contain the state of the repository at that commit, and not its entire history.

Using features like tagging allows us to highlight commits that are particularly important, which is very useful for reproducibility purposes. We can (and should) refer to specific commits for software in academic papers that make use of results from software, but tagging with a specific version number makes that just a little bit easier for humans.

Conforming to Data Policy and Regulation

We may also wish to make data available to either be used with the software or as generated results. This may be via GitHub or some other means. An important aspect to remember with sharing data on such systems is that they may reside in other countries, and we must be careful depending on the nature of the data.

We need to ensure that we are still conforming to the relevant policies and guidelines regarding how we manage research data, which may include funding council, institutional, national, and even international policies and laws. Within Europe, for example, there’s the need to conform to things like GDPR. It’s a very good idea to make yourself aware of these aspects.

Key Points

The reuse battle is won before it is fought. Select and use good practices consistently throughout development and not just at the end.

previous episode

Intermediate Research Software Development

next episode

Preparing Software for Reuse and Release

Overview

Introduction

The Levels of Software Reusability - Good Practice Revisited

Documenting Code to Improve Reusability

Writing a README

Other Documentation

Choosing an Open Source Licence

Exercise: Preparing for Release

Merging into `main`

Tagging a Release in GitHub

What is a Version Number Anyway?

Conforming to Data Policy and Regulation

Key Points

previous episode

next episode

previous episode

Intermediate Research Software Development

next episode

Preparing Software for Reuse and Release

Overview

Introduction

The Levels of Software Reusability - Good Practice Revisited

Documenting Code to Improve Reusability

Writing a README

Other Documentation

Choosing an Open Source Licence

Exercise: Preparing for Release

Merging into main

Tagging a Release in GitHub

What is a Version Number Anyway?

Conforming to Data Policy and Regulation

Key Points

previous episode

next episode

Merging into `main`