Content from Course introduction


Last updated on 2024-07-04 | Edit this page

Overview

Questions

  • What is open, reproducible and FAIR research?
  • Why are these practices important?

Objectives

After completing this episode, participants should be able to:

  • Understand the concept of open and reproducible research
  • Understand why these principles are of value in the research community

Jargon busting


Before we start with the course, below we cover the terminology and explain terms, phrases, and concepts associated with software development in reproducible research that we will use in this course.

  • Reproducibility - the ability to be reproduced or copied; the extent to which consistent results are obtained when an experiment is repeated (definition from Google’s English dictionary is provided by Oxford Languages)
  • Computational reproducibility - obtaining consistent results using the same input data, computational methods (code), and conditions of analysis; work that can be independently recreated from the same data and the same code (definition by the Turing Way’s “Guide to Reproducible Research”)
  • Reproducible research - the idea that scientific results should be documented in such a way that their deduction is fully transparent (definition from Wikipedia)
  • Open research - research that is openly accessible by others; concerned with making research more transparent, more collaborative, more wide-reaching, and more efficient (definition from Wikipedia)
  • FAIR - an acronym that stands for Findable, Accessible, Interoperable, and Reusable
  • Sustainable software development - software development practice that takes into account longevity and maintainability of code (e.g. beyond the lifetime of the project), environmental impact, societal responsibility and ethics in our software practices.

Computational reproducibility

In this course, we use the term “reproducibility” as a synonym for “computational reproducibility”.

What does open and reproducible research mean to you?

Think about the questions below. Your instructors may ask you to share your answers in a shared notes document and/or discuss them with other participants.

  • What do you understand by the words “open” and “reproducible” in the context of research?
  • How many people or groups can you list that might benefit from your work being open and reproducible?
  • How many times did you wish that someone else’s work you came across was more open or accessible to you? Can you provide some examples?

What is reproducible research?


The Turing Way’s “Guide to Reproducible Research” provides an excellent overview of definitions of “reproducibility” and “replicability” found in literature, and their different aspects and levels.

In this course, we adopt the Turing Way’s definitions:

  • Reproducible research: a result is reproducible when the same analysis steps performed on the same data consistently produce the same answer.
    • For example, two different people drop a pen 10 times each and every time it falls to the floor. Or, we run the same code multiple times on different machines and each time it produces the same result.
  • Replicable research: a result is replicable when the same analysis performed on different data produces qualitatively similar answers.
    • For example, instead of a pen, we drop a pencil, and it also falls to the floor. Or, we collect two different datasets as part of two different studies and run the same code over these datasets with the same result each time.
  • Robust research: a result is robust when the same data is subjected to different analysis workflows to answer the same research question and a qualitatively similar or identical answer is produced.
    • For example, I lend you my pen and you drop it out the window, and it still falls to the floor. Or we run the same analysis implemented in both Python and R over the same data and it produces the same result.
  • Generalisable research: combining replicable and robust findings allow us to form generalisable results that are broadly applicable to different types of data or contexts.
    • For example, everything we drop - falls, therefore gravity exists.
Four cartoon images depicting two researchers at two machines which take in data and output the same landscape image in 4 different ways. These visually describe the four scenarios listed above.
The Turing Way project illustration of aspects of reproducible research by Scriberia, used under a CC-BY 4.0 licence, DOI: 10.5281/zenodo.3332807

In this course we mainly address the aspect of reproducibility - i.e. enabling others to run our code to obtain the same results.

We can also further differentiate between:

  • Computational reproducibility: when detailed information is provided about code, software, hardware and implementation details.
  • Empirical reproducibility: when detailed information is provided about non-computational empirical scientific experiments and observations. In practice, this is enabled by making the data and details of how it was collected freely available.
  • Statistical reproducibility: when detailed information is provided, for example, about the choice of statistical tests, model parameters, and threshold values. This mostly relates to pre-registration of study design to prevent p-value hacking and other manipulations.

In this course, we are concerned with computational reproducibility, i.e. when the application of computer science and software engineering is used to aid solving research problems.

Why do reproducible research?


Scientific transparency and rigor are key factors in research. Scientific methodology and results need to be published openly and replicated and confirmed by several independent parties. However, research papers often lack the full details required for independent reproduction or replication. Many attempts at reproducing or replicating the results of scientific studies have failed in a variety of disciplines ranging from psychology (The Open Science Collaboration (2015)) to cancer sciences (Errington et al (2021)). These are called the reproducibility and replicability crises - ongoing methodological crises in which the results of many scientific studies are difficult or impossible to repeat.

Reproducible research is a practice that ensures that researchers can repeat the same analysis multiple times with the same results. It offers many benefits to those who practice it:

  • Reproducible research helps researchers remember how and why they performed specific tasks and analyses; this enables easier explanation of work to collaborators and reviewers.
  • Reproducible research enables researchers to quickly modify analyses and figures - this is often required at all stages of research and automating this process saves loads of time.
  • Reproducible research enables reusability of previously conducted tasks so that new projects that require the same or similar tasks become much easier and efficient by reusing or reconfiguring previous work.
  • Reproducible research supports researchers’ career development by facilitating the reuse and citation of all research outputs - including both code and data.
  • Reproducible research is a strong indicator of rigor, trustworthiness, and transparency in scientific research. This can increase the quality and speed of peer review, because reviewers can directly access the analytical process described in a manuscript. It increases the probability that errors are caught early on - by collaborators or during the peer-review process, helping alleviate the reproducibility crisis.

However, reproducible research often requires that researchers implement new practices and learn new tools. This course aims to teach some of these practices and tools pertaining to the use of software to conduct reproducible research.

Software in research and research software


Software is fundamental to modern research - some of it would even be impossible without software. From short, thrown-together temporary scripts written to help with day-to-day research tasks, through an abundance of complex data analysis spreadsheets, to the hundreds of software engineers and millions of lines of code behind international efforts such as the Large Hadron Collider, there are few areas of research where software does not have a fundamental role.

However, it is important to note that not all software that is used in research is “research software”. We define “research software” as software or code that is used to generate, process or analyse results of a research for publication. For example, software used to guide a telescope is not considered “research software”. On the other hand, formulas or macros in spreadsheets used to analyse data are considered “research code” as they are a form of computer programming that allow one to create, calculate, and change data sets in a number of different ways.

Quote: Research Software includes source code files, algorithms, scripts, computational workflows and executables that were created during the research process or for a research purpose. Software components (e.g., operating systems, libraries, dependencies, packages, scripts, etc.) that are used for research but were not created during or with a clear research intent should be considered software in research and not Research Software.
Definition of “research software” from the FAIR4RS working group, image by the Netherlands eScience Center licensed under CC-BY 4.0

In the software survey conducted by the Software Sustainability Institute in 2014, 92% of researchers indicated they used some kind of software to aid or conduct their research. This was not limited to researchers from computational science (aka scientific computing), the “hard” sciences or to those involved in “traditional” uses of computing infrastructure such as running simulations or developing computational methods. The use of research software is ubiquitous and fairly even across all disciplines.

Research software is increasingly being developed - researchers do not just use “off the shelf” software and the majority of researchers develop their own. In order to be able to produce quality software that outputs correct and verifiable results and that can be reused over time - researchers require training. This course teaches good practises and reproducible working methods that are agnostic of a programming language (although we will use Python code in our examples) and aims to provide researchers with the tools and knowledge to feel confident when writing good quality and sustainable software to support their research. Typically, we think of such software as being FAIR.

In the rest of the course, we will explore what exactly we mean by “FAIR research software”, why it is important and what practices can help us along our “FAIRification” journey.

Further reading


We recommend the following resources for some additional reading on the topic of this episode:

Acknowledgements and references


The content of this course borrows from or references various work.

Content from FAIR research software


Last updated on 2024-07-04 | Edit this page

Overview

Questions

  • What are FAIR research principles?
  • How do FAIR principles apply to software (and data)?

Objectives

After completing this episode, participants should be able to:

  • Explain the FAIR research principles in the context of research software and data
  • Explain why these principles are of value in the research community

Motivation

Think about the questions below. Your instructors may ask you to share your answers in a shared notes document and/or discuss them with other participants.

  • What motivated you to attend this course? Did you come by choice or were you advised to attend?
  • What do you hope to learn or change in your current research software practice? Describe how your knowledge, work or attitude may be different afterwards.

FAIR software


FAIR stands for Findable, Accessible, Interoperable, and Reusable and comprises a set of principles designed to increase the visibility and usefulness of your research to others. The FAIR data principles, first published in 2016, are widely known and applied today. Similar FAIR principles for software have now been defined too. In general, they mean:

  • Findable - software and its associated metadata must be easy to discover by humans and machines.
  • Accessible - in order to reuse software, the software and its metadata must be retrievable by standard protocols, free and legally usable.
  • Interoperable - when interacting with other software it must be done by exchanging data and/or metadata through standardised protocols and application programming interfaces (APIs).
  • Reusable - software should be usable (can be executed) and reusable (can be understood, modified, built upon, or incorporated into other software).

Each of the above principles can be achieved by a number of practices listed below. This is not an exact science, and by all means the list below is not exhaustive, but any of the practices that you employ in your research software workflow will bring you closer to the gold standard of a fully reproducible research.

  • Findable
    • Create a description of your software
    • Place your software in a public software repository (and ideally register it in a software registry)
    • Use a unique and persistent identifier (DOI) for your software (e.g. by depositing your code on Zenodo), which is also useful for citations - note that depositing your data/code on GitHub and similar software repositories may not be enough as they may change their open access model or disappear completely in the future
  • Accessible
    • Make sure people can freely, legally and easily get a copy your software
    • Use coding conventions and comments to make your code readable and understandable by people (once they get a copy of it)
  • Interoperable
    • Explain the functionality of your software
    • Use standard formats for inputs and outputs
    • Communicate with other software via standard protocols and APIs
  • Reusable
    • Document your software (including its functionality, and how to install and run it)
    • Follow best practices for software development (including coding conventions, code readability and verifying its correctness)
    • Test your software and make sure it works on different platforms/operating systems
    • Give a licence to your software clearly stating how it can be reused
    • State how to cite your software

FAIR is a process, not a perfect metric

FAIR is not a binary metric - there is no such thing as “FAIR or not FAIR”.

FAIR is not a perfect metric, nor does it provide a full and exhaustive software quality checklist. Software may be FAIR but still not very good in terms of its functionality.

FAIR is not meant to criticise or discredit work.

FAIR refers to the specific values of and describes a set of principles to aid open and reproducible research that can be a helpful guide for researchers who want to improve their practices (by helping them see where they are on the FAIR spectrum and help them on a journey to make their software more FAIR).

FAIR represented as as a 4-dimensional spectrum with 4 axes - findable, accessible, interoperable and reusable, image by the Netherlands eScience Center licensed under CC-BY 4.0
FAIR as a 4D spectrum, image by the Netherlands eScience Center licensed under CC-BY 4.0

We are going to explore the above practices on an example software project we will be working on as part of this course.

Challenge

Think of a piece of software you use in your research - any computational tool used for data gathering, modelling & simulation, processing & visualising results or others. If you have a bit of code or software you wrote yourself, in any language, feel free to use that.

Think where on the FAIR spectrum it fits, using the following scale as a guide for each principle:

  • 1 - requires loads of improvement
  • 2 - on a good path, but improvements still needed
  • 3 - decent, a few things could still be improved
  • 4 - very good, only tiny things to improve upon
  • 5 - excellent

Software and data used in this course


We are going to follow a fairly typical experience of a new PhD or postdoc joining a research group. They were emailed some data and analysis code bundled in a .zip archive and written by another group member who works on similar things. They need to be able to install and run this code on their machine, check they understand it and then adapt it to their own project.

As part of the setup for this course, you should have downloaded a .zip archive containing the software project the new research team member was given. Let’s unzip this archive and inspect its content in VS Code. The software project contains:

  1. a JSON file (data.json) - a snippet of which is shown below - with data on extra-vehicular activities (EVAs or spacewalks) undertaken by astronauts and cosmonauts from 1965 to 2013 (data provided by NASA via its Open Data Portal) JSON data file snippet showing EVA/spacewalk data including EVA id, country, crew members, vehicle type, date of the spacewalk, duration, and purpose
  2. a Python script (my code v2.py) containing some analysis. A first few lines of a Python script used as example code for the episode

The code in the Python script does some common research tasks:

  • Read in the data from the JSON file
  • Change the data from one data format to another and save to a file in the new format (CSV)
  • Perform some calculations to generate summary statistics about the data
  • Make a plot to visualise the data

Let’s have a critical look at this code and think about how FAIR this piece of software is.

Discussion

Compare this data and code to the software you chose earlier. Do you think it is Findable, Accessible, Interoperable and Reusable? Give it a score from 1 to 5 in each category, as in the previous exercise, and then we will discuss it together.

Here are some questions to help you assess where on the FAIR spectrum the code is:

  1. Findable
  • If these files were emailed to you, or sent on a chat platform, or handed to you on a memory stick, how easy would it be to find them again in 6 months, or 3 years?
  • If you asked your collaborator to give you the files again later on, how would you describe them? Do they have a clear name?
  • If more data was added to the data set later, could you explain exactly which data you used in the original analysis?
  1. Accessible
  • If the person who gave you the files left your institution, how would you get access to the files again?
  • Once you have the files, can you understand the code? Does it make sense to you?
  • Do you need to log into anything to use this? Does it require purchase or subscription to a service, platform or tool?
  1. Interoperable
  • Is it clear what kind of input data it can read and what kind of output data is produced? Will you be able to create the input files and read the output files with the tools your community generally uses?
  • If you wanted to use this tool as part of a larger data processing pipeline, does it allow you to link it with other tools in standard ways such as an API or command-line interface?
  1. Reusable
  • Can you run the code on your platform/operating system (is there documentation that covers installation instructions)? What programs or libraries do you need to install to make it work (and which versions)? Are these commonly used tools in your field?
  • Do you have explicit permission to use your collaborators code in your own research and do they expect credit of some form (paper authorship, citation or acknowledgement)? Are you allowed to edit, publish or share the files with others?
  • Is the language used familiar to you and people in your research field? Can you read the variable names in the code and the column names in the data file and understand what they mean?
  • Is the code written in a way that allows you to easily modify or extend it? Can you easily see what parameters to change to make it calculate a different statistic, or run on a different input file?

I would give the following scores:

F - 1/5

  • Positive: None
  • Negative: No descriptive name, identifier or version number. No way to find again except through one person and they might not remember what file you mean.

A - 2/5

  • Positive: No accounts or paid services needed. Python is free, the data is free and under a shareable license
  • Negative: No way to get the code without that one person. Not clear where the data comes or what license it has unless you check the URL in the comment.

I - 3/5

  • Positive: CSV and JSON files are common and well documented formats. They are machine- and human-readable. They could be generated by or fed into other programs in a pipeline.
  • Negative: JSON might not be well used in some fields. No API or CLI.

R - 2/5

  • Positive: Can ask collaborator for explicit permissions for using and modifying and how to credit them, if they did not specify before. Python is a common language.
  • Negative: Python and library versions not specified. Bad variable names, hardcoded inputs, no clear structure or documentation.

Let’s now have a look into tools and practices that are commonly used in research that can help us develop software in a more FAIR way.

Further reading


We recommend the following resources for some additional reading on the topic of this episode:

Also check the full reference set for the course.

Key Points

  • Open research means the outputs of publicly funded research are publicly accessible with no or minimal restrictions.
  • Reproducible research means the data and software is available to recreate the analysis.
  • FAIR data and software is Findable, Accessible, Interoperable, Reusable.
  • These principles support research and researchers by saving time, reducing barriers to discovery, and increasing impact of the research output.

Content from Tools and practices for research software development


Last updated on 2024-07-04 | Edit this page

Overview

Questions

  • What tools are available to help us develop research software in a FAIR way?
  • How do the tools fit together to enable FAIR research?

Objectives

After completing this episode, participants should be able to:

  • Identify some key tools for FAIR research software
  • Explain how can these tools help us work in a FAIR way
  • Install and run these key tools on learner’s machines

In this course we will introduce you to a group of tools and practices that are commonly used in research to help you develop software in a FAIR way. You should already have these tools installed on your machine following the setup instructions. Here we will give an overview of the tools, how they help you achieve the aims of FAIR research software and how they work together. In later episodes we will describe some of these tools in more detail.

The table below summarises some tools and practices that can help with each of the FAIR software principles.

Tools and practices Findable Accessible Interoperable Reusable
Integrated development environments (e.g. VS Code) - development environments (run, test, debug) x
Command line terminal (e.g. Bash)- reproducible workflows/pipelines x x
Version control tools x
Testing x x
Coding conventions and documentation x x x
License x x
Citation x x
Software repositories (e.g. GitHub) x x

Writing your code


Development environment

One of the first choices we make when writing code is what tool to use to do the writing. You can use a simple text editor such as Notepad, a terminal based editor with syntax highlighting such as Vim or Emacs, or one of many Integrated Development Environments (IDEs) which give you the tools to write, run and debug your code and inspect the output all in one place. Note that you should not use word processing software such as Microsoft Word or Apple Pages to write code - these tools are designed to create nicely formatted text for humans to read, so they may add or change formatting, or insert invisible characters that the programming language cannot interpret. Try opening a Word document in Notepad to see an example.

This is mostly a personal choice as an experienced user of any of these tools can write good, FAIR software efficiently, but IDEs are popular because they are designed specifically for writing and running code. There are some language specific IDEs such as PyCharm, and some that can work with many languages like VS Code (Visual Studio Code). IDEs also have add-ons that provide extra functionality, such as checking your code as you type (similar to a spell-checker in Word), highlighting when you are not following best practice, or even automatically generating bits of code for you.

In this course we will use VS Code IDE, as it is free, available on Windows, Mac and Linux operating systems, and can be used with many programming languages. It is a single tool in which we can:

  • view our file system
  • open many kinds of files
  • edit, compile, run and debug code
  • open a terminal to run command line tools (including version control tool Git) or view code outputs

Use VS Code to open the Python script and the data file from our project.

Command line tool/shell

In VS Code and similar IDEs you can often run the code by clicking a button or pressing some keyboard shortcut. If you gave your code to a colleague or collaborator they might use the same IDE or something different, so you cannot guarantee that they will have the same buttons and shortcuts as you.

In the previous episode we mentioned that interoperable software should use standard protocols so that it can integrate with other tools. One of these standard protocols/tools is using the inputs and outputs via the command line terminal or shell. Command line terminal provides one of the oldest ways of interacting with operating system so many programs will have command line interfaces. Command line terminals, such as Bash and Zsh, have their own language syntax that allow you to write scripts and/or group and chain commands together to build up complex workflows using several programs in different steps. They also use less resources than a graphical user interface tool like an IDE, so are commonly used on high-performance computers and other shared systems where time, memory and processing power are expensive or in high demand.

In this course we will use the Bash shell, which is one of the most common and comes already installed on Mac and Linux operating systems. You can create a command line interface to your program which will allow it to be run on any system that has a Bash shell, and allow users to change things like input and output files or choose different settings or parameters without editing your code. With a command line interface, your code can be built into automated workflows so that the whole process from data gathering to analysis to saving publication-quality results can be written in one Bash script and saved and reused.

Version control

Version control means knowing what changes were made to your code and when. Many people who have worked on large documents such as essays start doing this by saving files called essay_draft, essay_version1.doc, essay_version2.doc, and so on. This can work on a small scale but most people find it quickly gets confusing which version a certain change was made, or which version is the one that you got feedback from a supervisor on. Using a version control system helps you keep track of changes, including when you might be working on shared code being edited by more than one person at a time.

It also lets you assign version numbers or tags to particular versions so you can then use those to refer back to them later. For example, you can run your code and output some results and add a comment to your output that those results were produced by version 2.4 of your code, so if you try to run the same thing later and find it is different, you can check if it is a change in the code due to using a newer version, or a change in the data, or something else.

We will be using the Git version control system, which can be used through the command line terminal, in a browser or in a desktop application.

Testing

Testing ensures that your code is correct and does what it is set out to do. When you write code you often feel very confident that it is perfect, but when writing bigger codes or code that is meant to do complex operations it is very hard to consider all possible edge cases or notice every single typing mistake. Testing also gives other people confidence in your code as they can see an example of how it is meant to run and be assured that it does work correctly on their machine.

We will show different ways to test your code for different purposes. You need to think about what it is that is important to you and any future users or collaborators to decide what kind of testing is most useful for you.

Documentation

Documentation comes in many forms - from the names of variables and functions in your code, additional comments that explain some lines, up to a whole website full of documentation with function definitions, usage examples, tutorials and guides. You many not need as much documentation as a large commercial software product, but making your code Reusable relies on other people being able to understand what your code does and how to use it.

Licences and citation

A licence states what people can legally do with your code, and what restrictions you have placed on it. Whenever you want to use someone else’s code you should check what license they have and make sure your use is legal. You may be restricted by your institution, your grant funding, or by other tools you use that require certain licenses for you to be compatible with them.

Having a citation instructions is not a legal requirement but if you want to get academic credit for your work, you can make other people’s life much easier by telling them how you would like to be credited. Similarly, follow citation instructions ensures that when you use others’ code in your research they will be credited accordingly.

Both licensing law and citation procedures can vary depending on your country and institution, so remember to check with a local team where you are. Your local research IT support or library teams would be a good place to start.

Code repositories and registries

Having somewhere to share your code is fundamental to making it Findable. Your institution might have a code repository, your research field may have a practice of sharing code via a specific website or journal, or your version control system might include an online component that makes sharing different versions of your code easy. Again, remember to check the rules of your institution and grant on publishing code, and any licenses for code you depend on or reuse.

We will discuss later how to share your code on GitHub and make it easy for others to find and use.

Summary


Findable

  • Describe your software - README
  • Software repository/registry - GitHub, registries
  • Unique persistent identifier - GitHub commits/tags/releases, Zenodo

Accessible

  • Software repository/registry
  • License
  • Language and dependencies

Interoperable

  • Explain functionality - readme, inline comments and documentation
  • Standard formats
  • Communication protocols - CLI/API

Reusable

  • Document functionality/installation/running
  • Follow best practices where appropriate
  • License
  • Citation

Checking your setup


Challenge

Open a command line terminal and look at the prompt. Compare what you see in the terminal with your neighbour, does it look the same or different? What information is it telling you and why might this be useful? What other information might you want?

Run the following commands in a terminal to check you have installed the tools listed in the Setup page. Compare the output with your neighbour and see if you can see any differences.

Checking the command line terminal:

  1. date
  2. echo $SHELL
  3. pwd
  4. whoami

Checking Python:

  1. python --version
  2. python3 --version
  3. which python
  4. which python3

Checking Git and GitHub:

  1. git --help
  2. git config --list
  3. ssh -T git@github.com

Checking VS Code:

  1. code
  2. code --list-extensions

The prompt is the $ character and any text that comes before it, that is shown on every new line before you type in commands. Type each of the commands one at a time and press enter. They should give you a result by printing some text in the terminal.

The expected out put of each command is:

  1. Today’s date
  2. bash or zsh - this tells you what shell language you are using. In this course we show examples in Bash.
  3. Your “present working directory” or the folder where your shell is running
  4. Your username
  5. In this course we are using Python 3. If python --version gives you Python 2.x you may have two versions of Python installed on your computer and need to be careful which one you are using.
  6. Use this command to be certain you are using Python version 3, not 2, if you have both installed.
  7. The file path to where the Python version you are calling is installed.
  8. If you have more than one version these should be different paths, if both 5. and 6. gave the same result then 7. and 8. should match as well.
  9. The help message explaining how to use the git command.
  10. You should have user.name, user.email and core.editor set in your Git configuration. Check that the editor listed is one you know how to use.
  11. This checks if you have set up your connection to GitHub correctly. If is says permission denied you may need to look at the instructions for setting up SSH keys again on the Setup page.
  12. This should open VSCode in your current working directory. macOS users may need to first open VS Code and add it to the PATH.
  13. You should have the extensions GitLens, Git Graph, Python, JSON and Excel Viewer installed to use in this course.

You may have noticed that our researcher has received the software project they are meant to be working as a .zip archive via email. In the next episode, we will learn a better practice for sharing and tracking changes to a software project using version control software Git and project sharing and collaboration platform GitHub.

Further reading


We recommend the following resources for some additional reading on the topic of this episode:

Also check the full reference set for the course.

Key Points

  • Automating your analysis with shell scripts allows you to save and reproduce your methods.
  • Version control helps you back up your work, see how data and code change over time and identify which analysis used which data and code.
  • Programming languages each have advantages and disadvantages in different situations. Use the correct tools for your own work.
  • Integrated development environments (IDEs) automate many coding tasks, provide easy access to documentation, and can identify common errors.
  • Testing helps you check that your code is behaving as expected and will continue to do so in the future or when used by someone else.

Content from Version control


Last updated on 2024-07-04 | Edit this page

Overview

Questions

  • What is a version control system?
  • How can a version control system help make my work reproducible?
  • What does a standard version control workflow look like?

Objectives

After completing this episode, participants should be able to:

  • Create self-contained commits using Git to incrementally save work
  • Inspect logs to review the history of work
  • Push new work from a local machine to a remote server

In this episode, we will begin to cover the basics of version control and explore how this tool assists us in producing reproducible and sustainable scientific projects. We will create a new software project from our existing code, make some changes to it and track them with version control, and then push those changes to a remote server for safe-keeping.

What is a version control system?

Version control is the practice of tracking and managing changes to files. Version control systems are software tools that assist in the management of these file changes over time. They keep track of every modification to the files in a special database that allows users to “travel through time” and compare earlier versions of the files with the current state.

Motivation for using a version control system

The main motivation as scientists to use version control in our projects is for reproducibility purposes. As hinted to above, by tracking and storing every change we make, we can more effectively restore the state of the project at any point in time. This is incredibly useful if we want to reproduce results from a specific version of the code, or track down changes that broke some functionality.

The other benefit we gain is that version control provides us with the provenance of the project. As we make each change, we also leave a message about what the change was and why it was made. This improves the transparency of the project and makes it auditable, which is good scientific practice.

Later on in this workshop, we will also see how using a version control system allows many people to collaborate on the same project without a lot of manual effort to combine different items of work.

Git version control system

Git is one of the version control systems around and the one we will be using in this course. It is primarily used for source code management in software development but it can be used to track changes in files in general - it is particularly effective for tracking text-based files (e.g. source code files in any programming language, CSV, Markdown, HTML, CSS, Tex, etc. files).

The diagram below shows a typical software development lifecycle with Git (starting from making changes locally) and the commonly used commands to interact with different parts of the Git infrastructure. We will cover all of the commands below during this course, this is just a high level overview.

Software development lifecycle with Git showing Git commands and flow of data between components of a Git system, including working directory, staging area, local and remote repository
Software development lifecycle with Git
  • working directory - a local directory (including any subdirectories) where your project files live and where you are currently working. It is also known as the “untracked” area of Git. Any changes to files will be marked by Git in the working directory. If you make changes to the working directory and do not explicitly tell Git to save them - you will likely lose those changes. Using git add FILENAME command, you tell Git to start tracking changes to file FILENAME within your working directory.
  • staging area (index) - once you tell Git to start tracking changes to files (with git add FILENAME command), Git saves those changes in the staging area on your local machine. Each subsequent change to the same file needs to be followed by another git add FILENAME command to tell Git to update it in the staging area. To see what is in your working directory and staging area at any moment (i.e. what changes is Git tracking), run the command git status.
  • local repository - stored within the .git directory of your project locally, this is where Git wraps together all your changes from the staging area and puts them using the git commit command. Each commit is a new, permanent snapshot (checkpoint, record) of your project in time, which you can share or revert to.
  • remote repository - this is a version of your project that is hosted somewhere on the Internet (e.g., on GitHub, GitLab or somewhere else). While your project is nicely version-controlled in your local repository, and you have snapshots of its versions from the past, if your machine crashes - you still may lose all your work. Furthermore, you cannot share or collaborate on this local work with others easily. Working with a remote repository involves pushing your local changes remotely (using git push) and pulling other people’s changes from a remote repository to your local copy (using git fetch or git pull) to keep the two in sync in order to collaborate (with a bonus that your work also gets backed up to another machine). Note that a common best practice when collaborating with others on a shared repository is to always do a git pull before a git push, to ensure you have any latest changes before you push your own.

Git is a distributed version control system allowing for multiple people to be working on the same project (even the same file) at the same time. Initially, we will use Git to start tracking changes to files on our local machines; later on we will start sharing our work on GitHub allowing other people to see and contribute to our work.

Create a new repository

Create a new directory in the Desktop folder for our work, and then change the current working directory to the newly created one:

BASH

$ cd ~/Desktop
$ mkdir spacewalks
$ cd spacewalks

We tell Git to make spacewalks a repository – a place where Git can store versions of our files:

BASH

git init

We can check everything is setup correctly by asking Git to tell us the status of our project:

BASH

$ git status

OUTPUT

On branch main

No commits yet

nothing to commit (create/copy files and use "git add" to track)

The exact wording of this output may be slightly different if you are using a different version of Git.

Add initial files into our repository

During the setup for this course, you have been provided with a .zip archive with two files:

  • my code v2.py
  • data.json

We need to move these files into our git folder. You can either drag and drop the files from a file explorer window into the left pane of the VS Code IDE, or you can use the mv command in the command line terminal.

BASH

mv /path/where/you/saved/the/file/my\ code\ v2.py ~/Desktop/spacewalks/
mv /path/where/you/saved/the/file/data.json ~/Desktop/spacewalks/

Let’s see what that has done to our repository by running git status again:

BASH

git status

OUTPUT

On branch main

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	data.json
	my code v2.py

nothing added to commit but untracked files present (use "git add" to track)

This is telling us that Git has noticed the new files. The “untracked files” message means that there is a file in the directory that Git isn’t keeping track of. We can tell Git to track a file using git add:

BASH

$ git add my\ code\ v2.py
$ git add data.json

and then check the right thing happened:

BASH

$ git status

OUTPUT

On branch main

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
	new file:   data.json
	new file:   my code v2.py

Git now knows that it’s supposed to keep track of my code v2.py and data.json, but it hasn’t recorded these changes as a commit yet. To get it to do that, we need to run one more command:

BASH

$ git commit -m "Add an example script and dataset to work on"

OUTPUT

[main (root-commit) bf55eb7] Add and example script and dataset to work on
 2 files changed, 437 insertions(+)
 create mode 100644 data.json
 create mode 100644 my code v2.py

When we run git commit, Git takes everything we have told it to save by using git add and stores a copy permanently in a special .git directory. This permanent copy is called a commit (or revision).

We use the flag -m (for message) to record a short, descriptive, and specific comment that will help us remember later on what we did and why. If we only run git commit without the -m option, Git will launch a text editor so that we can write a longer message.

Good commit messages start with a brief (<50 characters) statement about the changes made in the commit. Generally, the message should complete the sentence “If applied, this commit will…”. If you want to go into more detail, add a blank line between the summary line and your additional notes. Use this additional space to explain why you made changes and/or what their impact will be.

If we run git status now, we see:

BASH

$ git status

OUTPUT

On branch main
nothing to commit, working tree clean

This tells us that everything is up to date.

Where are my changes?

If we run ls at this point, we will still see only two files, the script and the dataset. That’s because Git saves information about files’ history in the special .git directory mentioned earlier so that our filesystem does not become cluttered (and so that we cannot accidentally edit or delete an old version).

Make a change

Did you notice how when we were typing the Python script into the terminal, we had to add a slash before the space like this: my\ code\ v2.py? Using a backslash in this way is called ‘escaping’ and it lets the terminal know to treat the space as part of the filename, and not a separate argument. However, it is pretty annoying and considered bad practice to have spaces in your filenames like this, especially if you will be manipulating them from the terminal. So, let’s go ahead and remove the space from the filename altogether and replace it with a hyphen instead. You can use the mv command again like so:

BASH

$ mv my\ code\ v2.py my-code-v2.py

If you run git status again, you’ll see Git has noticed the change in the filename.

BASH

$ git status

OUTPUT

On branch main
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	deleted:    my code v2.py

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	my-code-v2.py

no changes added to commit (use "git add" and/or "git commit -a")

Add and commit the changed file

Using the Git commands demonstrated so far, save the change you just made to the Python script.

Remember, commit messages should be descriptive and complete the sentence “If applied, this commit will…”. You can also use git status to check the status of your project at any time.

To save the changes to the renamed Python file, use the following Git commands:

BASH

$ git add my\ code\ v2.py my-code-v2.py
$ git status

OUTPUT

On branch main
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	renamed:    my code v2.py -> my-code-v2.py

BASH

$ git commit -m "Replace spaces in Python filename with hyphens"

OUTPUT

[main 8ea2a0b] Replace spaces in Python filename with hyphens
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename my code v2.py => my-code-v2.py (100%)

Advanced solution

We initially renamed the Python file using the mv command, and we than had to git add both my-code-v2.py and my\ code\ v2.py. Alternatively, we could have used Git’s own mv command like so:

BASH

$ git mv my\ code\ v2.py my-code-v2.py
$ git status

OUTPUT

On branch main
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	renamed:    my code v2.py -> my-code-v2.py

git mv is the equivalent of running mv ... followed immediately by git add ... of the old and new filenames, so the changes have been staged automatically. All that needs to be done is to commit them.

BASH

$ git commit -m "Replace spaces in Python filename with hyphens"

OUTPUT

[main 6499bd7] Replace spaces in Python filename with hyphens
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename my code v2.py => my-code-v2.py (100%)

Rename our data and output files

Now that we have seen how to rename files in Git, let’s:

  1. give our input data file and script more meaningful names and
  2. choose informative file names for our output data file and plot.

First let’s update file names in our script.

PYTHON

# https://data.nasa.gov/resource/eva.json (with modifications)
data_f = open('./eva-data.json', 'r')
data_t = open('./eva-data.csv','w')
g_file = './cumulative_eva_graph.png'   

Now, let’s actually rename our files on the file system using git and commit our changes.

BASH

git mv data.json eva-data.json
git mv my-code-v2.py eva_data_analysis.py
git add eva_data_analysis.py
git status

OUTPUT

On branch main
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	renamed:    data.json -> eva-data.json
	renamed:    my-code-v2.py -> eva_data_analysis.py

Finally, let’s commit out changes:

BASH

git commit -m "Implement informative file names"

Commit messages

We have already met the concept of commit messages when we made and stored changes to our code files. Commit messages are short descriptions of, and the motivation for, what a commit will achieve. It is therefore important to take some time to ensure these commit messages are helpful and descriptive, as when work is reviewed (by your future self or a collaborator) they provide the context about what changes were made and why. This can make tracking down specific changes in commits much easier, without having to inspect the code or files themselves.

Generally, commit messages should complete the sentence “If applied, this commit will…”. Most often a short, 50 character (ish) title will suffice, but a longer-form description of the changes can also be provided by leaving a blank space between the summary line and the rest of the message. There are many different conventions that can be used for commit messages that range from very structured (such as conventional commits) to the fun (such as gitmoji). The important thing is that it is clear to the reader what a commit is doing and why. If a project is using a specific commit message convention, this will often be described in their contributing guidelines.

Good commit messages

Read the two commit messages below. In pairs or small groups, discuss which messages help you understand more about what the commit author did. What about the commit messages do you find helpful or not?

  1. OUTPUT

       [main 7cf85f6] Change variable
         1 file changed, 1 insertion(+), 1 deletion(-)
  2. OUTPUT

       [main 8baf69d] Change variable name from columns to column_headers
        1 file changed, 1 insertion(+), 1 deletion(-)

Commit message (2) is the better commit message since it is more descriptive about what the author did. This message could be improved further by adding a blank line then further describing the change discussing, for example, why the variable name was changed.

Self-contained commits

If we want our commit messages to be descriptive and help us understand the changes in the project over time, then we also have to make commits that are very self-contained. That is to say that each commit we make should only change one, logical thing. By “logical” here, we mean that one aspect of updating the files has been achieved to completion - such as adding docstrings or refactoring a function - we don’t mean that changes are committed line-by-line. See the “Things to avoid when creating commits” section of Openstack’s “Git Commit Good Practice” documentation for examples of logical, self-contained commits, and commits that don’t follow this practice.

The reasons that self-contained commits are important are that: it helps with reviewing changes if each commit tackles one step; if code breaks, tracking down the specific change that caused the break is simpler; if you need to undo changes, you can remove them in small increments, rather than losing a lot of unrelated work along with the change you do want to remove.

Understanding commit contents

Below are the diffs of two commits. A diff shows the differences in a file (or files!) compared to the previous commit in the history so you can what has changed. The lines that begin with +s represent additions, and the lines that begin with -s represent deletions. Compare these two commit diffs. Can you understand what the commit author was trying to achieve in each commit? How many changes have they tried to make in each commit? Discuss in pairs or small groups.

  1. Example Diff 1
    Example Diff 1
  2. Example Diff 2
    Example Diff 2

To find out more about how to generate diffs, you can read the Git documentation or the Tracking Changes episode from the Software Carpentry Version control with Git lesson.

The git diff presented in option (1) is cleaner. The author has only tackled one thing: placing the import statements at the top of the file. This kind of commit is much easier to review in isolation, and will be easier to track down if git bisect is required.

Git logs

If we want to know what we’ve done recently, we can ask Git to show us the project’s history using git log:

BASH

$ git log

OUTPUT

commit 6499bd731ab50fde2731ce2642f143cea86450b6 (HEAD -> main)
Author: Sarah Gibson <drsarahlgibson@gmail.com>
Date:   Mon Jun 17 11:55:17 2024 +0100

    Replace spaces in Python filename with hyphens

commit bf55eb7639a6508658aaa1bfeaeb9f115d1bcc40
Author: Sarah Gibson <drsarahlgibson@gmail.com>
Date:   Mon Jun 17 11:52:02 2024 +0100

    Add and example script and dataset to work on

This output demonstrates why it is important to write meaningful and descriptive commit messages. Without the messages, we will only have the commit hashes (the strings of random numbers and letters after “commit”) to identify each commit, which would be impossible for us.

We may need to inspect our recent commits to establish where a bug was introduced or because we have decided that our recent work isn’t suitable and we wish to discard it and start again. Once we have identified the last commit we want to keep, we can revert the state of our project back to that commit with a few different methods:

  • git revert: This command reverts a commit by creating a new commit that reverses the action of the supplied commit or list of commits. Because this command creates new commits, your Git history is more complete and tells the story of exactly what work you did, i.e., deciding to discard some work.
  • git reset: This command will recover the state of the project at the specified commit. What is done with the commits you had mave since is defined by some optional flags:
    • --soft: Any changes you have made since the specified commit would be preserved and left as “Changes to be committed”
    • --mixed: Any changes you have made since the specified commit would be preserved but not marked for commit (this is the default action)
    • --hard: Any changes you have made since the specified commit are discarded.

Using git reset command produces a “cleaner” history, but does not tell the full story and your work.

Pushing to a Git server

Git is also a distributed version control system, allowing us to synchronise work between any two or more copies of the same repository - the ones that are not located on your machine. So far we have have been working with a project on our local machines and, even though we have been incrementally saving our work in a way that is recoverable (version control), if anything happened to our laptops, the whole project would be lost. However, we can use the distribution aspect of Git to push our projects and histories to a server (someone else’s computer) so that they are accessible and retrievable if the worst were to happen to our machines.

2 Git repositories belonging to 2 different developers linked to a central repository and one another showing two way flow of information in each link
Git - distributed version control system, image from W3Docs (freely available)

Distributing our projects in this way also opens us up to collaboration, since colleagues would be able to access our projects, make their own copies on their machines, and conduct their own work.

We will now go through how to push a local project to GitHub, though other Git hosting services are available, such as GitLab and Bitbucket.

  1. In your browser, navigate to https://github.com and sign into your account

  2. In the top right hand corner of the screen, there is a menu labelled “+” with a dropdown. Click the dropdown and select “New repository” from the options.

    Creating a new GitHub repository
    Creating a new GitHub repository
  3. You will be presented with some options to fill in or select while creating your repository. In the “Repository Name” field, type “spacewalks”. This is the name of your project and matches the name of your local folder.

    Naming the GitHub repository
    Naming the GitHub repository

    Ensure the visibility of the repository is “Public” and leave all other options blank. Since this repository will be connected to a local repository, it needs to be empty which is why we don’t initialise with a README or add a license or .gitignore file. Click “Create repository” at the bottom of the page.

    Complete GitHub repository creation
    Complete GitHub repository creation
  4. Now you have created your repository, you need to send the files and the history you have stored on your local computer to GitHub’s servers. GitHub provides some instructions on how to do that for different scenarios. You want to use the instructions under the heading “…or push an existing repository from the command line”. These instructions will look like this:

    BASH

    git remote add origin https://github.com/<YOUR_GITHUB_HANDLE>/spacewalks.git
    git branch -M main
    git push -u origin main

    You can copy these commands using the button that looks like two overlapping squares to the right-hand side of the commands. Paste them into your terminal and run them.

    Copy the commands to sync the local and remote repositories
    Copy the commands to sync the local and remote repositories
  5. If you refresh your browser window, you should now see the two files my-code-v2.py and data.json visible in the GitHub repository, matching what you have locally on your machine.

Let’s explain a bit more about what those commands did…

BASH

git remote add origin https://github.com/<YOUR_GITHUB_HANDLE>/spacewalks.git

This command tells Git to create a remote called “origin” and link it to the URL of your GitHub repository. A remote is a version control concept where two (or more) repositories are connected to each other in such a way that they can be kept in sync by exchanging commits. “origin” is a name used to refer to the remote repository. It could be called anything, but “origin” is a convention that is often used by default in Git and GitHub since it indicates which repository is considered the “source of truth”, particularly useful when many people are collaborating on the same repository.

BASH

git branch -M main

git branch is a command used to manage branches. We will discuss branches later on in the course. This command ensures the branch we are working on is called “main”. This will be the default branch of the project for everyone working on it.

BASH

git push -u origin main

The git push command is used to update remote references with any changes you have made locally. This command tells Git to update the “main” branch on the “origin” remote. The -u flag (short for --set-upstream) will set a tracking reference, so that in the future git push can be run without the need to specify the remote and reference name.

Terminology

In pairs or small groups, discuss the difference between the terms remote and origin. What is the definition of each term?

  • remote: a version control concept where two (or more) repositories are linked together in such a way that they can be kept in sync by exchanging commits
  • origin: a common Git/GitHub naming convention for the remote repository to designate the source of truth for collaborators

Summary

During this episode, we have covered the basics of using version control to track changes to our projects. We have seen how to: create new projects, incrementally save progress, construct informative commit messages and content, inspect the history of our projects and retrieve the state, and push our projects to distributed servers.

These skills are critical to reproducible and sustainable science since using version control makes our work self-documenting - the commit messages provide the narrative of what changed and why - and we can recover the exact state of our projects at a specific time, so we can more reliably run the “same” code again. We can also back up our projects by pushing them to distributed, remote servers and reduce the risk of data loss over the course of a project’s lifetime.

Version control is a vast topic and we have only covered the absolute basics here as a brief introduction. For a deeper and more complete dive into the subject matter, please see the “Further reading” section below.

Further reading


We recommend the following resources for some additional reading on the topic of this episode:

Also check the full reference set for the course.

Key Points

  • A version control system is software that tracks and manages changes to a project over time
  • Using version control aids reproducibility since the exact state of the software that produced an output can be recovered
  • A commit represents the smallest unit of change to a project
  • Commit messages describe what each commit contains and should be descriptive
  • Logs can be used to overview the history of a project

Content from Code readability


Last updated on 2024-07-04 | Edit this page

Overview

Questions

  • Why does code readability matter?
  • How can I organise my code to be more readable?
  • What types of documentation can I include to improve the readability of my code?

Objectives

After completing this episode, participants should be able to:

  • Organise code into reusable functions that achieve a singular purpose
  • Choose function and variable names that help explain the purpose of the function or variable
  • Write informative inline comments and docstrings to provide more detail about what the code is doing

In this episode, we will introduce the concept of readable code and consider how it can help create reusable scientific software and empower collaboration between researchers.

Motivation for code readability


When someone writes code, they do so based on requirements that are likely to change in the future. Requirements change because software interacts with the real world, which is dynamic. When these requirements change, the developer (who is not necessarily the same person who wrote the original code) must implement the new requirements. They do this by reading the original code to understand the different abstractions, and identify what needs to change. Readable code facilitates the reading and understanding of the abstraction phases and, as a result, facilitates the evolution of the codebase. Readable code saves future developers’ time and effort.

In order to develop readable code, we should ask ourselves: “If I re-read this piece of code in fifteen days or one year, will I be able to understand what I have done and why?” Or even better: “If a new person who just joined the project reads my software, will they be able to understand what I have written here?”

We will now learn about a few software best practices we can follow to help create readable code.

Code layout


Our script currently places import statements throughout the code. Python convention is to place all import statements at the top of the script so that dependant libraries are clearly visible and not buried inside the code (even though there are standard ways of describing dependencies - e.g. using requirements.txt file). This will help readability (accessibility) and reusability of our code.

Our code after the modification should look like the following.

PYTHON

import json
import csv
import datetime as dt
import matplotlib.pyplot as plt

# https://data.nasa.gov/resource/eva.json (with modifications)
data_f = open('./eva-data.json', 'r')
data_t = open('./eva-data.csv','w')
g_file = './cumulative_eva_graph.png'   
fieldnames = ("EVA #", "Country", "Crew    ", "Vehicle", "Date", "Duration", "Purpose")

data=[]

for i in range(374):
    line=data_f.readline()
    print(line)
    data.append(json.loads(line[1:-1]))
#data.pop(0)
## Comment out this bit if you don't want the spreadsheet

w=csv.writer(data_t)

time = []
date =[]

j=0
for i in data:
    print(data[j])
    # and this bit
    w.writerow(data[j].values())
    if 'duration' in data[j].keys():
        tt=data[j]['duration']
        if tt == '':
            pass
        else:
            t=dt.datetime.strptime(tt,'%H:%M')
            ttt = dt.timedelta(hours=t.hour, minutes=t.minute, seconds=t.second).total_seconds()/(60*60)
            print(t,ttt)
            time.append(ttt)
            if 'date' in data[j].keys():
                date.append(dt.datetime.strptime(data[j]['date'][0:10], '%Y-%m-%d'))
                #date.append(data[j]['date'][0:10])

            else:
                time.pop(0)
    j+=1

t=[0]
for i in time:
    t.append(t[-1]+i)

date,time = zip(*sorted(zip(date, time)))

plt.plot(date,t[1:], 'ko-')
plt.xlabel('Year')
plt.ylabel('Total time spent in space to date (hours)')
plt.tight_layout()
plt.savefig(g_file)
plt.show()

Let’s make sure we commit our changes.

BASH

git add eva_data_analysis.py
git commit -m "Move import statements to the top of the script"

OUTPUT

[main a97a9e1] Move import statements to the top of the script
 1 file changed, 4 insertions(+), 4 deletions(-)

Standard libraries


Our script currently reads the data line-by-line from the JSON data file and uses custom code to manipulate the data. Variables of interest are stored in lists but there are more suitable data structures (e.g. data frames) to store data in our case. By choosing custom code over standard and well-tested libraries, we are making our code less readable and understandable and more error-prone.

The main functionality of our code can be rewritten as follows using Pandas library to load and manipulate the data in data frames.

PYTHON

import pandas as pd
import matplotlib.pyplot as plt


data_f = './eva-data.json'
data_t = './eva-data.csv'
g_file = './cumulative_eva_graph.png'

data = pd.read_json(data_f, convert_dates=['date'])
data['eva'] = data['eva'].astype(float)
data.dropna(axis=0, inplace=True)
data.sort_values('date', inplace=True)

data.to_csv(data_t, index=False)

data['duration_hours'] = data['duration'].str.split(":").apply(lambda x: int(x[0]) + int(x[1])/60)
data['cumulative_time'] = data['duration_hours'].cumsum()
plt.plot(data['date'], data['cumulative_time'], 'ko-')
plt.xlabel('Year')
plt.ylabel('Total time spent in space to date (hours)')
plt.tight_layout()
plt.savefig(g_file)
plt.show()

We should replace the existing code in our Python script eva_data_analysis.py with the above code and commit the changes. Remember to use an informative commit message.

BASH

git status
git add eva_data_analysis.py
git commit -m "Refactor code to use standard libraries"

OUTPUT

[main 0ba9b04] "Refactor code to use standard libraries""
 1 file changed, 11 insertions(+), 46 deletions(-)

Command-line interface to code


Let’s add a command-line interface to our script to allow us pass the data file to be read and the output file to be written to as parameters to our script and avoid hard-coding them. This improves interoperability and reusability of our code as it can now be run from the command line terminal and integrated into other scripts or workflows/pipelines (e.g. another script can produce our input data and can be “chained” with our code in a data analysis pipeline).

We will use sys.argv library to read the command line arguments passes to our script and make them available in our code as a list. The first element of the list is the name of the script itself, and the following elements are the arguments passed to the script.

Our modified code will now look as follows.

PYTHON

import pandas as pd
import matplotlib.pyplot as plt
import sys


if __name__ == '__main__':

    if len(sys.argv) < 3:
        data_f = './eva-data.json'
        data_t = './eva-data.csv'
        print(f'Using default input and output filenames')
    else:
        data_f = sys.argv[1]
        data_t = sys.argv[2]
        print('Using custom input and output filenames')

    g_file = './cumulative_eva_graph.png'

    print(f'Reading JSON file {data_f}')
    data = pd.read_json(data_f, convert_dates=['date'])
    data['eva'] = data['eva'].astype(float)
    data.dropna(axis=0, inplace=True)
    data.sort_values('date', inplace=True)

    print(f'Saving to CSV file {data_t}')
    data.to_csv(data_t, index=False)

    print(f'Plotting cumulative spacewalk duration and saving to {g_file}')
    data['duration_hours'] = data['duration'].str.split(":").apply(lambda x: int(x[0]) + int(x[1])/60)
    data['cumulative_time'] = data['duration_hours'].cumsum()
    plt.plot(data.date, data.cumulative_time, 'ko-')
    plt.xlabel('Year')
    plt.ylabel('Total time spent in space to date (hours)')
    plt.tight_layout()
    plt.savefig(g_file)
    plt.show()
    print("--END--")

We can now run our script from the command line passing the json input data file and csv output data file as:

BASH

python eva_data_analysis.py eva_data.json eva_data.csv

Remember to commit our changes.

BASH

git status
git add eva_data_analysis.py
git commit -m "Add commandline functionality to script"

OUTPUT

[main b5883f6] Add commandline functionality to script
 1 file changed, 30 insertions(+), 16 deletions(-)

Meaningful variable names


Variables are the most common thing you will assign when coding, and it’s really important that it is clear what each variable means in order to understand what the code is doing. If you return to your code after a long time doing something else, or share your code with a colleague, it should be easy enough to understand what variables are involved in your code from their names. Therefore we need to give them clear names, but we also want to keep them concise so the code stays readable. There are no “hard and fast rules” here, and it’s often a case of using your best judgment.

Some useful tips for naming variables are:

  • Short words are better than single character names
    • For example, if we were creating a variable to store the speed to read a file, s (for ‘speed’) is not descriptive enough but MBReadPerSecondAverageAfterLastFlushToLog is too long to read and prone to mispellings. ReadSpeed (or read_speed) would suffice.
    • If you’re finding it difficult to come up with a variable name that is both short and descriptive, go with the short version and use an inline comment to desribe it further (more on those in the next section!)
    • This guidance doesn’t necessarily apply if your variable is a well-known constant in your domain, for example, c represents the speed of light in Physics
  • Try to be descriptive where possible, and avoid names like foo, bar, var, thing, and so on

There are also some gotchas to consider when naming variables:

  • There may be some restrictions on which characters you can use in your variable names. For instance, in Python, only alphanumeric characters and underscores are permitted.
  • Particularly in Python, you cannot begin your variable names with numerical characters as this will raise a syntax error.
    • Numerical characters can be included in a variable name, just not as the first character. For example, read_speed1 is a valid variable name, but 1read_speed isn’t. (This behaviour may be different for other programming languages.)
  • In some programming languages, such as Python, variable names are case sensitive. So speed_of_light and Speed_Of_Light will not be equivalent.
  • Programming languages often have global pre-built functions, such as input, which you may accidentally overwrite if you assign a variable with the same name.
    • Again in Python, you would actually reassign the input name and no longer be able to access the original input function if you used this as a variable name. So in this case opting for something like input_data would be preferable. (This behaviour may be explicitly disallowed in other programming languages.)

Give a descriptive name to a variable

Below we have a variable called var being set the value of 9.81. var is not a very descriptive name here as it doesn’t tell us what 9.81 means, yet it is a very common constant in physics! Go online and find out which constant 9.81 relates to and suggest a new name for this variable.

Hint: the units are metres per second squared!

PYTHON

var = 9.81

Yes, \[9.81 m/s^2 \] is the gravitational force exerted by the Earth. It is often referred to as “little g” to distinguish it from “big G” which is the Gravitational Constant. A more decriptive name for this variable therefore might be:

PYTHON

g_earth = 9.81

Challenge

Let’s apply this to eva_data_analysis.py.

  1. Edit the code as follows to use descriptive variable names:

    • Change data_f to input_file
    • Change data_t to output_file
    • Change g_file to graph_file
    • Change data to eva_df
  2. Commit your changes to your repository. Remember to use an informative commit message.

Updated code:

PYTHON

if __name__ == '__main__':

    if len(sys.argv) < 3:
        input_file = './eva-data.json'
        output_file = './eva-data.csv'
        print(f'Using default input and output filenames')
    else:
        input_file = sys.argv[1]
        output_file = sys.argv[2]
        print('Using custom input and output filenames')

    graph_file = './cumulative_eva_graph.png'

    print(f'Reading JSON file {input_file}')
    eva_df = pd.read_json(input_file, convert_dates=['date'])
    eva_df['eva'] = eva_df['eva'].astype(float)
    eva_df.dropna(axis=0, inplace=True)
    eva_df.sort_values('date', inplace=True)

    print(f'Saving to CSV file {output_file}')
    eva_df.to_csv(output_file, index=False)

    print(f'Plotting cumulative spacewalk duration and saving to {graph_file}')
    eva_df['duration_hours'] = eva_df['duration'].str.split(":").apply(lambda x: int(x[0]) + int(x[1])/60)
    eva_df['cumulative_time'] = eva_df['duration_hours'].cumsum()
    plt.plot(eva_df.date, eva_df.cumulative_time, 'ko-')
    plt.xlabel('Year')
    plt.ylabel('Total time spent in space to date (hours)')
    plt.tight_layout()
    plt.savefig(graph_file)
    plt.show()
    print("--END--")

Commit changes:

BASH

git status
git add eva_data_analysis.py
git commit -m "Use descriptive variable names"

Inline comments


Commenting is a very useful practice to help convey the context of the code. It can be helpful as a reminder for your future self or your collaborators as to why code is written in a certain way, how it is achieving a specific task, or the real-world implications of your code.

There are many ways to add comments to code, the most common of which is inline comments.

PYTHON

# In Python, inline comments begin with the `#` symbol and a single space.

Again, there are few hard and fast rules to using comments, just apply your best judgment. But here are a few things to keep in mind when commenting your code:

  • Avoid using comments to explain what your code does. If your code is too complex for other programmers to understand, consider rewriting it for clarity rather than adding comments to explain it.
  • Focus on the why and the how.
  • Make sure you’re not reiterating something that your code already conveys on its own. Comments shouldn’t echo your code.
  • Keep them short and concise. Large blocks of text quickly become unreadable and difficult to maintain.
  • Comments that contradict the code are worse than no comments. Always make a priority of keeping comments up-to-date when code changes.

Examples of helpful vs. unhelpful comments

Unhelpful:

PYTHON

statetax = 1.0625  # Assigns the float 1.0625 to the variable 'statetax'
citytax = 1.01  # Assigns the float 1.01 to the variable 'citytax'
specialtax = 1.01  # Assigns the float 1.01 to the variable 'specialtax'

The comments in this code simply tell us what the code does, which is easy enough to figure out without the inline comments.

Helpful:

PYTHON

statetax = 1.0625  # State sales tax rate is 6.25% through Jan. 1
citytax = 1.01  # City sales tax rate is 1% through Jan. 1
specialtax = 1.01  # Special sales tax rate is 1% through Jan. 1

In this case, it might not be immediately obvious what each variable represents, so the comments offer helpful, real-world context. The date in the comment also indicates when the code might need to be updated.

Add some comments to a code block

  1. Examine eva_data_analysis.py. Add as many inline comments as you think is required to help yourself and others understand what that code is doing.
  2. Commit your changes to your repository. Remember to use an informative commit message.

Hint: Inline comments in Python are denoted by a # symbol.

Some good inline comments may look like the example below.

PYTHON

import pandas as pd
import matplotlib.pyplot as plt
import sys


if __name__ == '__main__':

    if len(sys.argv) < 3:
        input_file = './eva-data.json'
        output_file = './eva-data.csv'
        print(f'Using default input and output filenames')
    else:
        input_file = sys.argv[1]
        output_file = sys.argv[2]
        print('Using custom input and output filenames')

    graph_file = './cumulative_eva_graph.png'

    print(f'Reading JSON file {input_file}')
    # Read the data from a JSON file into a Pandas dataframe
    eva_df = pd.read_json(input_file, convert_dates=['date'])
    # Clean the data by removing any incomplete rows and sort by date
    eva_df['eva'] = eva_df['eva'].astype(float)
    eva_df.dropna(axis=0, inplace=True)
    eva_df.sort_values('date', inplace=True)

    print(f'Saving to CSV file {output_file}')
    # Save dataframe to CSV file for later analysis
    eva_df.to_csv(output_file, index=False)

    print(f'Plotting cumulative spacewalk duration and saving to {graph_file}')
    # Plot cumulative time spent in space over years
    eva_df['duration_hours'] = eva_df['duration'].str.split(":").apply(lambda x: int(x[0]) + int(x[1])/60)
    eva_df['cumulative_time'] = eva_df['duration_hours'].cumsum()
    plt.plot(eva_df.date, eva_df.cumulative_time, 'ko-')
    plt.xlabel('Year')
    plt.ylabel('Total time spent in space to date (hours)')
    plt.tight_layout()
    plt.savefig(graph_file)
    plt.show()
    print("--END--")

Commit changes:

BASH

git status
git add eva_data_analysis.py
git commit -m "Add inline comments to the code"

Functions


Functions are a fundamental concept in writing software and are one of the core ways you can organise your code to improve its readability. A function is an isolated section of code that performs a single, specific task that can be simple or complex. It can then be called multiple times with different inputs throughout a codebase, but it’s definition only needs to appear once.

Breaking up code into functions in this manner benefits readability since the smaller sections are easier to read and understand. Since functions can be reused, codebases naturally begin to follow the Don’t Repeat Yourself principle which prevents software from becoming overly long and confusing. The software also becomes easier to maintain because, if the code encapsulated in a function needs to change, it only needs updating in one place instead of many. As we will learn in a future episode, testing code also becomes simpler when code is written in functions. Each function can be individually checked to ensure it is doing what is intended, which improves confidence in the software as a whole.

Create a function

Below is a function that reads in a JSON file into a dataframe structure using the pandas library - but the code is out of order!

Reorder the lines of code within the function so that the JSON file is read in using the read_json method, any incomplete rows are dropped, the values are sorted by date, and then the cleaned dataframe is returned. There is also a print statement that will display which file is being read in on the command line for verification.

PYTHON

import pandas as pd

def read_json_to_dataframe(input_file):
    eva_df.sort_values('date', inplace=True)
    eva_df.dropna(axis=0, inplace=True)
    print(f'Reading JSON file {input_file}')
    return eva_df
    eva_df = pd.read_json(input_file, convert_dates=['date'])

Here is the correct order of the code for the function.

PYTHON

import pandas as pd

def read_json_to_dataframe(input_file):
    print(f'Reading JSON file {input_file}')
    eva_df = pd.read_json(input_file, convert_dates=['date'])
    eva_df.dropna(axis=0, inplace=True)
    eva_df.sort_values('date', inplace=True)
    return eva_df

We have chosen to create a function for reading in data files since this is a very common task within research software. While this isn’t that many lines of code, thanks to using pandas inbuilt methods, it can be useful to package this together into a function if you need to read in a lot of similarly structured files and process them in the same way.

Docstrings


Docstrings are a specific type of documentation that are provided within functions, and classes too. A function docstring should explain what the isolated code is doing, what parameters the function needs (these are inputs) and what form they should take, what the function outputs (you may see words like ‘returns’ or ‘yields’ here), and errors (if any) that might be raised.

Providing these docstrings helps improve code readability since it makes the function code more transparent and aids understanding. Particularly, docstrings that provide information on the input and output of functions makes it easier to reuse them in other parts of the code, without having to read the full function to understand what needs to be provided and what will be returned.

Docstrings are another case where there are no hard and fast rules for writing them. Acceptable docstring formats can range from single- to multi-line. You can use your best judgment on how much documentation a particular function needs.

Example of a single-line docstring

PYTHON

def add(x, y):
    """Add two numbers together"""
    return x + y

Example of a multi-line docstring:

PYTHON

def add(x, y=1.0):
    """
    Adds two numbers together.

    Args:
        x: A number to be included in the addition.
        y (float, optional): A float number to be included in the addition. Defaults to 1.0.

    Returns:
        float: The sum of x and y.
    """
    return x + y

Some projects may have their own guidelines on how to write docstrings, such as numpy. If you are contributing code to a wider project or community, try to follow the guidelines and standards they provide for codestyle.

As your code grows and becomes more complex, the docstrings can form the content of a reference guide allowing developers to quickly look up how to use the APIs, functions, and classes defined in your codebase. Hence, it is common to find tools that will automatically extract docstrings from your code and generate a website where people can learn about your code without downloading/installing and reading the code files - such as MkDocs.

Writing docstrings

Write a docstring for the read_json_to_dataframe function from the previous exercise. Things you may want to consider when writing your docstring are:

  • Describing what the function does
  • What kind of inputs does the function take? Are they required or optional? Do they have default values?
  • What output will the function produce?

Hint: Python docstrings are defined by enclosing the text with """ above and below. This text is also indented to the same level as the code defined beneath it, which is 4 whitespaces.

A good enough docstring for this function would look like this:

PYTHON

def read_json_to_dataframe(input_file):
    """
    Read the data from a JSON file into a Pandas dataframe
    Clean the data by removing any incomplete rows and sort by date
    """
    print(f'Reading JSON file {input_file}')
    eva_df = pd.read_json(input_file, 
                          convert_dates=['date'])
    eva_df.dropna(axis=0, inplace=True)
    eva_df.sort_values('date', inplace=True)
    return eva_df

Using Google’s docstring convention, the docstring may look more like this:

PYTHON

def read_json_to_dataframe(input_file):
    """
    Read the data from a JSON file into a Pandas dataframe.
    Clean the data by removing any incomplete rows and sort by date

    Args:
        input_file_ (str): The path to the JSON file.

    Returns:
         eva_df (pd.DataFrame): The cleaned and sorted data as a dataframe structure
    """        
    print(f'Reading JSON file {input_file}')
    eva_df = pd.read_json(input_file, 
                          convert_dates=['date'])
    eva_df.dropna(axis=0, inplace=True)
    eva_df.sort_values('date', inplace=True)
    return eva_df

Improving our code


Finally, let’s apply these good practices to eva_data_analysis.py and organise our code into functions with descriptive names and docstrings.

PYTHON

import pandas as pd
import matplotlib.pyplot as plt
import sys


def read_json_to_dataframe(input_file_):
    """
    Read the data from a JSON file into a Pandas dataframe.
    Clean the data by removing any incomplete rows and sort by date

    Args:
        input_file_ (str): The path to the JSON file.

    Returns:
         eva_df (pd.DataFrame): The loaded dataframe.
    """
    print(f'Reading JSON file {input_file_}')
    eva_df = pd.read_json(input_file_, convert_dates=['date'])
    eva_df['eva'] = eva_df['eva'].astype(float)
    eva_df.dropna(axis=0, inplace=True)
    eva_df.sort_values('date', inplace=True)
    return eva_df


def write_dataframe_to_csv(df_, output_file_):
    """
    Write the dataframe to a CSV file.

    Args:
        df_ (pd.DataFrame): The input dataframe.
        output_file_ (str): The path to the output CSV file.

    Returns:
        None
    """
    print(f'Saving to CSV file {output_file_}')
    df_.to_csv(output_file_, index=False)

def text_to_duration(duration):
    """
    Convert a text format duration "HH:MM" to duration in hours

    Args:
        duration (str): The text format duration

    Returns:
        duration_hours (float): The duration in hours
    """
    hours, minutes = duration.split(":")
    duration_hours = int(hours) + int(minutes)/60
    return duration_hours


def add_duration_hours_variable(df_):
    """
    Add duration in hours (duration_hours) variable to the dataset

    Args:
        df_ (pd.DataFrame): The input dataframe.

    Returns:
        df_copy (pd.DataFrame): A copy of df_ with the new duration_hours variable added
    """
    df_copy = df_.copy()
    df_copy["duration_hours"] = df_copy["duration"].apply(
        text_to_duration
    )
    return df_copy


def plot_cumulative_time_in_space(df_, graph_file_):
    """
    Plot the cumulative time spent in space over years

    Convert the duration column from strings to number of hours
    Calculate cumulative sum of durations
    Generate a plot of cumulative time spent in space over years and
    save it to the specified location

    Args:
        df_ (pd.DataFrame): The input dataframe.
        graph_file_ (str): The path to the output graph file.

    Returns:
        None
    """
    print(f'Plotting cumulative spacewalk duration and saving to {graph_file_}')
    df_ = add_duration_hours_variable(df_)
    df_['cumulative_time'] = df_['duration_hours'].cumsum()
    plt.plot(df_.date, df_.cumulative_time, 'ko-')
    plt.xlabel('Year')
    plt.ylabel('Total time spent in space to date (hours)')
    plt.tight_layout()
    plt.savefig(graph_file_)
    plt.show()


if __name__ == '__main__':

    if len(sys.argv) < 3:
        input_file = './eva-data.json'
        output_file = './eva-data.csv'
        print(f'Using default input and output filenames')
    else:
        input_file = sys.argv[1]
        output_file = sys.argv[2]
        print('Using custom input and output filenames')

    graph_file = './cumulative_eva_graph.png'

    eva_data = read_json_to_dataframe(input_file)

    write_dataframe_to_csv(eva_data, output_file)

    plot_cumulative_time_in_space(eva_data, graph_file)

    print("--END--")

Finally, let’s commit these changes to our local repository and then push to our remote repository on GitHub to publish these changes. Remember to use an informative commit message.

BASH

git status
git add eva_data_analysis.py
git commit -m "Organise code into functions"
git push origin main

Summary


During this episode, we have discussed the importance of code readability and explored some software engineering practices that help facilitate this.

Code readability is important because it makes it simpler and quicker for a person (future you or a collaborator) to understand what purpose the code is serving, and therefore begin contributing to it more easily, saving time and effort.

Some best practices we have covered towards code readability include:

  • Variable naming practices for descriptive yet concise code
  • Inline comments to provide real-world context
  • Functions to isolate specific code sections for re-use
  • Docstrings for documenting functions to facilitate their re-use

Further reading


We recommend the following resources for some additional reading on the topic of this episode:

Also check the full reference set for the course.

Key Points

  • Readable code is easier to understand, maintain, debug and extend!
  • Creating functions from the smallest, reusable units of code will help compartmentalise which parts of the code are doing what actions
  • Choosing descriptive variable and function names will communicate their purpose more effectively
  • Using inline comments and docstrings to describe parts of the code will help transmit understanding, and verify that the code is correct

Content from Code testing


Last updated on 2024-07-05 | Edit this page

Overview

Questions

  • How can we verify that our code is correct?
  • How can we automate our software tests?
  • What makes a “good” test?
  • Which parts of our code should we prioritize for testing?

Objectives

After completing this episode, participants should be able to:

  • Explain why code testing is important and how this supports FAIR software.
  • Describe the different types of software tests (unit tests, integration tests, regression tests).
  • Implement unit tests to verify that function(s) behave as expected using the Python testing framework pytest.
  • Interpret the output from pytest to identify which function(s) are not behaving as expected.
  • Write tests using typical values, edge cases and invalid inputs to ensure that the code can handle extreme values and invalid inputs appropriately.
  • Evaluate code coverage to identify how much of the codebase is being tested and identify areas that need further tests.

Motivation for Code Testing


The goal of software testing is to check that the actual results produced by a piece of code meet our expectations i.e. are correct.

Adopting software testing as part of our research workflow helps us to conduct better research and produce FAIR software.

Better Research

Software testing can help us be better, more productive researchers.

Testing helps us to identify and fix problems with our code early and quickly and allows us to demonstrate to ourselves and others that our code does what we claim. More importantly, we can share our tests alongside our code, allowing others to check this for themselves.

FAIR Software

Software testing also underpins the FAIR process by giving us the confidence to engage in open research practices.

If we are not sure that our code works as intended and produces accurate results, we are unlikely to feel confident about sharing our code with others. Software testing brings piece of mind by providing a step-by-step approach that we can apply to verify that our code is correct.

Software testing also supports the FAIR process by improving the accessibility and reusability of our code.

Accessible:

  • Well written software tests capture the expected behaviour of our code and can be used alongside documentation to help developers quickly make sense of our code.

Reusable:

  • A well tested codebase allows developers to experiment with new features safe in the knowledge that tests will reveal if their changes have broken any existing functionality.
  • The act of writing tests encourages to structure our code as individual functions and often results in a more modular, readable, maintainable codebase that is easier to extend or repurpose.

Types of Software Tests

There are many different types of software testing including:

  • Unit Tests: Unit tests focus on testing individual functions in isolation. They ensure that each small part of the software performs as intended. By verifying the correctness of these individual units, we can catch errors early in the development process.

  • Integration Tests: Integration tests, check how different parts of the code e.g. functions work together.

  • Regression Tests: Regression tests are used to ensure that new changes or updates to the codebase do not adversely affect the existing functionality. They involve checking whether a program or part of a program still generates the same results after changes have been made.

  • End-to-end tests are a special type of integration testing which checks that a program as a whole behaves as expected.

In this course, our primary focus will be on unit testing. However, the concepts and techniques we cover will provide a solid foundation applicable to other types of testing.

Types of Software Tests

Fill in the blanks in the sentences below:

  • __________ tests compare the ______ output of a program to its ________ output to demonstrate correctness.
  • Unit tests compare the actual output of a ______ ________ to the expected output to demonstrate correctness.
  • __________ tests check that results have not changed since the previous test run.
  • __________ tests check that two or more parts of a program are working together correctly.
  • End-to-end tests compare the actual output of a program to the expected output to demonstrate correctness.
  • Unit tests compare the actual output of a single function to the expected output to demonstrate correctness.
  • Regression tests check that results have not changed since the previous test run.
  • Integration tests check that two or more parts of a program are working together correctly.

Informal Testing

How should we test our code? Let’s start by considering the following scenario. A collaborator on our project has sent us the following code to add a crew_size variable to our data frame - a column which captures the number of astronauts participating in a given spacewalk. How do we know that it works as intended?

PYTHON

import re
import pandas

def calculate_crew_size(crew):
    """
    Calculate crew_size for a single crew entry

    Args:
        crew (str): The text entry in the crew column

    Returns:
        int: The crew size
    """
    if crew.split() == []:
        return None
    else:
        return len(re.split(r';', crew))-1


def add_crew_size_variable(df_):
    """
    Add crew size (crew_size) variable to the dataset

    Args:
        df_ (pd.DataFrame): The input data frame.

    Returns:
        df_copy (pd.DataFrame): A copy of df_ with the new crew_size variable added
    """
    print('Adding crew size variable (crew_size) to dataset')
    df_copy = df_.copy()
    df_copy["crew_size"] = df_copy["crew"].apply(
        calculate_crew_size
    )
    return df_copy
    

One approach is to copy/paste the function(s) into a python interpreter and check that they behave as expected with some input values where we know what the correct return value should be.

Since add_crew_size_variable contains boiler plate code for deriving one column from another let’s start with calculate_crew_size:

PYTHON

calculate_crew_size("Valentina Tereshkova;")
calculate_crew_size("Judith Resnik; Sally Ride;")

OUTPUT

1
2

We can then explore the behaviour of add_crew_size_variable by creating a toy data frame:

PYTHON

# Create a toy DataFrame
data = pd.DataFrame({
    'crew': ['Anna Lee Fisher;', 'Marsha Ivins; Helen Sharman;']
})

add_crew_size_variable(data)

OUTPUT

Adding crew size variable (crew_size) to dataset
                           crew  crew_size
0              Anna Lee Fisher;          1
1  Marsha Ivins; Helen Sharman;          2

Although this is an important process to go through as we draft our code for the first time, there are some serious drawbacks to this approach if used as our only form of testing.

What are the limitations of informally testing code? (5 minutes)

Think about the questions below. Your instructors may ask you to share your answers in a shared notes document and/or discuss them with other participants.

  • Why might we choose to test our code informally?
  • What are the limitations of relying solely on informal tests to verify that a piece of code is behaving as expected?

It can be tempting to test our code informally because this approach:

  • is quick and easy
  • provides immediate feedback

However, there are limitations to this approach:

  • Working interactively is error prone
  • We must repeat our tests every time we update our code; this is time consuming
  • We must rely on memory to keep track of how we have tested our code e.g. what input values we tried
  • We must rely on memory to keep track of which functions have been tested and which have not

Formal Testing


We can overcome some of these limitations by formalising our testing process. A formal approach to testing our function(s) is to write dedicated test functions to check our code. These test functions:

  • Run the function we want to test - the target function with known inputs
  • Compare the output to known, valid results
  • Raises an error if the function’s actual output does not match the expected output
  • Are recorded in a test script that can be re-run on demand.

Let’s explore this process by writing some formal tests for our text_to_duration function. (We’ll come back to our colleague’s calculate_crew_size function later).

The text_to_duration function converts a duration stored as a string (HH:MM) to a duration in hours e.g. duration “1.15” should return a value of 1.25.

PYTHON

def text_to_duration(duration):
    """
    Convert a text format duration "HH:MM" to duration in hours

    Args:
        duration (str): The text format duration

    Returns:
        float: The duration in hours
    """
    hours, minutes = duration.split(":")
    duration_hours = int(hours) + int(minutes)/60
    return duration_hours

Let’s create a new python file test_code.py in the root of our project folder to store our tests.

BASH

cd Spacewalks
touch test_code.py

First, we import text_to_duration into our test script. Then, we then add our first test function:

PYTHON


from eva_data_analysis import text_to_duration

def test_text_to_duration_integer():
    input_value = "10:00"
    test_result = text_to_duration("10:00") == 10
    print(f"text_to_duration('10:00') == 10? {test_result}")

test_text_to_duration()

This test checks that when we apply text_to_duration to input value “10:00”, the output matches the expected value of 10.

In this example, we use a print statement to report whether the actual output from text_to_duration meets our expectations.

However, this does not meet our requirement to “Raise an error if the function’s output does not match the expected output” and means that we must carefully read our test function’s output to identify whether it has failed.

To ensure that our code raises an error if the function’s output does not match the expected output, we can use an assert statement.

The assert statement in Python checks whether a condition is True or False. If the statement is True, then assert does not return a value but if the statement is false, then assert raises an AssertError.

Let’s rewrite our test with an assert statement:

PYTHON


from eva_data_analysis import text_to_duration

def test_text_to_duration_integer():
    assert text_to_duration("10:00") == 10

test_text_to_duration_integer()

Notice that when we run test_text_to_duration_integer(), nothing happens - there is no output. That is because our function is working correctly and returning the expected value of 10.

Let’s see what happens when we deliberately introduce a bug into text_to_duration: In the Spacewalks data analysis script let’s change int(hours) to int(hour)/60 and int(minutes)/60 to int(minutes)to mimic a simple mistake in our code where the wrong element is divided by 60.

PYTHON

def text_to_duration(duration):
    """
    Convert a text format duration "HH:MM" to duration in hours

    Args:
        duration (str): The text format duration

    Returns:
        duration (float): The duration in hours
    """
    hours, minutes = duration.split(":")
    duration_hours = int(hours)/60 + int(minutes) # Divide the wrong element by 60
    return duration_hours

Notice that this time, our test fails noisily. Our assert statement has raised an AssertionError - a clear signal that there is a problem in our code that we need to fix.

PYTHON

test_text_to_duration_integer()

ERROR

Traceback (most recent call last):
  File "/Users/AnnResearchers/Desktop/Spacewalks/test_code.py", line 7, in <module>
    test_text_to_duration_integer()
  File "/Users/AnnResearchers/Desktop/Spacewalks/test_code.py", line 5, in test_text_to_duration_integer
    assert text_to_duration("10:00") == 10
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

What happens if we add another test to our test script? This time we’ll check that our function can handle durations with a non-zero minute component. Notice that this time our expected value is a floating-point number. Importantly, we cannot use a simple double equals sign (==) to compare the equality of floating-point numbers. Floating-point arithmetic can introduce very small differences due to how computers represent these numbers internally - as a result, we check that our floating point numbers are equal within a very small tolerance (1e-5).

PYTHON

from eva_data_analysis import text_to_duration

def test_text_to_duration_integer():
    assert text_to_duration("10:00") == 10
    
def test_text_to_duration_float():
    assert abs(text_to_duration("10:20") - 10.33333333) < 1e-5

test_text_to_duration_integer()
test_text_to_duration_float()

OUTPUT

Traceback (most recent call last):
  File "/Users/AnnResearcher/Desktop/Spacewalks/test_code.py", line 9, in <module>
    test_text_to_duration_integer()
  File "/Users/AnnResearcher/Desktop/Spacewalks/test_code.py", line 4, in test_text_to_duration_integer
    assert text_to_duration("10:00") == 10
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

What happens when we run our updated test script? Our script stops after the first test failure and the second test is not run. To run our remaining tests we would have to manually comment out our failing test and re-run the test script. As our code base and tests grow, this will become cumbersome. This is not ideal and can be overcome by automating our tests using a testing framework.

Using a Testing Framework


Our approach so far has had two major limitations:

  • We had to carefully examine the output of our test script to work out if our test failed.
  • Our test script only ran our tests up to the first test failure.

We can do better than this! Testing frameworks can automatically find all the tests in our code base, run all of them and present the test results as a readable summary.

We will use the python testing framework pytest with its code coverage plugin pytest-cov. To install these libraries, open a terminal and type:

BASH

python -m pip install pytest pytest-cov

Let’s make sure that our tests are ready to work with pytest.

  • pytest automatically discovers tests based on specific naming patterns. pytest looks for files that start with test_ or end with _test.py. Then, within these files, pytest looks for functions that start with test_.
    Our test file already meets these requirements, so there is nothing to do here. However, our script does contain lines to run each of our test functions. These are no-longer required as pytest will run our tests so we will remove them:

    PYTHON

    # Delete
    test_text_to_duration_integer()
    test_text_to_duration_float()
  • It is also conventional when working with a testing framework to place test files in a tests directory at the root of our project and to name each test file after the code file that it targets. This helps in maintaining a clean structure and makes it easier for others to understand where the tests are located.

A set of tests for a given piece of code is called a test suite. Our test suite is currently located in the root folder of our project. Let’s move it to a dedicated test folder and rename our test_code.py file to test_eva_analysis.py.

BASH

mkdir tests
mv test_code.py tests/test_eva_analysis.py

Before we re-run our tests using pytest, let’s update our second test. to use pytest’s approx function which is specifically intended for comparing floating point numbers within a tolerance.

PYTHON

import pytest
from eva_data_analysis import text_to_duration

def test_text_to_duration_integer():
    assert text_to_duration("10:00") == 10
    
def test_text_to_duration_float():
    assert text_to_duration("10:20") == pytest.approx(10.33333333)

Let’s also add some inline comments to clarify what each test is doing and expand our syntax to highlight the logic behind our approach:

PYTHON

import pytest
from eva_data_analysis import text_to_duration

def test_text_to_duration_integer():
    """
    Test that text_to_duration returns expected ground truth values
    for typical whole hour durations 
    """
    actual_result =  text_to_duration("10:00")
    expected_result = 10
    assert actual_result == expected_result
    
def test_text_to_duration_float():
    """
    Test that text_to_duration returns expected ground truth values
    for typical durations with a non-zero minute component
    """
    actual_result = text_to_duration("10:20") 
    expected_result = 10.33333333
    assert actual_result == pytest.approx(expected_result)

Writing our tests this way highlights the key idea that each test should compare the actual results returned by our function with expected values.

Similarly, writing inline comments for our tests that complete the sentence “Test that …” helps us to understand what each test is doing and why it is needed.

Finally, let’s also modify our bug to something that will affect durations with a non-zero minute component like “10:20” but not those that are whole hours e.g. “10:00”.

Let’s change int(hours)/60 + int(minutes) to int(hours) + int(minutes)/6 a simple typo.

PYTHON

def text_to_duration(duration):
    """
    Convert a text format duration "HH:MM" to duration in hours

    Args:
        duration (str): The text format duration

    Returns:
        duration (float): The duration in hours
    """
    hours, minutes = duration.split(":")
    duration_hours = int(hours) + int(minutes)/6 # Divide by 6 instead of 60
    return duration_hours

Finally, let’s run our tests:

BASH

python -m pytest 

OUTPUT

========================================================== test session starts
platform darwin -- Python 3.12.3, pytest-8.2.2, pluggy-1.5.0
rootdir: /Users/AnnResearcher/Desktop/Spacewalks
plugins: cov-5.0.0
collected 2 items

tests/test_eva_data_analysis.py .F                                                                                                 [100%]

================================================================ FAILURES
______________________________________________________ test_text_to_duration_float

    def test_text_to_duration_float():
        """
        Test that text_to_duration returns expected ground truth values
        for typical durations with a non-zero minute component
        """
        actual_result = text_to_duration("10:20")
        expected_result = 10.33333333
>       assert actual_result == pytest.approx(expected_result)
E       assert 13.333333333333334 == 10.33333333 ± 1.0e-05
E
E         comparison failed
E         Obtained: 13.333333333333334
E         Expected: 10.33333333 ± 1.0e-05

tests/test_eva_data_analysis.py:23: AssertionError
======================================================== short test summary info
FAILED tests/test_eva_data_analysis.py::test_text_to_duration_float - assert 13.333333333333334 == 10.33333333 ± 1.0e-05
====================================================== 1 failed, 1 passed in 0.32s
  • Notice how if the test function finishes without triggering an assertion, the test is considered successful and is marked with a dot (‘.’).
  • If an assertion fails or an error occurs, the test is marked as a failure with an ‘F’. and the output includes details about the error to help identify what went wrong.

Interpreting pytest output

A colleague has asked you to conduct a pre-publication review of their code Spacetravel which analyses time spent in space by various individual astronauts.

Inspect the pytest output provided and answer the questions below.

pytest output for Spacetravel

OUTPUT

============================================================ test session starts
platform darwin -- Python 3.12.3, pytest-8.2.2, pluggy-1.5.0
rootdir: /Users/Desktop/AnneResearcher/projects/Spacetravel
collected 9 items

tests/test_analyse.py FF....                                              [ 66%]
tests/test_prepare.py s..                                                 [100%]

====================================================================== FAILURES
____________________________________________________________ test_total_duration

    def test_total_duration():

      durations = [10, 15, 20, 5]
      expected  = 50/60
      actual  = calculate_total_duration(durations)
>     assert actual == pytest.approx(expected)
E     assert 8.333333333333334 == 0.8333333333333334 ± 8.3e-07
E
E       comparison failed
E       Obtained: 8.333333333333334
E       Expected: 0.8333333333333334 ± 8.3e-07

tests/test_analyse.py:9: AssertionError
______________________________________________________________________________ test_mean_duration

    def test_mean_duration():
       durations = [10, 15, 20, 5]

       expected = 12.5/60
>      actual  = calculate_mean_duration(durations)

tests/test_analyse.py:15:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

durations = [10, 15, 20, 5]

    def calculate_mean_duration(durations):
        """
        Calculate the mean of a list of durations.
        """
        total_duration = sum(durations)/60
>       mean_duration = total_duration / length(durations)
E       NameError: name 'length' is not defined

Spacetravel.py:45: NameError
=========================================================================== short test summary info
FAILED tests/test_analyse.py::test_total_duration - assert 8.333333333333334 == 0.8333333333333334 ± 8.3e-07
FAILED tests/test_analyse.py::test_mean_duration - NameError: name 'length' is not defined
============================================================== 2 failed, 6 passed, 1 skipped in 0.02s
  1. How many tests has our colleague included in the test suite?
  2. The first test in test_prepare.py has a status of s; what does this mean?
  3. How many tests failed?
  4. Why did “test_total_duration” fail?
  5. Why did “test_mean_duration” fail?
  1. 9 tests were detected in the test suite
  2. s - stands for “skipped”,
  3. 2 tests failed: the first and second tests in test file test_analyse.py
  4. test_total_duration failed because the calculated total duration differs from the expected value by a factor of 10 i.e. the assertion actual == pytest.approx(expected) evaluated to False
  5. test_mean_duration failed because there is a syntax error in calculate_mean_duration. Our colleague has used the command length (not a python command) instead of len. As a result, running the function returns a NameError rather than a calculated value and the test assertion evaluates to False.

Test Suite Design


Now that we have tooling in place to automatically run our test suite. What makes a good test suite?

Good Tests

We should aim to test that our function behaves as expected with the full range of inputs that it might encounter. It is helpful to consider each argument of a function in turn and identify the range of typical values it can take. Once we have identified this typical range or ranges (where a function takes more than one argument), we should:

  • Test at least one interior point
  • Test all values at the edge of the range
  • Test invalid values

Let’s revisit the crew_size functions from our colleague’s codebase. First let’s add the the additional functions to eva_data_analysis.py:

PYTHON

import pandas as pd
import matplotlib.pyplot as plt
import sys
import re

...

def calculate_crew_size(crew):
    """
    Calculate crew_size for a single crew entry

    Args:
        crew (str): The text entry in the crew column

    Returns:
        int: The crew size
    """
    if crew.split() == []:
        return None
    else:
        return len(re.split(r';', crew))-1


def add_crew_size_variable(df_):
    """
    Add crew size (crew_size) variable to the dataset

    Args:
        df_ (pd.DataFrame): The input data frame.

    Returns:
        df_copy (pd.DataFrame): A copy of df_ with the new crew_size variable added
    """
    print('Adding crew size variable (crew_size) to dataset')
    df_copy = df_.copy()
    df_copy["crew_size"] = df_copy["crew"].apply(
        calculate_crew_size
    )
    return df_copy

if __name__ == '__main__':

    if len(sys.argv) < 3:
        input_file = './eva-data.json'
        output_file = './eva-data.csv'
        print(f'Using default input and output filenames')
    else:
        input_file = sys.argv[1]
        output_file = sys.argv[2]
        print('Using custom input and output filenames')

    graph_file = './cumulative_eva_graph.png'

    eva_data = read_json_to_dataframe(input_file)

    eva_data_prepared = add_crew_size_variable(eva_data)  # Add this line

    write_dataframe_to_csv(eva_data_prepared, output_file)  # Modify this line

    plot_cumulative_time_in_space(eva_data_prepared, graph_file) # Modify this line

    print("--END--")

Now, let’s write some tests for calculate_crew_size.

Unit Tests for calculate_crew_size

Implement unit tests for the calculate_crew_size function. Cover typical cases and edge cases.

Hint: use the following template:

def test_MYFUNCTION (): # FIXME
    """
    Test that ...   #FIXME
    """

    # Typical value 1
    actual_result =  _______________ #FIXME
    expected_result = ______________ #FIXME
    assert actual_result == expected_result

    # Typical value 2
    actual_result =  _______________ #FIXME
    expected_result = ______________ #FIXME
    assert actual_result == expected_result
    

PYTHON

import pytest
from eva_data_analysis import (
    text_to_duration,
    calculate_crew_size
)

def test_text_to_duration_integer():
    """
    Test that text_to_duration returns expected ground truth values
    for typical whole hour durations
    """
    actual_result =  text_to_duration("10:00")
    expected_result = 10
    assert actual_result == expected_result

def test_text_to_duration_float():
    """
    Test that text_to_duration returns expected ground truth values
    for typical durations with a non-zero minute component
    """
    actual_result = text_to_duration("10:20")
    expected_result = 10.33333333
    assert actual_result == pytest.approx(expected_result)

def test_calculate_crew_size():
    """
    Test that calculate_crew_size returns expected ground truth values
    for typical crew values
    """
    actual_result = calculate_crew_size("Valentina Tereshkova;")
    expected_result = 1
    assert actual_result == expected_result

    actual_result = calculate_crew_size("Judith Resnik; Sally Ride;")
    expected_result = 2
    assert actual_result == expected_result


# Edge cases
def test_calculate_crew_size_edge_cases():
    """
    Test that calculate_crew_size returns expected ground truth values
    for edge case where crew is an empty string
    """
    actual_result = calculate_crew_size("")
    assert actual_result is None

OUTPUT

========================================================== test session starts
platform darwin -- Python 3.12.3, pytest-8.2.2, pluggy-1.5.0
rootdir: /Users/AnnResearcher/Desktop/Spacewalks
plugins: cov-5.0.0
collected 4 items

tests/test_eva_data_analysis.py .F..                                                                                               [100%]

================================================================ FAILURES
______________________________________________________ test_text_to_duration_float

    def test_text_to_duration_float():
        """
        Test that text_to_duration returns expected ground truth values
        for typical durations with a non-zero minute component
        """
        actual_result = text_to_duration("10:20")
        expected_result = 10.33333333
>       assert actual_result == pytest.approx(expected_result)
E       assert 13.333333333333334 == 10.33333333 ± 1.0e-05
E
E         comparison failed
E         Obtained: 13.333333333333334
E         Expected: 10.33333333 ± 1.0e-05

tests/test_eva_data_analysis.py:23: AssertionError
======================================================== short test summary info
FAILED tests/test_eva_data_analysis.py::test_text_to_duration_float - assert 13.333333333333334 == 10.33333333 ± 1.0e-05
====================================================== 1 failed, 3 passed in 0.33s

Enough Tests

In this episode, so far we’ve (only) written tests for two individual functions text_to_duration and calculate_crew_size.

We can quantify the proportion of our code base that is run (also referred to as “exercised”) by a given test suite using a metric called code coverage:

\[ \text{Line Coverage} = \left( \frac{\text{Number of Executed Lines}}{\text{Total Number of Executable Lines}} \right) \times 100 \]

We can calculate our test coverage using the pytest-cov library. Before we do so, let’s fix our bug so that our output is cleaner and we can focus on the code coverage information.

PYTHON

def text_to_duration(duration):
    """
    Convert a text format duration "HH:MM" to duration in hours

    Args:
        duration (str): The text format duration

    Returns:
        duration (float): The duration in hours
    """
    hours, minutes = duration.split(":")
    duration_hours = int(hours) + int(minutes)/60 # Bug-free line
    return duration_hours

BASH

python -m pytest --cov 

OUTPUT

========================================================== test session starts
platform darwin -- Python 3.12.3, pytest-8.2.2, pluggy-1.5.0
rootdir: /Users/AnnResearcher/Desktop/Spacewalks
plugins: cov-5.0.0
collected 4 items

tests/test_eva_data_analysis.py ....                                                                                               [100%]

---------- coverage: platform darwin, python 3.12.3-final-0 ----------
Name                              Stmts   Miss  Cover
-----------------------------------------------------
eva_data_analysis.py                 56     38    32%
tests/test_eva_data_analysis.py      20      0   100%
-----------------------------------------------------
TOTAL                                76     38    50%


=========================================================== 4 passed in 1.04s

To get an in-depth report about which parts of our code are tested and which are not, we can add the option --cov-report=html.

BASH

python -m pytest --cov --cov-report=html 

This option generates a folder htmlcov which contains a html code coverage report. This provides structured information about our test coverage including (a) a table showing the proportion of lines in each function that are currently tested (b) an annotated copy of our code where untested lines are highlighted in red.

Ideally, all the lines of code in our code base should be exercised by at least one test. However, if we lack the time and resources to test every line of our code we should:

  • Avoid testing Python’s built-in functions or functions imported from well-known and well-test libraries like Pandas or numpy.
  • Focus on the the parts of our code that carry the greatest “reputational risk” i.e. that could affect the accuracy of our reported results.

One the other hand, it is also important to realise that althought coverage of less than 100% indicates that more testing may be helpful, test coverage of 100% does not mean that our code is bug-free!

Evaluating Code Coverage

Generate a code coverage report for the Spacewalks test suite and extract the following information:

  1. What proportion of the code base is currently NOT exercised by the test suite?
  2. Which functions in our code base are currently untested?

BASH

python -m pytest --cov --cov-report=html
  1. The proportion of the code base NOT covered by our tests is 100 - 32% = 68%
  2. The following functions in our code base are currently untested:
    • read_json_to_dataframe
    • write_dataframe_to_csv
    • add_duration_hours_variable
    • plot_cumulative_time_in_space
    • add_crew_size_variable

Implementing a minimal test suite

A member of our research team shares the following code with us to add to the Spacewalks codebase:

PYTHON

def summarise_categorical(df_, varname_):
    """
    Tabulate the distribution of a categorical variable

    Args:
        df_ (pd.DataFrame): The input dataframe.
        varname_ (str): The name of the variable

    Returns:
        pd.DataFrame: dataframe containing the count and percentage of
        each unique value of varname_
        
    Examples:
        >>> df_example  = pd.DataFrame({
            'vehicle': ['Apollo 16', 'Apollo 17', 'Apollo 17'],
            }, index=[0, 1, 2)
        >>> summarise_categorical(df_example, "vehicle")
        Tabulating distribution of categorical variable vehicle
             vehicle  count  percentage
        0  Apollo 16      1        33.0
        1  Apollo 17      2        67.0
    """
    print(f'Tabulating distribution of categorical variable {varname_}')

    # Prepare statistical summary
    count_variable = df_[[varname_]].copy()
    count_summary = count_variable.value_counts()
    percentage_summary = round(count_summary / count_variable.size, 2) * 100

    # Combine results into a summary data frame
    df_summary = pd.concat([count_summary, percentage_summary], axis=1)
    df_summary.columns = ['count', 'percentage']
    df_summary.sort_index(inplace=True)


    df_summary = df_summary.reset_index()
    return df_summary

This looks like a useful tool for creating summary statistics tables, so let’s integrate this into our eva_data_analysis.pycode and then write a minimal test suite to check that this code is behaving as expected.

PYTHON

import pandas as pd
import matplotlib.pyplot as plt
import sys
import re


...

def add_crew_size_variable(df_):
    """
    Add crew size (crew_size) variable to the dataset

    Args:
        df_ (pd.DataFrame): The input dataframe.

    Returns:
        pd.DataFrame: A copy of df_ with the new crew_size variable added
    """
    print('Adding crew size variable (crew_size) to dataset')
    df_copy = df_.copy()
    df_copy["crew_size"] = df_copy["crew"].apply(
        calculate_crew_size
    )
    return df_copy


def summarise_categorical(df_, varname_):
    """
    Tabulate the distribution of a categorical variable

    Args:
        df_ (pd.DataFrame): The input dataframe.
        varname_ (str): The name of the variable

    Returns:
        pd.DataFrame: dataframe containing the count and percentage of
        each unique value of varname_
    """
    print(f'Tabulating distribution of categorical variable {varname_}')

    # Prepare statistical summary
    count_variable = df_[[varname_]].copy()
    count_summary = count_variable.value_counts() # There is a bug here that we will fix later!
    percentage_summary = round(count_summary / count_variable.size, 2) * 100

    # Combine results into a summary data frame
    df_summary = pd.concat([count_summary, percentage_summary], axis=1)
    df_summary.columns = ['count', 'percentage']
    df_summary.sort_index(inplace=True)


    df_summary = df_summary.reset_index()
    return df_summary


if __name__ == '__main__':

    if len(sys.argv) < 3:
        input_file = './eva-data.json'
        output_file = './eva-data.csv'
        print(f'Using default input and output filenames')
    else:
        input_file = sys.argv[1]
        output_file = sys.argv[2]
        print('Using custom input and output filenames')

    graph_file = './cumulative_eva_graph.png'

    eva_data = read_json_to_dataframe(input_file)

    eva_data_prepared = add_crew_size_variable(eva_data)

    write_dataframe_to_csv(eva_data_prepared, output_file)

    table_crew_size = summarise_categorical(eva_data_prepared, "crew_size")

    write_dataframe_to_csv(table_crew_size, "./table_crew_size.csv")

    plot_cumulative_time_in_space(eva_data_prepared, graph_file)

    print("--END--")

To write tests for this function, we’ll need to be able to compare dataframes. The pandas.testing module in the pandas library provides functions and utilities for testing pandas objects and includes a function assert_frame_equal that we can use to compare two dataframes.

Exercise 1 - Typical Inputs

First, check that the function behaves as expected with typical input values. Fill in the gaps in the skeleton test below:

PYTHON

import pandas.testing as pdt

def test_summarise_categorical_typical():
    """
    Test that summarise_categorical correctly tabulates
    distribution of values (counts, percentages) for a ground truth
    example (typical values)
    """
    test_input = pd.DataFrame({
        'country': _________________________________________, # FIX-ME
    }, index=[0, 1, 2, 3, 4])

    expected_result = pd.DataFrame({
        'country': ["Russia", "USA"],
        'count': [2, 3],
        'percentage': [40.0, 60.0],
    }, index=[0, 1])

    actual_result = ____________________________________________ # FIX-ME 
    
    pdt.__________________(actual_result, _______________) #FIX-ME

PYTHON

import pandas.testing as pdt

def test_summarise_categorical():
    """
    Test that summarise_categorical correctly tabulates
    distribution of values (counts, percentages) for a simple ground truth
    example
    """
    test_input = pd.DataFrame({
        'country': ['USA', 'USA', 'USA', "Russia", "Russia"],
    }, index=[0, 1, 2, 3, 4])

    expected_result = pd.DataFrame({
        'country': ["Russia", "USA"],
        'count': [2, 3],
        'percentage': [40.0, 60.0],
    }, index=[0, 1])

    actual_result = summarise_categorical(test_input, "country")

    pdt.assert_frame_equal(actual_result, expected_result)

Exercise 2 - Edge Cases

Now let’s check that the function behaves as expected with edge cases.
Does the code behave as expected when the column of interest contains one or more missing values (pd.NA)? (write a new test).

Fill in the gaps in the skeleton test below:

PYTHON

import pandas.testing as pdt

def test_summarise_categorical_missvals():
    """
    Test that summarise_categorical correctly tabulates
    distribution of values (counts, percentages) for a ground truth
    example (edge case where all column contains missing values)
    """
    test_input = _______________
    _______________
    _______________ # FIX-ME
    
    expected_result = _______________
    _______________
    _______________ # FIX-ME
    
    actual_result = summarise_categorical(test_input, "country")

    pdt.assert_frame_equal(actual_result, expected_result)

PYTHON

import pandas.testing as pdt

def test_summarise_categorical_missvals():
    """
    Test that summarise_categorical correctly tabulates
    distribution of values (counts, percentages) for a ground truth
    example (edge case where column contains missing values)
    """
    test_input = pd.DataFrame({
        'country': ['USA', 'USA', 'USA', "Russia", pd.NA],
    }, index=[0, 1, 2, 3, 4])

    expected_result = pd.DataFrame({
        'country': ["Russia", "USA", np.nan], # np.nan because pd.NA is cast to np.nan
        'count': [1, 3, 1],
        'percentage': [20.0, 60.0, 20.0],
    }, index=[0, 1, 2])
    actual_result = summarise_categorical(test_input, "country")

    pdt.assert_frame_equal(actual_result, expected_result)

Exercise 3 - Invalid inputs

Now write a test to check that the summarise_categorical function raises an appropriate error when asked to tabulate a column that does not exist in the data frame.

Hint: lookup pytest.raises in the pytest documentation.

PYTHON


def test_summarise_categorical_invalid():
    """
    Test that summarise_categorical raises an
    error when a non-existent column is input
    """
    test_input = pd.DataFrame({
        'country': ['USA', 'USA', 'USA', "Russia", "Russia"],
    }, index=[0, 1, 2, 3, 4])

    with pytest.raises(KeyError):
        summarise_categorical(test_input, "vehicle")

Improving Our Code


At the end of this episode, our test suite in tests should look like this:

PYTHON

import pytest
import pandas as pd
import pandas.testing as pdt
import numpy as np

from eva_data_analysis import (
    text_to_duration,
    calculate_crew_size,
    summarise_categorical
)

def test_text_to_duration_integer():
    """
    Test that text_to_duration returns expected ground truth values
    for typical whole hour durations
    """
    actual_result =  text_to_duration("10:00")
    expected_result = 10
    assert actual_result == expected_result

def test_text_to_duration_float():
    """
    Test that text_to_duration returns expected ground truth values
    for typical durations with a non-zero minute component
    """
    actual_result = text_to_duration("10:20")
    expected_result = 10.33333333
    assert actual_result == pytest.approx(expected_result)

def test_calculate_crew_size():
    """
    Test that calculate_crew_size returns expected ground truth values
    for typical crew values
    """
    actual_result = calculate_crew_size("Valentina Tereshkova;")
    expected_result = 1
    assert actual_result == expected_result

    actual_result = calculate_crew_size("Judith Resnik; Sally Ride;")
    expected_result = 2
    assert actual_result == expected_result


def test_calculate_crew_size_edge_cases():
    """
    Test that calculate_crew_size returns expected ground truth values
    for edge case where crew is an empty string
    """
    actual_result = calculate_crew_size("")
    assert actual_result is None


def test_summarise_categorical():
    """
    Test that summarise_categorical correctly tabulates
    distribution of values (counts, percentages) for a simple ground truth
    example
    """
    test_input = pd.DataFrame({
        'country': ['USA', 'USA', 'USA', "Russia", "Russia"],
    }, index=[0, 1, 2, 3, 4])

    expected_result = pd.DataFrame({
        'country': ["Russia", "USA"],
        'count': [2, 3],
        'percentage': [40.0, 60.0],
    }, index=[0, 1])

    actual_result = summarise_categorical(test_input, "country")

    pdt.assert_frame_equal(actual_result, expected_result)


def test_summarise_categorical_missvals():
    """
    Test that summarise_categorical correctly tabulates
    distribution of values (counts, percentages) for a ground truth
    example (edge case where column contains missing values)
    """
    test_input = pd.DataFrame({
        'country': ['USA', 'USA', 'USA', "Russia", pd.NA],
    }, index=[0, 1, 2, 3, 4])

    expected_result = pd.DataFrame({
        'country': ["Russia", "USA", np.nan],
        'count': [1, 3, 1],
        'percentage': [20.0, 60.0, 20.0],
    }, index=[0, 1, 2])
    actual_result = summarise_categorical(test_input, "country")

    pdt.assert_frame_equal(actual_result, expected_result)
    


def test_summarise_categorical_invalid():
    """
    Test that summarise_categorical raises an
    error when a non-existent column is input
    """
    test_input = pd.DataFrame({
        'country': ['USA', 'USA', 'USA', "Russia", "Russia"],
    }, index=[0, 1, 2, 3, 4])

    with pytest.raises(KeyError):
        summarise_categorical(test_input, "vehicle")

Finally lets commit our test suite to our codebase and push the changes to GitHub.

BASH

git add eva_data_analysis.py 
git commit -m "Add additional analysis functions"
git add tests/
git commit -m "Add test suite"
git push origin main

Continuous Integration (Optional)


Continuous Integration

So far, we have run our tests locally using.

BASH

python -m pytest

A limitation of this approach is that we must remember to run our tests each time we make any changes.

Continuous integration services provide the infrastructure to automatically run a
code’s test suite every time changes are pushed to a remote repository.

This means that each time we (or our colleagues) push to the remote, the test suite will be run to verify that our tests still pass.

If we are using GitHub, we can use the continuous integration service GitHub Actions to automatically run our tests.

To setup this up:

  • Navigate to the spacewalks folder:

BASH

cd ~/Desktop/Spacewalks
  • To setup continuous integration on GitHub actions, the dependencies of our code must be recorded in a requirements.txt file in the root of our repository.
  • You can find out more about creating requirements.txt files from CodeRefinery’s tutorial on “Recording Dependencies”.
  • For now, add the following list of code dependencies to requirements.txt in the root of the spacewalks repository:

BASH

touch requirements.txt

Content of requirements.txt:

OUTPUT

numpy
pandas
matplotlib
pytest
pytest-cov
  • Commit the changes to your repository:

BASH

git add requirements.txt
git commit -m "Add requirements.txt file"

Now let’s define out continuous integration workflow:

  • Create a hidden folder .github/workflows

BASH

mkdir -p .github/workflows
touch .github/workflows/main.yml
  • Define the continuous integration workflow to run on GitHub actions.

YAML

name: CI

# We can specify which Github events will trigger a CI build
on: push

# now define a single job 'build' (but could define more)
jobs:

  build:

    # we can also specify the OS to run tests on
    runs-on: ubuntu-latest

    # a job is a sequence of steps
    steps:

    # Next we need to checkout out repository, and set up Python
    # A 'name' is just an optional label shown in the log - helpful to clarify progress - and can be anything
    - name: Checkout repository
      uses: actions/checkout@v4

    - name: Set up Python 3.12
      uses: actions/setup-python@v4
      with:
        python-version: "3.12"

    - name: Install Python dependencies
      run: |
        python3 -m pip install --upgrade pip
        python3 -m pip install -r requirements.txt

    - name: Test with PyTest
      run: |
        python3 -m pytest --cov

This workflow definition file instructs GitHub Actions to run our unit tests using python version 3.12 each time code is pushed to our repository,

  • Let’s push these changes to our repository and see if the tests are run on GitHub.

BASH

git add .github/workflows/main.yml
git commit -m "Add GitHub actions workflow"
git push origin main
  • To find out if the workflow has run, navigate to the following page in your browser:
https://github.com/YOUR-REPOSITORY/actions
  • On the left of this page a sidebar titled “Actions” lists all the workflows that are active in our repository. You should “CI” here (the name of the workflow we just added to our repository ).
  • The body of the page lists the outcome of all historic workflow runs. If the workflow was triggered successfully when we pushed to the repository, you should see one workflow run listed here.

Summary

During this episode, we have covered how to use software tests to verify the correctness of our code. We have seen how to write a unit test, how to manage and run our tests using the pytest framework and how identify which parts of our code require additional testing using test coverage reports.

These skills reduce the probability that there will be a “mistake in our code” and support reproducible research by giving us the confidence to engage in open research practices. Tests also document the intended behaviour of our code for other developers and mean that we can experiment with changes to our code knowing that our tests will let us know if we break any existing functionality. In other words, software testing suppors FAIR software by making our code more Accessible and Reusable.

To find out more about this topic, please see the “Further reading” section below.

Further reading


We recommend the following resources for some additional reading on the topic of this episode:

Also check the full reference set for the course.

Key Points

  1. Code testing supports the FAIR principles by improving the accessibility and re-usability of research code.
  2. Unit testing is crucial as it ensures each functions works correctly.
  3. Using the pytest framework, you can write basic unit tests for Python functions to verify their correctness.
  4. Identifying and handling edge cases in unit tests is essential to ensure your code performs correctly under a variety of conditions.
  5. Test coverage can help you to identify parts of your code that require additional testing.

Content from Documenting code


Last updated on 2024-07-04 | Edit this page

Overview

Questions

  • How should we document our code?
  • Why are documentation and repository metadata important and how they support FAIR software?
  • What are the minimum elements of documentation needed to support FAIR software?

Objectives

After completing this episode, participants should be able to:

  • Use a README file to provide an overview and a CITATION.cff file to add citation instructions to a code repository
  • Describe the main types of software documentation (tutorials, how to guides, reference and explanation).
  • Apply a documentation framework to write effective documentation of any type.
  • Describe the different formats available for delivering software documentation (Markdown files, wikis, static webpages).
  • Implement MkDocs to generate and manage comprehensive project documentation

Motivation for Documenting Code


The purpose of software documentation is to communicate important information about our code to the people who need it – users and developers.

Better Research

Software documentation is often perceived as a thankless task with few tangible benefits and is often neglected in research projects. However, like software testing, documenting our code can help us become more productive researchers. Here are some advantages of documenting our code:

  • Good documentation captures important methodological details ready for when we come to publish our research
  • Good documentation can help us return to a project seamlessly after time away.
  • Documentation can facilitate collaborations by helping to onboard new project members
  • Good documentation can save us time by answering FAQs about our code for us.

FAIR Software

Software documentation supports FAIR software by improving the re-usability of our code.

  • How-to guides and Tutorials ensure that users can install our software independently and make use of its basic features

  • Reference guides and background information can help developers understand our code sufficiently to modify / extend / repurpose it.

Software-level Documentation


In previous episodes we encountered several different forms of in-code documentation including in-line comments and docstrings.

These are an excellent way to improve the readability of our code, but by themselves are insufficient to ensure that our code is easy to use, understand and modify - this requires additional software-level documentation.

Types of Documentation

There are many different types of software-level documentation.

Technical Documentation

Software-level technical documentation encompasses:

  • Tutorials - lessons that guide learners through a series of exercises to build proficiency as using the code
  • How-To Guides - step by step instructions on how to accomplish specific goals using the code.
  • Reference - a lookup manual to help users find relevant information about the software e.g. functions and their parameters.
  • Explanation - conceptual discussion of the code to help users understand implementation decisions

Repository Metadata Files

In addition to software-level technical documentation, it is also common to see repository metadata files included in a code repository. Many of these files can be described as “social documentation” i.e. they indicate how users should “behave” in relation to our software project. Some common examples of repository metadata files and their role are tabulated below:

File Description
CONTRIBUTING.md Explains to developers how to contribute code to the project including processes and standards that should be followed.
CODE_OF_CONDUCT.md Defines expected standards of conduct when engaging in a software project.
LICENSE Defines the (legal) terms of use of a piece of code.
CITATION.cff Provides instructions on how and when to cite the code

Just Enough Documentation


However, for many small projects the following three pieces of documentation will be sufficient:

  • README - A document that provides an overview of the project, including installation, usage instructions, and dependencies. A README may include one or more of the technical documentation types - tutorial / how-to / explanation / reference.
  • LICENSE - A file that outlines the legal terms for using, modifying, and distributing the project.
  • CITATION.cff - A file that provides instructions on how to properly cite the project in academic or professional work.

Let’s look at each of these in turn.

README

A README file acts as a “landing page” for your code repository on GitHub and should provide sufficient information for users to and developers to get started using your code.

READMEs and The FAIR Principles

Think about the question below. Your instructors may ask you to share your answer in a shared notes document and/or discuss them with other participants.

Here are some of the major sections you might find in a typical README. Which are essential to support the FAIR principles? Which are optional?

  • Purpose of the code
  • Audience (who the code is intended for)
  • Installation Instructions
  • Contribution Guide
  • How to Get Help
  • License
  • Software Citation
  • Usage Example
  • Dependencies and their versions
  • FAQs
  • Code of Conduct

To support the FAIR principles (Findability, Accessibility, Interoperability, and Reusability), certain sections in a README file are more important than others. Below is a breakdown of the sections that are ESSENTIAL / OPTIONAL in a README to align with these principles.

Essential

  1. Purpose of the code
    • Explanation: Clearly explains what the code does. This is essential for findability and reusability.
  2. Installation Instructions
    • Explanation: Provides step-by-step instructions on how to install the software, ensuring accessibility.
  3. Usage Example
    • Explanation: Provides examples of how to use the code, helping users understand its functionality and enhancing reusability.
  4. License
    • Explanation: Specifies the terms under which the code can be used, which is crucial for legal clarity and reusability.
  5. Dependencies and their versions
    • Explanation: Lists the external libraries and tools required to run the code, including their versions. This is essential for reproducibility and interoperability.
  6. Software Citation
    • Explanation: Provides citation information for academic use, ensuring proper attribution and reusability.

Optional

  1. Audience (who the code is intended for)
    • Explanation: Helps users identify if the code is relevant to them, improving findability and usability.
  2. How to Get Help
    • Explanation: Informs users where they can get help, ensuring better accessibility.
  3. Contribution Guide
    • Explanation: Encourages and guides contributions from the community, enhancing the code’s development and reusability.
  4. FAQs
  • Explanation: Provides answers to common questions, aiding in troubleshooting and improving accessibility.
  1. Code of Conduct
  • Explanation: Sets expectations for behaviour in the community, fostering a welcoming environment and enhancing accessibility.

Let’s create a simple README for our repository.

BASH

$ cd ~/Desktop/Spacewalks
$ touch README.md

Let’s start by adding a one-liner that explains the purpose of our code and who it is for.

# Spacewalks

## Overview
Spacewalks is a python-based analysis tool for researchers to generate visualisations
and statistical summaries of NASA's extravehicular activity datasets.

Now let’s add a list of Spacewalk’s key features:

## Features
Key features of Spacewalks:
- Generates a CSV table of summary statistics of extravehicular activity crew sizes
- Generates a line plot to show the cumulative duration of space walks over time

Now let’s tell users about any Pre-requisites required to run the software:

## Pre-requisites

Spacewalks was developed using Python version 3.12

To install and run Spacewalks you will need have Python >=3.12
installed. You will also need the following libraries (minimum versions in brackets)

- [NumPy](https://www.numpy.org/) >=2.0.0 - Spacewalk's test suite uses NumPy's statistical functions
- [Matplotlib](https://matplotlib.org/stable/index.html) >=3.0.0  - Spacewalks uses Matplotlib to make plots
- [pytest](https://docs.pytest.org/en/8.2.x/#) >=8.2.0  - Spacewalks uses pytest for testing
- [pandas](https://pandas.pydata.org/) >= 2.2.0 - Spacewalks uses pandas for data frame manipulation 

Spacewalks README

Extend the README for Spacewalks by adding a. Installation instructions b. A simple usage example

Installation instructions:

NB: In the solution below the back ticks of each code block have been escaped to avoid rendering issues.

# Installation instructions

+ Clone the Spacewalks repository to your local machine using Git.
If you don't have Git installed, you can download it from the official Git website.

\`\`\`bash
git clone https://github.com/your-repository-url/spacewalks.git
cd spacewalks
\`\`\`

+ Install the necessary dependencies:
\`\`\`bash
pip install pandas==2.2.2 matplotlib==3.8.4 numpy==2.0.0 pytest==7.4.2
\`\`\`

+ To ensure everything is working correctly, run the tests using pytest.

\`\`\`bash
python -m pytest
\`\`\`

Usage instructions:

# Usage Example

To run an analysis using the eva_data_analysis.py script from the command line terminal,
launch the script using Python as follows:

\`\`\`python
# Usage Examples
python eva_data_analysis.py eva-data.json eva-data.csv
\`\`\`

The first argument is path to the Json data file.
The second argument is the path the CSV output file.

LICENSE

A license file outlines the legal terms for using, modifying, and distributing the project. We’ll talk about choosing a license in detail in later episode.

CITATION File

We can add a citation file to our repository to provide instructions on how and when to cite our code.

Citation files use a special format called CFF - the Citation File Format (CFF) and are a way to include metadata about software or datasets in plain text files, making it easy for both humans and machines to use this information.

Why Use CFF?

  • For developers, using a CFF file can help to automate the process of publishing new releases on Zenodo via GitHub.

  • For users, having a CFF file makes it easy to cite the software or dataset. with formatted citation information available for copy-paste and direct import into reference managers like Zotero from GitHub.

Creating a CFF File

There are a few common formats used for citation files including markdown and plain text. We will use the Citation File Format (CFF) for our project. A CFF file is written in YAML format. At a minimum a CFF file must contain the title of the software/data, the type of asset (software/data) and at least one author:

YAML

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: My Software
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Anne
    family-names: Researcher

Additional, optional metadata includes an Abstract, repository URL and more.

Steps to Make Your Software Citable

We can create (and update) a CFF file for our software using an online Web App called cffinit.

Let’s create a dummy citation file for a project called “Spacetravel” with Author “Max Hypothesis”, follow these steps:

  1. First, head to cffinit online at cffinit.
  2. Then, let’s work through the metadata input form to complete the minimum information needed to generate a CFF. We’ll also add the following abstract:

“Spacetravel - a simple python script to calculate time spent in Space by individual NASA astronauts”

  1. At the end of the process, download the CFF file and inspect it. It should look like this:

YAML

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Spacetravel
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Hypothesis
    name-particle: Max
abstract: >-
    A simple python script to calculate time spent in Space by individual NASA astronauts

Updating and citing

CFF files can also be updated using cffinit.

To cite our software (or dataset), once a CFF file has been pushed to our remote repository, GitHub’s “Cite this repository” button can be used to generate a citation in various formats (APA, BibTeX).

Tools Command line tools are also available for creating, validating, and converting CFF files. Further information is available from the Turing Way’s guide to software citation.

Spacewalks Software Citation

Write a software citation file for the Spacewalks code and add it to the root folder of our project.

  • Add the URL of the code repository as a “Related Resources”
  • Add a one-line description of the code under the “Abstract” section
  • Add at least two key words under the “Keywords” section

Use cffinit, a web application to create your citation file using a series of online forms.

YAML

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Spacewalks
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Jaffa
    name-particle: Sara
  - given-names: Aleksandra
    family-names: Nenadic
  - given-names: 'Kamilla'
    family-names: Kopec-Harding
  - given-names: YOUR
    family-names: NAME-HERE
repository-code: >-
  https://github.com/YOUR-REPOSITORY-URL/spacewalks.git
abstract: >-
  A python script to analyse NASA extravehicular activity
  data
keywords:
  - NASA
  - Extravehicular Activity

Documentation Tools


Once our project reaches a certain size or level of complexity we may want to add additional documentation such as a standalone tutorial or “Background” explaining our methodological choices.

Once we move beyond using a README as our primary source of documentation, we need to consider how we will distribute our documentation to our users.

Options include:

  • A docs/ folder of Markdown files.
  • Adding a Wiki to our repository.
  • Creating a set of web pages for our documentation using a static site generator for our documentation such as Sphinx or MkDocs

Creating a static site is a popular solution as it has the key benefit being able to automatically generate a reference manual from any docstrings we have added to our code.

MkDocs

Let’s setup MkDocs.

BASH

python -m pip install mkdocs
python -m pip install "mkdocstrings[python]"
python -m pip install mkdocs-material

Let’s check that MkDocs has been setup correctly:

BASH

python -m pip list

Let’s creates a new mkdocs project in the current directory

BASH

# In ~/Desktop/spacewalks
mkdocs new .    

OUTPUT

INFO    -  Writing config file: ./mkdocs.yml
INFO    -  Writing initial docs: ./docs/index.md

This command creates a new mkdocs project in the current directory with a docs folder containing an index.md file and a mkdocs.yml configuration file.

Now, let’s fill in the configuration file for our project.

YAML

site_name: Spacewalks Documentation

theme:
  name: "material"
font: false
nav:
  - Spacewalks Documentation: index.md
  - Tutorials: tutorials.md
  - How-To Guides: how-to-guides.md
  - Reference: reference.md
  - Background: explanation.md

Note font-false is for GDPR compliance

Let’s add support for mkdocstrings - this will allow us to automatically our docstrings into our documentation using a simple tag.

YAML

site_name: Spacewalks Documentation
use_directory_urls: false

theme:
  name: "material"

nav:
  - Spacewalks Documentation: index.md
  - Tutorials: tutorials.md
  - How-To Guides: how-to-guides.md
  - Reference: reference.md
  - Background: explanation.md

plugins:
  - mkdocstrings

Let’s populate our docs/ folder to match our configuration file.

BASH

touch docs/tutorials.md
touch docs/how-to-guides.md
touch docs/reference.md
touch docs/explanation.md

Let’s populate our reference file with some preamble to include before the reference manual that will be generated from the docstrings we created.

MARKDOWN

This file documents the key functions in the Spacewalks tool.
It is provided as a reference manual.

::: eva_data_analysis

Finally, let’s build our documentation.

BASH

mkdocs build

OUTPUT

INFO    -  Cleaning site directory
INFO    -  Building documentation to directory: /Users/AnnResearcher/Desktop/Spacewalks/site
WARNING -  griffe: eva_data_analysis.py:105: No type or annotation for returned value 'int'
WARNING -  griffe: eva_data_analysis.py:84: No type or annotation for returned value 1
WARNING -  griffe: eva_data_analysis.py:33: No type or annotation for returned value 1
INFO    -  Documentation built in 0.31 seconds

Once the build step is completed, our documentation site is saved to a site folder in the root of our project folder.

These files will be distributed with our code. We can either direct users to read these files locally on their own device using their browser, or we can choose to host our documentation as a website that our uses can navigate to.

Note that we used the setting use_directory_urls: false in the mkdocs.yml file. This setting ensures that the documentation site is generated with URLs that are easy to navigate locally on a user’s device.

Finally let us commit our documentation to the main branch of our git repository and push the changes to GitHub

BASH

git add mkdocs.yml 
git add docs/
git add site/
git commit -m "Add project-level documentation"
git push origin main

Hosting Documentation (Optional)


Hosting Documentation

In the previous section, we saw how mkdocs documentation can be distributed with our repository and viewed “offline” using a browser.

We can also make our documentation available as a live website by deploying our documentation to a hosting service.

GitHub Pages

As our repository is hosted in GitHub, we can use GitHub Pages - a service that allows GitHub users to host websites directly from their GitHub repositories.

There are two types of GitHub Pages: project pages and user/organization Pages. While similar, they have different deployment workflows, and we will only discuss project pages here. For information about deploying to user/organisational pages, see [MkDocs Deployment pages][mkdocs-deploy].

Project Pages deploy site files to a branch within the project repository (default is gh-pages). To deploy our documentation:

Warning! Before we proceed to the next step, we MUST ensure that there are no uncommitted changes or untracked files in our repository.

If there are, the commands used in the upcoming steps will include them in our documentation!

  1. (If not done already), let us commit our documentation to the main branch of our git repository and push the changes to GitHub

BASH

git add mkdocs.yml 
git add docs/
git add site/
git commit -m "Add project-level documentation"
git push origin main
  1. Once we are on the main branch and all our changes are up to date, run the following command to deploy our documentation to GitHub.

BASH

# Important: 
# - This command will push the documentation to the gh-pages branch of your repository
# - It will ALSO include uncommitted changes and untracked files (read the warning above!!) <- VERY IMPORTANT!!
mkdocs gh-deploy

OUTPUT

INFO    -  Cleaning site directory
INFO    -  Building documentation to directory: /Users/AnnResearch/Desktop/Spacewalks/site
WARNING -  griffe: eva_data_analysis.py:105: No type or annotation for returned value 'int'
WARNING -  griffe: eva_data_analysis.py:84: No type or annotation for returned value 1
WARNING -  griffe: eva_data_analysis.py:33: No type or annotation for returned value 1
INFO    -  Documentation built in 0.37 seconds
WARNING -  Version check skipped: No version specified in previous deployment.
INFO    -  Copying '/Users/AnnResearcher/Desktop/Spacewalks/site' to 'gh-pages' branch and pushing to
           GitHub.
Enumerating objects: 63, done.
Counting objects: 100% (63/63), done.
Delta compression using up to 11 threads
Compressing objects: 100% (60/60), done.
Writing objects: 100% (63/63), 578.91 KiB | 7.93 MiB/s, done.
Total 63 (delta 7), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (7/7), done.
remote:
remote: Create a pull request for 'gh-pages' on GitHub by visiting:
remote:      https://github.com/kkh451/spacewalks/pull/new/gh-pages
remote:
To https://github.com/kkh451/spacewalks-dev.git
 * [new branch]      gh-pages -> gh-pages
INFO    -  Your documentation should shortly be available at: https://kkh451.github.io/spacewalks/

This command will build our documentation with MkDocs, then commit and push the files to the gh-pages branch using the GitHub ghp-import tool which is installed as a dependency of MkDocs.

For more options, use:

BASH

mkdocs gh-deploy --help

Notice that the deploy command did not allow us to preview the site before it was pushed to GitHub; so, it is a good idea to check changes locally with the build commands before deploying.

Other Options

You can find out about other deployment options including free documentation hosting service ReadTheDocs on the [MkDocs deployment pages][mkdocs-deploy].

Documentation Guides


Once we start to consider other forms of documentation beyond the README, we can also increase the re-usability of our code by ensuring that the content and style of our documentation matches its purpose.

Documentation guides such as Write the Docs, The Good Docs Project and the Diataxis framework provide a range of resources including documentation templates to help to help us do this.

A Spacewalks How-to Guide

  1. Review the Diataxis guidance page on writing a How-to guide. Identify three features of an effective how-to guide.

  2. Following the Diataxis guidelines, add a How-to Guide to the docs folder that show users how to change the destination filename for the output dataset generated by Spacewalks.

An effective How-to guide should:

  • be goal oriented and focus on action.
  • avoid teaching or explanation
  • use appropriate language e.g. conditional imperatives
  • have an informative title
# How to change the file path of Spacewalk's output dataset

This guide shows you how to set the file path for Spacewalk's output
data set to a location of your choice.

By default, the cleaned data set generated by Spacewalk is saved to the current
working directory with file name `eva-data.csv`.

If you would like to modify the name or location of the output dataset, set the
second command line argument to your chosen file path.

\`\`\`python
python eva_data_analysis.py eva-data.json data/clean/eva-data-clean.csv
\`\`\`

The specified destination folder must exist before running spacewalks.

The Diataxis framework provides guidance for developing technical documentation for different purposes. Tutorials differ in purpose and scope to How-to Guides, and as a result, differ in content and style.

We have adapted the How-to guide from the previous challenge into the example tutorial below.

Example Tutorial: Changing the File Path for the Spacewalks Output Dataset

Introduction

In this tutorial, we will learn how to change the file path for the output dataset generated by Spacewalk. By the end of this tutorial, you will be able to specify a custom file path for the cleaned dataset.

Prerequisites

Before you start, ensure you have the following:

  • Python installed on your system
  • The Spacewalk script (eva_data_analysis.py)
  • An input dataset (eva-data.json)

Prepare the Destination Directory

First, let us decide where we want to save the cleaned dataset. and make sure the directory exists.

For this tutorial, we will use data/clean as the destination folder.

Let’s create the directory if it doesn’t exist:

BASH

mkdir -p data/clean

Run the Spacewalk Script with Custom Path

Next, execute the Spacewalk script and specify the custom file path for the output dataset:

BASH

python eva_data_analysis.py <input-file> <output-file>

Replace with your input dataset (eva-data.json) and with your desired output path (data/clean/eva-data-clean.csv).

Here is the complete command:

BASH

python eva_data_analysis.py eva-data.json data/clean/eva-data-clean.csv

Notice how the output to the command line clearly indicates that we are using a custom output file path.

OUTPUT

Using custom input and output filenames
Reading JSON file eva-data.json
Saving to CSV file data/clean/eva-data-clean.csv
Adding crew size variable (crew_size) to dataset
Saving to CSV file data/clean/eva-data-clean.csv
Plotting cumulative spacewalk duration and saving to ./cumulative_eva_graph.png

After running the script, let us check the data/clean directory to ensure the cleaned dataset has been saved correctly.

BASH

ls data/clean

You should see eva-data-clean.csv listed in the data/clean folder

Exercise: Custom Output Path

  • Create a new directory named output/data in your working directory.
  • Run the Spacewalk script to save the cleaned dataset in the newly created output/data directory with the filename cleaned-eva-data.csv.
  • Verify that the dataset has been saved correctly.
Solution

BASH

# Create the directory:
mkdir -p output/data

# Run the script:
python eva_data_analysis.py eva-data.json output/data/cleaned-eva-data.csv

# Verify the output:
ls output/data

# You should see cleaned-eva-data.csv listed

Summary

Congratulations! You have successfully changed the file path for Spacewalks output dataset and completed an exercise to practice the process. You can now customize the output location and filename according to your needs.

A Spacewalks Tutorial

How does the content and language of our example tutorial differ from our example how-to guide?

  • The tutorial clearly signposts what will be covered
  • Includes a narrative of each step and the expected output
  • Highlights important behaviour the learner should notice
  • Includes an exercise to practice skills

Language

  • Uses “we” language
  • Uses imperative to provide clear instructions “First do x, then do y”

Summary

During this episode, we have covered the basics of documenting our code including how to write an effective README and how to add citation instructions to our repository. We saw that documentation frameworks like Diataxis can help us to write high-quality documentation for different purposes, while static site generators like MkDocs can help us to distribute our documentation to our users.

To find out more about this topic, please see the “Further reading” section below.

Further reading


We recommend the following resources for some additional reading on the topic of this episode:

Also check the full reference set for the course.

Key Points

  • Documentation allows users to run and understand software without having to work things out for themselves directly from the source code.
  • Software documentation supports the FAIR principles by improving the reusability of research code.
  • A (good) README, CITATION file and LICENSE file are the minimum documentation elements required to support FAIR research code.
  • Documentation can be provided to users in a variety of formats including a docs folder of Markdown files, a repository Wiki and static webpages.
  • A static documentation site can be created using the tool mkdocs.
  • Documentation frameworks such as Diataxis provide content and style guidelines that can helps us write high quality documentation.

Content from Open project collaboration & management


Last updated on 2024-07-08 | Edit this page

Overview

Questions

  • What licensing information should I include with my code and data?
  • How do I share my code on Github?
  • How do I ensure my code is citable?
  • How do I track bugs in Github repositories and ensure that multiple developers can work on the same files simultaneously?

Objectives

After completing this episode, participants should be able to:

  • Explain why adding licensing information to a repository is important.
  • Understand that some licenses are intended for code and others for data.
  • Understand your rights and obligations under the GPL, MIT, BSD, Apache 2 and Creative Commons licenses.
  • Recall that open source software can be sold and used in commercial products but that license declaration will need to be included and any other terms of the license adhered to.
  • Apply an appropriate open source license to a code repository that is shared on Github.
  • Understand how to archive code to Zenodo and write a citation file that can be included with code shared on Github.
  • Understand how to track issues with Github.
  • Understand how to use Git branches for working on code in parallel and how to merge code back using pull requests.
  • Apply issue tracking, branching and pull requests together to fix bugs while allowing other developers to work on the same code.

Licensing

What is licensing and why is it important?

The authors of any creative work such as written text, photographs, films, music and computer code are protected by Copyright law. This allows them to set the terms under which their work can be copied or reproduced. This often takes the form of the creator of the work selling copies of it and receiving a fee from each person who obtains a copy. Those receiving the copies are typically forbidden from making additional copies without the permission of the creator. For example if an author writes a book they will receive some money for each copy of that book that’s sold and anybody who makes and sells a copy of that book would need their permission to do so.

Copyright licenses don’t necessarily require any money to be charged for copies and the creator might choose to let anybody copy their work, providing they don’t change it and keep their name on it.

Copyright is automatically implied, any creative work which doesn’t specify a license should be assumed to be copyrighted. If it were ever to come to a court case then the author might have a problem proving this and enforcing their copyright. Technically the work is copyrighted the moment it is created, even without any kind of copyright statement, registration or license agreement.

An Open Source license is a form of copyright license that gives anybody who receives a copy of a work the right to make additional copies and distribute them to anybody else. It also allows anybody who receives a copy to make changes to the original work and to redistribute those. When applied to software this usually requires that anybody supplying the executable binary of a program must also supply the source code if requested.

A common way to declare your copyright of a program and the license you are distributing it under is to include a file called LICENSE in your code repository, to display it in comments in a code file or to display it on screen when the program is run. At the very least each source file should state what license the code is under and tell the reader to refer to the LICENSE file. This means that if the code ever gets (accidentally) redistributed without the license file then the reader will still know about the license that was used.

Free Software

In the early history of computing there were often informal agreements where programmers would share source code with each other, but this was rarely backed up with any formal copyright license. As the field grew this was first formalised in the 1980s by Richard Stallman who formed the Free Software Foundation and defined “Free Software” as software which which respects users freedoms by granting them four freedoms:

  1. The freedom to run the program as you wish, for any purpose.
  2. The freedom to study how the program works, and change it so it does your computing as you wish. Access to the source code is a precondition for this.
  3. The freedom to redistribute copies so you can help your neighbor.
  4. The freedom to distribute copies of your modified versions to others. By doing this you can give the whole community a chance to benefit from your changes. Access to the source code is a precondition for this.

The term “Free” in English often causes some confusion here as it can mean both “free as in freedom” and “free as in beer” (no charge). The term “Libre” software is sometimes used instead of “Free” to help clarify this.

Open Source Definition

In the 1990s the “Open Source” movement wanted to make a more business friendly term and thought using the term “Free” might lead to confusion when people wanted to charge for software. Despite “Free Software” often being given away for no charge (or just a small charge/suggested donation to cover media or distribution costs), nothing in the four freedoms and the software licenses that were designed around it prohibits charging for software.

In 1998 the Open Source Definition was created and says that software which is open source must be distributed under the a license which allows the following:

  1. Free Redistribution must be allowed
  2. Source Code must be included
  3. Derived Works must be allowed
  4. Integrity of The Author’s Source Code - There can only be restrictions on the distribution of modifications if patches can be distributed instead. Must explicitly permit distribution of software built from modified source code. The license may require derived works to carry a different name or version number from the original software.
  5. No Discrimination Against Persons or Groups
  6. No Discrimination Against Fields of Endeavor
  7. Distribution of License - License applies to anyone the software is distributed to
  8. License Must Not Be Specific to a Product - the program can be extracted from other software it is distributed with
  9. License Must Not Restrict Other Software, e.g. can’t insist it is only distributed alongside other open source software.
  10. License Must Be Technology-Neutral

Free and Open Source Software

Because of these two terms many people use the term “Free and Open Source Software” (FOSS), while other’s simple say “Open Source Software” and some will say “Free Software”. In reality these all have very similar meanings that will respect the four freedoms and the open source definition. For the rest of this chapter we will refer to “Open Source”, but could have equally said “Free and Open Source Software” or “Free Software”.

Types of licenses

A number licenses have been written that conform to the Four Freedoms and/or the Open Source definition. Each of these set out the rights of anybody receiving software that is distributed under that license. We will look at a few of the most popular ones.

Public Domain

Before we look at any licenses in detail, one other thing to consider is Public Domain. This is a concept in some countries (but not all) where a work is not protected by copyright. These works have no license from their creators and typically anybody can do anything they like with them. Copyrights are usually time limited to between 50 and 100 years and in many jurisdictions when Copyrights expire the works enter the public domain. It is for this reason that the works of Charles Dickens are no longer subject to Copyright and anybody can make a copy of them. In some jurisdictions works can be placed straight into the public domain simply with a declaration within the work. In the USA works produced by the federal government are automatically public domain. But in other countries works need to be registered as public domain. For this reason it isn’t recommended to simply make work public domain if you want to release it for use across the world.

If you really want to release code under something very similar to public domain then the unlicense could be for you. It states that anybody is free to use the software in any way they like. In jurisdictions with public domain, it “dedicates” the software into the public domain.

Permissive Licenses

Some of the simplest open source licenses are known as the “permissive licenses”. Broadly speaking these require anybody redistributing the code to include the license text and a copyright statement crediting the authors. But they do not require any modifications to release their source code alongside the executable program. This allows software released under these licenses to be made into part of a closed source program and all the creator of the closed source program needs to do is distribute the license text and perhaps some crediting statement to the original author, but they don’t need to redistribute the source code with their closed source program.

MIT License

The MIT License is very simple and a very popular choice of license, it is only three paragraphs long. It states that:

  • Anybody receiving the software can copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the software to anybody they like.
  • That they must include the license text and copyright statement when they redistribute it.
  • That there’s no warranty included and you can’t hold the authors liable for any problems or damages that might arise.

BSD

The BSD license is very similar to the MIT license, but it has several variants. The oldest is the 4 clause version which requires:

  • Source code redistributions must include the copyright statement and license text.
  • Binary redistributions must include the copyright notice and license text in their documentation.
  • Any advertising for the software incorporating this software must include a statement saying that it includes software written by the author.
  • That the author doesn’t endorse any derived products.
  • That the author doesn’t provide a warranty. (this is written in a separate section and isn’t one of the 4 clauses)

The acknowledgement/advertising clause caused some controversy and some people thought that it didn’t meet the open source definition or comply with the four freedoms. It also became impractical and lead to very long statements where there were multiple authors of large pieces of software.

To address this a new version of the license known as the 3 clause BSD license was created. This removed the acknowledgement clause. An even simpler 2 clause version removes the endorsement clause.

Apache 2.0

Another popular permissive license is the Apache 2.0 license. It is very similar to the MIT and BSD licenses, but also includes some clauses about patent licensing. These require any contributor to a program which is licensed under Apache 2.0 to allow their patents to be used free of charge by any user of the software. This was done to head off some concerns about software patent legislation at the time the license was developed.

Copyleft Licenses

The other common form of open source licenses are the “Copyleft” licenses. These require that any modifications to the program are also released under a compatible copyleft license. This can cause complications when combining code that’s under copyleft with other licenses as now the entire codebase must be released under the copyleft license. If there are any incompatible terms in the other license then this can prevent the two codebases from being combined.

The main advantage of copyleft licenses is that anybody else who incorporates the code into another product must also keep that product open and this makes it harder for it to be subsumed into a commercial product that doesn’t contribute improvements to the code back to the community which created it.

GPL version 2

The most commonly known copyleft license is the GNU Public License or GPL. There are several versions of the GPL available, the oldest GPL version 1.0, but this version was quickly replaced with version 2.0 and was rarely used. Version 2 is much more common and is used by a lot of popular software including the Linux Kernel. Compared with the permissive license the GPL is quite a long license agreement and many of its clauses can be quite difficult for non-lawyers to fully understand.

GPL version 3

The GPL version 3 was introduced to try and prevent patenting of software under the license and to require that people distributing GPL code on embedded systems must also give users the ability to rebuild that software and use it in their device. It is an even longer document than GPL version 2 and has proved controversial with some and despite coming out in 2007 hasn’t been adopted by many projects which used GPL version 2.

LGPL

In order to allow library software to be combined with (“linked” in technical terms) other software that is not GPLed, a special version of the GPL called the LGPL (L for “Lesser” or “Library”) was created. This is used by many popular libraries including the GNU C library that is linked to every C program compiled on a Linux system.

AGPL

One problem for the GPL is that a lot of modern software doesn’t run on user’s own computers, but on remote web servers that they connect to through their web browser. When a user run’s one of these web applications they aren’t receiving a copy of the software’s executable binaries, but are interacting with them over a computer network such as the internet. This doesn’t meet the GPL’s definition of distributing the binaries, so even if the web application is under the GPL, the person running the server has no obligation to supply users with the source code. The Affero GPL or AGPL requires that people running web applications licensed under the AGPL make the source code available to the users of those web applications.

Choosing a License

When choosing a license for your code there’s a number of things you might want to consider:

  • Do you want to ensure that anybody modifying and redistributing your code will release the source code of their changes?
  • Or would you prefer to ensure the least number of restrictions and that your code will be used as widely as possible? Even if that means it might end up in commercial products that don’t release their source code.
  • Are you reusing code which is already under an open source license? What obligations do you have under those licenses?
  • Is there a preferred license in your research community?

Don’t be tempted to write your own license (or modify an existing one) unless you are a copyright lawyer. The common open source licenses have been carefully written by copyright lawyers, many of them have undergone multiple iterations in response to cases that have arisen and the implications of many different legal jurisidcitons has been considered. The common licenses are also well known, meaning that potential users and contributors have a better understanding of what their rights and responsibilities are leaving less room for misunderstanding.

Tools to help you choose

The website choosealicense.com has some great resources to help you choose a license that’s appropriate for your needs.

License selection exercise

Q: You have created a library of functions that are commonly used by researchers in your field. You would like to share this code with your research community and ensure that the code remains as open as possible to benefit the community. But you would also like to see it being integrated into as many different research codes and even commercial products as possible. What would be a good choice of license? (hint: You can use the choosealicense.com website to help you)

A: The LGPL license could be a good choice for library code. It can be linked to software that isn’t GPL licensed but any modifications to the library itself must be released under a GPL compatible license.

Creative Commons Licenses

All of the licenses we’ve discussed so far are really only intended for source code. They are not suitable for documentation, datasets, drawings, logos, music, maps etc. To solve this problem there are the Creative Commons licenses, which are expressly built for anything other than source code. Creative commons has now gone through four versions with the latest being version 4.0, generally you should use version 4.0 as this is more suited for use around the world

There are several different types of Creative Commons licenses:

CC0

Creative Commons Zero or CC0 is the simplest license and is effectively a public domain license, but is suitable for use world wide.

Attribution (BY)

All the Creative Commons licenses apart from CC0 require you to give credit to the original creator. This is very similar to what is required by all of the permissive code licenses.

Share-a-like (SA)

Share-a-Like Creative Commons licenses requires any derivative works to be released under a compatible creative commons license. This is very similar to the way that Copyleft licenses work.

Non-Commercial (NC)

Non-commercial creative commons licenses only allow for non-commercial use of the work.

No Derivatives (ND)

No derivatives creative commons licenses require that users of the work can’t redistribute modified versions of it.

Combinations of Creative Commons Licenses

Under these rules the following combinations are possible:

  • CC-BY - Creative Commons Attribution
  • CC-BY-SA - Creative Commons Attribution Share Alike
  • CC-BY-NC - Creative Commons Non Commercial
  • CC-BY-NC-SA - Creative Commons Non Commercial Share Alike
  • CC BY-ND - Creative Commons No Derivatives
  • CC BY-ND-NC - Creative Commons No Derivatives Non Commercial

License selection exercise 2

Choose a license for your code and data from the pervious exercises. Discuss with your neighbour or the group your choice of license and reason for choosing it.

Adding a license to your code

Add a LICENSE file containing the full text of the license you’ve chosen to the Git repository of your code from previous chapters of this lesson. Add a copyright statement, the name of the license you are using and a mention of the LICENSE file to at least one source file
Push your changes to your Github repository. Check the “About” section of your repository’s Github webpage and see if there is now a license listed.

Relicensing

License Compatibility

Generally, you can relicense someone else’s code under more permissive licenses to less permissive ones without their permission. For example, you could relicense code from the MIT license to the GPL. This is because nothing in the GPL contradicts anything in the MIT license, but you’ll still have to display the MIT license for the original code and the GPL for your modifications.

This doesn’t work the other way though, you can’t take code released under GPL and relicense it as MIT since MIT is a more liberal license that doesn’t match the terms of the GPL.

Creative commons zero or public domain code can usually be relicensed to any open source license (or a commercial license for that matter).

Sometimes there are contradictory statements in licenses which prevent relicensing. For example Apache 2 has some provisions about software patents, you won’t be able to relicense Apache 2 code under MIT since it doesn’t have an equivalent patent clause.

The GNU project has a [useful guide][gnu-license-guide] to license compatibility on their website.

Getting agreement to relicense

Sometimes the developers of a program will wish to change the license it is distributed under. In order to do this they’ll need the agreement of all the copyright holders of the program and typically this will mean everyone who contributed code to it. This is fine if you only have a handful of contributors to your project, but gets harder when the project starts to grow bigger.

Some projects work around this by having contributors agree to a “contributor license agreement”. This will set out the terms under which the code is contributed to the project. It might include a copyright assignment or just granting certain rights to the project. This usually allows the company (or non-profit foundation as many open source projects are based on non-profit foundations) running the project to relicense the code in future and to take legal action against people who violate the license.

Going commercial

It is possible to sell software that is licensed under an open source license commercially, you will need to supply the source code along with the binaries. However, under any license meeting the Open Source Definition or the Four Freedoms the person receiving the software can make copies and give (or sell) them to anybody else.

It is also possible to release software under two licenses, one open source and one commercial. This has been done by a few open source projects who wish to sell the software to some customers (sometimes with extra custom modifications) and give it to others.

Relicensing exercise

Q: Find the webpage of a major open source project that is relevant to your research or the Spacewalks codebase we have been working with. See if you can find a contributor license agreement. Add a link to this in the chat/etherpad/hackmd.

Hint: try looking at Matplotlib - https://matplotlib.org which Spacewalks uses for plotting

Sharing your code to encourage collaboration

Getting Code Ready to Share

In addition to adding a license to our code there are several other important steps to consider before sharing it publicly.

Making the code public

By default repositories created on Github are private and only their creator can see them. If we’re going to be adding an open source license to our repository we probably want to make sure people can actually access it too!

Go to the Github webpage of your repository (https://github.com/<yourusername>/<yourrepsoitoryname>) and click on the Settings link near the top right corner. Then scroll down to the bottom of the page and the “Danger Zone” settings. Click on “Change Visibility” and you should see a message saying “Change to public”, if it says “Change to private” then the repository is already public. You’ll then be asked to confirm that you want to make the repository public and agree to the warning that the code will now be publicly visible. As a security measure you’ll then have to put in your Github password.

Transferring to an organisation

Currently our repository is under the Github “namespace” of our individual user. This is ok for individual projects where we are the sole or at least main author, but for bigger and more complex projects it is common to use a Github organisation named after our project. If we are a member of an organisation and have the appropriate permissions then we can transfer a repository from our personal namespace to the organisation’s. This can be done with another option in the “Danger Zone” settings, the “Transfer ownership” button. Pressing this will then prompt us as to which organisation we want to transfer the repository to.

Archiving to Zenodo and Obtaining a DOI

Zenodo is a data archive run by CERN. Anybody can upload datasets up to 50GB to it and receive a Digital Object Identifier (DOI). Zenodo’s definition of a dataset is quite broad and can include code. This gives us a way to obtain a DOI for our code. We can archive our Github repository to Zenodo by doing the following:

  1. Go to the Zenodo Login page and choose to login with Github.
  2. Authorise Zenodo to connect to Github.
  3. Go to the Github page in your Zenodo account. This can be found in the pull down menu with your user name in the top right corner of the screen.
  4. You’ll now have a list of all of your Github repositories. Next to each will be an “On” button. If you have created a new repository you might need to press the “Sync” button to update the list of repositories Zenodo knows about.
  5. Press the “On” button for the repository you want to archive. If this was successful you’ll be told to refresh the page.
  6. The repository should now appear in the list of Enabled repositories at the top of the screen. But it doesn’t yet have a DOI. To get one we have to make a release on Github. Click on the repository and then press the green button to create a release. This will take you to Github’s release page where you’ll be asked to give a title and description of the release. You will also have to create a tag, this is a way of having a friendly name for the version of some code in Git instead of using a long hash code. Often we’ll create a sequential version number for each release of the software and have the tag name match this, for example v1.0 or just 1.0.
  7. If we now refresh the Zenodo page for this repository we will see that it has been assigned a DOI.

The DOI doesn’t just link to Github, Zenodo will have taken a copy of our repository at the point where we tagged the release. This means that even if we delete it from Github or even if Github were ever to go away or remove it, they’ll still be a copy on Zenodo. Zenodo will allow people to download the entire repository as a single Zip file.

Zenodo will have actually created two DOIs for you. One represents the latest version of the software and will always represent the latest if you make more releases. The other is specfic to the release you made and will always point to that version. We can see both of these by clicking on the DOI link in the Zenodo page for the repository. One of the things which is displayed on this page is a badge image that you can copy the link for and add to the README file in your Git repository so that people can find the Zenodo version of the repository. If you click on the DOI image in the Details section of the Zenodo page then you will be shown instructions for obtaining a link to the DOI badge in various formats including Markdown. Here is the badge for this repository and the corresponding Markdown:

DOI

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.11869450.svg)](https://doi.org/10.5281/zenodo.11869450)

Archive your repository to Zenodo

  • Create an account on Zenodo that is linked to your Github account.
  • Use Zenodo to create a release for your repository and obtain a DOI for it.
  • Get the link to the DOI badge for your repository and add a link to this image to your README file in Markdown format. Check that this is the DOI for the latest version and not the DOI for a specific version, if not you’ll be updating this every time you make a release.

Problems with Github and Zenodo integration

The integration between Github and Zenodo does not interact well with some browser privacy features and extensions. Firefox can be particularly problematic with this and might open new tabs to login to Github and then give an error saying: Your browser did something unexpected. Please try again. If the error continues, try disabling all browser extensions. If this happens try disabling the extra privacy features/extensions or using another browser such as Chrome.

Adding a DOI and ORCID to the citation file

Now that we have our DOI it is good practice to include this information in our citation file. In the previous part of this lesson we created a CITATION.cff file with information about how to cite our code. There are a few fields we can add which are related to the DOI, one of these is the version file which covers the version number of the software. We can add a DOI to the file in the identifiers section with a type of doi and value with the URL. Optionally we can also add a date-released field indicating the date we released this software. Here is an updated version of our CITATION.cff from the previous episode with a version number, DOI and release date added.

YAML

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: My Software
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Anne
    family-names: Researcher
version: 1.0.1
identifiers:
  - type: doi
    value: 10.5281/zenodo.1234
date-released: 2024-06-01

Add a DOI to your citation file

Add the DOI you were allocated in the previous exercise to your CITATION.cff file and commit/push the updated version to your Github repository. You can remove the commit field from the CITATION.cff file as the DOI is a better way to point to given version of the code.

Going further with publishing code

We now have our code published online, licensed as open source, archived with Zenodo, accessible via a DOI and with a citation file to encourage people to cite it. What else might we want to do in order to improve how findable, accessible or reusable it is? One further step we could take is to publish the code with a peer reviewed journal. Some traditional journals will accept software submissions, although these are usually as a supplementary material for a paper. There also journals which specialise in research software such as the Journal of Open Research Software, The Jornal of Open Source Software or SoftwareX. With these venues, the submission will be the software itself and not a paper, although a short abstract or description of the software is often required.

Working with collaborators

The strength of online collaboration tools such as Github doesn’t just lie in the ability to share code. They also allow us to track problems with that code, for multiple developers to work on it independently and bring their changes together and to review those changes before they are accepted.

Tracking Issues with Code

A key feature of Github (as opposed to Git itself) is the issue tracker. This provides us with a place to keep track of any problems or bugs in the code and to discuss them with other developers. Sometimes advanced users will also use issue trackers of public projects to report problems they are having (and sometimes this is misused by users seeking help using documented features of the program).

The code from the testing chapter earlier has a bug with an extra bracket in eva_data_analysis.py (and if you’ve fixed that a missing import of summarise_categorical in the test). Let’s go ahead and create a new issue in our Github repository to describe this problem. We can find the issue tracker on the “Issues” tab in the top left of the Github page for the repository. Click on this and then click on the green “New Issue” button on the right hand side of the screen. We can then enter a title and description of our issue.

A good issue description should include:

  • What the problem is, including any error messages that are displayed.
  • What version of the software it occurred with.
  • Any relevant information about the system running it, for example the operating system being used.
  • Versions of any dependent libraries.
  • How to reproduce it.

After the issue is created it will be assigned a sequential ID number.

Write an issue to describe our bug

Create a new issue in your repository’s issue tracker by doing the following:

  • Go to the Github webpage for your code
  • Click on the Issues tab
  • Click on the “New issue” button
  • Enter a title and description for the issue
  • Click the “Submit Issue” button to create the issue.

Discussing an issue

Once the issue is created, further discussion can take place with additional comments. These can include code snippets and file attachments such as screenshots or logfiles. We can also reference other issues by writing a # symbol and the number of the other issue. This is sometimes used to identify related issues or if an issue is a duplicate.

Closing an issue

Once an issue is solved then it can be closed. This can be done either by pressing the “Close” button in the Github web interface or by making a commit which includes the word “fixes”, “fixed”, “close”, “closed” or “closes” followed by a # symbol and the issue number.

Working in Parallel with Git Branches

Branching is a feature of Git that allows two or more parallel streams of work. Commits can be made to one branch without interfering with another. Branches are commonly used as a way for one developer to work on a new feature or a bug fix while other developers work on other features. When those new features or bug fixes are complete, the branch will be merged back with the main (sometimes called master) branch.

Creating a new Branch

New git branches are created with the git branch command. This should be followed by the name of the branch to create. It is common practice when the bug we are fixing has a corresponding issue to name the branch after the issue number and name. For example, we might call the branch 01-extra-brakcet-bug instead of something less descriptive like bugfix.

For example, the command:

BASH

git branch 01-extra-brakcet-bug

will create a new branch called 01-extra-brakcet-bug. We can view the names of all the branches, by default there should be one branch called main or perhaps master and our new 01-extra-brakcet-bug branch. by running git branch with no parameters. This will put * next to the currently active branch.

BASH

git branch

OUTPUT

  01-extra-brakcet-bug
* main

We can see that creating a new branch has not activated that branch. To switch branches we can either use the git switch or git checkout command followed by the branch name. For example:

BASH

git switch 01-extra-brakcet-bug

To create a branch and change to it in a single command we can use git switch with the -c option (or git checkout with the -b option, git switch is only found in more recent versions of Git).

BASH

git switch -c 02-another-bug

Committing to a branch

Once we have switched to a branch any further commits that are made will go to that branch. When we run a git commit command we’ll see the name of the branch we’re committing to in the output of git commit. Let’s edit our code and fix the lack of default values bug that we entered into the issue tracker earlier on.

Change your code from

PYTHON

<call to pandas without checks identified in testing section>

to:

PYTHON

<call to pandas with checks identified in testing section>

and now commit it.

BASH

git commit -m "fixed bug" eva_data_analysis.py

In the output of git commit -m the first part of the output line will show the name of the branch we just made the commit to.

OUTPUT

[01-extra-brakcet-bug 330a2b1] fixes missing values bug, closes #01 

If we now switch back to the main branch our new commit will no longer be there in the source file or the output of git log.

BASH

git switch main

And if we go back to the 01-extra-brakcet-bug branch it will re-appear.

BASH

git switch 01-extra-brakcet-bug

If we want to push our changes to a remote such as GitHub we have to tell the git push command which branch to push to. If the branch doesn’t exist on the remote (as it currently won’t) then it will be created.

BASH

git push origin 01-extra-brakcet-bug

If we now refresh the Github webpage for this repository we should see the bugfix branch has appeared in the list of branches.

If we needed to pull changes from a branch on a remote (for example if we’ve made changes on another computer or via Github’s web based editor), then we can specify a branch on a git pull command.

BASH

git pull origin 01-extra-brakcet-bug

Merging Branches

When we have completed working on a branch (for example fixing a bug) then we can merge our branch back into the main one (or any other branch). This is done with the git merge command.

This must be run on the TARGET branch of the merge, so we’ll have to use a git switch command to set this.

BASH

git switch main

Now we’re back on the main branch we can go ahead and merge the changes from the bugfix branch:

BASH

git merge 01-extra-bracket-bug

Pull Requests

On larger projects we might need to have a code review process before changes are merged, especially before they are merged onto the main branch that might be what is being released as the public version of the software. Github has a process for this that it calls a “Pull Request”, other Git services such as GitLab have different names for this, GitLab calls them “Merge Requests”. Pull requests are where one developer requests that another merge code from a branch (or “pull” it from another copy of the repository). The person receiving the request then has the chance to review the code, write comments suggesting changes or even change the code themselves before merging it. It is also very common for automated checks of code to be run on a pull request to ensure the code is of good quality and is passing automated tests.

As a simple example of a pull request, we can now create a pull request for the changes we made on our 01-extra-bracket-bug branch and pushed to Github earlier on. The Github webpage for our repository will now be saying something like “bugfix had recent pushes n minutes ago - Compare & Pull request”. Click on this button and create a new pull request.

Give the pull request a title and write a brief description of it, then click the green “Create pull request” button. Github will then check if we can merge this pull request without any problems. We’ll look at what to do when this isn’t possible later on.

There should be a green “Merge pull request” button, but if we click on the down arrow inside this button there are three options on how to handle this request:

  1. Create a merge commit
  2. Squash and merge
  3. Rebase and merge

The default is option 1, which will keep all of the commits made on our branch intact. This can be useful for seeing the whole history of our work, but if we’ve done a lot of minor edits or attempts at fixing a problem to fix one bug it can be excessive to have all of this history saved. This is where the second option comes in, this will place all of our changes from the branch into just a single commit, this might be much more obvious to other developers who will now see our bugfix as a single commit in the history. The third option merges the branch histories together in a different way that doesn’t make merges as obvious, this can make the history easier to read but effectively rewrites the commit history and will change the commit hash IDs. Some projects that you contribute to might have their own rules about what kind of merge they will prefer. For the purposes of this exercise we’ll stick with the default merge commit.

Go ahead and click on “Merge pull request”, then “Confirm merge”. The changes will now be merged together. Github gives us the option to delete the branch we were working on, since it’s history is preserved in the main branch there isn’t any reason to keep it.

Using Forks Instead of Branches

A fork is similar to a branch, but instead of it being part of the same repository it is a entirely new copy of the repository. Forks are commonly used by Github users who wish to work on a project that they’re not a member of. Typically forking will copy the repository to our own namespace (e.g. github.com/username/reponame instead of github.com/projectname/reponame)

To create a fork on github use the “Fork” button to the right of the repository name. After we create our fork we can make some changes and these could even be on the main branch inside our forked repository. Github will track that a fork has been made displays a “Contribute” button to create a pull request back to the original repository. Using this we can request that the changes on our fork are incorporated by the upstream project.

Pull Request Exercise

Q: Work in pairs for this exercise. Share the Github link of your repository with your partner. If you have set your repository to private, you’ll need to add them as a collaborator. Go to the settings page on your Github repository’s webpage, click on Collaborators from the left hand menu and then click the green “Add People” button and enter the Github username or email address of your partner. They will get an email and an alert within Github to accept your invitation to work on this repository, without doing this they won’t be able to access it.

  • Now make a fork of your partners repository.
  • Edit the CITATION.cff file and add your name to it.
  • Commit these changes to your fork
  • Create a pull request back to the original repository
  • Your partner will now receive your pull request and can review

Acknowledgements

The content of this episode was inspired / heavily borrowed from the following resources:

Further reading

We recommend the following resources for some additional reading on the topic of this episode:

Also check the full reference set for the course.

Key Points

  • Open source applies Copyright licenses permitting others to reuse and adapt your code or data.
  • Permissive licenses allow code to be used in other products providing the copyright statement is displayed.
  • Copyleft licenses require the source code of any modifications to be released under a copyleft license.
  • Creative commons licenses are suitable for non-code files such as documentation and images.
  • Open source software can be sold, but you must supply the source code and the people you sell it to can give it away to somebody else.
  • Add license file to your repository and add a license to each file in case it gets detached.
  • Zenodo can be used to archive a Github repository and obtain a DOI for it.
  • We can include a CITATION file to tell people how to cite our code.
  • Github can track bugs or issues with a program.
  • Git branches can be used to allow multiple developers to work on the same part of a program in parallel.
  • The git branch command shows the list of branches and can create new branches.
  • The git switch command changes which branch we are working on.
  • The git merge command merges another branch into the current one.
  • Pull requests allow developers to work on their own branch/fork and then request other developers review their changes before they are merged.

Content from Ethical considerations for research software


Last updated on 2024-07-04 | Edit this page

Overview

Questions

  • What ethical considerations are there when using and writing research software?
  • How can our research software impact the rights and well-being of others?

Objectives

After completing this episode, participants should be able to:

  • Understand some ethical issues around research software development and usage and how they impact others.

Research software development and its use (along with data it handles) does not happen in a vacuum - it can pose a number of ethical concerns and challenges for society and the environment, such as contributing to social inequality, compromising privacy and security, increasing environmental impact, and creating social and psychological issues. While these concerns do not strictly fall under the FAIR principles, they all contribute to research software quality, reliability and responsibility.

Some of the ethical considerations around software are listed below.

Software development practices

  • Transparency: openly sharing source code and methodologies to allow for scrutiny and replication of research.
  • Software quality and reliability: rigorous testing to ensure that software performs as documented and produces reliable results.
  • Accountability: ensuring that developers are accountable for the software they create, particularly in terms of accuracy and reliability.

Ethical software reuse

Good research software practices apply not only when we develop software ourselves, but also when we reuse other people’s software. Discuss as group and write down in the shared document your thoughts about what ethical considerations should we have in mind and pros and cons we should weigh in when choosing other people’s software for reuse.

Here are some considerations:

  • How is software shared? Is it openly available on some code sharing platform or given to us on a memory stick?
  • Are developers of the software credited for their work?
  • Are software developers respectful to each other and other external contributors to or users of the codebase in their communication? Does the project have a code of conduct and contribution guidelines?
  • How is the project funded? Are there any concerns about who is financing the work?
  • Is the project active? How many open issues are there and how long have they been opened for?
  • When we are selecting tools and libraries to include in our research pipelines or software, we do have a duty to check the quality and reliability of those components. Are they well tested? Is the project using outdated methods? We can also look to other indicators of reliability - e.g. does the project have many open unresolved issues for bugs that might signal unreliability?
  • Do they support they users of their code and how is that support handled?
  • Are we respecting licenses and giving proper attribution to the developers of the code we are reusing?
  • When incorporating other researchers’ software into our research workflows we should not do it blindly but should inspect it critically and look closely at how the results of that component are generated.

Note: tools such as Snyk are available to help you check software projects that you intend to reuse in your work - they are designed to help you identify find weaknesses, violations, vulnerabilities and security issues in open-source software libraries and can help you make a decision whether or not you should use some software.

Intellectual property and licensing

  • Open vs closed source: balancing the benefits of open-source software and data with the rights of creators to protect their intellectual property and data privacy.
  • License compliance: ensuring that software use complies with licensing agreements and avoiding unlicensed (and illegal) use of software and data.

Intellectual property and licensing of AI-generated code

The licensing and intellectual property (IP) of code generated by large language models (LLMs) and AI pair programmers (such as CodeGen or Google Copilot which operate with an AI system as one of the code developers) raise complex legal and ethical questions that are not clear cut. Discuss as group and write down in the shared document some considerations and pros and cons of using AI-generated code.

Here are some considerations:

  • Intellectual Property (IP) ownership of the AI-generated code - typically, IP laws assign authorship to humans, not machines. However, the authorship of AI-generated code is ambiguous - it could belong to the developer who directs the AI, the developer’s employer or the entity owning the AI. Furthermore, AI pair programmers can generate results that match or closely match code that is publicly available and potentially already licensed. As a result, you might end up incorporating someone else’s (licensed) code into your code base without attribution or considering license implications. Some tools like GitHub Copilot have taken steps to reduce the risk of this by filtering out code completions that match publicly available code.
  • Intellectual Property (IP) ownership of the training data: if the AI model was trained on copyrighted code, the output could potentially be seen as a derivative work, raising IP issues regarding the training data’s licenses.
  • Licensing terms of the AI-generated code - under what terms is the generated code to be used, for example, if the AI model is trained on open-source code, the generated code might need to comply with the open-source licenses (e.g., GPL, MIT). How can we attribute the original authors of the code used to train the AI model on? In addition, ethically, it is good practice to acknowledge the use of AI tools in generating code, even if not legally required.
  • Security and bias concerns of AI-generated code - AI tools can generate code that is insecure, potentially destructive (e.g. deleting files) or use biased algorithms and data (see the section below) - so we should always try to understand and check any generated code before running it.

Responsible use of software & data

  • Dual use: recognizing and mitigating the potential for software to be used for harmful purposes (e.g. cybersecurity tools being used for hacking).
  • Social impact: considering the broader social implications of software applications and data used, e.g. the use of data that has been collected without permission from the groups or individuals (e.g. historically excluded and exploited groups), or data that contains private or upsetting information.
  • Bias mitigation: implement measures to detect and mitigate biases in algorithms and data, ensuring that the software treats all users fairly.

Scraping publicly available data responsibly

You have been tasked with writing software to scrape a public forum to collect personal data for your study - discuss as a group and write down what ethical issues should you consider and how should you act in this case?

  1. Compliance with Terms of Service (ToS) - most websites, including forums, have ToS that explicitly prohibit or restrict web scraping. Violating these terms can lead to legal action and is considered unethical.
  2. User privacy concerns - scraping a forum can collect a vast amount of user data, including personal information, opinions, and conversations. Avoid collecting personally identifiable information (PII) unless it is absolutely necessary and permissible and protecting user privacy (e.g. do not de-anonymize users).
  3. Lack of consent: users of the forum likely have not given explicit consent for their posts and data to be scraped and used by third parties. It’s ethical to inform users and seek consent where possible, or at least be transparent about data usage.
  4. Impact on website performance - scraping can put significant strain on the forum’s servers, cause Denial of Service attacks amd potentially disrupting service for regular users. Ensure your software is designed to be non-disruptive.
  5. Data use and sharing - clearly define and limit the purpose of data collection, collect only the data necessary for the intended purpose, and be cautious about how the scraped data is shared or sold. Ensure that it is used responsibly and does not harm the users from whom the data was collected.
  6. Data accuracy and integrity - ensure that the scraped data is used accurately and not misrepresented, avoid taking quotes out of context or using data in misleading ways, avoid modifying or altering the data in ways that could be deceptive.
  7. Legal considerations - forum content is often copyrighted so scraping content without permission can lead to copyright infringement so seek permission where necessary.
  8. Ethical purpose - consider the intent behind scraping the forum and ensure that the purpose is constructive and beneficial (and not harmful and unethical - e.g., for creating, fake profiles, spam, or harassment).
  9. Transparency and accountability - where possible, be transparent about your research methodology, scraping activities and their purpose, take responsibility for any negative consequences that may arise from scraping and be prepared to address and mitigate any issues.

Best Practices for ethical scraping:

  • Respect robots.txt: Always check and respect the robots.txt file of the forum, which indicates the site’s rules regarding web crawling.
  • Rate limiting: implement rate limiting to avoid overwhelming the server.
  • Data minimization: collect only the data that is necessary for your purpose.
  • User-Agent identification: use a User-Agent string that identifies your scraper and include contact information.
  • Ethical review: if you are scraping for research, consider submitting your methodology for ethical review.

Environmental impact of research software

  • Energy consumption: research software often relies on large-scale data centers for storage and computation, or the use of HPC systems and supercomputers for complex simulations and data analysis, which can significantly increase energy usage (much of which may come from non-renewable sources).
  • Carbon footprint: the electricity used to power data centers and HPC systems contributes to CO2 emissions, particularly if the energy comes from fossil fuels.

Using HPC systems in a more environmentally responsible manner

Often our research software is written for use on HPC systems. Discuss as a group and write down in which way can we improve our practices when writing, testing and running code on HPC system to support more responsible and less wasteful use of resources.

Here are some thing to consider:

  • To make best use of resources, part of our workflow should include identifying the most appropriate job configurations, e.g. job scaling or using workflow management tools like Snakemake to ensure that we do not repeat expensive steps unless some upstream step in our workflow has changed.
  • Avoid running untested code on a HPC system - test and debug locally first
  • Run a test job on one case study or a small data set before running on full data (consider setting up many jobs and finding 100s of jobs crashed after a few hours)
  • Avoid including an expensive data processing step in every job when it could be run once and set as a dependency for other jobs
  • Using many nodes for a job that is not designed to run in parallel - consider refactoring your code
  • Optimise your code and use more efficient algorithms

Technical topics:

  • Work-flow management
  • Profiling jobs on HPC
  • Profiling Python code

Further reading


We recommend the following resources for some additional reading on the topic of this episode:

Also check the full reference set for the course.

Key Points

  • To act ethically, we have to be both responsible producers and consumers of research software.
  • Free and open source research software and data, along with FAIR, ethical and environmental considerations, offer viable ways to make more transparent, reliable and accountable science that not only advances scientific knowledge but also respects and protects the rights and well-being of all users.

Content from Wrap-up


Last updated on 2024-07-04 | Edit this page

Overview

Questions

  • What are the wider Research Software Development Principles and where does FAIR fit into them?

Objectives

  • Reflect on the Research Software Development Principles and their relevance to own research.

In this course we have explored the significance of reproducible research and how following the FAIR principles in our own work can help us and others do better research. Reproducible research often requires that researchers implement new practices and learn new tools - in this course we taught you some of these as a starting point but you will discover what works best for yourself, your group, community and domain. Some of these practices may take a while to implement and may require perseverance, others you can start practicing today.

An image of a Chinese proverb "The best time to plant a tree was 20 years ago. The second best time is now
An image of a Chinese proverb “The best time to plant a tree was 20 years ago. The second best time is now” by CCNULL, used under a CC-BY 2.0 licence

Research software development principles


Software and people who develop it have significance within the research environment and a broader impact on society and the planet. FAIR research software principles cover some aspects and operate within the wider Research Software Development Principles - recommended by Software Sustainability Institute’s Director Neil Chue Hong during his keynote at RSECon23. These principles can help us explore and reflect on our own work and guide us on a path to better research.

Helping your team

Help the team principles of writing FAIR, secure and maintainable code
Helping your team, image from RSECon2024, used under CC BY 4.0

Helping you peers

Help the peers principles of making your work reproducible, inclusive and credit everyone involved
Helping your peers, image from RSECon2024, used under CC BY 4.0

Helping the world

Help the world principles of being responsible, open and global, and humanist when developing research software
Helping the world, image from RSECon2024, used under CC BY 4.0

Further reading


Please check out the following resources for some additional reading on the topic of this course and the full reference set.

Key Points

  • When developing software for your research, think about how it will help you and your team, your peers and domain/community and the world.