Software management & collaboration

Last updated on 2025-12-03 | Edit this page

Estimated time: 90 minutes

Overview

Questions

How do can we create unique identifiers for our code so others can cite it?
How do we archive our code for future preservation?
How do we track issues with code in GitHub?
How can we ensure that multiple developers can work on the same files simultaneously?

Objectives

After completing this episode, participants should be able to:

Create a digital object identifier (DOI) for a software project and archive it on Zenodo.
Track the work needed to be done on code (such as bugs and new features) using issues in GitHub.
Use Git branches to work on code in parallel with other team members.
Merge code changes from separate development branches into the main branch using pull requests.

Sharing code openly promotes collaboration, transparency, and innovation by allowing others to review, use, and improve the code. It fosters knowledge exchange, accelerates scientific progress, and enhances the reproducibility of research. Additionally, open sharing encourages community contributions and can lead to better-maintained, more reliable software.

Adding a license and other metadata to our code (covered in the previous episode) are the first steps towards sharing the code publicly. There are several other important steps to consider which we will cover here.

Code state

At this point, the code in your local software project’s directory should be as in: https://github.com/carpentries-incubator/bbrs-software-project/tree/08-open-collaboration

Activate your virtual environment

If it is not already active, make sure to activate your virtual environment from the root of the software project directory:

BASH

$ source venv_spacewalks/bin/activate # Mac or Linux
$ source venv_spacewalks/Scripts/activate # Windows
(venv_spacewalks) $

Making the code public

By default repositories created on GitHub are private and only their creator can see them. This prevents sensitive information from being unintentionally made public. In our case - we added an open source license to our repository and want to make sure people can actually access and use it.

To make your repository public (if it is not public already):

Go to your repository on GitHub and click on the Settings link near the top right corner.
Scroll down to the bottom of the page and the “Danger Zone” settings.
Click on “Change Visibility” and you should see a message saying “Change to public”. If it says “Change to private” then the repository is already public.
You will then be asked to confirm that you indeed want to make the repository public and agree to the warning that the code will now be publicly visible. As a security measure, you will then have to put in your GitHub password.

Transferring to an organisation

At the moment, our repository sits under an individual GitHub user account. This is fine for smaller projects where one person is the main contributor. However, larger or more complex projects are typically hosted under a dedicated GitHub organisation. A GitHub organisation can group multiple repositories, manage contributor access through GitHub teams, and keep all project-related work in one place. You can create a GitHub organisation to centrally manage all repositories for a single organisation or a multi-institution project.”

If we are a member of a GitHub organisation and have the appropriate permissions then we can transfer a repository from our personal namespace to the organisation’s. This can be done with another option in the “Danger Zone” settings - the “Transfer ownership” button. Pressing this will then prompt us as to which organisation we want to transfer the repository to (if we have access to multiple GitHub organisations).

Getting an identifier and archiving code

Putting code on GitHub or GitLab (or any similar code hosting service) is good practice for code sharing, versioning and even code packaging, it is not enough for long-term software archiving. This is because these are commercial services - if they change their policies, remove repositories (e.g. for inactivity, or security reasons), or even shut down (which has happened to code sharing platforms in the past), your code could disappear. Archival means long-term preservation independent of any one platform.

Several archival solutions for research software are emerging. Zenodo is a data archive run by CERN which allows anyone to upload data up to 50GB for free and receive a Digital Object Identifier (DOI). Zenodo’s definition of “data” is quite broad and can include code - so it also supports DOI-backed software archiving linked to publications, ensuring persistent citation and access.

DOIs provide globally unique, citable references for a data or a software artefact (both software as a whole and its different releases) that integrates with various academic and research systems. Archival of and unique identifiers for software are important as they aid reproducibility, proper citation, and long-term accessibility of software.

Let us now look into how we can archive a GitHub repository to Zenodo and get a DOI for our software project at the same time. Note that, instead of using the real Zenodo website, we will practice using Zenodo Sandbox.

Callout

Zenodo Sandbox

Zenodo Sandbox is a testing environment for Zenodo, a repository for research outputs, allowing users to safely experiment with its features without affecting the live system. It is a clone of Zenodo, created for testing purposes, that works exactly the same way as Zenodo you can use it for learning, training, experimenting, and preparing uploads without impacting the primary Zenodo repository until you are ready to publish and release your code (or other research outputs) officially. It will also not create real DOIs for a number of test repositories we use for this course and saturate the DOI space (remember that a DOI, once created, is meant to exist forever).

Optional challenge: archive our GitHub repository to Zenodo (5 min)

You can choose to do the following as an exercise or by live coding based on what you think the learners would prefer. You can copy the detailed instructions below to give them it as an exercise.

We can archive our GitHub repository to Zenodo (Sandbox) by doing the following:

Go to the Zenodo Sandbox login page and choose to login with GitHub.
Authorise Zenodo Sandbox to connect to GitHub.
Go to the GitHub page in your Zenodo Sandbox account. This can be found in the pull down menu with your user name in the top right corner of the screen.
You will now have a list of all of your GitHub repositories. Next to each will be an “On” button. If you have created a new repository you might need to press the “Sync” button to update the list of repositories Zenodo Sandbox knows about.
Press the “On” button for the repository you want to archive. If this was successful you will be told to refresh the page.
The repository should now appear in the list of “Enabled” repositories at the top of the screen, but it does not yet have a DOI. To get one we have to make a “release” on GitHub. To do so, click on the repository and then press the green button to create a release. This will take you to GitHub’s release page for your software repository where you will be asked to give a title and description of the release. You will also have to create a “tag” for your release - a way of having a friendly name for the version of some code in Git instead of using a long hash code. Often we will create a sequential version number for each release of the software and have the tag name match this, for example v1.0 or just 1.0.
If we now refresh the Zenodo Sandbox page for this repository we will see that it has been assigned a DOI.

The DOI does not just link to GitHub, Zenodo will have taken a copy (a snapshot) of our repository at the point where we created the release. This means that even if we delete it from GitHub or even if GitHub were ever to go away or remove it, there will still be a copy of our software on Zenodo. Zenodo will allow people to download the entire repository (more accurately, its state at the time it was tagged for release) as a single zip file.

Zenodo will have actually created two DOIs for you. One represents the identifier for your software as a whole and will always link to the latest (most recent) release if you make more releases. The other is specific to the release you made and will always point to that version. We can see both of these by clicking on the DOI link in the Zenodo page for the repository.

One of the things which is displayed on this page is a badge image that you can copy the link for and add to the README file in your GitHub repository so that people can find the Zenodo version of the repository. If you click on the DOI image in the “Details” section of the Zenodo page then you will be shown instructions for obtaining a link to the DOI badge in various formats including Markdown. Here is the badge for this repository and the corresponding Markdown:

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.11869450.svg)](https://doi.org/10.5281/zenodo.11869450)

Problems with GitHub and Zenodo integration

The integration between GitHub and Zenodo does not interact well with some browsers’ privacy features and extensions. Firefox can be particularly problematic with this and might open new tabs to login to GitHub and then give an error saying: Your browser did something unexpected. Please try again. If the error continues, try disabling all browser extensions. If this happens try disabling the extra privacy features/extensions or using another browser such as Chrome.

Adding a DOI to the citation file

Now that we have our DOI it is good practice to include this information in our software’s citation file. Earlier we created a CITATION.cff file with information about how to cite our software. A DOI can be added to the CITATION.cff file using the dedicated doi key or by including it as an entry in the identifiers: array. Both methods are valid according to the CFF specification. We can also add the version field to CITATION.cff, which corresponds to the version number of the software related to the DOI we are adding.

Optional challenge: add a DOI and software version to the citation file (5 min)

You can choose to do the following as an exercise or by live coding based on what you think the learners would prefer. You can copy the the detailed instructions below to give it as an exercise.

We can add a DOI to the file by creating an identifiers section that includes the variables of doi and value of the Zenodo URL as a bulleted entry. Optionally, we can also add a date-released field indicating the date we released this software as a new variable. Here is an updated version of our CITATION.cff from the previous episode with a version number, DOI and release date added.

YAML

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Spacewalks
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Jaffa
    name-particle: Sarah
  - given-names: Aleksandra
    family-names: Nenadic
  - given-names: Kamilla
    family-names: Kopec-Harding
repository-code: >-
  https://github.com/YOUR-REPOSITORY-URL/spacewalks.git
abstract: >-
  A Python script to analyse NASA extravehicular activity
  data
keywords:
  - NASA
  - Extravehicular activity
version: 1.0.1
identifiers:
  - type: doi
    value: 10.5281/zenodo.1234
date-released: 2024-06-01

Callout

Going further with publishing code

We now have our code published online, licensed as open source, archived with Zenodo, accessible via a DOI and with a citation file to encourage people to cite it. What else might we want to do in order to improve how findable, accessible or reusable it is? One further step we could take is to publish the code with a peer reviewed journal. Some traditional journals will accept software submissions, although these are usually as a supplementary material for a paper. There also journals which specialise in research software such as the Journal of Open Research Software, The Journal of Open Source Software or SoftwareX. With these venues, the submission will be the software itself and not a paper, although a short abstract or description of the software is often required.

Working with collaborators

The strength of online collaboration platforms such as GitHub does not just lie in the ability to share code. They also allow us to track problems with that code, for multiple developers to work on it independently and bring their changes together and to review those changes before they are accepted.

Tracking issues with code

As mentioned before in episode on version control systems, a key feature of GitHub (as opposed to Git itself) is the issue tracker.

GitHub issues is a place for keeping track of any problems or bugs in the code, feature requests, and lists of future work. They provide a record of all the problems with the code, and improvements that could be made, along with solutions and discussions around them. This helps the team to keep track of what they are working on now or need to work on later, and reduces the chance of receiving redundant reports of issues you already know about. Sometimes advanced users will also use issue trackers of public projects to report problems they are having (and sometimes this is understandably misused by users seeking help using documented features of the program).

To practice making an issue, we will file a “feature request”, where we describe new functionality that may improve the codebase. Let’s say we want to implement additional functionality in our code to create a summary table for the total/cumulative EVA/spacewalk time for each astronaut.

Before we start working on the implementation - we will create a new issue in our GitHub repository for this feature request track the work on it. Recall that we can find the issue tracker on the “Issues” tab in the top left of the GitHub home page for the repository. Click on this and then click on the green “New Issue” button on the right hand side of the screen. We can then enter a title and description of our issue.

An issue should include a short title, e.g. the key features of the new code, and a more detailed description of the feature, e.g.:

Title: Add a summary table of total EVA time by astronaut
Description: A summary table of total EVA time split by astronaut would be helpful for individual level analysis.

After the issue is created it will be assigned a sequential ID number - because we have already created one issue on our repository, for this issue it should be 02.

Discussing an issue

Once the issue is created, further discussion can take place with additional comments. These can include code snippets and file attachments such as screenshots or logfiles. We can also reference other issues (by writing a # symbol and the number of the other issue) or mention other collaborators (by writing @ followed by their GitHub username) to draw their attention to our comment. This is sometimes used to identify related issues or if an issue is a duplicate.

Working in parallel with Git branches

Next, we will learn how to suggest this change back to the repository. So far, we’ve been working on our own making changes and directly committing to the main branch. When we start to have collaborators, we may want to adopt a development workflow such as GitHub flow (aka feature branch workflow) that facilitates simultaneous editing on the project and quality control on the changes made.

Branching is a feature of Git that allows two or more parallel streams of work. Commits can be made to one branch without interfering with another. Branches are commonly used as a way for one developer to work on a new feature or a bug fix while other developers work on other features. When those new features or bug fixes are complete, the branch will be merged back with the main branch.

Creating a new branch

New Git branches are created with the git branch command. This should be followed by the name of the branch to create. It is common practice when the bug we are fixing or feature we are adding has a corresponding issue to name the branch after the issue number and name. For example, we might call the branch 02-sum-by-astro-feat instead of something less descriptive like feature-request or bugfix.

For example, the command:

BASH

(venv_spacewalks) $ git branch 02-sum-by-astro-feat

will create a new branch called 02-sum-by-astro-feat. We can view the names of all the branches by running git branch with no parameters. By default there should be one branch called main (formerly master) and our new 02-sum-by-astro-feat branch. The command will put * next to the currently active branch.

BASH

(venv_spacewalks) $ git branch

OUTPUT

  02-sum-by-astro-feat
* main

We can see that creating a new branch has not switched our working branch to that branch. To switch branches we can either use the git switch or git checkout command followed by the branch name. For example:

BASH

(venv_spacewalks) $ git switch 02-sum-by-astro-feat

To create a branch and change to it in a single command we can use git switch with the -c option (or git checkout with the -b option) - e.g. git switch -c switch 02-sum-by-astro-feat. Note that git switch command is only available in more recent versions of Git.

Committing to a branch

Once we have switched to a branch any further commits that are made will go to that branch. When we run a git commit command we will see the name of the branch we are committing to in the output of git commit. Let’s add the following function to our code to implement the requested feature.

Copy and paste this function to eva_data_analysis.py:

PYTHON

def summary_duration_by_astronaut(df):
    """
    Summarise the duration data by each astronaut and saves resulting table to a CSV file

    Args: 
        df (pd.DataFrame): Input dataframe to be summarised

    
    Returns:
        sum_by_astro (pd.DataFrame): Data frame with a row for each astronaut and a summarised column 
    """
    print(f'Calculating summary of total EVA time by astronaut')
    subset = df.loc[:,['crew', 'duration']] # subset to work with only relevant columns
    subset = add_duration_hours(subset) # need duration_hours for easier calcs
    subset = subset.drop('duration', axis=1) # dropping the extra 'duration' column as it contains string values not suitable for calulations
    subset = subset.groupby('crew').sum() 
    return subset

Then add the following line after the graph_file variable is created:

PYTHON

duration_by_astronaut_output_file = 'results/duration_by_astronaut.csv'

Next, change the signature of the main() function to main(input_file, output_file, duration_by_astronaut_output_file, graph_file) and also its invocation (in order to pass the new CSV file as an argument):

PYTHON

main(input_file, output_file, duration_by_astronaut_output_file, graph_file)

Finally, in the main() function, add the invocation of the new function (e.g. after converting and exporting original data to CSV file):

PYTHON

# Calculate summary table for total EVA per astronaut
duration_by_astronaut_df = summary_duration_by_astronaut(eva_data)
# Save summary duration data by each astronaut to CSV file
write_dataframe_to_csv(duration_by_astronaut_df, duration_by_astronaut_output_file)

Note how we reused our old write_dataframe_to_csv function to write the new data to a different CSV file and did not have to repeat that code again.

Finally, our whole script eva_data_analysis.py may look like:

PYTHON

import matplotlib.pyplot as plt
import pandas as pd
import sys
import re


def main(input_file, output_file, duration_by_astronaut_output_file, graph_file):
    print("--START--")

    # Read the data from JSON file
    eva_data = read_json_to_dataframe(input_file)

    # Convert and export data to CSV file
    write_dataframe_to_csv(eva_data, output_file)

    # Calculate summary table for total EVA per astronaut
    duration_by_astronaut_df = summary_duration_by_astronaut(eva_data)
    # Save summary duration data by each astronaut to CSV file
    write_dataframe_to_csv(duration_by_astronaut_df, duration_by_astronaut_output_file)

    # Sort dataframe by date ready to be plotted (date values are on x-axis)
    eva_data.sort_values('date', inplace=True)

    # Plot cumulative time spent in space over years
    plot_cumulative_time_in_space(eva_data, graph_file)

    print("--END--")


def read_json_to_dataframe(input_file):
    """
    Read the data from a JSON file into a Pandas dataframe.
    Clean the data by removing any rows where the 'duration' value is missing.

    Args:
        input_file (file or str): The file object or path to the JSON file.

    Returns:
         eva_df (pd.DataFrame): The cleaned data as a dataframe structure
    """
    print(f'Reading JSON file {input_file}')
    # Read the data from a JSON file into a Pandas dataframe
    eva_df = pd.read_json(input_file, convert_dates=['date'], encoding='ascii')
    eva_df['eva'] = eva_df['eva'].astype(float)
    # Clean the data by removing any rows where duration is missing
    eva_df.dropna(axis=0, subset=['duration', 'date'], inplace=True)
    return eva_df


def write_dataframe_to_csv(df, output_file):
    """
    Write the dataframe to a CSV file.

    Args:
        df (pd.DataFrame): The input dataframe.
        output_file (file or str): The file object or path to the output CSV file.

    Returns:
        None
    """
    print(f'Saving to CSV file {output_file}')
    # Save dataframe to CSV file for later analysis
    df.to_csv(output_file, index=False, encoding='utf-8')


def plot_cumulative_time_in_space(df, graph_file):
    """
    Plot the cumulative time spent in space over years.

    Convert the duration column from strings to number of hours
    Calculate cumulative sum of durations
    Generate a plot of cumulative time spent in space over years and
    save it to the specified location

    Args:
        df (pd.DataFrame): The input dataframe.
        graph_file (file or str): The file object or path to the output graph file.

    Returns:
        None
    """
    print(f'Plotting cumulative spacewalk duration and saving to {graph_file}')
    df = add_duration_hours(df)
    df['cumulative_time'] = df['duration_hours'].cumsum()
    plt.plot(df['date'], df['cumulative_time'], 'ko-')
    plt.xlabel('Year')
    plt.ylabel('Total time spent in space to date (hours)')
    plt.tight_layout()
    plt.savefig(graph_file)
    plt.show()


def text_to_duration(duration):
    """
    Convert a text format duration "HH:MM" to duration in hours

    Args:
        duration (str): The text format duration

    Returns:
        duration_hours (float): The duration in hours
    """
    hours, minutes = duration.split(":")
    duration_hours = int(hours) + int(minutes)/60
    return duration_hours


def add_duration_hours(df):
    """
    Add duration in hours (duration_hours) variable to the dataset

    Args:
        df (pd.DataFrame): The input dataframe.

    Returns:
        df_copy (pd.DataFrame): A copy of df with the new duration_hours variable added
    """
    df_copy = df.copy()
    df_copy["duration_hours"] = df_copy["duration"].apply(
        text_to_duration
    )
    return df_copy


def calculate_crew_size(crew):
    """
    Calculate the size of the crew for a single crew entry

    Args:
        crew (str): The text entry in the crew column containing a list of crew member names

    Returns:
        (int): The crew size
    """
    if crew.split() == []:
        return None
    else:
        return len(re.split(r';', crew))-1

def add_crew_size_column(df):
    """
    Add crew_size column to the dataset containing the value of the crew size

    Args:
        df (pd.DataFrame): The input data frame.

    Returns:
        df_copy (pd.DataFrame): A copy of the dataframe df with the new crew_size variable added
    """
    print('Adding crew size variable (crew_size) to dataset')
    df_copy = df.copy()
    df_copy["crew_size"] = df_copy["crew"].apply(
        calculate_crew_size
    )
    return df_copy


def summary_duration_by_astronaut(df):
    """
    Summarise the duration data by each astronaut and saves resulting table to a CSV file

    Args: 
        df (pd.DataFrame): Input dataframe to be summarised

    
    Returns:
        sum_by_astro (pd.DataFrame): Data frame with a row for each astronaut and a summarised column 
    """
    print(f'Calculating summary of total EVA time by astronaut')
    subset = df.loc[:,['crew', 'duration']] # subset to work with only relevant columns
    subset = add_duration_hours(subset) # need duration_hours for easier calcs
    subset = subset.drop('duration', axis=1) # dropping the extra 'duration' column as it contains string values not suitable for calulations
    subset = subset.groupby('crew').sum() 
    return subset


if __name__ == "__main__":

    if len(sys.argv) < 3:
        input_file = 'data/eva-data.json'
        output_file = 'results/eva-data.csv'
        print(f'Using default input and output filenames')
    else:
        input_file = sys.argv[1]
        output_file = sys.argv[2]
        print('Using custom input and output filenames')

    graph_file = 'results/cumulative_eva_graph.png'
    duration_by_astronaut_output_file = 'results/duration_by_astronaut.csv'

    main(input_file, output_file, duration_by_astronaut_output_file, graph_file)

Now, let’s test run our script, to make sure we do not get any errors.

BASH

(venv_spacewalks) $ python3 eva_data_analysis.py

If we do:

BASH

(venv_spacewalks) $ ls results

we should see the new result CSV file with EVA durations summary listed.

Let’s add and commit the new version of the code to our 02-sum-by-astro-feat branch. First we will check that we are on the right branch using either git branch or git status.

BASH

(venv_spacewalks) $ git status

Closing an issue

Once an issue is solved then it can be closed. This can be done either by pressing the “Close” button in the GitHub web interface (as we did before) or by making a commit which includes the word “fixes”, “fixed”, “close”, “closed” or “closes” followed by a # symbol and the issue number in the commit message.

BASH

(venv_spacewalks) $ git branch
(venv_spacewalks) $ git add eva_data_analysis.py
(venv_spacewalks) $ git commit -m "Add duration by astronaut functionality. Fixes #2."

In the output of git commit -m command - the first part of the output line will show the name of the branch we just made the commit to.

OUTPUT

[02-sum-by-astro-feat 330a2b1] Add duration by astronaut functionality. Fixes #2.

If we now switch back to the main branch our new commit will no longer be there in the source file or the output of git log.

BASH

(venv_spacewalks) $ git switch main

And if we go back to the 02-sum-by-astro-feat branch it will re-appear.

BASH

(venv_spacewalks) $ git switch 02-sum-by-astro-feat

If we want to push our changes to a remote such as GitHub we have to tell the git push command which branch to push to. If the branch doesn’t exist on the remote (as it currently won’t) then it will be created. We need to use the -u switch to the git push command to tell GitHub to create the enw remote branch for us (which will be also linked to our local branch). The -u switch is only needed the first time we push to a new remote branch that is being created - next time we just do a normal git push.

BASH

(venv_spacewalks) $ git push -u origin 02-sum-by-astro-feat

If we now refresh the GitHub home page for our repository we should see the 02-sum-by-astro-feat branch appear in the list of branches.

How to pull changes from a remote branch

If we needed to pull changes from a branch on a remote (for example if we have made changes on another computer or via GitHub’s web based editor), then we can specify a branch on a git pull command.

BASH

git pull origin 02-sum-by-astro-feat

Merging branches

If we are working alone, when we have completed working on a branch (for example adding a feature or fixing a bug) then we can merge our branch back into the main one (or any other branch). This is done with the git merge command.

This must be run on the TARGET branch of the merge, so we will have to use a git switch command to set this.

BASH

(venv_spacewalks) $ git switch main

Now we are back on the main branch we can go ahead and merge the changes from the bugfix branch:

BASH

(venv_spacewalks) $ git merge 02-sum-by-astro-feat

Pull requests

On larger projects, with collaborators, we might include a code review process before changes are merged. A quality control step like this is especially important if the main branch of the project is being released as the public version of the software. GitHub has a process for this that is called a Pull Request. Other Git services such as GitLab have different names for this; GitLab calls them merge requests. Pull requests are situations where one developer requests that another merges code from a branch (or “pull” it from another copy of the repository). The person receiving the request then has the chance to review the code, write comments suggesting changes or even change the code themselves before merging it. It is also very common for automated checks of code to be run on a pull request to ensure the code is of good quality and is passing automated tests.

As a simple example of a pull request, we can now create a pull request for the changes we made on our 02-sum-by-astro-feat branch and pushed to GitHub earlier on. The GitHub webpage for our repository will now be saying something like “02-sum-by-astro-feat had recent pushes n minutes ago - Compare & Pull request”. Click on this button and create a new pull request.

Give the pull request a title and write a brief description of it, then click the green “Create pull request” button. GitHub will then check if we can merge this pull request without any problems.

There should be a green “Merge pull request” button, but if we click on the down arrow inside this button there are three options on how to handle this request:

Create a merge commit
Squash and merge
Rebase and merge

The default is option 1, which will keep all of the commits made on our branch intact. This can be useful for seeing the whole history of our work, but if we’ve done a lot of minor edits or attempts at creating the feature it can be excessive to have all of this history saved. This is where the second option comes in, this will place all of our changes from the branch into just a single commit, this might be much more obvious to other developers who will now see our feature addition as a single commit in the history. The third option merges the branch histories together in a different way that doesn’t make merges as obvious, this can make the history easier to read but effectively rewrites the commit history and will change the commit hash IDs. Some projects that you contribute to might have their own rules about what kind of merge they will prefer. For the purposes of this exercise we’ll stick with the default merge commit.

Go ahead and click on “Merge pull request”, then “Confirm merge”. The changes will now be merged together. GitHub gives us the option to delete the branch we were working on, since its history is preserved in the main branch there is no reason to keep it.

Callout

Using forks instead of branches

A fork is similar to a branch, but instead of it being part of the same repository it is a entirely new copy of the repository. Forks are commonly used by Github users who wish to work on a project that they are not a member of. Typically forking will copy the repository to our own namespace (e.g. github.com/username/reponame instead of github.com/projectname/reponame).

To create a fork on github use the “Fork” button to the right of the repository name. After we create our fork we can make some changes and these could even be on the main branch inside our forked repository. GitHub will track that a fork has been made displays a “Contribute” button to create a pull request back to the original repository. Using this we can request that the changes on our fork are incorporated by the upstream project.

Challenge

Practice with Issues and PRs (20 min)

We have a bug in our code! If we look at the results in results/duration_by_astronaut.csv, the crew column has groups of crew and we wanted to calculate this per astronaut.

Create an issue in GitHub to report this bug. A good issue description for a bug should include:

What the problem is, including any error messages that are displayed.
What version of the software it occurred with.
Any relevant information about the system running it, for example the operating system being used.
Versions of any dependent libraries.
How to reproduce it.
(Optionally) description of the expected behaviour, e.g. if there is not an error message but the user thinks the result or behaviour is not correct.

We might also reference the previous issue in the description, to provide even more context (e.g. “related to #N” where N is the number of the feature request issue).

Create a pull request fix the code. You can try to create the code yourself or copy the test code below.
- Hint: Do not forget to make a new branch from the main branch, not your 02-sum-by-astro-feat branch.
(Optionally) Have a partner review your pull request.
Merge your pull request
Switch your local computer back to the main branch and pull the latest changes from the remote/origin main branch.
(Bonus/optional) Delete your merged branches from your local computer and in GitHub.

Updated function to copy-paste

PYTHON

def summary_duration_by_astronaut(df):
    """
    Summarise the duration data by each astronaut and saves resulting table to a CSV file

    Args: 
        df (pd.DataFrame): Input dataframe to be summarised

    
    Returns:
        sum_by_astro (pd.DataFrame): Data frame with a row for each astronaut and a summarised column 
    """
    subset = df.loc[:,['crew', 'duration']] # subset to work with only relevant columns
    subset.crew = subset.crew.str.split(';').apply(lambda x: [i for i in x if i.strip()]) # splitting the crew into individuals and removing blank string splits from ending ;
    subset = subset.explode('crew') # separating lists of crew into individual rows
    subset = add_duration_hours(subset) # need duration_hours for easier calcs
    subset = subset.drop('duration', axis=1) # dropping the extra 'duration' column as it contains string values not suitable for calulations
    subset = subset.groupby('crew').sum() 
    return subset

Callout

Reopening an Issue

In practice, a report of incorrect implementation of a new feature might be made as a comment on the original feature request issue rather than via a new issue. The project maintainers might then reopen that issue until the correction is made. For the purposes of this workshop, we want you to get more practice by opening a separate issue.

Summary

This episode emphasises the importance of collaboration in creating sustainable research software. It covers practices like using open version control repositories (e.g., GitHub), inviting contributions through issue templates, PR and code review processes, and establishing clear governance, licensing, and codes of conduct. The episode also highlights the ethos of open collaboration so that anyone can contribute and benefit from the project, reinforcing transparency, inclusivity, and community-driven development.

Code state

At this point, the code in your local software project’s directory should be as in: https://github.com/carpentries-incubator/bbrs-software-project/tree/final

Software management & collaboration

Overview

Questions

Objectives

Code state

Activate your virtual environment

BASH

Sharing code to encourage collaboration

Making the code public

Transferring to an organisation

Getting an identifier and archiving code

Zenodo Sandbox

Optional challenge: archive our GitHub repository to Zenodo (5 min)

Problems with GitHub and Zenodo integration

Adding a DOI to the citation file

Optional challenge: add a DOI and software version to the citation file (5 min)

YAML

Going further with publishing code

Working with collaborators

Tracking issues with code

Discussing an issue

Working in parallel with Git branches

Creating a new branch

BASH

BASH

OUTPUT

BASH

Committing to a branch

PYTHON

PYTHON

PYTHON

PYTHON

PYTHON

BASH

BASH

BASH

Closing an issue

BASH

OUTPUT

BASH

BASH

BASH

How to pull changes from a remote branch

BASH

Merging branches

BASH

BASH

Pull requests

Using forks instead of branches

Practice with Issues and PRs (20 min)

Updated function to copy-paste

PYTHON

Reopening an Issue

Summary

Code state

Further reading