This lesson is still being designed and assembled (Pre-Alpha version)

Version control with git

Overview

Teaching: 45 min
Exercises: 15 min
Questions
  • What is version control? How do I use it?

  • What is the difference between gitand Github?

  • What benefits does a version control system brings in for my research?

Objectives
  • Understand the benefits of using a version control system such as git.

  • Be able to decipher git jargon: repository, commit, push, pull, branches etc.

  • Understand the basics of git and its usage in RStudio.

Table of contents

1. Introduction

In this episode, you will learn about the git version control system and how to use it in your R project from RStudio.

We will see how to trace edits and modifications made to your R Markdown document. Also, we will demonstrate how you can revert changes if needed or experiment safely with changes on your valuable code.

1.1 What is a version control system and why scientists should use it?

In the context of a research project, a version control system will help you to manage your project history, progress and support active collaboration with your colleagues but also with you (past, present and future self).

As a concrete example, this is something we might have all experienced in the past when keeping track of file versions:

Version control is an essential tool in data analysis

Version control will help you to avoid this file nightmare but also fosters other good practices related to code.

1.2 Five reasons to use a version control system in research

  1. Tell the story: The history of your commit messages will describe your project progress.
  2. Travel back in time: a version control system makes it easy to compare different time points of your project smoothly. If you want to compare the stage of your project a year ago from now, it only takes one command-line of code.
  3. Experiment with changes: if you want to make changes in a script, you can first make a “snapshot” of the project status before experimenting with changes. As a researcher, this might be a second nature for you!
  4. Backup your work: by being able to linking your local repository (folder) to a distant online host, a version control system backs up your precious work instantly.
  5. Collaborate easily on projects: having a web-hosted synchronised version of your project will encourage collaboration with other researchers. Think about a colleague of yours being able to add a script to make a figure for your first PhD publication for instance.

There are possibly other important reasons why you could use a version control system for your research project. While originally created for software development, a common usage in scientific research is to track versions of datasets, scripts or figures easily and efficiently.

git logo

One of the most used version control software out there is git. It is a cross-platform tool that is available on Mac and Linux OS natively and that needs to be installed on Windows check the Setup section on how to do this. git is a version control system primarily used in software development.

Definition

Defined simply: git is an application that runs on your computer like a web browser or a word processor (Tom Stuart).


1.4 Collaborating with yourself with git

Using your recently acquired flashy R skills, you are now ready to apply them to your scientific project You start by creating an R Markdown document, add code and text comments, generate an HTML report, save your R Markdown document, etc.

But how do you make sure that your changes are properly been saved and tracked? What is your backup strategy? This is where git will come in handy.

2. Tell the story of your project

Compare two solutions below, one without git and one with:

timeline of files

Discussion

Can you list the potential and proven drawbacks of keeping track of changes by saving copies of the files?

In the follow-up section, we will see how to tell a story about the changes applied to our R Markdown document. This storyline will be composed of the git commit messages.

Let’s see how we can use git powerful file versioning from within RStudio.

2.1 Create a new RStudio project

Projects in RStudio are a great feature and work very well in combination with git.

Go to RStudio and click on File > New Project > New directory.

New project

Then select New project

New project type

We will call our new project “learning_git”

New project type

2.2 Create a new R Markdown document

Go to File > New File > R Markdown and call it “learning git”. Click “OK”. It should open this new R Markdown document.

Below the ## R Markdown, add a new code chunk, and copy this code:

library("tidyverse")

Save your document under the name learning_git.Rmd. You should see this in your File pane:

A learning_git.Rmd

2.3 Initialize git from within the folder

Great, but git is still unaware of things that happen in this R project folder. Let’s change that.

In the console pane, click on “Terminal” to get access to a Shell from within RStudio. We will initialise git in this folder.

Shell from within RStudio

This is a regular Shell in which you can type any command-line instruction. Let’s type this:

git init

This command created a hidden folder called .git/ that will contain all information needed by git to recapitulate your file changes, project history, etc.

Try typing this:

ls -l .git/

This will show you what happened behing the hood:

(base) marcs-MacBook-Pro:learning_git mgalland$ ls -l .git/
total 24
-rw-r--r--   1 mgalland  staff   23 Jun 17 17:45 HEAD
-rw-r--r--   1 mgalland  staff  137 Jun 17 17:45 config
-rw-r--r--   1 mgalland  staff   73 Jun 17 17:45 description
drwxr-xr-x  14 mgalland  staff  448 Jun 17 17:45 hooks
drwxr-xr-x   3 mgalland  staff   96 Jun 17 17:45 info
drwxr-xr-x   4 mgalland  staff  128 Jun 17 17:45 objects
drwxr-xr-x   4 mgalland  staff  128 Jun 17 17:45 refs

2.4 Track file changes with git

Close and restart RStudio to show the “git” tab in the environment pane. You should see this:

git tab

For now, git does not track anything in this RStudio project folder.

We would like git to track changes in our learning_git.Rmd document. To do this, click in the empty checkbox:

git add R Markdown document

You can see that there is now a small green “A” next to the learning_git.Rmd file under the “Status” column. This means that our file is now being tracked by git.

2.5 Making changes and visualising them.

We will first:

  1. Import the gapminder dataset.
  2. Make a plot of the GDP per capita along the years for Canada.
  3. Write a small comment about the plot.

These 3 steps will all have their own commit message. Let’s start.

In your Rmd document, create a new code chunk and add this:

gapminder <- readr::read_csv('https://raw.githubusercontent.com/carpentries-incubator/open-science-with-r/gh-pages/data/gapminder.csv')

Save your learning_git.Rmd document

Modification of the Rmd document as seen in git pane

You see a small blue “M” button next to your learning_git.Rmd file. This stands for “Modified”. You can visualise the changes in your Rmd document by selecting “diff”:

diff button in git pane

This opens a new window where you can see that 3 lines where added (shown in green). These lines are the code chunk we’ve added where we read the gapminder dataset.

Show the modification of the Rmd document with diff

While we are in this “diff” view, we can write a small commit message to describe what happened to our document in a meaningful way.

In the “Commit message” write this little message:

Import the gapminder dataset 

The gapminder dataset is imported using an online url. 
It will be used to produce a plot of the GDP per year.

Now, click on commit. This will assign a unique identifier to your commit as git takes a snapshot of your learning_git.Rmd file.

first commit

Let’s continue our work, add the changes and create commit messages.

Exercise

  • Step 1: Add a scatterplot of the GDP per capita per year for Canada (use geom_point). Save your Rmd document.
  • Step 2: Add the modifications by cliking the checkbox under “Staged” to see the blue “M” sign in RStudio git pane.
  • Step 3: Click on “Diff” to open the new window where you should write a small commit message. Click on “Commit” when you’re done.
  • Step 4: Write a small conclusion about the plot in your Rmd document.
  • Step 5: save, add/stage changes, commit your changes with a small message.

If all went well, you can click on “History” to preview the history of commits that you have already made:

history of commits

This gives you a history of your Rmd file and your project so far. These 3 commits are nicely placed on top of each other. Each of them has a unique SHA identifier to trace it back. We will see in the next section how to move back and forth in time using these SHA ids.

history of commits

2.6 Great commits tell a great story

A good commit message

  1. Separate subject from body with a blank line
  2. Limit the subject line to 50 characters.
  3. Capitalize the subject line.
  4. Do not end the subject line with a period.
  5. Use the imperative mood in the subject line.
  6. Wrap the body at 72 characters.
  7. Use the body to explain what and why vs. how. The how is self-explainable from your code.

Here is an example of a good commit message:

Fix issue with dplyr filter function

By specifying the dplyr::filter() explicitely
avoid issues with other filter() functions
from other packages


3. Travel back in time

Back to the future poster

3.1 History of commits

If all went well in the previous exercise, you have 3 nicely self-explainable commits like this:

history of commits

In this section we will see how to move back and forth between these commits safely. This can be useful to see what happened to a file or to revert to a previous commit (because you are not happy with the current version).

3.2 Back to the past

Imagine that you are not happy with your conclusion about the GDP per capita plot for Canada. Then, it would be useful to revert to a previous commit. In the history, we would like to revert to the previous commit with the message “Add GDP per capita plot”.

Go to the Terminal in the Console pane of RStudio and type:

git hist

This will output the commit history of your local folder where you are working.

* 21830a4 2021-06-18 | Add a small comment on the GDP plot (HEAD -> master) [Marc Galland]
* 081d7cd 2021-06-18 | Add GDP per capita plot [Marc Galland]
* a5cc728 2021-06-18 | Import the gapminder dataset [Marc Galland]

The commit id 21830a4 is the most recent one (also called the HEAD). The commit we would like to revert to has the commit identifier 081d7cd.

Important note

Your exact commit identifier should be different. Using git hist identify your commit identifier that is required. Make sure you use your own commit identifier otherwise it will not work.


In git, the command to do this is called git checkout. In your terminal in RStudio, type:

git checkout 081d7cd

We get a lot of text messages.

Note: switching to '081d7cd'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 081d7cd Add GDP per capita plot

This simply tells us that our latest commit (the HEAD) is now pointing at the commit id 081d7cd where we added the GDP plot. Again you will have a different commit identifier and that’s totally normal.

Check your learning_git.Rmd file. It should have changed and the conclusion about the plot is now gone.

work loss

Actually not, git has just masked commits that were happening after the commit we checked out.

Question

Can you think about another way to delete the plot conclusion?

Solution

You can also delete the plot conclusion, save your Rmd document and commit this new change. Commits are as much about deleted code/text as about additions.

3.3 Back to the present

Ok, let’s get back to the latest commit in one step:

git checkout master

Now we retrieve our most up to date Rmd document.

4. Experiment with changes

One of the greatest feature of git is that it allows you to experiment with changes without any harm to your functional R script. Imagine that you want to change the way you perform statistics and see the consequences. This is easy with git.

4.1 Create a new branch

FIXME: create a branch called “barplot” where you modify the Canada GDP scatterplot into a bar plot.

4.2 Modify the plot

Modify your code that you previously wrote to make a bar plot instead of a scatterplot. Here is a suggestion:

gapminder %>% 
  filter(country == "Canada") %>% 
  ggplot(., aes(x = year, y = gdpPercap)) +
    geom_histogram(stat = "identity")

Make sure you add + commit your changes.

work loss

work loss

work loss

work loss

4.3 Switch back to the master branch

Once your changes are committed inside the barplot branch, you can easily switch back to the main branch called the master branch. You can either use the branch tool in RStudio and select master or use the Terminal of RStudio (see below):

RStudio tool work loss

Terminal alternative

git checkout master

This will switch your Rmd document to its original content on the master branch. The plot is now a scatterplot.

Branch are key to git power

Branches are a great feature since they allow you to experiment changes, test options without altering your main functional piece of work.


5. Recap of git commands

Before we dive in, there is a bit of technical terms to know.

git command description RStudio equivalent
git add asking git to track the file changes. This is also called “staging” the file. add button
git commit taking a snapshot of the folder at a given time point. commit button
git status asking git to give an overview of changes to be committed, untracked files, etc. None
git hist list the history of commits history button
git log showing the most recent commits. Do git log --oneline for more concision history button
git checkout -b makes a new branch history button


6. Resources

6.2 Troubleshooting

Sometimes, git integration with RStudio has issues.

Key Points

  • In a version control system, file names do not reflect their versions.

  • git acts as a time machine for files in a given repository under version control.

  • git allows you to test changes and discard them if not relevant.

  • A new RStudio project can be smoothly integrated with git to allow you to version control scripts and other files.