Version control with git
OverviewTeaching: 45 min
Exercises: 15 minQuestions
What is version control? How do I use it?
What is the difference between
What benefits does a version control system brings in for my research?Objectives
Understand the benefits of using a version control system such as
Be able to decipher git jargon: repository, commit, push, pull, branches etc.
Understand the basics of
gitand its usage in RStudio.
Table of contents
- 1. Introduction
- 2. Tell the story of your project
- 3. Travel back in time
- 4. Experiment with changes
- 5. Recap of git commands
- 6. Resources
In this episode, you will learn about the
git version control system and how to use it in your R project from RStudio.
We will see how to trace edits and modifications made to your R Markdown document. Also, we will demonstrate how you can revert changes if needed or experiment safely with changes on your valuable code.
1.1 What is a version control system and why scientists should use it?
In the context of a research project, a version control system will help you to manage your project history, progress and support active collaboration with your colleagues but also with you (past, present and future self).
As a concrete example, this is something we might have all experienced in the past when keeping track of file versions:
Version control is an essential tool in data analysis
Version control will help you to avoid this file nightmare but also fosters other good practices related to code.
1.2 Five reasons to use a version control system in research
- Tell the story: The history of your commit messages will describe your project progress.
- Travel back in time: a version control system makes it easy to compare different time points of your project smoothly. If you want to compare the stage of your project a year ago from now, it only takes one command-line of code.
- Experiment with changes: if you want to make changes in a script, you can first make a “snapshot” of the project status before experimenting with changes. As a researcher, this might be a second nature for you!
- Backup your work: by being able to linking your local repository (folder) to a distant online host, a version control system backs up your precious work instantly.
- Collaborate easily on projects: having a web-hosted synchronised version of your project will encourage collaboration with other researchers. Think about a colleague of yours being able to add a script to make a figure for your first PhD publication for instance.
There are possibly other important reasons why you could use a version control system for your research project. While originally created for software development, a common usage in scientific research is to track versions of datasets, scripts or figures easily and efficiently.
git is a popular version control software
One of the most used version control software out there is
git. It is a cross-platform tool that is available on Mac and Linux OS natively and that needs to be installed on Windows check the Setup section on how to do this.
git is a version control system primarily used in software development.
gitis an application that runs on your computer like a web browser or a word processor (Tom Stuart).
1.4 Collaborating with yourself with
Using your recently acquired flashy R skills, you are now ready to apply them to your scientific project You start by creating an R Markdown document, add code and text comments, generate an HTML report, save your R Markdown document, etc.
But how do you make sure that your changes are properly been saved and tracked? What is your backup strategy? This is where
git will come in handy.
2. Tell the story of your project
Compare two solutions below, one without
git and one with:
Can you list the potential and proven drawbacks of keeping track of changes by saving copies of the files?
In the follow-up section, we will see how to tell a story about the changes applied to our R Markdown document. This storyline will be composed of the git commit messages.
Let’s see how we can use
git powerful file versioning from within RStudio.
2.1 Create a new RStudio project
Projects in RStudio are a great feature and work very well in combination with
Go to RStudio and click on File > New Project > New directory.
Then select New project
We will call our new project “learning_git”
2.2 Create a new R Markdown document
Go to File > New File > R Markdown and call it “learning git”. Click “OK”. It should open this new R Markdown document.
## R Markdown, add a new code chunk, and copy this code:
Save your document under the name
learning_git.Rmd. You should see this in your File pane:
git from within the folder
git is still unaware of things that happen in this R project folder. Let’s change that.
In the console pane, click on “Terminal” to get access to a Shell from within RStudio. We will initialise
git in this folder.
This is a regular Shell in which you can type any command-line instruction. Let’s type this:
This command created a hidden folder called
.git/ that will contain all information needed by
git to recapitulate your file changes, project history, etc.
Try typing this:
ls -l .git/
This will show you what happened behing the hood:
(base) marcs-MacBook-Pro:learning_git mgalland$ ls -l .git/ total 24 -rw-r--r-- 1 mgalland staff 23 Jun 17 17:45 HEAD -rw-r--r-- 1 mgalland staff 137 Jun 17 17:45 config -rw-r--r-- 1 mgalland staff 73 Jun 17 17:45 description drwxr-xr-x 14 mgalland staff 448 Jun 17 17:45 hooks drwxr-xr-x 3 mgalland staff 96 Jun 17 17:45 info drwxr-xr-x 4 mgalland staff 128 Jun 17 17:45 objects drwxr-xr-x 4 mgalland staff 128 Jun 17 17:45 refs
2.4 Track file changes with
Close and restart RStudio to show the “git” tab in the environment pane. You should see this:
git does not track anything in this RStudio project folder.
We would like
git to track changes in our
learning_git.Rmd document. To do this, click in the empty checkbox:
You can see that there is now a small green “A” next to the
learning_git.Rmd file under the “Status” column. This means that our file is now being tracked by
2.5 Making changes and visualising them.
We will first:
- Import the
- Make a plot of the GDP per capita along the years for Canada.
- Write a small comment about the plot.
These 3 steps will all have their own commit message. Let’s start.
In your Rmd document, create a new code chunk and add this:
gapminder <- readr::read_csv('https://raw.githubusercontent.com/carpentries-incubator/open-science-with-r/gh-pages/data/gapminder.csv')
You see a small blue “M” button next to your
learning_git.Rmd file. This stands for “Modified”. You can visualise the changes in your Rmd document by selecting “diff”:
This opens a new window where you can see that 3 lines where added (shown in green). These lines are the code chunk we’ve added where we read the gapminder dataset.
While we are in this “diff” view, we can write a small commit message to describe what happened to our document in a meaningful way.
In the “Commit message” write this little message:
Import the gapminder dataset The gapminder dataset is imported using an online url. It will be used to produce a plot of the GDP per year.
Now, click on commit. This will assign a unique identifier to your commit as
git takes a snapshot of your
Let’s continue our work, add the changes and create commit messages.
- Step 1: Add a scatterplot of the GDP per capita per year for Canada (use
geom_point). Save your Rmd document.
- Step 2: Add the modifications by cliking the checkbox under “Staged” to see the blue “M” sign in RStudio git pane.
- Step 3: Click on “Diff” to open the new window where you should write a small commit message. Click on “Commit” when you’re done.
- Step 4: Write a small conclusion about the plot in your Rmd document.
- Step 5: save, add/stage changes, commit your changes with a small message.
If all went well, you can click on “History” to preview the history of commits that you have already made:
This gives you a history of your Rmd file and your project so far. These 3 commits are nicely placed on top of each other. Each of them has a unique SHA identifier to trace it back. We will see in the next section how to move back and forth in time using these SHA ids.
2.6 Great commits tell a great story
A good commit message
- Separate subject from body with a blank line
- Limit the subject line to 50 characters.
- Capitalize the subject line.
- Do not end the subject line with a period.
- Use the imperative mood in the subject line.
- Wrap the body at 72 characters.
- Use the body to explain what and why vs. how. The how is self-explainable from your code.
Here is an example of a good commit message:
Fix issue with dplyr filter function By specifying the dplyr::filter() explicitely avoid issues with other filter() functions from other packages
3. Travel back in time
3.1 History of commits
If all went well in the previous exercise, you have 3 nicely self-explainable commits like this:
In this section we will see how to move back and forth between these commits safely. This can be useful to see what happened to a file or to revert to a previous commit (because you are not happy with the current version).
3.2 Back to the past
Imagine that you are not happy with your conclusion about the GDP per capita plot for Canada. Then, it would be useful to revert to a previous commit. In the history, we would like to revert to the previous commit with the message “Add GDP per capita plot”.
Go to the Terminal in the Console pane of RStudio and type:
This will output the commit history of your local folder where you are working.
* 21830a4 2021-06-18 | Add a small comment on the GDP plot (HEAD -> master) [Marc Galland] * 081d7cd 2021-06-18 | Add GDP per capita plot [Marc Galland] * a5cc728 2021-06-18 | Import the gapminder dataset [Marc Galland]
The commit id
21830a4 is the most recent one (also called the
HEAD). The commit we would like to revert to has the commit identifier
Your exact commit identifier should be different. Using
git histidentify your commit identifier that is required. Make sure you use your own commit identifier otherwise it will not work.
git, the command to do this is called
git checkout. In your terminal in RStudio, type:
git checkout 081d7cd
We get a lot of text messages.
Note: switching to '081d7cd'. You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by switching back to a branch. If you want to create a new branch to retain commits you create, you may do so (now or later) by using -c with the switch command. Example: git switch -c <new-branch-name> Or undo this operation with: git switch - Turn off this advice by setting config variable advice.detachedHead to false HEAD is now at 081d7cd Add GDP per capita plot
This simply tells us that our latest commit (the
HEAD) is now pointing at the commit id
081d7cd where we added the GDP plot. Again you will have a different commit identifier and that’s totally normal.
learning_git.Rmd file. It should have changed and the conclusion about the plot is now gone.
git has just masked commits that were happening after the commit we checked out.
Can you think about another way to delete the plot conclusion?
You can also delete the plot conclusion, save your Rmd document and commit this new change. Commits are as much about deleted code/text as about additions.
3.3 Back to the present
Ok, let’s get back to the latest commit in one step:
git checkout master
Now we retrieve our most up to date Rmd document.
4. Experiment with changes
One of the greatest feature of
git is that it allows you to experiment with changes without any harm to your functional R script. Imagine that you want to change the way you perform statistics and see the consequences. This is easy with
4.1 Create a new branch
FIXME: create a branch called “barplot” where you modify the Canada GDP scatterplot into a bar plot.
4.2 Modify the plot
Modify your code that you previously wrote to make a bar plot instead of a scatterplot. Here is a suggestion:
gapminder %>% filter(country == "Canada") %>% ggplot(., aes(x = year, y = gdpPercap)) + geom_histogram(stat = "identity")
Make sure you add + commit your changes.
4.3 Switch back to the master branch
Once your changes are committed inside the
barplot branch, you can easily switch back to the main branch called the
You can either use the branch tool in RStudio and select
master or use the Terminal of RStudio (see below):
git checkout master
This will switch your Rmd document to its original content on the
master branch. The plot is now a scatterplot.
Branch are key to
Branches are a great feature since they allow you to experiment changes, test options without altering your main functional piece of work.
5. Recap of git commands
Before we dive in, there is a bit of technical terms to know.
|git command||description||RStudio equivalent|
||asking git to track the file changes. This is also called “staging” the file.|
||taking a snapshot of the folder at a given time point.|
||asking git to give an overview of changes to be committed, untracked files, etc.||None|
||list the history of commits|
||showing the most recent commits. Do
||makes a new branch|
- A “git for humans” presentation
- Jenny Bryan’s HappyGitWithR is very useful for troubleshooting, particularly the sections on Detect Git from RStudio and RStudio, Git, GitHub Hell (troubleshooting).
- Online game
- RStudio webinar on GitHub and RStudio
- Using git and GitHub for scientific writing
git integration with RStudio has issues.
- Issues with
gitand Mac OS X: https://github.com/jennybc/happy-git-with-r/issues/8
In a version control system, file names do not reflect their versions.
gitacts as a time machine for files in a given repository under version control.
gitallows you to test changes and discard them if not relevant.
A new RStudio project can be smoothly integrated with
gitto allow you to version control scripts and other files.