An introduction to version control
Last updated on 2023-02-17 | Edit this page
Estimated time 40 minutes
Maintaining History through Version Control
Overview
Questions
- What is version control?
- Why using git ?
- How is version control system relevant for biomedical research?
Objectives
- Get an overview of version control principles
- Understand its importance for reproducibility
- Understand the power and pitfalls of git
Version control allows tracking of history and go back to different versions as needed. The Turing Way project illustration by Scriberia for The Turing Way Community Shared under CC-BY 4.0 License. Zenodo. http://doi.org/10.5281/zenodo.3332807
Practices and recommendations described in this lesson are applicable to all areas of biological research. What can be considered slightly different in computational projects is that every object required to carry out the research exist in digital form. Starting from research workflow, data, software, analysis process, resulting outcomes as well as how researchers involved in the project communicate with each other. This means that research objects can be organised and maintained without losing the provenance or missing knowledge of how each of these objects is connected in the context of your project.
Versioning Every Research Object
Management of changes or revisions to any type of information made in a file or project is called versioning. Version Control Systems are platform and technical tools that allow all changes made in a file or research object over time is recorded. Version Control Systems, or VCS, allows all collaborators to track history, review any changes, give appropriate credit to all contributors, track and fix errors when they appear and revert or go back to earlier versions.
Different VCS can be used through a program with web browser-based applications (such as Google Docs for documents) and more dynamically for code and all kinds of data through command-line tools (such as Git) and their integration into the graphical user interface (Visual Studio Code editor, Git-gui and gitkraken). The practice of versioning is particularly important to allow non-linear or branched development of different parts of the project, testing a new feature, debugging and error or reusing code from one project to different data by different contributors.
GitLab, GitHub, or BitBucket are online platforms that allow version-controlled projects online and allow multiple collaborators to participate. Different members can download a copy of the online repository (most recent version), make changes by adding their contributions locally on their computer and push the changes to GitLab/GitHub/BitBucket (a new version!) allowing others to build on the new development.
Read All you need to know about Git, GitHub & GitLab on Towards Data Science and version control in The Turing way for more details on workflow, technical details of using git and versioning large datasets.
basics
We have all seen a simple file versioning approach where different versions of a file are stored with a different name. Tools such as Google Drive and Microsoft Teams offer platforms to update files and share them with others in real-time, collaboratively. More sophisticated version control system exists within tools like Google docs or HackMD. These allow collaborators to update files while storing each version in its version history (we will discuss this in detail). Advanced version control systems (VCS) such as Git and Mercurial provide much more powerful tools to maintain versions in local files and share them with others.
Web-based Git repository hosting services like GitLab and GitHub facilitate online collaborations in research projects by making changes available online more frequently, as well as enabling participation within a common platform from colleagues who don’t code. With the help of comments and commit messages, each version can explain what changes it contains compared to the previous versions. This is helpful when we share our analysis (not only data), and make it auditable or reproducible - which is good scientific practice. In next chapters we discuss version control for different research objects.
You can read more details in Version Control and Getting Started With GitHub chapters in The Turing Way.