This lesson is in the early stages of development (Alpha version)

Introduction to Reproducible Publications with RStudio

Scientific reproducibility: What is it for?

Overview

Teaching: 15 min
Exercises: 5 min
Questions
  • What is reproducible research?

  • How can RStudio help research to be more reproducible?

  • What are the benefits of using RStudio for writing academic essays and papers?

Objectives
  • Understand what scientific reproducibility entails.

  • Identify the benefits of using RStudio to create research reports.

  • Understand how RStudio supports Open Science principles.

  • Learn how RStudio can help one’s research.

Warm-up

Let’s get into breakout rooms and discuss: What is reproducible research for you? Have you ever experienced issues while trying to reproduce someone else’s study or even your own research?

Reproducible studies allow other researchers to perform the same processes and analyses to produce an identical result as the first initial researcher. Original researchers have to make available the study’s associated data, documentation, and code pipelines and workflows in a way that is sufficiently self-explanatory and well-documented so that independent investigators can reproduce/recreate the original study under the same conditions, using identical materials and procedures, and ultimately achieve consistent results and render equal outcomes. Original investigators, therefore, must produce rich and detailed documentation for themselves and others. This includes fully specifying both in human-readable and computer-executable ways all steps taken in the study.

The importance of Reproducibility in Research

PhD Comics cartoon

Source: Comic number 1869 from PhD Comics Copyrighted artwork by Jorge Cham.

Discussion: A scary anecdote

According to the U.S. National Science Foundation (NSF)subcommittee on replicability in science (2015):

Science should routinely evaluate the reproducibility of findings that enjoy a prominent role in the published literature. To make reproduction possible, efficient, and informative, researchers should sufficiently document the details of the procedures used to collect data, to convert observations into analyzable data, and to perform data analysis.

Reproducibility refers to the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results. Reproducibility is a minimum necessary condition for a finding to be considered rigorous, believable and informative.

Why all the talk about reproducible research?

A 2016 survey in Nature revealed that irreproducible experiments are a problem across all domains of science:

Nature Report - 2016

Source: Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016). doi.org/10.1038/533452a

Factors behind irreproducible research

Science is not a miracle

Source: Then a Miracle Occurs. Copyrighted artwork by Sydney Harris Inc.

Reproducible, replicable, robust, generalizable

While reproducibility is the minimum requirement and can be solved with “good enough” computational practices, replicability/robustness/generalizability of scientific findings are an even greater concern involving research misconduct, questionable research practices (p-hacking, HARKing, cherry-picking), sloppy methods, and other conscious and unconscious biases.

How science should be

Source: This image was created by Scriberia for The Turing Way community DOI: 10.5281/zenodo.3 332807

If contributing to science and other researchers seems not to be compelling enough, here are 5 selfish reasons to work reproducibly according to Markowetz (2015):

When do you need to worry about reproducibility?

Let’s assume that I have convinced you that reproducibility and transparency are in your own best interest. Then what is the best time to worry about it?

From day one, and throughout the whole research life cycle! Before you start the project because you might have to learn tools like R or Git. While you do the analysis because if you wait too long you might lose a lot of time trying to remember what you did two months ago. When you write the paper because you want your numbers, tables, and figures to be up-to-date. When you co-author a paper, because you want to make sure that the analyses presented in a paper with your name on are sound. When you review a paper, because you can’t judge the results if you don’t know how the authors got there.

Levels of Reproducibility

A published article is like the top of a pyramid, meaning that a reproducible paper/report rests on multiple levels that each contributes to its reproducibility.

Advantages of using RStudio for your project

RStudio is an integrated development environment (IDE) for R and Python. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging, collaboration, and workspace management. It is a powerful tool which supports research by weaving the principles of reproducibility throughout the entire research lifecycle, from data gathering to the statistical analysis, presentation and publication of results.

It is free and open-source

Reproducibility becomes more difficult and opaque when results rely on proprietary software. Unless research code is open sourced, reproducing results on different software/hardware configurations is impossible. Rstudio is dedicated to sustainable investment in free and open-source software for data science.

It is designed to make it easy to write and reuse code

As soon as you create a new script, the windows within your RStudio session adjust automatically so you can see both your script and the results in your console when you run your syntax. It also offers the ability to call up potential syntax options while you are writing just by using the tab key.

Makes it convenient to view and interact with the objects stored in your environment

RStudio has a very useful “Environment” window available which shows all of the objects that you have stored, including data; scalars, vectors, and matrices; model outputs; etc., along with a summary of the information that is stored in those objects.

Makes it easy to set your working directory and access files on your computer

With RStudio, you can navigate to folders on your computer in the “Files” window, view any files you have in that folder, and set that folder as the working directory.

Integrates with Collaboration and Publishing Tools

Another great advantage of using Rstudio for your R project is that the platform integrates with GitHub. Once you connect RStudio with your GitHub account a remote repo becomes the “upstream” remote for your local repo. In essence, it enables you push and pull commits to GitHub allowing more seamless collaboration and more effective version control. RStudio also connects with Rpubs for easy R project web publishing.

Creates documents using R Markdown

R Markdown is a variant of Markdown, a system for writing simple, readable text that is easily converted to html which allows you to write using an easy-to-read, easy-to-write plain text format.

R Markdown belongs to the field of literate programming which is about weaving text and source code into a single document to make it easy to create reproducible web-based reports. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents and much, much more. R Markdown provides the flexibility of Markdown with the implementation of R input and output. For more details on using R Markdown check http://rmarkdown.rstudio.com.

The idea of literate programming shines some light on this dark area of science. This is an idea from Donald Knuth where you combine your text with your code output to create a document. This is a blend of your literature (text), and your programming (code), to create something that you can read from top to bottom. Imagine your paper - the introduction, methods, results, discussion, and conclusion, and all the bits of code that make each section. With R Markdown, you can see all the pieces of your data analysis altogether.

You can include both text and code to execute. It is a convenient tool for reproducible and dynamic reports with R! With R Markdown, you are able to:

  1. Keep an eye on text (the paper) AND the source code. These computational steps are essential to ensure computational reproducibility.
  2. Conduct the entire analysis pipeline in an R Markdown document: data (pre-)processing, analysis, outputs, visualization.
  3. Apply a formatting syntax that is part of the R ecosystem and supports LaTeX.
  4. Combine text written in Markdown and source code written in R (and other languages).
  5. Easily share R Markdown documents with colleagues, as supplemental material, or as the paper under review. Thanks to the package knitr, others can execute the document with a single click and receive, for example, HTML or PDF renderings.
  6. Get figures automatically updated if you change the underlying parameters in the code. The error-prone task of exporting figures and uploading the right figure version to another platform is thus not needed anymore.
  7. Since Markdown is a text-based format, you can also use versioning control with Git.
  8. If you do not make any changes to the document after creating the output document, you can be sure that the paper was executable at least at the time of submission.
  9. Refer to the corresponding code lines in the methodology section making it unnecessary to use pseudocode, high-level textual descriptions, or just too many words to describe the computational analysis.
  10. Use packages such as rticles to use templates from publishers and create submission-ready documents.

Some Real-world Applications

Finally, three real-world examples that motivated the authors of this lesson to value and use R Markdown:

  1. Greg Janee quickly put together a simple but compelling R Markdown document describing his survey results. The ease with which he created his plots is a testament to the power of R as a data analysis environment, but the ease with which he was able to publish a page on the web is a testament to R Markdown and Github as a publishing environment. Notice that he did not have to: create plots in a tool and then export the plots as images; write any HTML; embed plot images in HTML; or create a site under Wordpress or other web hosting service. Instead, he directly published his R code as he wrote it, and using GitHub, made it appear on the web with a button click.

  2. One of us wanted to create a short document that included some math formulas. The LaTeX document preparation can be used for this, but it is difficult to use and is overkill for just a few formulas in otherwise plain text. R Markdown lets you use just the best part of LaTeX—math formatting—while letting you write your text in a user-friendly way.

  3. In this lesson we will be constructing a scientific paper that is based on an actual Nature publication and attendant survey and data. In trying to recreate the plots the original authors created, we found it difficult and time-consuming to figure out exactly how the authors created their plots. Out of the many columns in their data, many with similar-sounding names, which did they use? How did they handle missing data? Exactly what operations did they perform to compute aggregate values? How much easier it would have been if they had published the code they used along with their paper. R Markdown allows you to do this.

Our goal is that by the end of this workshop you will be able to create a reproducible report. This template is a short and adapted version of the data paper referenced below:

Nitsch, F. J., Sellitto, M., & Kalenscher, T. (2021). Trier social stress test and food-choice: Behavioral, self-report & hormonal data. Data in brief, 37, 107245. https://doi.org/10.1016/j.dib.2021.107245

This template is used exclusively for instruction purposes with permission from the authors.

Key Points

  • Reproducible research is key for scientific advancement.

  • RStudio can help you to organize, have better control over and produce reproducible research.


Navigating RStudio and R Markdown Documents

Overview

Teaching: 20 min
Exercises: 10 min
Questions
  • How do you find your way around RStudio?

  • How do you start an R Markdown document in Rstudio?

  • How is an R Markdown document configured and how do I work with it?

Objectives
  • Understand key functions in Rstudio.

  • Learn about the structure of a Rmarkdown file.

  • Understand the workflow of an R Markdown file.

Getting Around RStudio

Throughout this lesson, we’re going to teach you some of the fundamentals of using R Markdown as part of your RStudio workflow.

We’ll be using RStudio: a free, open source R Integrated Development Environment (IDE). It provides a built in editor, works on all platforms (including on servers) and provides many advantages such as integration with version control and project management.

This lesson assumes you already have a basic understanding of R and RStudio but we will do a brief tour of the IDE, review R projects and the best practices for organizing your work, and how to install packages you may want to use to work with R Markdown.

Basic layout

When you first open RStudio, you will be greeted by three panels:

RStudio layout

Once you open files, such as .Rmd files or .R files, an editor panel will also open in the top left.

RStudio layout with .R file open

R Packages

It is possible to add functions to R by writing a package, or by obtaining a package written by someone else. As of this writing, there are over 10,000 packages available on CRAN (the comprehensive R archive network). R and RStudio have functionality for managing packages:

Packages can also be viewed, loaded, and detached in the Packages tab of the lower right panel in RStudio. Clicking on this tab will display all of installed packages with a checkbox next to them. If the box next to a package name is checked, the package is loaded and if it is empty, the package is not loaded. Click an empty box to load that package and click a checked box to detach that package.

Packages can be installed and updated from the Package tab with the Install and Update buttons at the top of the tab. We have asked you to install a few packages prior to the workshop following the setup instructions using the install.packages() command. Let’s now make sure you have all of them good to go.

CHALLENGE 2.1 - Checking for Installed Packages

Which command would you use to check for packages ready for use?

SOLUTION

To see what packages are installed, use the installed.packages() command. This will return a matrix with a row for each package that has been installed.

Starting a R Markdown File

Start a new R Markdown document in RStudio by clicking File > New File > R Markdown…

Opening a new R Markdown document

Tip: Bonus! Note about R Notebooks:

You may have noticed that the menu offers the option to create an R Notebook, which is essentially an interactive execution mode for R Markdown documents. Technically, R Markdown is a file, whereas R Notebook is a way to work with R Markdown files. R Notebooks do not have their own file format, they all use .Rmd. All R Notebooks can be ‘knitted’ to R Markdown outputs, and all R Markdown documents can be interfaced as a Notebook.

If this is the first time you have ever opened an R Markdown file a dialog box will open up to tell you what packages need to be installed. You shouldn’t see the dialog box if you installed these packages before the workshop.

For a mac: First time R Markdown install packages dialog box mac

For windows:

First time R Markdown install packages dialog box windows

Click “Yes”. The packages will take a few seconds (to a few minutes) to install. You should see that each package was installed successfully in the dialog box.

Once the package installs have completed, a dialog box will pop up and ask you to name the file and add an author name (may already know what your name is) The default output is HTML and as the wizard indicates, it is the best way to start and in your final version or later versions you have the option of changing to pdf or word document (among many other output formats! We’ll see this later).

Naming your new R Markdown Document

Name new .Rmd file

New R Markdown files will have a generic template unless you click the “Create Empty Document” in the bottom left-hand corner of the dialog box.

Note that you have the option to select Use the current date when rendering the document. If you choose so, this will generate the date dynamically each time you knit your document and replace the rmd creation date with the inline R expression: “`r Sys.Date()`”. You may also consider exploring changing date formats following these tips.

If you see this default text you’re good to go: .Rmd new file generic template

Visual Editor vs. Source Editor

RStudio released a new major update to their IDE in January 2020, which includes a new “visual editor” for R Markdown to supplement their original editor (which we will call the source editor) for authoring with R Markdown syntax. The new visual editor is friendlier with a graphical user interface similar to Word or Google docs that lets you choose styling options from the menu (before you had to either have the R Markdown code memorized or look it up for each of your styling choices). Another major benefit is that the new editor renders the R Markdown styling in real time so you can preview your paper before rendering to your output format.

Source Editor

The image below displays the default R Markdown template in the “source editor” mode. Notice the symbols scattered throughout the text (#, *, <>). Those are examples of R Markdown syntax, which is a flavor of Markdown syntax, an easy and quick, human-readable markup language for document styling.

Add image source editor

CHALLENGE 2.2 - Formatting with Symbols (optional)

In Rmd certain symbols are used to denote formatting that should happen to the text (after we “knit” or render). Before we knit, these symbols will show up seemingly “randomly” throughout the text and don’t contribute to the narrative in a logical way. In the template Rmd document, there are three types of such symbols (##, **, <>) . Each symbol represents a different kind of formatting (think of your text formatting buttons you use in Word). Can you deduce from the surrounding text how these symbols format the surrounding text?

## R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.

When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

SOLUTION

## is a heading, ** is to bold enclosed text, and <> is for hyperlinks. Don’t worry about this too much right now! This is an example of R Markdown syntax for styling, you won’t need it if you stick to the visual editor, but it is recommended to get at least a basic understanding of R Markdown syntax if you plan to work with .Rmd documents frequently.

Switch to the visual editor

The new visual editor is accessible through a small button on the far right side of the script/document pane in RStudio. The icon is a protractor, but from further away it just looks like a squiggly “A”. See the image below to find the visual editor button, it isn’t the most obvious!

change to visual editor

Thankfully, the newer versions of RStudio (2022 onward), have made it easier to find the button to change to visual mode:

new visual editor

Visual Editor

We’ve already touched on the visual editor and it’s useful features, but now that we’ve switched to the visual editor take another look at your document and see what’s changed. You’ll notice that formatting elements like headings, hyperlinks and bold have been generated automatically, giving us a preview of how our text will render. However, the visual editor does not run any code automatically, we’ll have to do that manually (but we will learn how to do that later on).

Add image visual editor

We will proceed using the visual editor during this workshop as it is more user-friendly and allows us to talk about styling without needing to teach the whole R Markdown syntax system. However, we highly encourage you to become familiar with markdown syntax (specifically the R Markdown flavor) as it increases your abilities to format and style your paper without relying on the visual editor options.

Tip: Resources to learn R Markdown

if you want to learn how to use the source editor (as we call it) please see the the Pandoc Markdown Documentation. You will need to know Markdown formatting (specifically R-flavored Markdown).

Now we’ll get into how our R Markdown file & workflow is organized and then on to editing and styling!

Key Points

  • RStudio has four panes to organize your code and environment.

  • Manage packages in RStudio using specific functions.

  • R Markdown documents combine text and code.


Introduction to Working with R Markdown Files

Overview

Teaching: 20 min
Exercises: 5 min
Questions
  • What is the breakdown of an R Markdown file?

  • What are templates in R Markdown?

  • How can you render the input file to the specified output format?

  • How can you find existing templates for R Markdown files?

Objectives
  • Learn about the structure of a R Markdown file.

  • Learn how an R Markdown file works.

  • Learn how to knit/render a .Rmd file into an output format.

  • Understand what templates are and the advantage of using them.

  • Learn how to start a document from a template.

Anatomy of an R Markdown File

The key to our reproducible workflow is using R Markdown files in RStudio rather than basic scripts to dynamically “knit” both code and paper narrative. So let’s do a quick anatomy lesson on the components of an R Markdown file (YAML header, R Markdown formatted, R code chunks) and how to render them into our final formatted document. There are four distinct steps in the R Markdown workflow:

  1. create a YAML header (optional)
  2. write R Markdown-formatted text
  3. add R code chunks for embedded analysis
  4. render the document with Knitr

R Markdown Workflow

Let’s dig in to those more:

1. YAML header:

What is YAML anyway?

YAML, pronounced “Yeah-mul” stands for “YAML Ain’t Markup Language”. YAML is a human-readable data-serialization language which, as its name suggests, is not a markup language. YAML has no executable commands though it is compatible with all programming languages and virtually any application that deals with storing or transmiting data. YAML itself is made up of bits of many languages including Perl, MIME, C, & HTML. YAML is also a superset of JSON. When used as a stand-alone file the file ending is .yml or .yaml.

R Markdown’s default YAML header includes the following metadata surrounded by three dashes ---:

yaml highlighted in Rmd document

The first three are self-explanatory, but what’s the output? We saw this in the wizard for starting a new document, by default you are able to pick from pdf, html, and word document. Basically, this allows you to export your Rmd file as a file type of your choice. There are other options for output and even more can be added by installing certain packages, but these are the three default options.

We’ll see other formatting options for YAML later on including how to add bibliography information, customize our output, and change the default settings of the knit function. Below is an example of how our YAML file will look at the end of this workshop.

---
---
title: "Data Article: Trier social stress test and food-choice: Behavioral, self-report & hormonal data"
author: "Felix Jan Nitsch; Manuela Sellitto; Tobias Kalenscher"
date: "June, 25 2021"
output:
  html_document:
    df_print: paged
bibliography: references.bib
knit: (function(rmdFile, encoding) {
      out_dir <- '../output';
      rmarkdown::render(rmdFile,
                        encoding=encoding,
                        output_file=file.path(dirname(rmdFile),
                        out_dir,
                        'DataPaper-ReproducibilityWorkshop.html'))})
---
---

2. Formatted text:

This one is simple, it’s literally just text narrative formatted by using markdown (more on markdown syntax later). Markdown-formatted text is one of the benefits added above and beyond the capabilities of a regular r script. Any text section will have the default white background in the Rmd document. As you might know, in a regular R file, # starts a comment. In R markdown, plain text is just plain narrative text that appears in the document. In R scripts, plain text wants to be code. In R Markdown, you will need to enclose your code in special characters. Any symbols you do see that aren’t regular grammar components are for formatting, such as ##, ** **, and < >.

Tip: Bonus! You can use a variety of languages to format text and images in R Markdown:

  • R Markdown
  • HTML
  • LaTeX
  • CSS

Rmd template text

3. Code Chunks:

R code chunks appear highlighted in gray throughout the Rmd document. They are surrounded by three tick marks on either side (```) in source mode with the starting three tick marks followed by curly brackets {}with some other code inside. The tick marks indicate the start of a code section and the bits found between the curly brackets {}indicate how R should read and display the code (more on this in the Knitr syntax episodes). These are the sections you add R code such as summary statistics, analysis, tables and plots. If you’ve already written an R script you can copy and paste your code between the few lines of required formatting to embed & run whichever piece you want at that particular spot in the document.

Tip: Bonus! You may code with many different languages in RStudio:

  • R
  • Python
  • Bash
  • SQL

A complete list of compatible languages can be found at: https://rmarkdown.rstudio.com/lesson-5.html

rmd template code

4. Rendering your Rmd document:

This is called “knitting” and the button looks like a spool of yarn with a knitting needle. Clicking the knit button will compile the code, check for errors, and finally, output the type of file indicated in your yaml header. One nice thing about the knit button is that it saves the .Rmd document each time you run it. Your Rmd document may not run and render as your indicated output if there are any errors in the document so it also functions somewhat as a code checker.

Try it yourself

We’re going to pause here and see what the R Markdown does when it’s rendered. We’ll just use the generic template, but when we’re working on our own project, knitting periodically while we’re editing allows us to catch errors early. We’ll continue rendering our Rmd throughout the lesson to see what happens when we add our markdown and knitr syntax and to make sure we aren’t making any errors.

This is a little preview of what’s to come in the later episodes: Click the “knit” button.

Add or update image

Before you can render your document, you’ll need to give it a file name and choose what folder you want to save it to. Choose my_first_rmd.rmd as your file name and save it to an easily accessible directory in your file system.

This is how our html document will render after clicking the knit button and choosing a file name: Knit html output

CHALLENGE 3.1 - Rendering the document in another format

Suppose you want this .Rmd document to render as a word document. What options would you have?

Solution:

1) You may change the output format in the YAML to word_document, or 2) Select “Knit > “Knit to Word” on the menu.

Finding and Applying Existing Templates

We have learned how to start a new document on RStudio and we will learn good practices for project organization next. But, let’s say you are writing a paper and you already know which journal you are submitting it to? Writing it in your own style and then formatting prior to submission is time-consuming, right? The good news is that RStudio makes our lives easier. Through a package called “rticles” you can access a number of existing journals’ templates that will let you easily and quickly format and prepare your paper draft for peer review. Even if the journal you submit to does not have a template, it may be good to review several of the templates to get an idea of formatting options available to you in R Markdown..

Let’s take a look at that! On RStudio, load the rticles package by using the function library(rticles) (remember we’ve already installed the package earlier!). Once you’ve loaded the package, use the plus icon at the upper-left side of your screen to create a new document or proceed with File > New File > R Markdown. This will prompt the window for creating a new R Markdown document as we saw earlier.

Clicking on “From Template” will prompt a couple of dozen templates listed as {rticles}. Let’s choose the Biometrics Journal template and then, OK.

Rticles Templates (Step 2)

Note that along with the skeleton of the paper you will see a message on top indicating additional packages you may need to install for that particular template.

Tip: Create your own template

Please remember that for this workshop we are producing a report in html and not tied to a particular journal template. You may choose other output formats such as word or pdf. Creating templates and adding other templates is beyond the scope of this workshop, but that is also possible. If you submit to the same journal frequently or use the same formatting for many of your publications, it may be worth creating your own template to save time. To learn more about how you can create templates in RStudio:

Challenge: Find a template (optional):

Find the template for the Bioinformatics Journal, what does the template look like? What sections does it contain?

Discussion: What may be the pros and cons of using an existing template? (optional)

Solution:

Pros:

  • Formatting papers according to journals’ guidelines can be very cumbersome and time-consuming. So, using a template for a specific journal will save you time!

Cons:

  • If, along the way, you change your mind about the journal you were planning to submit to, there is no easy conversion to another template. Overwritting will cause problems.
  • There are only a few journal titles available.

Tips:

  • Always check if the template meets the most updated guidelines in the journal website. Since the rticles package is maintained by a community, we advise you check their (GitHub page][https://github.com/rstudio/rticles) for more details.
  • Did not find a particular template? You can recommend one to the community or become a contributor!

Key Points

  • An R Markdown file is comprised of a YAML header, formatted text in Rmd and code chunks.

  • The knit function renders the file into the chosen output format.

  • Rstudio has some journals’ templates that can save you some formatting time or you can make your own for frequent submissions.


Good Practices for Managing Projects in RStudio

Overview

Teaching: 30 min
Exercises: 10 min
Questions
  • What are good research project management practices?

  • What is an R Project file?

  • How do you start a new or open an existing R Project?

  • How do you use version control to keep track of your work?

Objectives
  • Best pratices for working on research projects involving data.

  • The purpose of using .Rproj files.

  • Using version control in RStudio.

  • Starting or continuing an R project.

Managing Research Projects in R

Now that we’ve learned some of the basics of authoring in RStudio with R Markdown documents, let’s take a step back and talk about research project management as a whole.

The ability to integrate code and narratives is a major advantage of the RStudio environment, especially considering the scientific process is naturally incremental, and many projects start life as random notes, some code, then a manuscript, and eventually everything ends up a bit mixed together. To complicate things further, we are often working with other collaborators, lab members, graduate students, faculty from the same or different institutions, which makes it that much more difficult to keep projects organized. When you throw data into the mix (sometimes huge amounts of it!), it’s integral to use best practices to maintain the integrity of your analysis and to be able to publish high quality and reproducible research. Using R Markdown is a powerful tool, but it can’t be fully utilized unless your project documents, scripts and other files are well-organized. So let’s take a look at RStudio’s features to manage projects and discuss some of the best practices when working with data and collaborators.

Research Project Stress Points

We often have organizational or logistical stress points in our research that may become breaking points, especially when it comes to working with collaborators, returning to a project after a hiatus, or dealing with data and scripts. Let’s discuss three of those common stress points:

Discussion

To what extent do these stress points affect your research projects? Are there additional issues that you’ve encountered that slow down or derail your work due to issues with project management?

Discussion: Antidotes

What are some practices you implement to keep your project materials organized?

Antidotes

A good project layout will ultimately make your life easier:

We’ll discuss three aspects of project management and then implement those practices for the remainder of this workshop in the RStudio environment.

  1. File/Folder Organization
  2. Storage & Sharing
  3. Using Version Control

Then, we’ll get started on our project!

Project File/Folder Organization

Important principles:

Although there is no “best” way to lay out a project, there are some general principles to adhere to that will make project management easier:

Practice good file-organization

Good Enough Practices for Scientific Computing gives the following recommendations for project organization:

  1. Put each project in its own directory, which is named after the project.
  2. Put text documents associated with the project in the doc directory.
  3. Put raw data and metadata in the data directory, and files generated during cleanup and analysis in a results directory.
  4. Put source for the project’s scripts and programs in the src directory, and programs brought in from elsewhere or compiled locally in the bin directory.
  5. Name all files to reflect their content or function.
  6. Additionally, we’d recommend to include README, LICENSE, and CITATION files!

For our project we’re working in today, we used the following setup for folders and files:

directory tree

Challenge 4.1: Take a few minutes to look through the workshop project files

Please take some time to look through the project files. Either the screenshot above, or you may browse the files on GitHub at <https://github.com/UCSBCarpentry/Reproducible-rmd>. What do each of the directories (folders) contain? What is their purpose?

See the solution drop-down for an explanation of each directory’s contents.

Solution:

  • code: contains the scripts that generate the plots and analysis (found in output/plots)
    • /functions: contains custom functions written for the data pre-processing
  • data: this folder contains the raw and cleaned data files
    • /foodchoice_data: contains the individual data files from food choice trials
  • output: contains processed/transformed data and all plots generated
    • /data: contains the output data file after applying custom pre-processing function
    • /plots: contains pdfs of the plots generated from the plot scripts in the code folder
  • report: all files needed for the publication of the research project
    • /source: .Rmd file for the paper and additional files needed for rendering the paper
    • /fig: contains the images created specifically (not through the analysis scripts) for the paper
    • /output: contains the final output of the Rmd paper
  • R-repro-pub.Rproj: the R project file that lives in the root directory.
  • README.md: a detailed project description with all collaborators listed.
  • CITATION.md: directions to cite the project.
  • LICENSE.md: instructions on how the project or any components can be reused.

Practice good file-naming

The three principles of file-naming are:

  1. Machine-readable
    • Friendly for searching (using regular expressions/globbing)
    • No spaces, unsupported punctuation, accented characters, or case-sensitive file names
      - Friendly for computing
      - Deliberate use of delimiters (i.e. for splitting file names)
    • data-analyses-fig1.R with - used consistently as a separator
  2. Human-readable
    • Name contains brief description of content
    • Borrow from clean URL practices:
      • “slug” i.e. the part of a url that is human readable
    • i.e. data-analyses-fig1.R
  3. Plays nice with default ordering
    • Use chronological or logical order:
    • chronological: filename starts with date.
    • logical: filename starts with a number or keyword/number combo.
      • i.e. 01_data_preprocessing.R see code directory
      • i.e. CC-101_1_data.csv see data directory

Adapted from https://datacarpentry.org/rr-organization1/01-file-naming/index.html. For more tips on file naming, check: The Dos and Don’ts of File Naming.

Challenge 4.2: File name syntax

Given the filename CC-101_1_data.csv and 2022-01-01_data_analyses.R, why does it make sense to use both - and _ as delimiters/separators?

Solution:

In CC-101_1_data.csv, the - is used as part of the keyword that is shared between a number of files. the _ separates it from the trial number and description. If one were to split the filename on the _, the keyword would be maintained and the trial number would be separated out. In the 2022-01-01_data_analyses.R, - is used for a delimiter for the date, between year month and day. _ is used between the rest. This allows us to split on _ which would preserve the date (separate from other file info).

It’s good to strategize on the best way to name files to anticipate future uses of the information contained within the filename.

Use relative paths

This goes hand-in-hand with keeping your project within one “root” directory. If you use complete paths to say, read in your data to RStudio and then share your code with a collaborator, they won’t be able to run it because the complete path you used is unique to your system and they will receive an error that the file is not found. That is why one should always use relative paths to link to other files in the project. I.e. “where is my data file in relation to the script I’m reading the data into?” The practice of using relative paths is made easier by having a logical directory set up and keeping all project files within one root project folder.

Assuming your R script is in a code directory and your data file is in a data directory then an example of a relative path to read your data would be:

df <- read.csv("../data/foodchoice_budgetlines.csv", encoding = "UTF-8")

whereas a complete path might look like:

df <- read.csv("C:/users/flintstone/wilma/Desktop/project23/data/foodchoice_budgetlines.csv", encoding = "UTF-8")

In the complete path example you can see that the code is not going to be portable. If someone other than Wilma Flintstone wanted to run the r script they would have to alter the path to match their system.

Challenge 4.3: relative paths

What would be the relative path needed to refer to the bronars.pdf plot (located in the plots directory) from R-repro-pub.Rproj (located in the root directory). What is the inversed relative path?

Solution:

R-repro-pub.Rproj to bronars.pdf “output/plot/bronars.pdf”

bronars.pdfto R-repro-pub.Rproj “../..” “..” directs back to the directory that contains the directory of the file of interest.

Tip: Level up your relative paths

We just discussed how relative paths are a better practice when coding because we can guarantee our code will work on somebody else’s system. However, relative paths can still be quite confusing to deal with, especially when you have many sub-directories in your project. One way to make things a bit easier on ourselves is to make sure the part that’s relative to what we’re referencing is always the same. We can employ a package called “here” to do this. Here has a function which always references the root (or top-level) directory of your project (i.e. where the .rproj file lives). Conviniently, that function is called here(). here() gives us a consistent starting path when building relative paths. We’ll see how this is used later in the lesson.

Treat data as read only

This is probably the most important goal of setting up a project. Data is typically time consuming and/or expensive to collect. Working with them interactively (e.g., in Excel or R) where they can be modified means you are never sure of where the data came from, or how it has been modified since collection. It is therefore a good idea to treat your data as “read-only”. However, in many cases your data will be “dirty”: it will need significant preprocessing to get into a format R (or any other programming language) will find useful. Storing these scripts in a separate folder, and creating a second “read-only” data folder to hold the “cleaned” data sets can prevent confusion between the two sets. You should have separate folders for each: raw data, code, and output data/analyses. You wouldn’t mix your clean laundry with your dirty laundry, right?

Treat generated output as disposable

Anything generated by your scripts should be treated as disposable: it should all be able to be regenerated from your scripts. There are lots of different ways to manage this output. Having an output folder with different sub-directories for each separate analysis makes it easier later. Since many analyses are exploratory and don’t end up being used in the final project, and some of the analyses get shared between projects.

Include a README file

For more information about the README file and a customizable template, check this handout. Make sure to include citation and license information both for your data see creative commons license and software (see license types on Github). This information will be critical for others to reuse and correctly attribute your work. You may also consider adding a separate citation and license file to your project folder.

Again, there are no hard and fast rules here, but remember, it is important at least to keep your raw data files separate and to make sure they don’t get overridden after you use a script to clean your data. It’s also very helpful to keep the different files generated by your analysis organized in a folder.

*what’s this .Rproj file? We’ll explain in a bit.

Storage and Sharing

Backup your work

Having a solid backup plan in case of emergencies (say your hard drive on your computer fails) is essential. The general guideline for back ups is to adhere to the 3-2-1 principal which dictates that you should have 3 copies, on 2 different media, with 1 copy offsite. Your decision on backups will be based on your own personal tolerance but we recommend at minimum to avoid only having a copy of your project on your personal, work computer or a lab computer at all costs.

At the very least, you should backup your project into cloud storage (either provided by your university or paid for yourself). Common cloud storage platforms include Google drive, Box, OneDrive, Dropbox, etc. Backing up a project on a local device to cloud storage allows you to meet two of the 3-2-1 criteria (2 different media and 1 offsite). If you’re working with at least one collaborator and they also keep an up-to-date copy of the project on their computer, you’re set!

Version Control hosting services

If your research project involves code, the best way to make sure you have your work backed up AND keep track of your code and data is to use a version control hosting service such as GitHub - though we’d recommend using version control for any large projects.

The main three version control hosting services are GitHub, GitLab, and BitBucket, to see a comparison of the available options, see this comparison on LinkedIn

We will proceed using GitHub because it is the most used version control platform to date.

Using Version Control

Ok, now let’s talk about implementing version control in your project through RStudio! But first… let’s quickly clarify the difference between Git and GitHub. We already said that GitHub is the version control hosting platform. Git is the version control system and does not have to be used with GitHub. You can use Git and then host your code on Bitbucket for example, or save to your Google drive. In fact, you can use Git on your local system only and never save it to a cloud storage platform. However, version control hosting platforms such as GitHub enhance the benefits of version control and offer incredible collaboration features. The difference between the two can be a bit confusing because they are so often used together, but the more you use them the more it will make sense. Soon enough you’ll be wondering how you even completed a code project without version control.

There are actually many ways to use Git, you could use it on GitHub only (though that suffers from lack of options and is a bit clunky), there is a Desktop interface, many serious programmers use it on command line. HOWEVER, RStudio has Git controls built in so we’ll use it there - all in one place!

Before we use Git in RStudio project, we must have an R Projects file (.RProj) so let’s talk about how R Projects works in RStudio.

Who has used R Projects before?

Working in R Projects

One of the most powerful and useful aspects of RStudio is its project management functionality. We’ll be using an R project today to complement our R Markdown document and bundle all the files needed for our paper into one self-contained, reproducible bundle. An .Rproj file helps keep your R scripts, data and other files together - just navigate through your file system to get to your project directory and double click on the .Rproj file. The added benefit is that the .RProj file will automatically open RStudio and start your R session in the same directory as the .Rproj file and remember exactly where you left off. .RProj files are powerful ways to stay organized on their own, but they also unlock the additional benefit of being able to use Git within RStudio.

Challenge 4.4: R Project in “root” folder

.Rproj files must be in the root directory of your project folder/directory. What is the root directory again (look back at the relative paths intro)?

Key Points

  • Use best practices for file and folder organization. This includes using relative file paths as opposed to complete file paths.

  • Make sure that all data are backed up on multiple devices and that you treat raw data as read-only.

  • We can use Git and Github to keep track of what we’ve done in the past, and what we plan to do in the future.

  • Rproj files are pivotal to keeping everything bundled and organized.


Getting Your project set up with Version Control in RStudio

Overview

Teaching: 30 min
Exercises: 10 min
Questions
  • How do I start or continue a project with Git versioning?

  • What are the features in the RStudio Interface for working with Git?

  • What are the basics of the Git versioning workflow?

Objectives
  • Copy an existing project on Github to make contributions

  • Open a project with Git versioning in RStudio

  • Learn the basics of Git - pull, add, commit, push

  • Make our first edits in a verison controlled project

Using R projects and Version Control in RStudio

Using version control is a powerful feature to make your research more reproducible and better organized. In order to use versioning while working in RStudio the first step is to make sure your work is set up as an R Project, because you may not use the versioning features in RStudio without one. There are three options for doing this depending on your given scenario.

Three Methods of Setting up Versioning with an R Project

There are several options for working with R projects in RStudio. If you aren’t already working in an R Project, you can create a new one. There are three options here:

  1. New Directory - start a brand new R project (with the option of version control).
  2. Existing Directory - add existing work to a R project (with the option of setting up version control).
  3. Version Control Continue an existing R project that already uses version control (i.e. download from GitHub).

new r project options

The third option would be a project already under version control but options 1 and 2 will also give you the opportunity to use or add versioning to the project. Let’s see how that would work.

Starting a R Project with Version Control

To start an R project, you would navigate to File > new project rather than just File > new file.

New directory

After choosing New Directory chose new project on the next menu options.

Then, to use version control, make sure to check the “Create a git repository” box as highlighted in this screen shot: new project w/ version control

*Note when you choose directory name, it will create a new directory in the directory you specified along with an .Rproj file of the same name. Avoid spaces here. underscores “_”, dashes “-“ or camel case “NewProject” is the recommended way to name this directory/file.

*Optionally, check the box in the bottom left corner “Open in new session” if you want it to appear in a new RStudio window.

Add versioning to an existing project

existing project

We won’t take the time to cover this here, but if you’ve already started an R project WITHOUT version control, you have the option to add version control retrospectively. You can also add existing R files to a project and setup version control if you’ve done neither. To see a tutorial of this process, please see episode 14 “Using Git from RStudio” in Version Control with Git.

This is by far the most labor intensive way to do it, so remember to add version control at the beginning of any new project.

Continue a version-controlled project

version controlled The final option is to continue a version controlled project. This is the option we will do for our workshop.

First, indicate which version control language you will be using (Subversion is another version control system, though less popular than Git)

Git or Subversion

When you choose this option there will be a place to paste the url of the GitHub (or other hosting platform) url. The name of the repository will automatically populate. Just choose which directory on your computer you wish to save the project directory and your good to go!

continue project from GitHub

Our turn!

We have a repository already prepared for this workshop at https://github.com/carpentries-incubator/Reproducible-Publications-with-RStudio-Example. We are going to use the third option to download a repository from GitHub to work hands on. This is pre-requisite if you would like to follow along. Let’s take a second to acquaint ourselves with GitHub. At this link, you may sign into your GitHub account or create one if you have not already.

code

GitHub

The two main sections are files and directories and the README which should contain a narrative description of the project.

We are each going to make a copy of this repository to use for this workshop. To do so we will do what’s called “forking” on GitHub. A Fork is a copy of a repository that you get to experiment with without disrupting the original project.

In the upper right hand corner of the repository, click on the button that says “Fork” - see highlighted example below:

fork on GitHub

If you are a member of any organizations on GitHub, you will be asked whether you want to fork to your account or to an organization. Choose your personal account for this workshop. GitHub will process for a few moments and voila! You have a copy of the workshop repository.

Now, click on the green Code drop-down and then click on the copy icon next to the repository url:

copy GitHub repository url

Now, let’s return to RStudio:

Click File > New Project > Verison Control > Git.

So back to the url you copied from GitHub. Navigate again to File > New Project > Version Control > Git. Paste in your url and choose “Desktop” as your directory.

start my R project

Woo hoo! We have the project we’re working on for this workshop opened in RStudio and set to use version control!

Git not detected on system path

If you are using Git for the first time in RStudio at this point you may be getting a notification that Git isn’t set up to work with RStudio.

See the solution below:

Solution:

Git not detected on system path

To set it up we need to go to Tools > Global Options Global Options Git/SVN setup

First, make sure “Enable version control interface for RStudio projects” is checked. Next, you must make sure that the Git executable path is correct. For macs, more than likely the path will have automatically populated. In all likelihood that path is /usr/bin/git. Windows users may find that the correct path is also pre-populated, but it is likely that you may need to manually add it by clicking “browse”. More than likely your path will be something like C:/Program Files/Git/bin/git.exe. If not, search for where Git for Windows was installed (Git) go into the bin folder and select the ‘git.exe` file.

Ok! Now that we set that up (by the way, this is a one time set up -it will work now for all future projects in RStudio on your device), we should be able to open our project from GitHub in RStudio.

Now, let’s dive in to how to use version control.

Using Version Control in RStudio

There are two places we can interact with Git in the RStudio interface.

  1. Menu bar Git menu bar
  2. Environment/History pane git environment panel

Ok, but what do all the options mean? We won’t go through them all, but here are the basics to get started versioning your project.

Git Workflow

The most simple workflow for version control (working on your computer only) is referred to as “add” and “commit”:

But what do those words even mean?

add: choose a file or files to take a “snapshot” of. Aka what files do you want to add to your next version?

Commit: Taking a “snapshot” of a selected version of your project. The snapshot will only include the files you “added”, typically only files that you’ve edited since your last commit.

You may have a few to many commits in a single work session.

When you commit, you add a “commit message” aka a short line of text (recommended 50 characters or less) that describes the changes that were made to the file(s) you added. This helps keep your versions organized and makes it easier to go back to remember what you did or to restore your work to exactly the version needed if you make a mistake or want to implement a change.

git add commit workflow

Git Workflow with GitHub

If we are saving our work to a version control hosting cloud platform such as GitHub, our workflow gets a bit more complex, we add a “pull” and “push” step at the beginning and end of a work session.

Pull > add > commit > push

Pull: download the most recent version of the repository from GitHub to your local computer.

Push: upload the most recent version of the repository to GitHub from your local computer.

Put a pin in pulling and pushing for now. For the time being as we edit our paper we will just stick to adding and commiting. At the end we’ll see how to use push to GitHub, and you can experiment with pulling later on.

Tips for working with Git

This pull, add, commit, push routine will become second nature. Pulling at the beginning and pushing at the end of your work session becomes a sort of ritual that marks the beginning and end of your work session.

Your first edit

Now, let’s open up the report in this repository that’s already been drafted. The R Markdown document for the report is located in report/source. It is called: DataPaper-ReproducibilityWorkshop.rmd The first edit we will make is to the yaml file of this draft report so we can practice using version control.

In the title add “(Carpentry Workshop Version)” and make sure to save.

first edit

Now, in the Environment panel, toggle to the Git tab. You’ll see the file that was edited with a check mark next to it. Click the check mark to “add” . Note that if you edited more than one file you could choose any or all of the documents to “add”.

git panel add

Now, click commit. A dialogue box will pop up. You’ll need to add a commit message to proceed. Add something about editing the title. The difference between your files will show in the bottom panel.

commit in RStudio

Hit commit and a dialogue box will show a completed commit.

You made your first commit!

Discussion: (optional) Utilizing .gitignore files

a .gitignore file is used to signal to Git to NOT track versions of specific files. One instance where this is used in a data analysis project is with data files that are too large to be uploaded to GitHub.

Now, there are some caveats to this, so in what situations would it make sense to add data to the .gitignore and what situations would it not? What else could you imagine you wouldn’t want to track in your research project?

Solution:

Why and when would it be a good idea to add data files to the .gitignore?

  • With raw data files - since they will not be modified (remember: raw data = read only).
  • With sensitive data - This should absolutely not be pushed to GitHub

Why and when would it not make sense to add data files to the .gitignore so they will be available in the Git repository.

  • pre-processed data files - these are the data files that are edited - processed from the raw data
  • small data files - may not make much of a difference whether they are tracked or not
  • the first time you add data files - You can’t push data files to GitHub unless they are tracked. So if you want your data on GitHub, but don’t want to track it, you must make sure you push once and then add the file to the .gitignore file.

Challenge: (optional) Add the data files/directories to .gitignore

Add the data (all of the raw data files) to the .gitignore.
Hint: there are two ways to do this.
Hint2: add a forward slash / after directories.

Solution:

1) open the .gitignore file by double-clicking on it in the file view pane, on a new line add data/. Save the file and don’t forget to commit it. .gitignore file 2) Click on the settings gear in the Git tab of the environment pane. Click on gitignore. On a new line add data and click save. Don’t forget to commit the .gitignore file. .gitignore in git pane

Key Points

  • R Studio has Git version control functionality built in.

  • Forking a Github repository makes a copy of the repository into your personal account on Github.

  • You can clone a git repository from Github to your local disk using R Studio.

  • For this workshop each learner will work with their own fork of the “R-Repro-pub” repository.


Writing and Styling Rmd Documents

Overview

Teaching: 10 min
Exercises: 10 min
Questions
  • What is the Visual Editor in Rstudio?

  • Which features does the Visual Editor have?

  • How can I can apply styling and formatting to Rmd documents in Rstudio more easily?

  • How to add inline code?

Objectives
  • Learn how to enable the visual editor.

  • Get familiar with its basic functionalities.

  • Apply Rmd formatting and styling using the visual editor.

  • Learn how to add inline code to your rdm document.

Formatting Rmd Documents with the Visual Editor

As we mentioned earlier, the visual editor in RStudio has made R Markdown formatting much more effortless. It provides improved productivity for composing longer-form articles and analyses with R Markdown. The visual markdown editing is available in RStudio v1.4 or higher. Markdown documents can be edited in either source or visual mode. To switch into visual mode for a given document, toggle on the visual option at the top-left of the document toolbar (alternatively for windows, the ⌘⇧ F4 keyboard shortcut). This will prompt a formatting bar through which you can apply styling, add links, create tables, and others similar to functions you find in google docs and other document editors. Note that you can switch between source and visual mode at any time (editing location and undo/redo state will be preserved when you switch). Let’s try it! Feel free to follow along or just watch this quick demo. But first, make sure to have your visual editor enabled on your screen. Also, make sure to open your DataPaper-ReproducibilityWorkshop.rmd file located at the report\source folder

Editor Toolbar

The editor toolbar includes buttons for the most commonly used formatting commands:

Fig. 6.1 - Toolbar

Additional commands are available on the Format, Insert, and Table menus:

Fig. 6.2 - Menu

Tip: Inserting anything with shortcuts

You can also use the catch-all ⌘ / shortcut to insert just about anything. Just execute the shortcut then type what you want to insert. For example: /lis will prompt listing options.

Applying Emphasis

At the very top of the document we have a recommended citation for the sample data paper (FIXME1). We want to emphasize the title of the journal, “Data in brief” in italics. Select the text and click in the I icon and voilà! Remember to delete (FIXME1).

In the same citation we have just worked on, let’s now add a link to it by selecting and copying the doi address (FIXME2). Then, click on the link icon and paste the address in the URL field. Simple right? If you prefer, you can also the drop-down insert menu, or even use shortcuts. By hovering the mouse over the desired icon, you will see which keys you should use. For a complete list of editing shortcuts, check this link. Tip: if you didn’t intend to use a shortcut and want to reverse its effect, just press the backspace key.

Adding Headings

Adding headings to a R Markdown document in Rstudio is as simple as applying links. Let’s say we want the abstract section as a Heading Level 2. We can select the “abstract” then, and under “Normal” on the left-hand side of the menu, we can choose the desired level. Again, all the shortcuts will be listed next to the styling in the menu. Now apply the same heading to keywords and Level 2 to “Specification Table” (FIXME3).

Creating Tables

Because creating tables manually in Rmd documents could be a little painful for beginners, Rstudio released an add-in functionality for tables back in 2018. The new visual editor, however, have made the process to create Rmd tables more similar to other editors we use daily. In our template, we have the specification table with 10 rows and two columns. If we were willing to add that table, we could do that by inserting a table to a selected part of the documents and by specifying the desired number of rows and columns. Including a caption is optional, but recommended. We can add or delete rows and columns, add a header that will be set bold by default but can be changed, and set the desired alignment. Select the desired text and click on the crossed T icon if you wish to clear formatting.

Clear Formatting Option

Creating Bullet and Numbered Lists

Again, similarly to other document editors, Rstudio allows you to turn text into bullet or numbered lists. Let’s apply a bullet list to the paragraphs specifying the “Values of the Data” reported in the data paper (FIXME4). Assuming we were willing to create a numbered list instead, we could have followed the same process and chosen the other icon. We can also sink or lift the listed items.

Adding Images

You may need to include static images to your manuscripts. For that, you can use the insert image function, click on the painting icon or even use the shortcut that shows right next to the function in the menu. After browsing and upload the desired image you may also specify the caption and the image title, as well as adjust dimensions if needed. Let’s insert two images Fig. 1 (FIXME5) and Fig. 2 (FIXME6).

Adding Formulas

If you have math formula in your manuscript there are three different ways you may insert one. Let’s look for (FIXME7) for an example. Point and click at the insert menu, use the catch-all ⌘ / keyboard short and then get to inline math mode, or type the formula content between dollar signs $. You will notice that the color and font type will change, as Rstudio identifies the block as an inline equation.

Keyboard Shortcuts

As you become a more regular Rstudio user, you may also consider using some keyboard shortcuts for all basic editing tasks. Visual mode supports both traditional keyboard shortcuts (e.g. ⌘ B for bold) as well as markdown shortcuts (using markdown syntax directly). For example, enclose bold text in asterisks or type ## and press space to create a second level heading. Here are some of the most commonly used shortcuts for Mac users:

Fig. 6.3 - Shortcuts

Tip: Windows users should replace in the shortcuts above by ctrl and ⌥⌘ by alt (+) ctrl.

Other Editing Features

The visual editor allows users to insert images by browsing their location or copying and pasting it to the Rmd document directly. There are also options to add html, line blocks, blockquotes, and footnotes. Up next we will learn more about how to add code chunks. In further episodes we will also learn how to insert citations and create a bibliography.

Time to Commit!

Make sure to commit your changes to GitHub. Add your changed files and commit with the following message: “Added Formatting”

Key Points

  • The visual editor has made formatting much easier.

  • You can apply Rmd styling without prior R Markdown knowledge.

  • You can include inline code to narratives for basic calculations and dynamic information.


Adding Code-Generated Plots and Figures

Overview

Teaching: 50 min
Exercises: 20 min
Questions
  • What is Knitr?

  • What are code chunks and how they are structured?

  • How can you run code from your rmd document?

  • What are global knitr options?

  • What are global chunk options?

Objectives
  • Understand the syntax of a code chunk.

  • Learn how to insert run-able blocks of code to integrate into your report

  • Learn how to source external scripts to run within an rmd document.

  • Learn about using global knitr options and global chunk options

Utilizing the Code Features of R Markdown

We’ve learned about the text-formatting options of R Markdown, now let’s dive into the code portion of R Markdown documents. R Markdown flips around the defaults of code and text in the documents. Instead of priortizing the code and making you comment out (#) text, they priortize text and force you to specially signal the code portions. How do you signal to R the difference between code and text when you’re not using code commments (#)? That’s where “Code Chunks” come into play (Yes that’s RStudio’s technical name for them). Instead of R Markdown’s rendering system processing the markdown styling into the final output, Code chunks are sent to a preceding stage of processing by Knitr, which “knits” the code output and text together. Secondly, rmarkdown processes the code output and displays it in the document format of our choice - i.e. Knitr runs the lines of code for a plot in a code chunk, joins it to the markdown text portions, and rmarkdown outputs that as an html document.

What is Knitr?

But what is Knitr? Knitr is the engine in RStudio which creates the “dynamic” part of R Markdown reports. It’s specifically a package that allows the integration of R code into the html, word, pdf, or LaTex document you have specified as your output for R Markdown. It utilizes Literate Programming to make research more reproducible. There are two main ways to process code with Knitr in R Markdown documents:

  1. Code Chunks
  2. Inline Code

First, we’re going to talk about code chunks more substantial portions of code into our narrative such as figures and plots. There are a plethora of options that become available to us when using code chunks so this tends to be the more complex part of R Markdown documents. Now, sometimes you just need to do a quick calculation - like a count of total observations in your data or the mean of one of your variables. In those cases, it may not be worth setting up a code chunk to calculate those values, so after code chunks we will see how to add inline code - which allows one to add a quick line of code or single function to be executing within the text portion of the document. But let’s start with code chunks.

Inserting Code Chunks

Code chunks are better when you need to do something more sophisticated with your code than inline code, such as building plots or tables. They also incorporate syntax which allows modifications to how that code is rendered and styled in your final output. We’ll learn more about that as we walk through the “anatomy” of a code chunk.

Start a new .Rmd File

First, though, let’s open a new .rmd document to get a look at how code chunks work before integrating them into our paper.

Again, open a new document by navigating to File > New File > R Markdown. Add the title Code-Chunk-Test.

Let’s first delete the generic text because we don’t need it at this point (all except the first code chunk that is - we’ll get back to that in a second).

Default RMD just setup chunk

Basic Anatomy of the Code Chunk

You can quickly insert chunks like these into your file with:

The most basic (and empty) code chunk looks like so:

blank Rmd code chunk

Other than our backticks ``` for code chunks that surround the code top and bottom, the only required piece is the specified language (r) placed between the curly brackets. This indicates that the language to read the code is R.

Let’s all start a new code chunk by typing our our starting backticks & r between curly brackets. (in your own workflow you may want to add the ending three backticks as well so you don’t forget after adding your code - it’s a common mistake):

Fun fact: Other Programming Languages

Although we will (mostly) be using R in this workshop, it’s possible to use other programming or markup languages. For example, we have seen that we can use LaTeX code for equations. You can also use python and a handful of other languages, so if R is not your preferred programming, but you like working in the RStudio environment, don’t despair! Other options for languages include: sql, julia, bash, and c, etc. It should be noted however, that some languages (like python) will require installing and loading additional packages.

Add a Code Chunk

Ok, let’s add some code! There are already some plots included in our code but as static images. This time, we are going to opt to add these plots as code chunks - which are also more reproducible and easier to update. This is because, as with our inline code, this assures that if there are any changes to the data, the plots update automatically. This also makes our life easier because when there’s a change we don’t have to re-generate plots, save them as images and then add them back in to our paper. This will potentially help prevent version errors as well! So we’re actually going to go ahead and add a few plots with code chunks.

Now, let’s open our 03_HR_analysis.R script in our code folder. We will insert the code of this script into our current working file. To do this, copy the code and paste it in-between the two lines with backticks and {r}.

heartrate code in chunk

Tip:

There’s actually a button you can use in the RStudio menu to generate the code chunks automatically. Automatic code chunk generation is available for several other languages as well. Also, you can use the keyboard shortcut ctrl+alt+i for Windows and command+option+i for Mac. auto create code chunk

Run the code in a code chunk

Now, to check to make sure our code renders, we could click the “knit” button as we have been doing to check on the output of our R Mardown file. However, with the code chunks we have options for running and debugging code that don’t require us to wait for the file to render.

1) Run from code chunk (green play button on the right top corner). This allows us to run one specific code chunk.

run from code chunk

2) Run menu - this gives more options for running code chunks including the current one, the next one, all chunks, etc.

run code menu

3) Keyboard shortcuts:

Task Windows & Linux macOS
Create a code chunk Ctrl + Alt + I Cmd + Option + I
Run all chunks above Ctrl+Alt+P Command+Option+P
Run current chunk Ctrl+Alt+C Command+Option+C
Run current chunk Ctrl+Shift+Enter Command+Shift+Enter
Run next chunk Ctrl+Alt+N Command+Option+N
Run all chunks Ctrl+Alt+R Command+Option+R
Go to next chunk/title Ctrl+PgDown Command+PgDown
Go to previous chunk/title Ctrl+PgUp Command+PgUp

Run your code with one of the given methods.

Did it work? Look under the code chunk. You should now see a plot preview displayed beneath the code chunk if all went well.

Code Chunk Plot Preview

Knitting with Code Chunks

We just saw how to run our code in our code chunks to see a preview of the code output that will render in our html document but to actually render it we need to use the Knit button. Using the knit button with code chunks is a two step process - first the code is run (all code chunks will run automatically). Second, (if there are no code errors) the document of choice will render for our whole R Markdown document.

Time to Knit!

Now, let’s knit the R Markdown file and see how our code output looks in the final html page.

code chunk with plot1 code

Wait… what’s all that output in our document? We don’t want that in our paper!

Heart rate code no options for code chunk

This happens because the output from running code (messages, results, warnings, etc.) get’s added to the R Markdown document instead of being printed to the console. Let’s see about adjusting the output to make it look better with code chunk rendering options.

Code Chunk Naming and Options

Name Your Code Chunk

Before we get to fixing how our code output looks, let’s pause a second and give our code chunk a name (also called a label). While not necessary for running your code, it is good practice is to give a name to each code chunk because it gives the chunk a unique identifier which allows for more advanced options (such as cross-referencing) to work with your rmd files later on:

{r chunk-name}

Some things to keep in mind

We’ll see in a bit where this code chunk label comes in handy. But, for now let’s go back and give our first code chunk a name:

{r fig3-heartrate}

Tip: Don’t use spaces, periods or underscores in code chunk labels

Try to avoid spaces, periods (.), and underscores (_) in chunk labels and paths. If you need separators, you are recommended to use hyphens (-) instead. For example, setup-options is a good label, whereas setup.options and chunk 1 are bad; fig.path = ‘figures/mcmc-‘ is a good path for figure output, and fig.path = ‘markov chain/monte carlo’ is bad. See more at: https://yihui.org/knitr/options/

Code Chunk Options

There are over 50 different code chunk options!!! Obviously we will not go over all of them, but they fall into several larger categories including: code evaluation, text output, code style, cache options, plot output and animation. We’ll talk about a few options for code evaluation, text output and plot output specifically.

Tip: Learn more about code chunk options

Find a complete list of code chunk options on Knitr developer, Yihui Xie’s, online guide to knitr. Or, you can find a brief list of all options on the R Markdown Reference guide on page 3 accesible through the RStudio Interface by navigating to the main menu bar Help > Cheat Sheets > R Markdown Reference Guide.

The chunk name is the only value other than r in the code chunk options that doesn’t require a tag (i.e. the “= VALUE” part of option = VALUE). So chunk options will always require a tag, and the syntax will be in the form:

{r chunk-label, option = VALUE}

the option always follows the code chunk label (don’t forget to add a , after the label either).

Common text output & code evaluation options:

Code evaluation option

include = (logical) whether to include the chunk output in the output document (defaults to TRUE).

Text output options

eval = (logical or numeric) TRUE/FALSE to evaluate (or not) or a numeric value like c(1,3) (only evaluate expressions 1 and 3). echo = (logical or numeric - following the same rules as above) whether to display source code or not.
results = (logical or character) text output of the code can be hidden (hide or FALSE), or delineated in a certain way (default ‘markup’).
warning = (logical) whether to display the warnings in the output (default TRUE). FALSE will output warnings to the console only.
message = (logical) whether or not to display messages that appear when running the code (default TRUE).

CHALLENGE 7.1 - Rendering Codes (optional)

How will some hypothetical code render given the following options? {r global-chunk-challenge, eval = TRUE, include = FALSE}

SOLUTION

The expressions in the code chunk will be evaluated, but the outputed figures/plots will not be included in the knit document.
When might you want to use this?
If you need to calculate some value or do something on your dataset for a further calucation or plot, but the output is not important to be included in your paper narrative.

CHALLENGE 7.2 - add options to your code

Add the following options to your code:
echo = FALSE, message = FALSE, warning = FALSE, results = FALSE

What will this do?

SOLUTION

solution to 7.2

These options mean the source code will not be printed in the knit html document, messages from the code will not be printed in the knit html document, and warnings will not be printed in the knit html document (but will still output to the console). Plots, figures or whatever is printed by the code WILL show up in the final html document.

Caption your code chunk output:

The options we just looked at focus on code evaluation and text output. However, we have another set of options that deal with how plot or figure outputs look at act. Many of the options start with fig. The one we will use today allows us to add a caption to our figure. Again, this is an optional feature, but if you need (or want) to add captions to your publication, it is straightforward to do in code chunks.

The caption information also resides between your brackets at the beginning of the chunk: {r}

the tag is fig.cap followed by a = and the captions within quotes "caption for figure".

CHALLENGE 7.3: Add a caption to Figure 3

Let’s add a caption to our heartrate figure. Add the caption:

“Fig 3: Mean heart rate of stress and control groups at baseline and during intervention.”

SOLUTION

so, you should end up with the following in your code chunk:

{r fig3-heartrate, echo = FALSE, message = FALSE, warning = FALSE, results = FALSE, 
fig.cap = "Fig 3: Mean heart rate of stress and control groups at baseline and during intervention."}

Set the option fig.cap to equal the text in double quotes.

More plot/figure options

Other options that change how a plot or figure appears often use the sytax fig.xxx similar to fig.cap Some other useful plot/figure code options include (From Yihui Xie’s page ):

  • fig.width, fig.height: (both are 7; numeric) Width and height of the plot (in inches), to be used in the graphics device.
  • out.width, out.height: (NULL; character) Width and height of the plot in the output document, which can be different with its physical fig.width and fig.height, i.e., plots can be scaled in the output document.
  • fig.align: (‘default’; character) Alignment of figures in the output document. Possible values are default, left, right, and center. The default is not to make any alignment adjustments.
  • fig.link: (NULL; character) A link to be added onto the figure.
  • fig.alt: (NULL; character) The alternative text to be used in the alt attribute of the tags of figures in HTML output. By default, the chunk option fig.cap will be used as the alternative text if provided.

Let’s knit one more time to see if our figure outputs how we’d like and has a caption.

Time to Knit!

Let’s try that again

Global Code Chunk Options:

Now we’ve learned how to create a code chunk and learned about options for adjusting how that code renders in our output document. However, let’s direct our attention back to the first code chunk in this document that I asked you not to delete.

The code looks like:

Code Chunk Option Setup

This is an option to globally set options for the entire R Markdown document.

With our first plot we set the four options that adjust how that one chunk renders. However, we may end up with quite a few code chunks in our paper. For example, if we have 10 code chunks in the final paper, can you imagine how much work it would be to add the options in manually each time? and if we need different options for different figures, it could be a lot of work to keep track of what options we’re using throughout the paper. We can automate setting options by adding this special code chunk at the beginning of the document. Then, each code chunk we add will refer to those “global” options when it runs.

To test this, let’s add the options we put in our code chunk (and make sure to delete them from the code chunk with the heart rate code)

In the () after the knitr::opts_chunk$set() add the options:

knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE, results = FALSE)

Tip: Overiding global options

What if you want most of your code chunks to render with the same options (i.e. echo = FALSE), but you just have one or two chunks that you want to tweak the options on (i.e. display code with echo = TRUE)? Good news! The global options can be overwritten on a case by case basis in each individual code chunk. Test this by adding echo = TRUE to your code chunk in your document and knitting. Did you override the global settings successfully?

Add global options to our paper

Now, let’s navigate back to our paper `` and add global options there.

To set global options that apply to every chunk in your file, we will call knitr::opts_chunk$set() in a new code chunk right after our yaml header (name the new code chunk setup.

Knitr will treat each option that we add to this call as default settings for all code chunks. However, we will need to set the options for this code chunk in the first place! so make sure to use include = FALSE as in the generic R Markdown document.

In the () after the knitr::opts_chunk$set() add the options:

knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE, results = FALSE)

Alright! That sets us up well for adding code chunks into our paper (which we will do next)

CHALLENGE 7.4 (optional) global & individual code chunk options

How would appear in our html document if we knit a code chunk with the following options?
{r challenge-5, warning = TRUE, echo = TRUE}

…considering the global chunk settings were as listed: knitr::opts_chunk$set(echo = FALSE, include = FALSE)

SOLUTION

In this case, the global settings are set so neither the code nor the output will display. However, the individual chunk reverses the echo setting so the code will display, and it also indicates that any warnings the code renders should output too. The outputs of the code would still not be displayed (include = FALSE) The hypothetical situation for this configuration may be for debugging while writing the rmd document.

Key Points

  • Knitr will render your code and R markdown-formatted text and output your document format of choice

  • Code chunks are runable piece of R code. Each time you knit the document, calculations and plots will be run and displayed

  • Options for code chunks can be set at the individual level or at the global level


Reproducible & Efficient Methods of Using Code Chunks

Overview

Teaching: 50 min
Exercises: 20 min
Questions
  • How do I run external scripts in an R Markdown document?

  • How can I avoid issues with relative paths?

  • How can I get my R Markdown document to render faster?

  • What is inline code and when to use it?

Objectives
  • Learn how to source external scripts to run within an rmd document to modularize your code.

  • Learn about using global knitr options to change your .rmd file’s working directory.

  • Learn how to load libraries and data for use throughout the whole .rmd document.

  • Learn the syntax for inline code.

Reproducible Methods for Code in R Markdown

Now that we’ve learned the core benefit of using R Markdown documents - the integration of code with text - let’s learn some more reproducible methods of working with code in R Markdown. We’ll cover how to run external R scripts from within the R Markdown document, additional Global Knitr options, including setting the working directory for the R Markdown document and loading packages and data globally, as well as how to use inline code and change the knit directory for R Markdown Documents. Whew! that’s a lot! Let’s dig in.

Run Code from an external script in a code chunk

Let’s learn another technique for adding code-generated plots and figures into our document. This time around let’s see how to run code in a code chunk from an external R script instead of somewhat awkwardly copying and pasting the code from a R script to a code chunk in our .rmd.

There are at least a few benefits to running code in this modular fashion instead of copy/pasting:

  1. Automatic updates: if the code gets updated in the R script, it will automatically be updated in the rmd document as well. We won’t need to copy/paste code updates, which would make it easy to end up with discrepancies between our .r scripts and our .rmd paper.

  2. Readability: calling code externally only takes several lines of code - versus copy/pasting 50+ lines of code from our scripts.

  3. Less fussing with relative paths* - we had to change the code slightly in the first example to update the file path to the data set, which introduces variations and inconsistencies. With this method we won’t have to modify the source code.

*unfortunately you will never be free of relative paths, but you can make it a bit easier on yourself.

Again, let’s test this out in our generic Rmd Document. After our first figure add a new code chunk:

basic code chunk

We’re just going to test out the same figure again so we can verify this new method works. So, add the following code to your new chunk:

# run the code from 03_HR_analysis.R in the code directory
source("code/03_HR_analysis.R", local = knitr::knit_global())
# display the plot created by code in 03_HR_analysis.R
plot 

Time to Knit!

Let’s see if our code worked when generated from an external script

Our plot should look exactly the same as the first copy-pasted one.

Success! And you’ll notice that the global code chunk options were applied to this second code chunk as well.

Now that we’ve tested this code, let’s add it to our actual paper:

Add the code to our Rmd document

First, find FIXME 9 in the rmd document for Fig 3 (ctrl-f “FIXME 9”).

Add the same code from our generic document where FIXME 9 is located under “Previw of Research Results”.

# run the code from 03_HR_analysis.R in the code directory
source("code/03_HR_analysis.R", local = knitr::knit_global())
# display the plot created by code in 03_HR_analysis.R
plot 

It should look like this:

03-HR-analysis.R externally sourced in Rmd Document

ADD chunk name and caption for Figure 3 (can use the same as the copy/pasted code chunk we just tested). Remember we don’t need to add options since we defined them globally.

Time to Run!

Let’s see if our code worked when generated from an external script

Error in file

Shoot, we got an error! It’s a file connection error - i.e. RStudio cannot find the R script file we are trying to run code from. This is because the .rmd document for the paper we are trying to write is located in the report/source directory, and our relative file path only gives directions from the root folder. So, we need to amend our relative file path. Before, the new default .rmd document we created to test was located in the root directory of the project by default (i.e. the directory where the .rproj file is located). However, we decided to implement a better directory sturcture by creating a separate directory for the publication we’re writing.

The logical fix for this would be to adjust the relative file path to read the external script, from code/03_HR_analysis.R to ../../code/03_HR_analysis.R. But hmmm… that doesn’t work either (try it for yourself!)

Challenge 8.1 Relative Path Madness

change the relative path for the externally sourced code from code/03_HR_analysis.R to ../../code/03_HR_analysis.R

What was going on here? Why did we get an error when running the code this time?

Solution:

The issue is that the code we are calling from within the rmd document contains file paths to read and save the data that are relative to the code directory where the 03_HR_analysis.R script resides so the paths aren’t correct when run from the .rmd file. Yesh!

Thankfully, there is a solution for this! (As with most every obstacle you run into with R). What we will do is use a handy feature for R Markdown which allows us to change the working directory of our R Markdown document. This means whenever using relative paths in our project they can always* be relative to the root directory, allowing us to standardize our relative file path and allieviate this file connection error. To do this we will learn additional global Knitr options.

*almost always

More Global Knitr Options

We already know about one of the benefits of global knitr options - using code chunk options that can be applied consistently for the whole document as we saw in the previous episode.

What are some of the additional features of global knitr options? There are many, but we’ll cover two more:

  1. Set working directory so file paths (for code chunks) can be relative to the root instead of our .Rmd file

  2. Load libraries and data once at the beginning of the document instead of in each code chunk (more concise and less rendering time)

Set working directory to project directory:

Ok, so let’s get back to fixing those path issues we get when we try to run externally sourced code. The definition of relative paths is that they are relative to your current document or working directory. So we are having issues with connections trying to read our data files because the R scripts in our code directory (../ to get to the ‘root’ or .Rproj directory) are in a different location relative to our rmd document (../..). What we want to do is direct RStudio to change the default working directory for the rmd document from the directory where the document is located to the project directory (which is the root directory of our project where the .Rproj file is located). We actually have several methods to do it.

Option #1: Change the working directory through the RStudio IDE

The first option is a simple one - we click the menu next to the knit button and change Knit Directory to Project Directory.
change working directory rmd file

Now, what might be the issue with changing the working directory through the IDE? Think about your project collaborators. – Yes, if another colleague pulls this project down from GitHub and starts working on it, but doesn’t have the same RStudio IDE settings as you, this paper won’t knit. Rather than telling your colleague to change their IDE (and all your other current or future co-collaborators), it would be better that this setting be self-contained within the paper/project. So let’s see how to do that.

Option #2: Change the working directory with a bit of code

The second option requires a bit of code, but will overall be more reproducible (because it’s not dependent on your personal RStudio IDE settings). To accomplish this, we will now use the here package introduced previously.

This is also a setting option in our global knitr settings:

We will now navigate back up to the top of our .rmd document to the setup code chunk. Then, we’ll add the following line of code before or after our code for code chunk options (it wouldn’t hurt to add a comment explaining what it does either):

knitr::opts_knit$set(root.dir = here::here())

*notice again this code uses a function from one of our pre-installed packages here.

global knitr settings

Finally, let’s re-adjust the path in the source() function after our working directory change (if you haven’t already).

source("../../code/03_HR_analysis.R) to source("code/03_HR_analysis.R")

Looks neater already!

Note: setting the knit directory globally with opts_knit$set

Setting the knit directory to the project directory with the setup code chunk as we just did adjusts the working directory for all code in the R Markdown document (code chunks and inline code), but NOT for any markdown text elements (images and hyperlinks).

Time to Knit!

First, run the code to make sure that our file paths are correct and our code runs without errors. Good? Time to knit the document!

Now, we can have some more fun with global options.

Globally load data and packages

We can make our lives easier in one other way too. So far we’ve loaded the library tidyverse and the data frame df we need in the first code chunk. Now if we want to add another figure (say the hormone analysis code ‘02_hormone_analysis.r`), which uses the same data as our first figure - we would be loading tidyverse for a second time unnecessarily. This is because once libraries and data are loaded they are available for the rest of the rmd document.

Instead, we can load libraries and data once at the beginning of our document making it available for all other figures or calculations throughout the document - allowing us to avoid repetition in our code and saving us rendering time. This also makes it easier for us to keep track of all the libraries and data we need to use in any given document. If anything needs to be tweaked, we don’t need to search through every code chunk in our rmd document to make a change - it’s listed right at the top.

# load libraries
library(tidyverse)
library(BayesFactor)
library(patchwork)

# load data
df <- read_csv("./output/data/preprocessed-GARP-TSST-data.csv")

Challenge 8.2: Order matters (optional)

What would happen if we loaded the data before we loaded the libraries? Try it out!

Solution:

We would get an error because we haven’t loaded tidyverse yet! load data error

At this point we could go back through our R scripts and comment out (or delete) the beginning sections where we load the data and libraries. That will save some time for the rmd document to render, because the data and libraries will only load once instead of twice. You can imagine that the more code chunks you have the more time taking this step would save. Bonus that this also works to load the data before it is called in inline code as well!

Tip: Many ways to run external code

There are at least 3-4 methods one can use to run external code, the best choice may just depend on the context or on your personal preference. All are a bit awkward because of relative paths, but better than copy/pasting code from elsewhere in your project (in our humble opinion):

  1. source() – see more at bookdown.org
  2. sys.source() – see more at bookdown.org
  3. knitr::read_chunk() – see more at stackoverflow
  4. code() *in {r} header see more at stackoverflow
  • another helpful page: http://zevross.com/blog/2014/07/09/making-use-of-external-r-code-in-knitr-and-r-markdown/

8.3: Your turn! Create Figure 4 with the external code

First, find FIXME 10 in the rmd document for Fig 4 (ctrl-f “FIXME 10”). We need to add the code for the hormone analysis.

Make sure to give the code chunk a name: fig4-hormones and a caption: "Fig 4: Cortisol and Amylase levels in stress and control groups"

Solution:

{r fig4-hormones, fig.cap = "Fig 4: Cortisol and Amylase levels in stress and control groups" }
# run the code from 02_hormone_analysis.R in the code directory
source("code/02_hormone_analysis.R", local = knitr::knit_global())
# display the plot created by code in 02_hormone_analysis.R
plot 

Inline Code

What if you only need to make a quick calculation and adding a code chunk seems a little overkill?

You can also include r code directly in your the text portion of your document. Say you are discussing some of the summary statistics in your manuscript, R Markdown makes this possible through HTML/LaTeX inline code which allows you to calculate simple expressions integrated to your narrative. Inline code enables you to insert r code into your document to dynamically updated portions of your text. In other words, if your data set changes for any reason the code will automatically update the calculation specified.

This can be helpful when referring to specific variables on your data. For example, you should include numbers that are derived from the data as code not as numbers. Thus, rather than writing “The CSV file contains choice consistency data for 10.000 simulated participants” (FIXME8) , replace the static number with a bit of code that, when evaluated, gives you a dynamic number if anything changes on your dataset. Note that there is not an insert option to do this from the menu in the visual editor, so we need to insert inline code manually with r, for example:

The CSV file contains choice consistency data for r nrow(bronars_simulation_data) simulated participants.

When you knit you might get an error. Any idea why? That is because we need to make sure to import the dataset we are referring to before the inline code can work. Let’s add the following to our chunk at the beginning of the document where we loaded our other data:

bronars_simulation_data <- read_csv("data/bronars_simulation_data.csv")

Time to Knit! If you update your dataset this value will match the number of rows.

CHALLENGE 8.3 - Adding inline code

Suppose we would like to add some information to the sentence we have just adjusted in our manuscript. We would like to include the average for the variable violation_count present in the same dataset. Which inline code we would have to add to following sentence?

The CSV file contains choice consistency data for ` r nrow(bronars_simulation_data.csv) ` simulated participants, that have been used to determine the power of our food-choice task design to detect choice consistency violations, which averaged ` enter inline code here `. What inline code would you enter? What number would replace the inline code?

Tip: we will need to use a dataset$variable syntax!

Solution:

` r mean(bronars_simulation_data$violation_count) ` 5.3924

Important Note:

Make sure the file you are calling is in the right subdirectory and your working directory is set appropriately.

More on inline codes:

R Markdown will always display the results of inline code, but not the code. Inline expressions do not take knitr options.

Tip: Yaml chunk options

We can also tweak some settings in our yaml which changes how code chunks are displayed. We’re not going to get into this in the workshop, but many of the same options you set in your global code chunk settings are also configurable in the yaml.

Adjust rendered html output directory

Ok, we’ll adjust one thing in the yaml. You know how we said it’s good practice to have code and output from the code in separate directories? Well, the html file that renders from our .rmd file outputs to the same report/source directory. So that violates our standards. It might not be the end of the world, but let’s see how to change the directory that R Markdown documents output to after knitting.

This is unfortunately more difficult that one would like, but we can use the following code in the yaml to create a custom function that changes the output directory for the .rmd file. The code for our documet is as follows:

knit: (function(rmdFile, encoding) { 
      out_dir <- '../output';
      rmarkdown::render(rmdFile,
                        encoding=encoding, 
                        output_file=file.path(dirname(rmdFile), 
                        out_dir, 
                        'DataPaper-ReproducibilityWorkshop.html'))})

Simply copy and paste it in to the yaml.

Key Points

  • Learn how to externally source code source()

  • Learn how to modularize your code to make it more reproducible

  • There are options for changing the working directory of your .rmd document with package rprojroot

  • Use a chunk at the beginning of your document to load libraries and data globally to make your document more effiecient.


Bibliography, Citations & Cross-Referencing

Overview

Teaching: 25 min
Exercises: 10 min
Questions
  • How can you insert citations to your manuscripts using RStudio’s visual editor?

  • How can you change citation styles?

  • What are the options to display cited and uncited bibliography?

  • How can you cross-reference content?

Objectives
  • Inserting citations and listing bibliography in an R Markdown file.

  • Changing citation styles.

  • Customizing how citations and bibliography are displayed.

  • Add cross-referencing directing your readers through your document.

Why citing?

Correctly citing and attributing publications is key to academic writing. Older versions of RStudio require Pandoc’s citation syntax to render bibliographies correctly. We won’t be covering this approach extensively in this workshop, since the new visual editor has made this process much more simple. You can refer to our previous workshop on R Markdown pre-visual editor for more information.

The new visual editor in RStudio 1.4 has made citations and cross-referencing much easier, by offering different options for referencing various types of sources. Before getting into these different features, let’s first learn how you can call the citation window dialog on Rstudio and how to navigate these different options.

Calling Citation Options on Rstudio

After placing your cursor where you want to insert the citation you can either click the @ icon in the toolbar, or select Insert, and from the drop down menu choose Citation. Alternatively you can use the keyboard shortcut ⇧⌘ F8 on Mac, or Ctrl+Shift+F8 for Windows.

Citation Window

The citation window will display different options for inserting citations, You can either find items listed in your own sources through your Bibliography folder (you should have one already in your project folder provided by us), your Zotero Library(ies) if you have the reference manager installed in your computer, or even use the lookup feature to search for publications by DOI (Digital Object Identifier), Crossref, DataCite, or PubMed ID.

Understanding How Rstudio Stores and Organizes Citations

Have you noticed that the YAML header contains “bibliography: references.bib”. Any idea why? Well, that’s because on our paper template we have some existing citations, and a references.bib file in our project folder. Rstudio adds that automatically to the YAML once you cite the first item to your manuscript. But let’s first open the references.bib file and understand how citations are presented there:

references.bib

A file with the BIB file extension is a BibTeX Bibliographical Database file. It’s a specially formatted text file that lists references pertaining to a particular source of information. They’re normally seen only with the .BIB file extension but might instead use .BIBTEX. BibTeX files might hold references for things like research papers, articles, books, etc. Included within the file is often an author name, year, title, page number, and other related content. Each item can be edited, in case there is any metadata incorrect or missing.

Most citation and reference management tools such as Refworks, Endnote, Mendeley and Zotero, as well as some search engines (e.g. Google Scholar), most scientific databases, and our UCSB library catalog allow us to export citations as .bib BibteX files. These files are used to describe and process lists of references, mostly in conjunction with LaTeX documents. Each .bib file has a citation key or ID preceded by an @, which uniquely identifies each item. Citation keys can be customized as we will learn in a bit, but be advised that your manuscript will render citations correctly only if you have the cited item corresponding to its exact key.

Inserting Citations

Note that we have different blocks in this file starting with a @ and ending with a curly bracket {, each one representing a unique citable source. We will be adding now a new item using the DOI lookup function so that you can see how the magic works!

Let’s assume you want to include a citation in the first line of the Value of the Data section, because you would like to provide some reference of how the salivary cortisol technique has been used in research. So let’s click where we want to insert the citation, call the citation function and lookup for the following DOI: https://doi.org/10.1016/B978-012373947-6.00334-2

Salviary Cortisol

The DOI lookup uses the persistent identifier which connects to the DOI resolver service and retrieves the .bibtex file with resource metadata1. You should insert the whole DOI address, including the resolver service, the prefix and the suffix which is specific to the resource as illustrated below:

DOI Lookup

Rstudio will search the DOI API and list the only matching result and you can insert it. After confirming this is the citation you would like to include, you can modify the key, if you would like to simplify it, and also choose if you would like to insert it as a in-text citation, meaning you would like to have the last name of the author(s) followed by a page number enclosed in parentheses. i.e., Kirschbaum and Hellhammer (2007), but in this case, we will uncheck that option since we won’t include authors as part of the narrative. Instead we would like to insert a parenthetical citation, where authors and year will be displayed inside the parentheses such as (Kirschbaum and Hellhammer 2007). You may insert more than one citation by selecting multiple items.

This item will be automatically listed in your Bibliography folder, and if you want to cite this same item again you can type @ and the first letters of the item name which will be auto-completed by Rstudio. For parenthetical citations you will have to type the key between brackets and for in-text narrative citations, you only need to type in the key. Note that when you hover over the citation, you will preview the full reference for the cited item that will be listed at the bottom of the manuscript. This feature helps you to identify if you have to edit anything in the .bib file your citation is calling. Also, note that all citations will be included at the end of your document under a reference list.

Challenge 9.1 - Insert a Citation Using the DOI Lookup Function

Following the same process described, insert a parenthetical citation to the publication “Welcome to the tidyverse” (https://doi.org/10.21105/joss.01686) where there is a mention to this package in the data paper. Look for: insert citation here

Solution

[@wickham2019]
(Wickham et al., 2019)

Inserting citations using Crossref, DataCite or PubMed follows a very similar process to the DOI process, however; to search on their APIs you will need to input information accordingly. For Crossref, you may use keywords and author information to identify an item (e.g., Cortisol Stress Oken) and Rstudio will connect to Crossref search and provide related results, often not as specific as the DOI search, for cases you know exactly what you are looking for. DataCite allows searches by persistent identifiers or keywords, while PubMed searches exclusively in biomedical literature indexed in the database. If you have the PMID (PubMed reference number), which is uniquely assigned by the NIH National Library of Medicine to papers indexed in PubMed, similarly to the DOI search, it will save you time.

Editing Metadata & Citation Key

Not all citations are perfect shape when in import them. Sometimes we will need to perform some adjustments (e.g., include missing metadata, move content to another metadata field). If needed, we can do so by modifying the .bib file. As briefly mentioned, you can also edit citation keys. By default, most citation keys will have the first author last name or the first word of the title (if no authors), followed by the year of publication. You may consider editing the citation key in case you want to simplify the entry and speed up the autocomplete option. If you choose to do so, you can simply click on the key in the .bib file and edit it. Please be advised to use this option with caution, and to update citations to match the .bib file.

Changing Citation Styles

You might have noticed that all citations are inserted in a specific style. Can you guess which one? If you answered Chicago that is correct! By default, Rstudio via Pandoc will use a Chicago author-date format for citations and references. To use another style, you will need to specify a CSL (Citation Style Language) file in the csl metadata field in the YAML.

But how can you identify which CSL you should use? You can find required formats on the Zotero Style Repository, which makes it easy to search for and download your desired style.

Download the format you wish to use and call it out in the YAML. We have pre-saved the APA CSL file in the project folder for you. But if you would like to follow the process or try another style, go to the Zotero Style repo and select American Psychological Association 7th edition or any other style of your choice. You will notice that it will automatically download the file (e.g. apa.csl). Make sure to save it to your project directory in report/source folder. In the YAML we have to call the exact name of the file preceded by “csl:”

csl: apa.csl

Save and knit the document to see how citations and references have changed. This same process could be followed for any citation style required by the university, the journal or conference you are planning to submit your manuscript to.

Challenge 9.2 - Changing the Citation Style

How can you go back to using Chicago Style?

Solution

You can either delete the csl information as it is set as the default, or call it in the YAML:

csl: chicago.csl

Adding Items to the References without Citing them

All cited items will be listed under the section References which you created before while practicing headings and subheadings. Items will be placed automatically in alphabetical order for most citation styles. However, there might be cases that you will be referencing supporting literature which you have not necessarily cited in the document.

By default, the bibliography will only display items that are directly referenced in the document. If you want to include items in the bibliography without actually citing them in the body text, you can define a dummy nocite metadata field in the YAML and put the citations there.

nocite: |
  @item1, @item2

To demonstrate that let’s include the bibtex @key for an item that was not cited in the YAML. Note that this will force all items added in the YAML with this command to be listed in the bibliography.

Let’s try!

Challenge 9.3 - Adding references you have not cited

We have used a few packages in our paper that we do not necessarily cite in text. However, it is a good practice to add them to the reference list. This practice is recommended for giving proper credit to package developers and also to inform your readers about the exact version you have used to produce your paper.

Ideally we would follow this process for all packages we have installed and used to produce the paper. But for the sake of time, let’s do that process for the Tidyverse package only.

1) First, you would go to the CRAN page for Tidyverse, click on citation info link and copy the corresponding BibTeX entry. 2) Paste the bibtex to the “references.bib” file inside the “report/source and adjust the key as you wish.

We have pre-saved the bibtex for Tidyverse for you. Now that you have the citation info and know the that key for the package is @Tidyverse2019, how would you go about to add this bibliography as a no-cite?

Solution

You should call the @Tidyverse2019 it in the YAML, with the nocite function:

nocite: |
 @Tidyverse2019

Important Info:

Does the indentation matter? Yes, you have to indent at least one space and the citation key should turn green to work.
In case you are including a citation to nocite that you have not cited in the document, you have to make sure first that the bib file is in your Bibliography folder.

Adding Cross-referencing

Cross-referencing is a useful way of directing your readers through your document, and can be automatically done within R Markdown once you have the Bookdown package installed. To use cross-references, you will need:

After these conditions are met, we can make cross-references within the text using the syntax \@ref(type:label), where label is the chunk label and type is the environment being referenced (e.g., tab or fig).

Challenge 9.4 - Cross-referencing a Figure

We would like to include a cross-reference to Figure 3, where the paper says “as via the mean heart-rate measured before and during the experimental manipulation”. Based on the information about the required elements for cross-references in R Markdown, how would you go about adding a cross-reference to Figure 3?

Solution

1) Make sure to include to the YMAL
`output:
     bookdown::html_document2: default`
2) Look it up for the passage "as via the mean heart-rate measured before and during the experimental manipulation" in your paper and add `see Figure  \@ref(fig:fig3-heartrate)`

Time to Commit!

Make sure to commit your changes to GitHub. Add your changed files and commit with the following message: “Added Bibliography”

Key Points

  • Rstudio supports different lookups strategies to easy the citation process.

  • Rstudio supports different citation styles.

  • The YAML can be ajusted to display uncited items in the reference list.

  • Use bookdown to cross-reference content.


Collaborating via Github

Overview

Teaching: 15 min
Exercises: 10 min
Questions
  • How do I authenticate with Github?

  • How do I put my project on Github?

  • How do I "push" my latest changes to Github ?

Objectives
  • Authenticate with Github.

  • Connecting your project to Github.

  • Make changes locally and push them to Github.

In episode 5 we learned about using version control as you write your publication. In this part of the workshop we’ll setup Rstudio to authenticate with Github which is necessary to push your changes to Github.

Terminology: Git Push and Pull

Definition: The process of syncronizing your local git repository with your git repository on Github (or other Git server).

Github used to allow simple Username & Password authentication but now Github requires a more secure method of authentication. For this workshop we’ll be using the Personal Access Token method. The Personal Access Token (PAT) must be created for your account on github. You’ll use the PAT to authenticate instead of your Github account password.

  1. Login to your Github account with your web browser. https://github.com
  2. On Github.com go to your account setting -> Developer Settings -> Personal access tokens or this link: https://github.com/settings/tokens
  3. Any tokens you have created in the past will be listed there and you can click “Generate new token” button. Set an expiration date for your new token and a scope. For this workshop the “Repo” scope should be sufficient. Then Generate your new Token with the button at the bottom of the page.

PAT options on GitHub

Note:

If you cloned the repository from Github in Episode 5 the next 2 steps are not necessary as the “origin” is set as part of the cloning process.

Getting your repository’s URL from Github

You can get the address of your repository from Github by navigating to your repository on Github.com and clicking the green “Code” button. Make sure to copy the SSH form of the URL.

Copy Repo URL from GitHub

With that address you can complete setting the origin URL in the next step.

Checking and Setting the “Origin” for the local copy of yoiur repository.

If you forked and cloned the demonstration publication for this workshop as covered in an earlier episode then your copy of the repository should already have the “origin” set. Once the “origin” is set properly you should be able to push and pull your changes to and from Github. When you clone a repository from Github your local copy of the repository should have Github set as the “origin”.

You can check this in Rstudio –> Tools –> Project Options –> Git/SVN

If the “Origin” field is blank then you’ll need to add it from the terminal with a couple of terminal commands like this:

git remote add origin <paste your repository address here>
git fetch --set-upstream origin main

Be sure your Github username is part of the URL.

After you’ve updated the Origin URL from the command line go back to R Studio –> Tools –> Project Options –> Git/SVN to verify you have the “Origin” field filled in. It should look like this.

RSTudio Project Options Git/SVN

Push your local changes up to your repository Github.

With authentication set up and your local copy of your repository pointing to Github as the “Origin” you should be able to make changes locally and push them up to Github.

When you are prompted enter your Github username and then paste in your Personal Access Token (PAT) when prompted for your github password.

RStudio PAT Password Prompt

Let’s try it and see if it works.

Challenge: Push to Github

  1. Make a change to one of the files in your project or add a new file.
  2. In R Studio’s Git panel check the box to Stage the changed file.
  3. Commit the change to your Git repository.
  4. Click the green up arrow to Push you repository changes up to Github.
  5. Look on Github.com to verify your changes are there.

With the ability to synchronize your changes between Github and your local the next step is explore options for publishing your research paper.

Key Points

  • Setting up R Studio to authenticate with Github using a Personal Authentication Token (PAT).

  • Setting the Git repository Origin in your R Studio project enables pushing and pulling from your local copy of the repository to the repository on Github.


Publishing your project

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • What are the options for you to publish your project?

  • What free and open publishing resources are available?

  • What aspects should guide your choice?

Objectives
  • Identify different ways you can publish your project.

  • Overview of some free and open resources available.

  • Learn which factors should guide your decision-making process.

What is Next?

Once you have completed your Rmd manuscript following all the best practices for reproducibility, including organizing your project files what is next? The answer depends on your plans to move forward. Let’s explore some scenarios:

If you plan to share your insights with your community right away

Publishing with Rpubs

Notice the “Publish” button in the upper right corner of your Knit output. Click this to publish to Rpubs. This is where you’ll need an Rpubs account as mentioned in setup for this workshop.

Click the publish button Publish button in RStudio

and you’ll be presented with the following panels:

Publish to RPubs or RStudio Connect (The other option in the dialog box, RStudio Connect, is a standalone publishing platform for teams to share content. It requires purchase to host and use.)

Confirm Publish to RPubs

The first time you publish, RStudio will likely ask if you want to install some needed packages; say yes. RStudio will then open up a web browser to allow you sign in to rpubs.com.

At the end of the publish process your paper will be live on the internet with a URL similar to: https://rpubs.com/yourname/678624

RStudio also saves an HTML version of your knit document to your local file system. Look for it in the results directory in the same directory as the R-markdown file in your R-Studio project directory.
This html document is self-contained and highly portable. Images are encoded directly into the HTML so you can easily move it to any web hosting you have available.

Publishing as website on GitHub

Another, better, but slightly more involved option for publishing an R Markdown document on the web is to use GitHub and GitHub Pages. It is out of the scope of this lesson to use GitHub, but briefly, GitHub is a widely-used version control and collaboration system. RStudio has built-in support for GitHub: in the upper right panel of your RStudio window, look for the Git tab, which allows you to sync your R Markdown project with a remote repository stored on github.com. To enable publishing to GitHub Pages, go to the Settings page of your repository on GitHub and select a branch (“branch” is a repository term) to publish. Name your main R Markdown file index.Rmd, and render it to HTML as index.html. With GitHub Pages enabled on your repository, the HTML file in your repository at https://github.com/myusername/myrepo/index.html will appear on the web as https://myusername.github.io/.

Other document types

When you create a new R-markdown file in R Studio you are presented with a choice of Output Formats:

RStudio output formats

For the purposes of this workshop we’re using HTML as the output format but other types are available. You can render your R Markdown as a document, a presentation or a Shiny app. With the default installation of R-Studio HTML output is most likely to work. Other formats may require additional R packages and/or code libraries be installed on your computer. R Studio also has a templating system to help with creating R Markdown files with common elements, YAML metadata and rendering instructions. This can be very helpful for example if you want to create a weekly or monthly report documenting an ongoing experiment, study or other changing data.

If you are willing to publish your manuscript through a peer-reviewed journal

Key Points

  • You may choose to share and publish your data project before publishing its associated manuscript.

  • Sharing the code, data and documentation is necessary to allow for inspection and research reproducibility.