This lesson is still being designed and assembled (Pre-Alpha version)

R & RStudio, R Markdown

Overview

Teaching: 50 min
Exercises: 10 min
Questions
  • How do I orient myself in the RStudio interface?

  • How can I work with R in the console?

  • What are built-in R functions and how do I use their help page?

  • How can I generate an R Markdown notebook?

Objectives
  • Learn what is an Integrated Developing Environment.

  • Learn to work in the R console interactively.

  • Learn how to generate a reproducible code notebook with R Markdown.

  • Learn how to create an HTML or PDF document from a R Markdown notebook.

  • Understand that R Markdown notebooks foster literate programming, reproducibility and open science.

Table of Contents

1. Introduction


This episode is focusing on the concept of literate programming supported by the ability to combine code, its output and human-readable descriptions in a single R Markdown document.

Literate programming

More generally, the mixture of code, documentation (conclusion, comments) and figures in a notebook is part of the so-called “literate programming” paradigm (Donald Knuth, 1984). Your code and logical steps should be understandable for human beings. In particular these four tips are related to this paradigm:

  • Do not write your program only for R but think also of code readers (that includes you).
  • Focus on the logic of your workflow. Describe it in plain language (e.g. English) to explain the steps and why you are doing them.
  • Explain the “why” and not the “how”.
  • Create a report from your analysis using a R Markdown notebook to wrap together the data + code + text.

1.1 The R Markdown format

Dr. Jenny Bryan’s lectures from STAT545 at R Studio Education

Leave your mark

R Markdown allows you to convert your complete analysis into a single report that is easy to share and that should recapitulate the logic of your code and related outputs.
A variety of output formats are supported:

  • Word document
  • Powerpoint
  • HTML
  • PDF

R Markdown conversion to different formats

In practice, it is best practice to create a PDF document from your analysis as PDF documents are easy to open and visualise online especially on GitHub.

1.2 Why learn R with RStudio?

You are all here today to learn how to code. Coding made me a better scientist because I was able to think more clearly about analyses, and become more efficient in doing so. Data scientists are creating tools that make coding more intuitive for new coders like us, and there is a wealth of awesome instruction and resources available to learn more and get help.

Here is an analogy to start us off. Think of yourself as a pilot, and R is your airplane. You can use R to go places! With practice you’ll gain skills and confidence; you can fly further distances and get through tricky situations. You will become an awesome pilot and can fly your plane anywhere.

And if R were an airplane, RStudio is the airport. RStudio provides support! Runways, communication, community, and other services that makes your life as a pilot much easier. So it’s not only the infrastructure (the user interface or IDE), although it is a great way to learn and interact with your variables, files, and interact directly with GitHub. It’s also a data science philosophy, R packages, community, and more. So although you can fly your plane without an airport and we could learn R without RStudio, that’s not what we’re going to do.

Take-home message

We are learning R together with RStudio because it offers the power of a programming language with the comfort of an Integrated Development Environment.

Something else to start us off is to mention that you are learning a new language here. It’s an ongoing process, it takes time, you’ll make mistakes, it can be frustrating, but it will be overwhelmingly awesome in the long run. We all speak at least one language; it’s a similar process, really. And no matter how fluent you are, you’ll always be learning, you’ll be trying things in new contexts, learning words that mean the same as others, etc, just like everybody else. And just like any form of communication, there will be miscommunications that can be frustrating, but hands down we are all better off because of it.

While language is a familiar concept, programming languages are in a different context from spoken languages, but you will get to know this context with time. For example: you have a concept that there is a first meal of the day, and there is a name for that: in English it’s “breakfast”. So if you’re learning Spanish, you could expect there is a word for this concept of a first meal. (And you’d be right: ‘desayuno’). We will get you to expect that programming languages also have words (called functions in R) for concepts as well. You’ll soon expect that there is a way to order values numerically. Or alphabetically. Or search for patterns in text. Or calculate the median. Or reorganize columns to rows. Or subset exactly what you want. We will get you increase your expectations and learn to ask and find what you’re looking for.


2. A quick touR

2.1 RStudio panes

Like a medieval window, RStudio has several panes (sections that divide the entire window). Window with panes

Launch RStudio/R and identify the different panes.

600px

Notice the default panels:

Customizing RStudio appearance

You can change the default location of the panes, among many other things: Customizing RStudio.

2.2 Locating yourself

An important first question: where are we inside the computer file system?

If you’ve have opened RStudio for the first time, you’ll be in your home directory. This is noted by the ~/ at the top of the console. You can see too that the Files pane in the lower right shows what is in the home directory where you are. You can navigate around within that Files pane and explore, but note that you won’t change where you are: even as you click through you’ll still be Home: ~/.

2.3 First step in the console

OK let’s go into the Console, where we interact with the live R process.

Make an assignment and then inspect the object you created by typing its name on its own.

x <- 3 * 4
x

In my head, I hear e.g., “x gets 12”.

All R statements where you create objects – “assignments” – have this form: objectName <- value.

I’ll write it in the console with a hashtag #, which is the way R comments so it won’t be evaluated.

## objectName <- value

## This is also how you write notes in your code to explain what you are doing.

Object names cannot start with a digit and cannot contain certain other characters such as a comma or a space. You will be wise to adopt a convention for demarcating words in names.

# i_use_snake_case
# other.people.use.periods
# evenOthersUseCamelCase

Make an assignment

this_is_a_really_long_name <- 2.5

To inspect this variable, instead of typing it, we can press the up arrow key and call your command history, with the most recent commands first. Let’s do that, and then delete the assignment:

this_is_a_really_long_name

Another way to inspect this variable is to begin typing this_…and RStudio will automagically have suggested completions for you that you can select by hitting the tab key, then press return.

One more:

science_rocks <- "yes it does!"

You can see that we can assign an object to be a word, not a number. In R, this is called a “string”, and R knows it’s a word and not a number because it has quotes " ". You can work with strings in your data in R pretty easily, thanks to the stringr and tidytext packages. We won’t talk about strings very much specifically, but know that R can handle text, and it can work with text and numbers together (this is a huge benefit of using R).

Let’s try to inspect:

sciencerocks
# Error: object 'sciencerocks' not found

2.4 Make your life easier with keyboard shortcuts

One can rapidly experience that typing the assign operator <- is laborious to type in the long run. Instead, we can create a keyboard shortcut to make our life easier.

With RStudio, this is relatively straightforward. Follow the screenshots to change the default to Alt + L for instance.

Go to “Tools” followed by “Modify Keyboard Shortcuts”:

Then in the “Filter” text box, type “assign” to find the current keyboard shortcut for the assign operator. Change it to Alt + L or any other convenient key combination.

Lovely keyboard shortcuts:

RStudio offers many handy keyboard shortcuts.
Also, Alt + Shift + K brings up a keyboard shortcut reference card.

2.5 Error messages are your friends

Implicit contract with the computer / scripting language: Computer will do tedious computation for you. In return, you will be completely precise in your instructions. Typos matter. Case matters. Pay attention to how you type.

Remember that this is a language, not unsimilar to English! There are times you aren’t understood – it’s going to happen. There are different ways this can happen. Sometimes you’ll get an error. This is like someone saying ‘What?’ or ‘Pardon’? Error messages can also be more useful, like when they say ‘I didn’t understand what you said, I was expecting you to say blah’. That is a great type of error message. Error messages are your friend. Google them (copy-and-paste!) to figure out what they mean.

And also know that there are errors that can creep in more subtly, when you are giving information that is understood, but not in the way you meant. Like if I am telling a story about suspenders that my British friend hears but silently interprets in a very different way (true story). This can leave me thinking I’ve gotten something across that the listener (or R) might silently interpreted very differently. And as I continue telling my story you get more and more confused… Clear communication is critical when you code: write clean, well documented code and check your work as you go to minimize these circumstances!

2.6 Logical operators and expressions

A moment about logical operators and expressions. We can ask questions about the objects we made.

x == 2
x <= 30
x != 5

2.7 Variable assignment

Let’s assign a number to a variable called weight_kg.

weight_kg <- 55    # doesn't print anything
(weight_kg <- 55)  # but putting parenthesis around the call prints the value of `weight_kg`
weight_kg          # and so does typing the name of the object

When assigning a value to an object, R does not print anything. You can force R to print the value by using parentheses or by typing the object name:

Now that R has weight_kg in memory, we can do arithmetic with it. For instance, we may want to convert this weight into pounds (weight in pounds is 2.2 times the weight in kg):

weight_kg * 2.2

We can also change a variable’s value by assigning it a new one:

weight_kg <- 57.5
weight_kg * 2.2

And when we multiply it by 2.2, the outcome is based on the value currently assigned to the variable.

OK, let’s store the animal’s weight in pounds in a new variable, weight_lb:

weight_lb <- weight_kg * 2.2

and then change weight_kg to 100.

weight_kg <- 100

What do you think is the current content of the object weight_lb? 126.5 or 220? Why? It’s 125.6. Why? Because assigning a value to one variable does not change the values of other variables — if you want weight_kg updated to reflect the new value for weight_lb, you will have to re-execute that code. This is why we re-comment working in scripts and documents rather than the Console, and will introduce those concepts shortly and work there for the rest of the day.

We can create a vector of multiple values using c().

c(weight_lb, weight_kg)

names <- c("Jamie", "Melanie", "Julie")
names

Exercise

  1. Create a vector that contains the different weights of four fish (you pick the object name!):
    • one fish: 12 kg
    • two fish: 34 kg
    • red fish: 20 kg
    • blue fish: 6.6 kg
  2. Convert the vector of kilos to pounds (hint: 1 kg = 2.2 pounds).
  3. Calculate the total weight.

Solution

# Q1 
fish_weights <- c(12, 34, 20, 6.6)
# Q2
fish_weights_lb <- fish_weights * 2.2
# Q3
# we haven't gone over functions like `sum()` yet but this is covered in the next section. 
sum(fish_weights_lb) 


3. Diving deepeR

3.1 Functions and help pages

R has a mind-blowing collection of built-in functions that are used with the same syntax: function name with parentheses around what the function needs to do what it is supposed to do.

function_name(argument1 = value1, argument2 = value2, ...). When you see this syntax, we say we are “calling the function”.

Let’s try using seq() which makes regular sequences of numbers and, while we’re at it, demo more helpful features of RStudio.

Type se and hit TAB. A pop up shows you possible completions. Specify seq() by typing more to disambiguate or using the up/down arrows to select. Notice the floating tool-tip-type help that pops up, reminding you of a function’s arguments. If you want even more help, press F1 as directed to get the full documentation in the help tab of the lower right pane.

Type the arguments 1, 10 and hit return.

seq(1, 10)

We could probably infer that the seq() function makes a sequence, but let’s learn for sure. Type (and you can autocomplete) and let’s explore the help page:

?seq 
help(seq) # same as ?seq

Help page

The help page tells the name of the package in the top left, and broken down into sections:

  • Description: An extended description of what the function does.
  • Usage: The arguments of the function and their default values.
  • Arguments: An explanation of the data each argument is expecting.
  • Details: Any important details to be aware of.
  • Value: The data the function returns.
  • See Also: Any related functions you might find useful.
  • Examples: Some examples for how to use the function.
seq(from = 1, to = 10) # same as seq(1, 10); R assumes by position
seq(from = 1, to = 10, by = 2)

The above also demonstrates something about how R resolves function arguments. You can always specify in name = value form. But if you do not, R attempts to resolve by position. So above, it is assumed that we want a sequence from = 1 that goes to = 10. Since we didn’t specify step size, the default value of by in the function definition is used, which ends up being 1 in this case. For functions I call often, I might use this resolve by position for the first argument or maybe the first two. After that, I always use name = value.

The examples from the help pages can be copy-pasted into the console for you to understand what’s going on. Remember we were talking about expecting there to be a function for something you want to do? Let’s try it.

Exercise

Talk to your neighbor(s) and look up the help file for a function that you know or expect to exist. Here are some ideas:

  1. ?getwd()
  2. ?plot()
  3. min()
  4. max()
  5. ?mean()
  6. ?log())

Solution

  1. Gets and prints the current working directory.
  2. Plotting function.
  3. Minimum value in a vector or dataframe column.
  4. Maximum value in a vector or dataframe column.
  5. Geometric mean (average) of a vector or dataframe column. Generic function for the (trimmed) arithmetic mean.
  6. Logarithm function. Specific functions exist for log2 and log10 calculations.

And there’s also help for when you only sort of remember the function name: double-question mark:

??install 

Not all functions have (or require) arguments:

date()

3.2 Packages

So far we’ve been using a couple functions from base R, such as seq() and date(). But, one of the amazing things about R is that a vast user community is always creating new functions and packages that expand R’s capabilities. In R, the fundamental unit of shareable code is the package. A package bundles together code, data, documentation, and tests, and is easy to share with others. They increase the power of R by improving existing base R functionalities, or by adding new ones.

The traditional place to download packages is from CRAN, the Comprehensive R Archive Network, which is where you downloaded R. You can also install packages from GitHub, which we’ll do tomorrow.

You don’t need to go to CRAN’s website to install packages, this can be accomplished within R using the command install.packages("package-name-in-quotes"). Let’s install a small, fun package praise. You need to use quotes around the package name.:

install.packages("praise")

Now we’ve installed the package, but we need to tell R that we are going to use the functions within the praise package. We do this by using the function library().

What’s the difference between a package and a library?
Sometimes there is a confusion between a package and a library, and you can find people calling “libraries” to packages.

Please don’t get confused: library() is the command used to load a package, and it refers to the place where the package is contained, usually a folder on your computer, while a package is the collection of functions bundled conveniently.

library(praise)

Now that we’ve loaded the praise package, we can use the single function in the package, praise(), which returns a randomized praise to make you feel better.

praise()

3.3 Clearing the environment

Now look at the objects in your environment (workspace) – in the upper right pane. The workspace is where user-defined objects accumulate.

RStudio objects in environment

You can also get a listing of these objects with a few different R commands:

objects()
ls()

If you want to remove the object named weight_kg, you can do this:

rm(weight_kg)

To remove everything:

rm(list = ls())

or click the broom 🧹 in RStudio Environment panel.

For reproducibility, it is critical that you delete your objects and restart your R session frequently. You don’t want your whole analysis to only work in whatever way you’ve been working right now — you need it to work next week, after you upgrade your operating system, etc. Restarting your R session will help you identify and account for anything you need for your analysis.

We will keep coming back to this theme but let’s restart our R session together: Go to the top menus: Session > Restart R.

Exercise

Clear your workspace and create a few new variables. Create a variable that is the mean of a sequence of 1-20.

  1. What’s a good name for your variable?
  2. Does it matter what your “by” argument is? Why?

Solution

  1. Any meaningful and relatively short name is good. As a suggestion mean_seq could work.
  2. Yes it does. By default “by” is equal to 1 but it can be changed to any increment number.

4. R Markdown notebook

R Markdown will allow you to create your own workflow, save it and generate a high quality report that you can share. It supports collaboration and reproducibility of your work. This is really key for collaborative research, so we’re going to get started with it early and then use it for the rest of the day.

Literate programming

More generally, the mixture of code, documentation (conclusion, comments) and figures in a notebook is part of the so-called “literate programming” paradigm (Donald Knuth, 1984). Your code and logical steps should be understandable for human beings. In particular these four tips are related to this paradigm:

  • Do not write your program only for R but think also of code readers (that includes you).
  • Focus on the logic of your workflow. Describe it in plain language (e.g. English) to explain the steps and why you are doing them.
  • Explain the “why” and not the “how”.
  • Create a report from your analysis using a R Markdown notebook to wrap together the data + code + text.

4.1 R Markdown video (1-minute)

What is R Markdown? from RStudio, Inc. on Vimeo.

A minute long introduction to R Markdown

This is also going to introduce us to the fact that RStudio is a sophisticated text editor (among all the other awesome things). You can use it to keep your files and scripts organized within one place (the RStudio IDE) while getting support that you expect from text editors (check-spelling and color, to name a few).

An R Markdown file will allow us to weave markdown text with chunks of R code to be evaluated and output content like tables and plots.

4.2 Create a R Markdown document

To do so, go to: File -> New File -> R Markdown… -> Document of output format HTML -> click OK.

You can give it a Title like “R tutorial”. Then click OK.

Knit button

Let’s have a look at this file — it’s not blank; there is some initial text is already provided for you. You can already notice a few parts:

4.3 The YAML header

The header of your R Markdown document will allow you to personalize the related report from your R Markdown document.
The header follows the YAML syntax (“YAML Ain’t Markup Language”) which usually follows a key:value syntax.

A few YAML parameters are all you need to know to start using R Markdown. Here is an opinionated list of the key parameters:

---
- title: "R tutorial"
- output: html_document 
- author: "John Doe"
- date: "Tuesday, February 15 2021"
---

The three dashes --- before and after the option: value are important to delimit the YAML header. Do not forget them!

A note on output format: if you search online, you will find tons of potential output formats available from one R Markdown document. Some of them require additional packages or software installation. For instance, compiling your document to produce a PDF will require LaTeX libraries etc.

Exercise

Open the output formats of the R Markdown definitive guide: https://bookdown.org/yihui/rmarkdown/output-formats.html.
Instead of output: html_document, specify pdf_document to compile into a PDF (because it is easier to share for instance).
Press the knit button. Is it working? If not, what is missing?

For PDF, you might need to install a distribution of LaTeX for which several options exist. The recommended one is to install TinyTeX from Yihui Xie. Other more comprehensive LaTeX distributions can be obtained from the LaTeX project directly for your OS.

If you feel adventurous, you can try other formats. There are many things you can generate from a R Markdown document even slides for a presentation.

Exercise

Instead of hard-coding the date in the YAML section, search online for a way to dynamically have the today’s date.

Solution

In the YAML header, write:
date: r Sys.Date()
This will add today’s date in the YYYY-MM-DD format when compiling.

More generally, you can use the syntax option: r <some R command> to have options automatically updated by some R command when compiling your R Markdown notebook into a report.

4.4 Code chunks

Code chunks appear in grey and will execute the R code when you compile the document. The following chunk will create a summary of the cars dataframe.

simple code chunk

A code chunk is defined by three backticks ```{r} before curly braces with r inside to indicate the coding language.
It is closed by three backticks ```.

```{r}
summary(cars)
```

The code chunk will be executed when compiling the report. You can also run it by clicking on the green arrow.

simple code chunk

To insert a new code chunk, you can either:

  1. Use a keyboard shortcut: Ctrl + Alt + I: to add a code chunk. Use Cmd + Alt + I on Mac OS.
  2. Click on “Add chunk in the toolbar. “
  3. Place two code chunk: ```{r} to open the code chunk and ``` to close it.

Exercise

Introduce a new code chunk to produce a histogram of the cars speed.
Compile your R Markdown document and visualise the results.
In the final document, can you find a way to hide the code chunk that generates the plot?

Solution

Add a new code chunk:

```{r}
hist(cars$speed)
```

Inside the curly braces, add:

```{r, echo = FALSE}
hist(cars$speed)
```

4.5 Text markdown syntax

You might wonder what the “markdown” in R Markdown stands for.

Between code chunks, you can write normal plain text to comment figures and code outputs. To format titles, paragraphs, format text in italics, etc. you can make use of the markdown syntax that is a simple but efficient method to format text. Altogether, it means that a R Markdown document has 2 different languages within it: R and Markdown.

Markdown is a formatting language for plain text, and there are only about 15 rules to know.

Have a look at your own document. Notice the syntax for:

There are some good cheatsheets to get you started, and here is one built into RStudio: Go to Help > Markdown Quick Reference

Exercise

In Markdown:

  1. Format text in italics,
  2. Make a numbered list,
  3. Add a web link to the RStudio website in your document,
  4. Add a “this is a subheader” subheader with the level 2 or 3.
    Reknit your document.

Solution

  1. Add one asterisk or one underscore on both sides of the text.
  2. To make a numbered list, write 1. then add a line and write a second 2..
  3. Place the link between squared brackets. RStudio link
  4. Subheaders can be written with ### or ## depending on the level that you want to write.

A complete but short guide on Markdown syntax from Yihui Xie is available here.

4.6 Compile your R Markdown document

Now that we are all set, we can compile the document to generate the corresponding HTML document. Press the “Knit” button.

Knit button

This will compile your R Markdown document and open a new window.

What do you notice between the two? So much of learning to code is looking for patterns.

Notice how the grey R code chunks are surrounded by 3 backticks and {r LABEL}. These are evaluated and return the output text in the case of summary(cars) and the output plot in the case of plot(pressure).

Notice how the code plot(pressure) is not shown in the HTML output because of the R code chunk option echo=FALSE.

Compiling takes place in a separate R workspace

When compiling, you will be redirected to the R Markdown tab next to your Console. This is normal as your R Markdown document is compiled in a separate new R workspace.

4.7 Useful tips and common issues

Here is a list of useful keyboard shortcuts:

Useful shortcuts

Place the cursor in the script editor pane. Then type:

  • Ctrl + Alt + I: to add a code chunk.
  • Ctrl + Shift + K: compile the R Markdown document to create the related output.
  • Ctrl + Alt + C to run the current code chunk (your cursor has to be inside a code chunk).
  • Ctrl + Alt + R

For Mac OS users, replace Ctrl with Cmd (Command).

All these shortcuts can be seen in Code > Run Region > …

Code run shortcuts

As seen before, you can modify these shortcuts to anything you find convenient: Tools > Modify keyboard shortcuts.
Type “chunk” to filter the shortcuts for code chunks.

modify keyboard shortcut panel

Common issues

Separate workspace when compiling When you compile your R Markdown document, it will start from a clean R workspace. Anything you have in your current R interactive session will not be available in the R Markdown tab.

This is often the source of bugs and halting

Exercise

Step 1: In the R console, type:

library(dplyr)   
tooth_filtered <- dplyr::filter(ToothGrowth, len > 1) 

You should see the tooth_filtered R object in your current environment.

Step 2: in your R Markdown document, add this line:

with(tooth_filtered, hist(x = len, col = "darkgrey"))

Try to knit your document. What bug do you experience?

Solution

Since your R Markdown workspace starts from scratch and creates a new environment, it ignores the tooth_filtered object you created in your R console.
The solution is to add the tooth_filtered <- dplyr::filter(ToothGrowth, len > 1) line inside a code chunk.


5. Import your own data

5.1 Functions available

To import your own data, you can use different functions depending on your input format:

Some important parameters in data import functions:

5.2 Important tips

Taken from Anna Krystalli workshop:

read.csv

read.csv(file, 
	     na.strings = c("NA", "-999"), 
         strip.white = TRUE, 
         blank.lines.skip = TRUE, 
         fileEncoding = "mac")

5.2 Large tables

If you have very large tables (1000s of rows and/or columns), use the fread() function from the data.table package.


6. Credits and additional resources

6.1 Jenny Bryan

6.2 RStudio materials

6.3 The definitive R Markdown guide

“The R Markdown definitive guide” by Yihui Xie, J. J. Allaire and Garrett Grolemund: https://bookdown.org/yihui/rmarkdown/

6.4 Others


Key Points

  • R and RStudio make a powerful duo to create R scripts and R Markdown notebooks.

  • RStudio offers a text editor, a console and some extra features (environment, files, etc.).

  • R is a functional programming language: everything resolves around functions.

  • R Markdown notebook support code execution, report creation and reproducibility of your work.

  • Literate programming is a paradigm to combine code and text so that it remains understandable to humans, not only to machines.