Content from What is a reprex and why is it useful?


Last updated on 2025-05-27 | Edit this page

Overview

Questions

  • What steps can you take to solve problems in your code?
  • What is a minimal reproducible example?
  • Why are minimal reproducible examples important?
  • What variables are included in the Portal Project dataset?

Objectives

  • Understand the high-level process for getting unstuck in R.
  • Define each key characteristic of a minimal reproducible example.
  • Explain why minimal reproducible examples are central to getting help from others.
  • Load in the rodent survey data and briefly explain its contents.

Mickey is an ecologist working with data from The Portal Project, a long-term research study of rodents in Portal, Arizona. Mickey has just started in a new lab at their university. They are interested in learning about rodent morphology. For now, they are learning about the dataset by doing some descriptive analyses and visualizations of the data.

Mickey starts by loading the data so they can begin to explore it. They also load the {tidyverse}, a set of packages that will be useful for wrangling and visualizing the data.

::: instructor note Loading the entire {tidyverse} here, rather than a few component packages, is an intentional over-complication so that we can teach learners to simplify their packages later. Learners should have {tidyverse} installed, as per the setup instructions.

Would it be better to have the surveys dataset as a downloaded file for them to load in, or does loading it from a url make sense? :::

R

library(tidyverse)

OUTPUT

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

R

surveys <- read_csv("https://raw.githubusercontent.com/carpentries-incubator/R-help-reprexes/refs/heads/main/episodes/data/surveys_complete_77_89.csv") 

OUTPUT

Rows: 16878 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): species_id, sex, genus, species, taxa, plot_type
dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Mickey has some past experience in R, but this project will require more data analysis than they have done before. Mickey attended a Carpentries workshop, “Data Analysis and Visualization in R for Ecologists,” and they feel comfortable with the fundamentals of coding in R. Still, they are a little nervous about starting this project.

Prerequisites and target audience

This workshop assumes some prior experience with working in R and RStudio. We will assume you’ve taken the equivalent of the Data Analysis and Visualization in R for Ecologists workshop and are comfortable with basic commands, and we won’t necessarily explain every line of code in detail.

If you’re much more experienced in R, this workshop is still for you! Even expert coders may not always know how to get unstuck. We hope this workshop will be useful to people with a variety of coding backgrounds.

Sometimes, Mickey’s code doesn’t work as expected and they go to their colleague, Remy, for help. Remy has spent many hours sitting with Mickey, helping to work through various errors. But soon, Remy will be starting a big project, and they won’t have as much time to help with debugging.

::: instructor note The following exercises are optional, but they can are useful for getting learners settled in. :::

Think, pair, share: When you get stuck

When you’re coding in R and you get stuck, what are some things that you do to get help or get unstuck?

Think, pair, share: Helping someone else

Think about a time that you helped someone else with their code. What information did you need to know in order to help? (If you have never helped someone else with their code, think about a time that someone helped you–what information did they need to know in order to help?)

To help Mickey get more comfortable troubleshooting their own code, Remy suggests some steps to follow the next time they get stuck. Remy calls this the “Road Map to Getting Unstuck in R.”

Remy explains that the road map includes two main phases. First, there is guidance about “code first aid.” This includes understanding types of errors, reading function documentation, investigating error messages, and running through code line by line to diagnose problems.

Sometimes, these first aid steps are not enough to solve your problem. One of the most frustrating parts of learning to code is getting stuck and not knowing what to do! Luckily, there are many people in the R and data science communities who are happy to help, as long as you give them the right information. But figuring out how to ask a good question can feel even harder than the original problem that got you stuck in the first place. That’s why the second part of Jordan’s road map includes guidance on how to create a minimal reproducible example (also known as a “reprex”).

A minimal reproducible example is a piece of code that demonstrates the problem you are facing, includes all necessary information to show the problem but nothing extra, and will run easily on someone else’s computer.

Minimal reproducible examples are very important tools to get help when you’re stuck on a coding problem.

  • Stripping the code and data down to their simplest (minimal) parts makes it easy for a helper to zero in on what might be going wrong.

  • Making your example reproducible allows a helper to run your code on their own computer so they can “feel your pain” and understand what’s going wrong. Even experts often have to “tinker” with code in order to fix it. Providing a reprex makes that “tinkering” easy, which makes it more likely that a helper will take the time to assist you.

  • The process of making a minimal reproducible example often gives you insight into your own code. Often, you might end up solving the problem yourself, without even needing to ask for help.

Callout

The phenomenon of solving one’s own problem during the process of trying to explain it to someone else is often called “rubber duck debugging.” This is a reference to a story about programmers who would keep rubber ducks on their desks to explain coding problems to. Jenny Bryan refers to reprexes as “basically the rubber duck in disguise,” because they force you to explain your problem to someone else, often solving it in the process.

Jenny Bryan shares many other insights about reprexes in her 2018 talk “Help me help you: creating reproducible examples.”

Helpers

There are lots of people who might help you with your code: friends, colleagues, mentors, or total strangers online. In this lesson, we will use the term “helper” to refer to the person who is helping you to debug your code. Helpers are the target audience for your minimal reproducible example.

Jordan emphasizes to Mickey that they are still happy to be a helper, but that since they won’t have as much time to devote to debugging in the future, following this road map first will make the helping process more efficient. Hopefully, it will also make Mickey into a more confident coder!

Before heading off to their own work, Jordan also introduces Mickey to the dataset they’ve just loaded in.

R

glimpse(surveys)

OUTPUT

Rows: 16,878
Columns: 13
$ record_id       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ month           <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, …
$ day             <dbl> 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16…
$ year            <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, …
$ plot_id         <dbl> 2, 3, 2, 7, 3, 1, 2, 1, 1, 6, 5, 7, 3, 8, 6, 4, 3, 2, …
$ species_id      <chr> "NL", "NL", "DM", "DM", "DM", "PF", "PE", "DM", "DM", …
$ sex             <chr> "M", "M", "F", "M", "M", "M", "F", "M", "F", "F", "F",…
$ hindfoot_length <dbl> 32, 33, 37, 36, 35, 14, NA, 37, 34, 20, 53, 38, 35, NA…
$ weight          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ genus           <chr> "Neotoma", "Neotoma", "Dipodomys", "Dipodomys", "Dipod…
$ species         <chr> "albigula", "albigula", "merriami", "merriami", "merri…
$ taxa            <chr> "Rodent", "Rodent", "Rodent", "Rodent", "Rodent", "Rod…
$ plot_type       <chr> "Control", "Long-term Krat Exclosure", "Control", "Rod…

R

min(surveys$year)

OUTPUT

[1] 1977

R

max(surveys$year)

OUTPUT

[1] 1989

Jordan explains that the dataset is made up of many individual rodent records (record_id). The date of each record is given by the month, day, and year columns.

The dataset includes data from a number of different study plots that had different treatments applied: plot IDs are given by the plot_id column, and the type of treatment is specified in plot_type.

There is information about the genus and species of each individual caught, as well as higher-level taxa information and a short-form species_id code.

For each individual caught, the field crew took weight, sex and hindfoot_length measurements, although those measurements are sometimes missing.

The dataset contains 16,878 rodent observations ranging across years from 1977 through 1989.

Callout

More information about the Portal Project and the surveys dataset is available at [LINK].

With an introduction to the dataset and a road map to guide them if they get stuck, Mickey feels ready to start coding!

Key Points

  • kp1
  • kp2
  • kp3

Content from Identify the problem and make a plan


Last updated on 2025-05-27 | Edit this page

Overview

Questions

  • What do I do when I encounter an error?
  • What do I do when my code outputs something I don’t expect?
  • Why do errors and warnings appear in R?
  • How can I find which areas of code are responsible for errors?
  • How can I fix my code? What other options exist if I can’t fix it?

Objectives

After completing this episode, participants should be able to…

  • Describe how the desired code output differs from the actual output
  • Categorize an error message (e.g. syntax error, semantic errors, package-specific errors, etc.)
  • decode/describe what an error message is trying to communicate
  • Identify specific lines and/or functions generating the error message
  • Use R Documentation to look up function syntax and examples
  • Quickly fix commonly-encountered R errors using the internet
  • Identify when a problem is better suited for asking for further help, including online help and reprex

(initial intro – edit upon looking at intro episode)

The first step we’ll cover is what to do when encountering an error or other undesired output from your code. With this episode, we hope to teach you the basics about identifying errors, rectifying them if possible, and if not, how to isolate the problem for others to look at. This is the first step in our “roadmap” of how to solve coding problems – recognizing when something you don’t intend is happening with your code, and then identifying the problem (to a lesser or greater degree) in order to solve it yourself or be able to succinctly describe it to a helper.

3.1 What do I do when I encounter an error message?


While sometimes frustrating to read, R will often let us know when a problem occurs by generating an error message that tells us why R was unable to run our code. This type of ‘error’ is often referred to as a syntax error. When R is unable to run your code, it will return this type of error message, and stop the program (as opposed to a warning or attempting to run further lines despite the error). Error messages may happen for many reasons. However, deciphering the meaning of such error messages is not always as easy as we might hope. While we can’t review every type of reason your code generates an error, we will try to teach you some tools for you to interpret and figure out syntax errors for yourself.

The accompanying error message attempts to tell you exactly how your code failed. For example, consider the following error message that occurs when I run this command in the R console:

R

ggplot(x = taxa) + geom_bar()

ERROR

Error: object 'taxa' not found

Though we know somewhere there is an object called taxa (it is actually a column of the dataset rodents), R is trying to communicate that it cannot find any such object in the local environment. Let’s try again, appropriately pointing ggplot to the rodents dataset and taxa column using the $ operator. For the sake of argument, let’s say we also remember that geom_bar expects an aesthetic (aes).

R

ggplot(aes(x = rodents$taxa)) + geom_bar()

ERROR

Error in `fortify()`:
! `data` must be a <data.frame>, or an object coercible by `fortify()`,
  or a valid <data.frame>-like object coercible by `as.data.frame()`, not a
  <uneval> object.
ℹ Did you accidentally pass `aes()` to the `data` argument?

Whoops! Here we see another error message – this time, R responds with a perhaps more-uninterpretable message.

Let’s go over each part briefly. First, we see an error from a function called fortify, which we didn’t even call! Then, there’s a more helpful informational message: Did we accidentally pass aes() to the data argument? This does seem to relate to our function call, as we do pass aes into the ggplot function. But what is this “data argument?” A helpful starting place when attempting to decipher an error message is checking the documentation for the function which caused the error:

?ggplot

Here, a Help window pops up in RStudio which provides some more information. Skipping the general description at the top, we see ggplot takes positional arguments of data, then mapping, which uses the aes call. We can see in “Arguments” that the aes(x = rodents$taxa) object used in the plot is attempted by fortify to be converted to a data.frame: now the picture is clear! We accidentally passed our mapping argument (telling ggplot how to map variables to the plot) into the position it expected data in the form of a data frame. And if we scroll down to “Examples”, to “Pattern 1”, we can see exactly how ggplot expects these arguments in practice. Let’s amend our result:

R

ggplot(rodents, aes(x = taxa)) + geom_bar()

Here we see our desired plot.

Stop no. 1 on our roadmap: Identifying the problem


Let’s pause here to highlight some patterns we’re starting to see in the course of fixing our code:

  1. Seeing a problem arise in our code (in this case, R is explicitly telling us it has a problem running it).

  2. Reading and interpreting the error message R gives us.

Other steps we might take then include:

  1. Acting on parts of the error we can understand, such as changing input to a function.

  2. Pulling up the R Documentation for that function, and reading the documentation’s Description, Usage, Arguments, Details and Examples entries for greater insight into our error.

  3. Copying and pasting the error message into a search engine / generative LLM for more interpretable explanations.

And, when all else fails, we can prepare our code into a reproducible example for expert help.

While the above steps may be new or seem familiar, we formalize this a little bit to explicitly address something: recognizing when a problem arises and attempting to interpret what is going wrong is essential to fixing it. This is true whether you fix the problem on your own, or communicate it to an expert. The latter steps we listed might be categorized as attempts to immediately address the problem – we’ll call these code first aid – these steps might fix the problem, give you greater insight into what the problem is (and how R is interpreting your code), or not be helpful at all.

In any case, we want to emphasize that these skill sets are essential to being a practiced coder able to effectively seek help. While these examples may seem trivial to pull up a whole checklist, below we will see examples of problems that are trickier to both recognize and interpret. But in those cases, we’ll nonetheless apply the same framework.

3.2 What do I do when my code outputs something I don’t expect


Another type of problem you may encounter is when your R code runs without errors, but does not produce the desired output. You may sometimes see these called semantic errors. As with syntax errors, semantic errors may occur for a variety of non-intuitive reasons, and are often harder to solve as there is no description of the error – you must work out where your code is defective yourself!

Let’s go back to our rodent analysis. The next step in the plan is to subset to just the Rodent taxa (as opposed to other taxa: Bird, Rabbit, Reptile or NA). Let’s quickly check to see how much data we’d be throwing out by doing so:

R

table(rodents$taxa)

OUTPUT


   Bird  Rabbit Reptile  Rodent
    300      69       4   16148 

We’re interested in the Rodents, and thankfully it seems like a majority of our observations will be maintained when subsetting to rodents. But wait… In our plot above, we can clearly see the presence of NA values. Why are we not seeing them here? Our command was correctly executed, but the output is not everything we intended. Having no error message to interpret, let’s jump straight to the function documentation:

R

?table

OUTPUT

Help on topic 'table' was found in the following packages:

  Package               Library
  vctrs                 /home/runner/.local/share/renv/cache/v5/linux-ubuntu-jammy/R-4.5/x86_64-pc-linux-gnu/vctrs/0.6.5/c03fa420630029418f7e6da3667aac4a
  base                  /home/runner/.cache/R/renv/sandbox/linux-ubuntu-jammy/R-4.5/x86_64-pc-linux-gnu/9a444a72


Using the first match ...

Here, the documentation provides some clues: there seems to be an argument called useNA that accepts “no”, “ifany”, and “always”, but it’s not immediately apparent which one we should use to show our NA values. As a second approach, let’s go to Examples to see if we can find any quick fixes. Here we see a couple lines further down:

R

table(a)                 # does not report NA's
table(a, exclude = NULL) # reports NA's

That seems like it should be inclusive. Let’s try again:

R

table(rodents$taxa, exclude = NULL)

OUTPUT


   Bird  Rabbit Reptile  Rodent    <NA>
    300      69       4   16148     357 

Now our NA values show up in the table. We see that by subsetting to the “Rodent” taxa, we would losing about 357 NAs, which themselves could be rodents! However, in this case, it seems a small enough portion to safely omit. Let’s subset our data to the rodent taxon.

R

rodents <- rodents %>% filter(taxa == "Rodent")

Challenge

There are 3 lines of code below, and each attempts to create the same plot. Identify which produces a syntax error, which produces a semantic error, and which correctly creates the plot (hint: this may require you inferring what type of graph we’re trying to create!)

  1. ggplot(rodents) + geom_bin_2d(aes(month, plot_type))

  2. ggplot(rodents) + geom_tile(aes(month, plot_type), stat = "count")

  3. ggplot(rodents) + geom_tile(aes(month, plot_type))

In this case, A correctly creates the graph, plotting as colors in the tile the number of times an observation is seen. It essentially runs the following lines of code:

R

rodents_summary <- rodents %>% group_by(plot_type, month) %>% summarize(count=n())

OUTPUT

`summarise()` has grouped output by 'plot_type'. You can override using the
`.groups` argument.

R

ggplot(rodents_summary) + geom_tile(aes(month, plot_type, fill=count))

B is a syntax error, and will produce the following error:

R

ggplot(rodents) + geom_tile(aes(month, plot_type), stat = "count")

ERROR

Error in `geom_tile()`:
! Problem while computing stat.
ℹ Error occurred in the 1st layer.
Caused by error in `setup_params()`:
! `stat_count()` must only have an x or y aesthetic.

Finally, C is a semantic error. It produces a plot, which is rather meaningless:

R

ggplot(rodents) + geom_tile(aes(month, plot_type))

Summary


In general, encountering semantic errors can make our job more difficult, but the roadmap remains the same:

  1. Seeing a problem arise in our code.

  2. Interpreting the problem.

Other steps we might take then include:

  1. Acting on parts of the error we can understand, such as changing input to a function.

  2. Pulling up the R Documentation for relevant functions, and reading the documentation’s Description, Usage, Arguments, Details and Examples entries for greater insight into our error.

  3. Describing our problem into a search engine / generative LLM for more interpretable explanations.

And, when all else fails, we can prepare our code into a reproducible example for expert help.

The steps to identifying the problem and in code first aid matches what we’ve seen above. However, here seeing the problem arise in our code may be much more subtle, and comes from us recognizing output we don’t expect or know to be wrong. Even if the code is run, R may give us warning or informational messages which pop up when executing your code. Most of the time, however, it’s up to the coder to be vigilant and be sure steps are running as they should. Interpreting the problem may also be more difficult as R gives us little or no indication about how it’s misinterpreting our intent.

Callout

Generally, the more your code deviates from just using base R functions, or the more you use specific packages, both the quality of documentation and online help available from search engines and Googling gets worse and worse. While base R errors will often be solvable in a couple of minutes from a quick ?help check or a long online discussion and solutions on a website like Stack Overflow, errors arising from little-used packages applied in bespoke analyses might merit isolating your specific problem to a reproducible example for online help, or even getting in touch with the developers! Such community input and questions are often the way packages and documentation improves over time.

3.3 How can I find where my code is failing?


Isolating your problem may not be as simple as assessing the output from a single function call on the line of code which produces your error. Often, it may be difficult to determine which lines or logical sections of code (e.g. functions) are producing the error.

Consider the example below, where we now are attempting to see which species of kangaroo rodents appear in different plot types over the years. To do so, we’ll filter our dataset to just include the genus Dipodomys. Then we’ll plot a histogram of which how many observations are seen in each plot type over an x axis of years.

R

krats <- rodents %>% filter(genus == "Dipadomys") #kangaroo rat genus

ggplot(krats, aes(year, fill=plot_type)) + 
geom_histogram() +
facet_wrap(~species)

ERROR

Error in `combine_vars()`:
! Faceting variables must have at least one value.

Uh-oh. Another error here, when we try to make a ggplot. But what is “combine_vars?” And then: “Faceting variables must have at least one value” What does that mean?

This is not an easily-interpretable error message from ggplot, and our code looks like it should run. Perhaps we can take a step back and see whether our error is actually not in the ggplot code itself. Often, when trying to isolate the problem area, it is a good idea to look back at the original input. So let’s take a look at our krats dataset.

R

krats

OUTPUT

# A tibble: 0 × 13
# ℹ 13 variables: record_id <dbl>, month <dbl>, day <dbl>, year <dbl>,
#   plot_id <dbl>, species_id <chr>, sex <chr>, hindfoot_length <dbl>,
#   weight <dbl>, genus <chr>, species <chr>, taxa <chr>, plot_type <chr>

It’s empty! What went wrong with our “Dipadomys” filter? Let’s use a print statement to see which genuses are included in the original rodents dataset.

R

print(rodents %>% count(genus))

OUTPUT

# A tibble: 12 × 2
   genus                n
   <chr>            <int>
 1 Ammospermophilus   136
 2 Baiomys              3
 3 Chaetodipus        382
 4 Dipodomys         9573
 5 Neotoma            904
 6 Onychomys         1656
 7 Perognathus        553
 8 Peromyscus        1271
 9 Reithrodontomys   1412
10 Rodent               4
11 Sigmodon           103
12 Spermophilus       151

We see two things here. For one, we’ve misspelled Dipodomys, which we can now amend. This quick function call also tells us we should expect a data frame with 9573 values resulting after subsetting to the genus Dipodomys.

R

krats <- rodents %>% filter(genus == "Dipodomys") #kangaroo rat genus
dim(krats)

OUTPUT

[1] 9573   13

R

ggplot(krats, aes(year, fill=plot_type)) + 
geom_histogram() +
facet_wrap(~species)

OUTPUT

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Our improved code here looks good. Checking the dimensions of our subsetted data frame using the dim() function confirms we now have all the Dipodomys observations, and our plot is looking better. In general, having a ‘print’ statement or some other output after we manipulate data or other major steps can be a good way to check your code is producing intermediate results consistent with your expectations.

Callout

Often, giving your expert helpers access to the entire problem, with a detailed description of your desired output allows you to directly improve your coding skills and learn about new functions and techniques.

Summary


In general, encountering semantic errors can make our job more difficult, but the roadmap remains the same:

  1. Seeing a problem arise in our code.

  2. Interpreting the problem.

Other steps we might take then include:

  1. Acting on parts of the error we can understand, such as changing input to a function.

  2. Pulling up the R Documentation for relevant functions, and reading the documentation’s Description, Usage, Arguments, Details and Examples entries for greater insight into our error.

  3. Describing our problem into a search engine / generative LLM for more interpretable explanations.

And, when all else fails, we can prepare our code into a reproducible example for expert help.

Our roadmap to identifying problems in our code may now look like:

  1. Seeing a problem arise in our code.

  2. Isolating our code to the problem area.

  3. Interpreting the problem.

Now we can see the need to isolate the specific areas of code causing the bug or problem. There is no general rule of thumb as to how large this needs to be. But, unless our problem occurs on the first line, we should be able to isolate our code a bit: Any early lines which we know run correctly and as intended may not need to be included, and by isolating the problem area as much as we can to make it understandable to others, even if that does not help us solve the problem ourselves.

Let’s add to our code first aid:

  1. Identify the problem area – add print statements immediately upstream or downstream of problem areas, check the desired output from functions, and see whether any intermediate output can be further isolated.

  2. Acting on parts of the error we can understand, such as changing input to a function.

  3. Pulling up the R Documentation for relevant functions, and reading the documentation’s Description, Usage, Arguments, Details and Examples entries for greater insight into our error.

  4. Describing our problem into a search engine / generative LLM for more interpretable explanations.

And, when all else fails, we can prepare our code into a reproducible example for expert help.

While this is similar to our previous checklists, we can now understand these steps as a continuous cycle of isolating the problem into more and more discrete chunks for a reproducible example. Any step in the above that helps us identify the specific areas or aspects of our code that are failing in particular, we can zoom in on and restart the checklist. We can stop as soon as we don’t understand anymore how our code fails. At this point, we can excise that area for further help using a reprex.

3.3 When should I prepare my code for a reprex?


There may be some point at which our code first aid does not help us anymore, and we still cannot figure out the problem our code is giving us – in that case, it may be time to turn to expert help, by asking a coworker, mentor, or someone online for aid in

While it is common practice in intro coding courses to call over the instructor with a raised hand and a statement such as “I don’t know what’s wrong,” in reality people have limited time, bandwidth, or requisite knowledge to be able to help out with any problem that might arise. Even if they can’t figure out a bug on their own, the practiced coder can identify and articulate the problem effectively such that someone with available time and expertise can help out

Content from Minimal Reproducible Code


Last updated on 2025-05-27 | Edit this page

Overview

Questions

  • Why is it important to make a minimal code example?
  • Which part of my code is causing the problem?
  • Which parts of my code should I include in a minimal example?
  • How can I tell whether a code snippet is reproducible or not?
  • How can I make my code reproducible?

Objectives

  • Explain the value of a minimal code snippet.
  • Identify the problem area of a script.
  • Identify supporting parts of the code that are essential to include.
  • Simplify a script down to a minimal code example.
  • Evaluate whether a piece of code is reproducible as is or not. If not, identify what is missing.
  • Edit a piece of code to make it reproducible
  • Have a road map to follow to simplify your code.
  • Describe the {reprex} package and its uses

Mickey is interested in understanding how kangaroo rat weights differ across species and sexes, so they create a quick visualization.

R

ggplot(rodents, aes(x = species, fill = sex))+
  geom_bar()

Whoa, this is really overwhelming! Mickey forgot that the dataset includes data for a lot of different rodent species, not just kangaroo rats. Mickey is only interested in two kangaroo rat species: Dipodomys ordii (Ord’s kangaroo rat) and Dipodomys spectabilis (Banner-tailed kangaroo rat).

Mickey also notices that there are three categories for sex: F, M, and what looks like a blank field when there is no sex information available. For the purposes of comparing weights, Mickey wants to focus only rodents of known sex.

Mickey filters the data to include only the two focal species and only rodents whose sex is F or M.

R

rodents_subset <- rodents %>%
  filter(species == c("ordii", "spectabilis"),
         sex == c("F", "M"))

Because these scientific names are long, Mickey also decides to add common names to the dataset. They start by creating a data frame with the common names, which they will then join to the rodents_subset dataset:

R

## Adding common names
common_names <- data.frame(species = unique(rodents_subset$species), common_name = c("Ord's", "Banner-tailed"))
common_names

OUTPUT

      species   common_name
1 spectabilis         Ord's
2       ordii Banner-tailed

But looking at the common names dataset reveals a problem!

Exercise 1a: Applying code first aid

  1. Is this a syntax error or a semantic error? Explain why.
  2. What “code first aid” steps might be appropriate here? Which ones are unlikely to be helpful?

Mickey re-orders the names and tries the code again. This time, it works! Now they can join the common names to rodents_subset.

R

common_names <- data.frame(species = sort(unique(rodents_subset$species)), common_name = c("Ord's", "Banner-Tailed"))
rodents_subset <- left_join(rodents_subset, common_names, by = "species")

Before moving on to answering their research question about kangaroo rat weights, Mickey also wants to create a date column, since they realized that having the dates stored in three separate columns (month, day, and year) might be hard for future analysis. They want to use lubridate to parse the dates. But here, too, they run into trouble.

R

rodents_subset <- rodents_subset %>%
  mutate(date = lubridate(paste(year, month, day, sep = "-")))

ERROR

Error in `mutate()`:
ℹ In argument: `date = lubridate(paste(year, month, day, sep = "-"))`.
Caused by error in `lubridate()`:
! could not find function "lubridate"

Exercise 1b: Applying code first aid, part 2

  1. Is this a syntax error or a semantic error? Explain why.
  2. What “code first aid” steps might be appropriate here?
  3. What would be your next step to fix this error, if you were Mickey?

Exercise 1c: Applying code first aid, part 2 (extra challenge)

Mickey tried several methods to create a date column. Here’s one of them.

R

test <- rodents_subset %>%
  mutate(date = lubridate::as_date(paste(day, month, year)))

WARNING

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `date = lubridate::as_date(paste(day, month, year))`.
Caused by warning:
! All formats failed to parse. No formats found.
  1. What type of error is this?
  2. What do you learn from the warning message? Why do you think this code causes a warning message, rather than an error message?
  3. Try some code first aid steps. What do you think happened here? How did you figure it out?

Mickey reads some of the lubridate documentation and changes their code so that the date column is created correctly.

R

rodents_subset <- rodents_subset %>%
  mutate(date = lubridate::ymd(paste(year, month, day, sep = "-")))

Now that the dataset is cleaned, Mickey is ready to start learning about kangaroo rat weights!

They start by running a quick linear regression to predict weight based on species and sex.

R

weight_model <- lm(weight ~ common_name + sex, data = rodents_subset)
summary(weight_model) 

OUTPUT


Call:
lm(formula = weight ~ common_name + sex, data = rodents_subset)

Residuals:
     Min       1Q   Median       3Q      Max
-111.201   -6.466    2.534   10.799   45.799

Coefficients: (1 not defined because of singularities)
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)      123.2007     0.8061  152.83   <2e-16 ***
common_nameOrd's -74.7342     1.3352  -55.97   <2e-16 ***
sexM                   NA         NA      NA       NA
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.71 on 939 degrees of freedom
  (35 observations deleted due to missingness)
Multiple R-squared:  0.7694,	Adjusted R-squared:  0.7691
F-statistic:  3133 on 1 and 939 DF,  p-value: < 2.2e-16

The negative coefficient for common_nameOrd's tells Mickey that Ord’s kangaroo rats are significantly less heavy than Banner-tailed kangaroo rats.

But something is wrong with the coefficients for sex. Why is everything NA for sexM?

When Mickey visualizes the data, they see a problem in the graph, too. As the model showed, Ord’s kangaroo rats are significantly smaller than Banner-tailed kangaroo rats. But something is definitely wrong! Because the boxes are colored by sex, we can see that all of the Banner-tailed kangaroo rats are male and all of the Ord’s kangaroo rats are female. That can’t be right! What are the chances of catching all one sex for two different species?

R

rodents_subset %>%
  ggplot(aes(y = weight, x = common_name, fill = sex)) +
  geom_boxplot()

WARNING

Warning: Removed 35 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Mickey confirms this with a two-way frequency table.

R

table(rodents_subset$sex, rodents_subset$species)

OUTPUT


    ordii spectabilis
  F   350           0
  M     0         626

To double check, Mickey looks at the original dataset.

R

table(rodents$sex, rodents$species)

OUTPUT


    albigula eremicus flavus fulvescens fulviventer harrisi hispidus
          62       14     15          0           0     136        2
  F      474      372    222         46           3       0       68
  M      368      468    302         16           2       0       42

    leucogaster maniculatus megalotis merriami ordii penicillatus  sp.
             16           9        33       45     3            6   10
  F         373         160       637     2522   690          221    4
  M         397         248       680     3108   792          155    5

    spectabilis spilosoma taylori torridus
             42       149       0       28
  F        1135         1       0      390
  M        1232         1       3      441

Not only were there originally males and females present from both ordii and spectabilis, but the original numbers were way, way higher! It looks like somewhere along the way, Mickey lost a lot of observations.

[WORKING THROUGH CODE FIRST AID STEPS HERE] Mickey is feeling overwhelmed and not sure where their code went wrong. They were able to fix the errors and warning messages that they encountered so far, but this one seems more complicated, and there has been no clear indication of what went wrong. They work their way through the code first aid steps, but they are not able to solve the problem.

They decide to consult Remy’s road map to figure out what to do next.

Since code first aid was not enough to solve this problem, it looks like it’s time to ask for help using a reprex.

Making a reprex

Simplify the code

When asking someone else for help, it is important to simplify your code as much as possible to make it easier for the helper to understand what is wrong. Simplifying code helps to reduce frustration and overwhelm when debugging an error in a complicated script. The more that we can make the process of helping easy and painless for the helper, the more likely that they will take the time to help.

Callout

Depending on how closely you have been following the lesson and which challenges you have attempted, your script may not look exactly like Mickey’s. That’s okay!

Mickey has written a lot of code so far. The code is also a little messy–for example, after fixing the previous errors, they sometimes commented out the old code and kept it for future reference.

Create a new script

To make the task of simplifying the code less overwhelming, let’s create a separate script for our reprex. This will let us experiment with simplifying our code while keeping the original script intact.

Let’s create and save a new, blank R script and give it a name, such as “reprex-script.R”

Callout: Making an R script

There are several ways to make an R script:

  • File > New File > R Script
  • Click the white square with a green plus sign at the top left corner of your RStudio window
  • Use a keyboard shortcut: Cmd + Shift + N (on a Mac) or Ctrl + Shift + N (on Windows)

We’re going to start by copying over all of our code, so we have an exact copy of the full analysis script.

R

# Minimal reproducible example script
# Load packages and data
library(ggplot2)
library(dplyr)
rodents <- read.csv("data/surveys_complete_77_89.csv")

# XXX ADD PETER'S EPISODE CODE HERE

## Filter to only rodents
rodents <- rodents %>% filter(taxa == "Rodent")

# Visualize sex by species
ggplot(rodents, aes(x = species, fill = sex))+
  geom_bar()

# Subset to species and sexes of interest
rodents_subset <- rodents %>%
  filter(species == c("ordii", "spectabilis"),
         sex == c("F", "M"))

# Add common names
# common_names <- data.frame(species = unique(rodents_subset$species), common_name = c("Ord's", "Banner-tailed"))
# common_names # oops, this looks wrong!
common_names <- data.frame(species = sort(unique(rodents_subset$species)), common_name = c("Ord's", "Banner-Tailed"))
common_names
rodents_subset <- left_join(rodents_subset, common_names)

# Add a date column
# rodents_subset <- rodents_subset %>%
#   mutate(date = lubridate(paste(year, month, day, sep = "-"))) # that didn't work!

rodents_subset <- rodents_subset %>%
  mutate(date = lubridate::ymd(paste(year, month, day, sep = "-")))

# Predict weight by species and sex, and make a plot
weight_model <- lm(weight ~ common_name + sex, data = rodents_subset)
summary(weight_model) 
rodents_subset %>%
  ggplot(aes(y = weight, x = common_name, fill = sex)) +
  geom_boxplot() # wait, why does this look weird?

# Investigate
table(rodents_subset$sex, rodents_subset$species)
table(rodents$sex, rodents$species)

Now, we will follow a process: 1. Identify the symptom of the problem. 2. Remove a piece of code to make the reprex more minimal. 3. Re-run the reprex to make sure the reduced code still demonstrates the problem–check that the symptom is still present.

In this case, the symptom is that we are missing rows in rodents_subset that were present in rodents and should not have been removed!

Let’s start by identifying pieces of code that we can probably remove. A good start is to look for lines of code that do not create variables for later use, or lines that add complexity to the analysis that is not relevant to the problem at hand.

We can start by removing the broken code that we commented out earlier. Also, adding the date column is not directly relevant to the current problem. Let’s go ahead and remove those pieces of code. Now our script looks like this:

R

# Minimal reproducible example script
# Load packages and data
library(ggplot2)
library(dplyr)
rodents <- read.csv("data/surveys_complete_77_89.csv")

# XXX ADD PETER'S EPISODE CODE HERE

## Filter to only rodents
rodents <- rodents %>% filter(taxa == "Rodent")

# Visualize sex by species
ggplot(rodents, aes(x = species, fill = sex))+
  geom_bar()

# Subset to species and sexes of interest
rodents_subset <- rodents %>%
  filter(species == c("ordii", "spectabilis"),
         sex == c("F", "M"))

# Add common names
common_names <- data.frame(species = sort(unique(rodents_subset$species)), common_name = c("Ord's", "Banner-Tailed"))
common_names
rodents_subset <- left_join(rodents_subset, common_names)

# Predict weight by species and sex, and make a plot
weight_model <- lm(weight ~ common_name + sex, data = rodents_subset)
summary(weight_model) 
rodents_subset %>%
  ggplot(aes(y = weight, x = common_name, fill = sex)) +
  geom_boxplot() # wait, why does this look weird?

# Investigate
table(rodents_subset$sex, rodents_subset$species)
table(rodents$sex, rodents$species)

When we run this code, we can confirm that it still demonstrates our problem. There are still many rows missing from rodents_subset.

We’ve made progress on minimizing our code, but we still have a ways to go. This script is still pretty long! Let’s identify more pieces of code that we can remove.

Exercise 2: Minimizing code

Which other lines of code can you remove to make this script more minimal? After removing each one, be sure to re-run the code to make sure that it still reproduces the error.

  • [Peter’s episode code]
  • Visualizing sex by species (ggplot) can be removed because it generates a plot but does not create any variables that are used later.
  • Filtering to only rodents can be removed because later we filter to only two species in particular
  • Adding common names can be removed because we didn’t actually use those common names. This one is tricky because technically we did use the common names in the rodents_subset plot. But is that plot really necessary? We can still demonstrate the problem using the table() lines of code at the end. Also, we could still make the equivalent plot using the species column instead of the common_name column, and it would demonstrate the same thing!
  • The weight model and the summary can be removed

A totally minimal script would look like this:

R

rodents <- read.csv("data/surveys_complete_77_89.csv")

rodents_subset <- rodents %>%
  filter(species == c("ordii", "spectabilis"),
         sex == c("F", "M"))

table(rodents_subset$sex, rodents_subset$species)
table(rodents$sex, rodents$species)

Great, now we have a totally minimal script!

However, we’re not done yet.

Exercise 3: The problem area is not enough

Let’s suppose that Mickey has created the minimal problem area script shown above. They email this script to Remy so that Remy can help them debug the code.

Remy opens up the script and tries to run it on their computer, but it doesn’t work. - What do you think will happen when Remy tries to run the code from this reprex script? - What do you think Mickey should do next to improve the minimal reproducible example?

We haven’t yet included enough code to allow a helper, such as Remy, to run the code on their own computer. If Remy tries to run the reprex script in its current state, they will encounter errors because they don’t have access to the same R environment that Mickey does.

Include dependencies

R code consists primarily of functions and variables. In order to make our minimal examples truly reproducible, we have to give our helpers access to all the functions and variables that are necessary to run our code.

First, let’s talk about functions. Functions in R typically come from packages. You can access them by loading the package into your environment.

To make sure that your helper has access to the packages necessary to run your reprex, you will need to include calls to library() for whichever packages are used in the code. For example, if your code uses the function lmer from the lme4 package, you would have to include library(lme4) at the top of your reprex script to make sure your helper has the lme4 package loaded and can run your code.

Callout: Default packages

Some packages, such as {base} and {stats}, are loaded in R by default, so you might not have realized that a lot of functions, such as dim, colSums, factor, and length actually come from those packages!

You can see a complete list of the functions that come from the {base} and {stats} packages by running library(help = "base") or library(help = "stats").

Let’s do this for our own reprex. We can start by identifying all the functions used, and then we can figure out where each function comes from to make sure that we tell our helper to load the right packages.

The first function used in our example is ggplot(), which comes from the package ggplot2. Therefore, we know we will need to add library(ggplot2) at the top of our script.

The function geom_boxplot() also comes from ggplot2. We also used the function table(). Running ?table tells us that the table function comes from the package {base}, which is automatically installed and loaded when you use R–that means we don’t need to include library(base) in our script.

Our reprex script now looks like this:

R

# Mickey's reprex script

# Load necessary packages to run the code
library(ggplot2)

rodents_subset %>%
  ggplot(aes(y = weight, x = common_name, fill = sex)) +
  geom_boxplot() # wait, why does this look weird?

WARNING

Warning: Removed 35 rows containing non-finite outside the scale range
(`stat_boxplot()`).

R

# Investigate
table(rodents_subset$sex, rodents_subset$species)

OUTPUT


    ordii spectabilis
  F   350           0
  M     0         626

R

table(rodents$sex, rodents$species)

OUTPUT


    albigula eremicus flavus fulvescens fulviventer harrisi hispidus
          62       14     15          0           0     136        2
  F      474      372    222         46           3       0       68
  M      368      468    302         16           2       0       42

    leucogaster maniculatus megalotis merriami ordii penicillatus  sp.
             16           9        33       45     3            6   10
  F         373         160       637     2522   690          221    4
  M         397         248       680     3108   792          155    5

    spectabilis spilosoma taylori torridus
             42       149       0       28
  F        1135         1       0      390
  M        1232         1       3      441

Callout: Installing vs. loading packages

But what if our helper doesn’t have all of these packages installed? Won’t the code not be reproducible?

Typically, we don’t include install.packages() in our code for each of the packages that we include in the library() calls, because install.packages() is a one-time piece of code that doesn’t need to be repeated every time the script is run. We assume that our helper will see library(specialpackage) and know that they need to go install “specialpackage” on their own.

Technically, this makes that part of the code not reproducible! But it’s also much more “polite”. Our helper might have their own way of managing package versions, and forcing them to install a package when they run our code risks messing up our workflow. It is a common convention to stick with library() and let them figure it out from there.

Exercise 4: Which packages are essential?

In each of the following code snippets, identify the necessary packages (or other code) to make the example reproducible.

weight_model <- lm(weight ~ common_name + sex, data = rodents_subset)
tab_mod(weight_model)
mod <- lmer(weight ~ hindfoot_length + (1|plot_type), data = rodents)
summary(mod)
rodents_processed <- process_rodents_data(rodents)
glimpse(rodents_processed)

This exercise should take about 10 minutes. :::solution a. lm is part of base R, so there’s no package needed for that. tab_mod comes from the package sjPlot. You could add libary(sjPlot) to this code to make it reproducible. b. lmer is a linear mixed modeling function that comes from the package lme4. summary is from base R. You could add library(lme4) to this code to make it reproducible. c. process_rodents_data is not from any package that we know of, so it was probably an originally-created function. In order to make this example reproducible, you would have to include the definition of process_rodents_data. glimpse is probably from dplyr, but it’s worth noting that there is also a glimpse function in the pillar package, so this might be ambiguous. This is another reason it’s important to specify your packages–if you leave your helper guessing, they might load the wrong package and misunderstand your error!

:::::::::::::::::::::::::::::::::::::::::::

Including library() calls will definitely help Remy run the code. But this code still won’t work as written because Remy does not have access to the same objects that Mickey used in the code.

The code as written relies on rodents_subset, which Remy will not have access to if they try to run the code. That means that we’ve succeeded in making our example minimal, but it is not reproducible: it does not allow someone else to reproduce the problem!

[Transition to minimal data episode]

Exercise 5: Reflection

Let’s take a moment to reflect on this process.

  • What’s one thing you learned in this episode? An insight; a new skill; a process?

  • What is one thing you’re still confused about? What questions do you have?

This exercise should take about 5 minutes.

Key Points

  • Making a reprex is the next step after trying code first aid.
  • In order to make a good reprex, it is important to simplify your code
  • Simplify code by removing parts not directly related to the question
  • Give helpers access to the functions used in your code by loading all necessary packages

Content from Minimal Reproducible Data


Last updated on 2025-05-27 | Edit this page

Overview

Questions

  • What is a minimal reproducible dataset, and why do I need it?
  • How do I create a minimal reproducible dataset?
  • Can I just use my own data?

Objectives

  • Describe a minimal reproducible dataset
  • Identify the aspects of your data necessary to replicate your issue
  • Create a dataset from scratch to replicate your issue
  • Share your own dataset in a way that is minimal, accessible, and reproducible
  • Subset an existing dataset to replicate your issue

3.1 What is a minimal reproducible dataset and why do I need it?


Now that you have narrowed down your problem area and stripped down your code to make it minimal we need to ensure it is reproducible; this means it needs to be accessible and runnable such that anyone else can copy-paste it into their system, run the code, and replicate your issue. Importantly, an example code will always require example data in order to run! Therefore, every reprex requires you to provide a minimal reproducible dataset to use with the code.

Furthermore, as we have seen previously, sometimes the source of the problem isn’t actually your code, but rather your data! By providing an example dataset that, when used in your example code, still replicates your issue, you also give your helper the opportunity to better investigate and manipulate that data to fix your issue. It would be great if we could give the helper our entire computer so they could just take over where we left off, but usually we can’t.

Just as we did with our code, when providing such an example dataset we also want to make sure we keep it minimal–free of unnecessary data. This will allow your helper to more clearly see what the data looks like and what the source of your issue may be. Furthermore, it will allow you to not only better understand your data but also potentially work out the source of your issue. When extraneous information is removed and only the parts that replicate the issue are kept, we can begin to see where the issues arise.

In short, a minimal reproducible dataset must be:

  • minimal: it only contains the necessary information to run your minimal code. You can also think of this as being relevant to the problem: keep only what is necessary.
  • reproducible: it must be accessible to someone without your computer, and it must consistently replicate your output/issue. This means it also needs to be complete, meaning there are no dependencies that have been omitted.

Remember: your helper may not be in the room with you or have access to your computer and the files that are on it!

You might be used to always uploading data from separate files, but helpers can’t access those files. Even if you sent someone a file, they would need to put it in the right directory, make sure to load it in exactly the same way, make sure their computer can open it, etc. Since the goal is to make everyone’s lives as easy as possible, we need to think about our data in a different way–as a dummy object created in the script itself.

Pro-tip

An example of what minimal reproducible examples look like can also be found in the ?help section, in R Studio. Just scroll all the way down to where there are examples listed. These will usually be minimal and reproducible.

For example, let’s look at the function mean:

R

?mean

We see examples that can be run directly on the console, with no additional code.

R

x <- c(0:10, 50)
xm <- mean(x)
c(xm, mean(x, trim = 0.10))

OUTPUT

[1] 8.75 5.50

In this case, x is the dummy dataset consisting of just 1 variable. Notice how it was created as part of the example.

Exercise 1

These datasets are not well suited for use in a reprex. For each one, try to reproduce the dataset on your own in R (copy-paste). Does it work? What happened? Explain.

  1. Screenshot of the ratdat comple_old dataset.
  2. sample_data <- read_csv(“/Users/kaija/Desktop/RProjects/ResearchProject/data/sample_dataset.csv”)
  3. dput(complete_old[1:100,])
  4. sample_data <- data.frame(species = species_vector, weight = c(10, 25, 14, 26, 30, 17))
  1. Not reproducible because it is a screenshot.
  2. Not reproducible because it is a path to a file that only exists on someone else’s computer and therefore you do not have access to it using that path.
  3. Not minimal, it has far too many columns and probably too many rows. It is also not reproducible because we were not given the source for complete_old.
  4. Not reproducible because we are not given the source for species_vector.

Exercise 2

Let’s say we want to know the average weight of all the species in our rodents dataset. We try to use the following code…

R

mean(rodents$weight)

OUTPUT

[1] NA

…but it returns NA! We don’t know why that is happening and we want to ask for help.

Which of the following represents a minimal reproducible dataset for this code? Can you describe why the other ones are not?

  1. sample_data <- data.frame(month = rep(7:9, each = 2), hindfoot_length = c(10, 25, 14, 26, 30, 17))
  2. sample_data <- data.frame(weight = rnorm(10))
  3. sample_data <- data.frame(weight = c(100, NA, 30, 60, 40, NA))
  4. sample_data <- sample(rodents$weight, 10)
  5. sample_data <- rodents_modified[1:20,]

The correct answer is C!

  1. does not include the variable of interest (weight).
  2. does not produce the same problem (NA result with a warning message)–the code runs just fine.
  3. minimal and reproducible.
  4. is not reproducible. Sample randomly samples 10 items; sometimes it may include NAs, sometime it may not (not guaranteed to reproduce the error). It can be used if a seed is set (see next section for more info).
  5. uses a dataset that isn’t accessible without previous data wrangling code–the object rodents_modified doesn’t exist.

3.2 Can I just use my own data?


At this point you may be wondering why you need a separate dataset, can’t you just use your own data if you made sure it was minimal and your helper could access it?

Callout

There are several reasons why you might need to create a separate dataset that is minimal and reproducible instead of trying to use your actual dataset. The original dataset may be:

  • too large - the Portal dataset is ~35,000 rows with 13 columns and contains data for decades. That’s a lot!
  • private - your dataset might not be published yet, or maybe you’re studying an endangered species whose locations can’t easily be shared. Another example: many medical datasets cannot be shared publically.
  • hard to send - on most online forums, you can’t attach supplemental files (more on this later). Even if you are just sending data to a colleague, file paths can get complicated, the data might be too large to attach, etc.
  • complicated - it would be hard to locate the relevant information. One example to steer away from are needing a ‘data dictionary’ to understand all the meanings of the columns (e.g. what is “plot type” in ratdat?) We don’t our helper to waste valuable time to figure out what everything means.
  • highly derived/modified from the original file. You may have already done a bunch of preliminary data wrangling you don’t want to include when you send the example, so you would need to provide the intermediate dataset directly to your helper.

If so, you wouldn’t be entirely wrong. While there could be several ways in which your original data may be inaccessible or hard to derive or subset, there are likely just as many ways you could make it anonymous, minimal, and reproducible. And we will show you how! Nevertheless, making your own data minimal and reproducible isn’t necessarily easier than creating a new dataset from scratch. Furthermore, creating a dataset from scratch can often highlight the source of your issue! Which means you might not need to ask help after all or you can ask a more specific question.

3.3 How do I create a minimal reproducible dataset?


In general, there are 3 common ways to provide minimal reproducible data for a reprex.

  • You can write a script that creates a new “dummy” dataset with the same key characteristics as your original data.

  • You can make your own data minimal and reproducible.

  • You can use a dataset that is already embedded in R and is therefore already accessible.

For the purpose of this lesson, we believe each coder is entitled to all the options, therefore we will walk you through how to provide a minimal reproducible dataset using each of these 3 methods. However, opinions generally differ on which method is best for which situations. Below we compiled a summary table of advantages and disadvantages of each method based on many conversations with several data science groups.

Advantages Disadvantages
Data from Scratch
  • Often the most concise

  • Easiest for helpers

  • Helps you problem-solve along the way (e.g., identify what data aspects are generating the problem)

  • Easy to share, collaborate, teach, and understand

  • More universally applicable

  • Avoids privacy/security concerns

  • Lets you clearly illustrate the sought outcome

  • Uses important-to-learn skills

  • Easier for more skilled individuals

While some disadvantages are universal, many apply mostly to novices.

  • Can be intimidating

  • Requires good understanding of your data

  • Harder to generate if the error is idiosyncratic or dependent on having a large dataset

  • Time-consuming if unskilled

  • Iterative (you may need to trial and error a few times to replicate the problem)

  • More likely to require analogies–less interpretable/connected to real problems, more likely to require greater context

  • Risks generating XY problems–make sure you are asking the right question/replicating the right problem

R-built Data
  • Simple and easy to share–no need to provide additional data

  • Familiar–helpers already know the data structure

  • Potentially faster than starting from scratch (i.e. faster than generating rows/columns de novo).

  • No privacy/security concerns

  • Can easily be manipulated or simplified

  • Generalizes the problem

  • May require a good mental model of the problem

  • Harder if the error is idiosyncratic

  • Greatest risk of generating XY problems–make sure you are asking the right question/replicating the right problem

  • Iterative (you may need to trial and error a few times to replicate the problem)

  • Need to re-think the question so it matches a different dataset and context–mental gymnastics

  • Still need to choose which dataset and variables are more appropriate

Your Data
  • Can require the least mental effort

  • Problem is easy to replicate even if you don’t understand it at all

  • Accurately represents the actual problem; avoids XY problems.

  • Richer context may intrigue/motivate helpers

  • Can be quicker if dataset is small

  • May be required for idiosyncratic problems that are based on aspects of the data itself that you don’t know about (e.g.  when the data itself, rather than the code per se, is central to the problem)

  • Captures data structures that are difficult or time-consuming to replicate if you are a novice

  • Less growth-minded if chosen as the “easy way out”–skips the learning process of trying to construct a dataset and any insights that that process might give you.

  • Easy to do poorly and think that you’re doing it well; perceived “easiness” leads to overcomplication/confusion

  • Leaves all the work to the helpers if you don’t also work to minimize it–less motivation for harder work

  • Could obfuscate the problem if not minimized–less likely to find a quick answer

  • More difficult to share–may be large/messy

  • Security/privacy/sanitizeation problems

3.4 Creating a “dummy” dataset from scratch


While this might be the most daunting option for novices, it tends to be the preferred methods for experts. That’s probably because, once you really understand the basic building blocks, it becomes the most straight-forward method of creating a minimal reproducible dataset. This is also the method that makes most sense when doing other activities that also require a reprex (e.g., teaching, collaborating, developing). Lastly, in this lesson we believe there are greater problem-solving insights to be gained by creating a new “dummy” dataset. So let’s start by making this scary process more easily digestible!

Usually, at this stage, you would have 2 pressing quesions:

  1. How do I create a dataset?
  2. How do I recreate the key elements of my data that replicate my issue?

Let’s start with the first.

There are many ways one can create a dataset in R.

You can start by creating vectors using c()

R

vector <- c(1,2,3,4) 
vector

OUTPUT

[1] 1 2 3 4

You can also add some randomness by sampling from a vector using sample().

For example you can sample numbers 1 through 10 in a random order

R

x <- sample(1:10)
x

OUTPUT

 [1]  6  9 10  4  1  7  3  8  2  5

Or you can randomly sample from a normal distribution

R

x <- rnorm(10)
x

OUTPUT

 [1] -0.7645173 -0.9526317 -1.2161516  0.6193248 -0.2540016  0.9599602
 [7]  0.1201105  0.9103763  0.5288797  0.7116196

You can also use letters to create factors.

R

x <- sample(letters[1:4], 20, replace=T)
x

OUTPUT

 [1] "d" "c" "c" "a" "d" "c" "d" "a" "d" "b" "c" "b" "b" "d" "c" "a" "d" "b" "d"
[20] "d"

Remember that a data frame is just a collection of vectors. You can create a data frame using data.frame (or tibble in the dplyr package). You can then create a vector for each variable.

R

data <- data.frame (x = sample(letters[1:3], 20, replace=T), 
                    y = rnorm(1:20))
head(data)

OUTPUT

  x          y
1 b -0.4722334
2 a -0.6265888
3 b  1.6929309
4 a  1.6969200
5 a  0.8845575
6 a  1.3272655

However, when sampling at random you must remember to set.seed() before sending it to someone to make sure you both get the same numbers!

Callout

For more handy functions for creating data frames and variables, see the cheatsheet. For some questions, specific formats can be needed. For these, one can use any of the provided as.someType functions: as.factor, as.integer, as.numeric, as.character, as.Date, as.xts.

Exercise 3: Your turn! (Optional)

Create a data frame with:

A. One categorical variable with 2 levels and one continuous variable. B. One continuous variable that is normally distributed. C. Name, sex, age, and treatment type.

3.5 Identifying the key elements of your data


No matter which approach we take for providing a dataset, we need to identify which elements of our original data are necessary. To do so, we propose starting a few simple questions to investigate your data:

  • How many variables do we need?
  • What data type (discrete or continuous) is each variable?
  • How many levels and/or observations are necessary?
  • Should the values be distributed in a specific way?
  • Are there any NAs that could be relevant?

Let’s come back to our kangaroo rats example. Here is the minimal code we settled on:

R

# Minimal code [Or whatever we end up with]
krats_subset <- rodents %>%
  filter(species == c("ordii", "spectabilis"),
         sex == c("F", "M"))

table(krats_subset$sex, krats_subset$species)

OUTPUT


    ordii spectabilis
  F   350           0
  M     0         626

So we want to create a minimal reproducible ‘dummy’ version of krats_subet. Let’s start by taking a quick look, then aswering the questions.

R

head(krats_subset)

OUTPUT

  record_id month day year plot_id species_id sex hindfoot_length weight
1        58     7  18 1977      12         DS   M              45     NA
2        80     8  19 1977       1         DS   M              48     NA
3       104     8  20 1977      11         DS   M              43     NA
4       144     8  21 1977      15         DS   M              40     NA
5       176     9  11 1977      12         DS   M              45     NA
6       182     9  11 1977      21         DS   M              49     NA
      genus     species   taxa                plot_type
1 Dipodomys spectabilis Rodent                  Control
2 Dipodomys spectabilis Rodent        Spectab exclosure
3 Dipodomys spectabilis Rodent                  Control
4 Dipodomys spectabilis Rodent Long-term Krat Exclosure
5 Dipodomys spectabilis Rodent                  Control
6 Dipodomys spectabilis Rodent Long-term Krat Exclosure

Excercise 4

Try to answer the following questions on your own to determine what we need to include in our minimal reproducible dataset:

  1. How many variables do we need?
  2. What data type (discrete or continuous) is each variable?
  3. How many levels and/or observations are necessary?
  4. Should the values be distributed in a specific way?
  5. Are there any NAs that could be relevant?

Let’s go over the answers together and build a dataset as we go along!

  1. How many variables do we need?

We need species, sex, and maybe a third identifier like record_id. This means we potentially need 3 vector (remember, each column in a dataframe is essentially just a vector).

  1. What data type (discrete or continuous) is each variable?

Species and sex are both discrete (categorical) variables, while record ID would be more continuous.

  1. How many levels and/or observations are necessary?

Since we are filtering our dataset down to 2 categories for both species and sex, that means we need at least 3 levels in each to start with. In terms of number of observations there don’t seem to be specific restrictions other than we probably want at least 1 observations per original category, so 2*3=6, or we can just pick another typical nice number like 10.

  1. Should the values be distributed in a specific way?

Since question probably isn’t going to be relevant most of the time, but certainly worth considering. If we needed a longer dataset of measurements we may have wanted to make sure it was normally distributed. If we needed a longer dataset of counts we may have wanted to make sure it was Poisson distributed. Or maybe we had bimodal data. But in this case, we had a short dataset and I don’t think it matters. We can always come back to this if we are unable to replicate our issue (hint: in which case the distribution may be related to the issue).

  1. Are there any NAs that could be relevant?

We don’t have any NAs but we do have a blank category under sex. For all we know that could be important, so maybe we want to also make one of our categories blank.

R

# We need 3 variables: species, sex, and record_id
# species and sex are categorical with at least 3 levels, one of which is blank for sex
species <- sample(letters, 3, replace=F)
print(species)

OUTPUT

[1] "o" "g" "k"

R

sex <- c('M','F','')
print(sex)

OUTPUT

[1] "M" "F" "" 

R

# record_id is continuous 
record_id <- 1:10
print(record_id)

OUTPUT

 [1]  1  2  3  4  5  6  7  8  9 10

R

# Now let's go sampling and put it all together
sample_data <- data.frame(
  record_id = record_id,
  species = sample(species, 10, replace=T),
  sex = sample(sex, 10, replace=T)
)
print(sample_data)

OUTPUT

   record_id species sex
1          1       g   M
2          2       o   M
3          3       o
4          4       g   F
5          5       g   F
6          6       g
7          7       k
8          8       k
9          9       k   M
10        10       k   F

And just like that we created a ‘dummy’ dataset from scratch! Notice that we could also have compiled the same type of dataset in a single line by creating each vector already within the data.frame()

R

sample2_data <- data.frame(
  record_id = 1:10,
  species = sample(letters[1:3], 10, replace=T),
  sex = sample(c('M','F',''), 10, replace=T)
)
print(sample2_data)

OUTPUT

   record_id species sex
1          1       c   M
2          2       c   M
3          3       c   F
4          4       c
5          5       c   M
6          6       c
7          7       c   F
8          8       a
9          9       c   M
10        10       b   F

Important: Notice that the outputs of if you want the outputs to be EXACTLY the same each time, but you are using sample() which is an inherently random process, you must first use set.seed() and share that with your helper too.

R

set.seed(1) # set seed before recreating the sample
sample_data <- data.frame(
  record_id = 1:10,
  species = sample(letters[1:3], 10, replace=T),
  sex = sample(c('M','F',''), 10, replace=T)
)
print(sample_data)

OUTPUT

   record_id species sex
1          1       a
2          2       c   M
3          3       a   M
4          4       b   M
5          5       a   F
6          6       c   F
7          7       c   F
8          8       b   F
9          9       b
10        10       c   M

Callout

Adding a set.seed() at the start of your reprex will ensure anyone else who runs the same code in the same order will always get the same results. However, if using it more generally, you may want to read more about it. For example, in the example below we set a seed of 2 and then run sample(10) twice. You will notice that the output of each sample run is not the same. However, if you run the whole code again, you will see that each of the outputs actually do stay the same.

R

set.seed(2)
sample(10)

OUTPUT

 [1]  5  6  9  1 10  7  4  8  3  2

R

sample(10)

OUTPUT

 [1]  1  3  6  2  9 10  7  5  4  8

Great! Now we need to check whether it works within our code and whether it reproduces our issue

R

# Minimal code [or whatever we end up with] 
sample_subset <- sample_data %>% # replace rodents with our sample dataset
  filter(species == c("a", "b"), # replace species with those from our sample dataset
         sex == c("F", "M")) # this can stay the same because we recreated it the same

table(sample_subset$sex, sample_subset$species)

OUTPUT


    a b
  F 1 0
  M 0 1

It works! Our sample size has unexpectedly been reduced to just 2 observations, when we would have expected a sample of 8, based on the sample_data output above. Wherever the issue lies, we were able to successfully replicate it in our minimal ‘dummy’ dataset.

Exercise 5: Your turn!

Now practice doing it yourself. Create a data frame with:

A. One categorical variable with 2 levels and one continuous variable. B. One continuous variable that is normally distributed. C. Name, sex, age, and treatment type.

3.6 Using your own data set


Even once you master the art of creating ‘dummy’ datasets, there may be occasions in which your data or your issue is maybe too complex and you can’t seem to replicate the issue. Or maybe you still think using your own data would just be easier.

In cases when you want to make your own data minimal and reproducible, you will want to take a similar approach to what we did in Episode 2 when making our code minimal. Keep what is essential, get rid of the rest. In other words, we will want to subset our data into a smaller, more digestible chunk.

The question still arises: how do I know what is essential?

Use the same guiding questions that we used earlier!

  1. How many (or rather which) variables do we need?
  2. What data type is each variable? (less necessary, since we are keeping the actual variables)
  3. How many levels and/or observations are necessary? (potentially still useful, we don’t want to get rid of more than we need)
  4. Should the values be distributed in a specific way? (they are as they are, but worth keeping in mind in terms of how we are removing observations)
  5. Are there any NAs that could be relevant?

Based on our previous answers we end up with:

  1. We need species, sex, and maybe record_id
  2. Species and sex are categorical, record_id is a continuous count of our observations.
  3. As we said earlier, we want 3 each for species and sex, which happens to already be the case. And we could reduce our record_id size to ~10.
  4. Not really, but we want to make sure that when we reduce the number of observations we still have observations in each of the 3 levels in species and sex.
  5. No NA’s, but we still don’t know if the blanks are relevant, so let’s make sure we keep at least one.

Now that we have a clearer goal, let’s subset our data.

Useful functions for subsetting a dataset include subset(), head(), tail(), and indexing with [] (e.g., iris[1:4,]). Alternatively, you can use tidyverse functions like select(), and filter() from the tidyverse. You can also use the same sample() functions we covered earlier.

Note: you should already have an understanding of how to subset or wrangle data using the tidyverse from the R for Ecology lesson. If not, go check it out! [insert link to lesson]

R

# Remember your minimal code
krats_subset <- rodents %>%
  filter(species == c("ordii", "spectabilis"),
         sex == c("F", "M"))

table(krats_subset$sex, krats_subset$species)

OUTPUT


    ordii spectabilis
  F   350           0
  M     0         626

Exercise 6: Think quick!

Which dataset are we trying to make minimal and reproducible? Hint: the two datasets we can see are krats_sebset and rodents

Given that the code that is going wrong is that which creates krats_subset, we need to create a minimal reproducible version of rodents! We can then insert our new_rodent dataset in place of the original rodent one.

Step 1: select the variables of interest

R

# subset rodent into new_rodent to make it minimal
# Note: there are many ways you could do this!
new_rodent <- rodents %>% 
  # 1. select the variables of interest
  select(record_id, species, sex)
  # PAUSE. Does this work so far?
print(new_rodent)

OUTPUT

   record_id      species sex
1          1     albigula   M
2          2     albigula   M
3          3     merriami   F
4          4     merriami   M
5          5     merriami   M
6          6       flavus   M
7          7     eremicus   F
8          8     merriami   M
9          9     merriami   F
10        10       flavus   F
11        11  spectabilis   F
12        12     merriami   M
13        13     merriami   M
14        14     merriami
15        15     merriami   F
16        16     merriami   F
17        17  spectabilis   F
18        18 penicillatus   M
19        19       flavus
20        20  spectabilis   F
21        21     merriami   F
22        22     albigula   F
23        23     merriami   M
24        24     hispidus   M
25        25     merriami   M
26        26     merriami   M
27        27     merriami   M
28        28     merriami   M
29        29 penicillatus   M
30        30  spectabilis   F
31        31     merriami   F
32        32     merriami   F
33        33     merriami   F
 [ reached 'max' / getOption("max.print") -- omitted 16115 rows ]

Step 2-5: reduce the number of observations to ~10 while making sure the dataset still contains at least 3 species and at least 3 sexes

While the rest is just one step, it is the trickiest, because this is where we want to ensure the key elements of our original dataset, as defined earlier, are preserved.

Exercise 7: Try it yourself

How would you continue the subsetting pipeline? How could you reduce the number of observations while making sure you still have at least 3 species and 3 sexes left? Hint: there is no single right answer! Trial and error works wonders.

R

set.seed(1)
new_rodents <- rodents %>% 
  # 1. select the variables of interest
  select(record_id, species, sex) %>%
  slice_sample(n=4, replace = F, by='sex')
print(new_rodents)

OUTPUT

   record_id     species sex
1       2359    merriami   M
2      16335    albigula   M
3       9910       ordii   M
4       8278       ordii   M
5      12038    merriami   F
6       7862   megalotis   F
7       9221    albigula   F
8       1335 spectabilis   F
9       9862     harrisi
10     14979    merriami
11     11333   spilosoma
12       351 leucogaster    

The code ran wihtout issues, yay! But do we end up with what we were looking for?

  1. Doe we have ~10 observations? Yes! 9 seems good enough
  2. Do we have at least 3 species? Yes! We have 7 (we could choose to reduce this further)
  3. Do we have at least 3 sexes? Yes! M, F, and blank

Great! All of our requirements are fulfilled. Now let’s see if it replicates our issue when we add it to our minimal code.

Note: slice_sample() and similar functions allow you to specify and customize how exactly you want that sample to be taken (check the documentation!). For example, you can specify a proportion of rows to select, specify how to order variables, whether ties [may require more explanation] should be kept together, or even whether to weigh certain variables. All of this allows you to keep aspects of your dataset that may be relevant and hard to replicate otherwise.

Remember your minimal code:

R

krats_subset <- rodents %>%
  filter(species == c("ordii", "spectabilis"),
         sex == c("F", "M"))

table(krats_subset$sex, krats_subset$species)

OUTPUT


    ordii spectabilis
  F   350           0
  M     0         626

We now want to replace rodents with our new_rodents. Do we need to change anything else?

We actually still have ordii and spectabilis as species in our list, so we can keep it as is. Same for sex. So we’re all set!

R

new_subset <- new_rodents %>%
  filter(species == c("ordii", "spectabilis"),
         sex == c("F", "M"))

The code ran without any issues! But does it replicate our issue?

Take a step back to remind yourself of what you are looking for. What was the issue we had identified?

The number of rows we end up after the filer is lower than expected.

So what would we expect to see with this new dataset? Since it is nice and short, this makes it a lot easier to predict the outcome.

We are asking for the 2 ordii rows, both males, and the 1 spectabilis row, which is female.

R

table(new_subset$sex, new_subset$species)

OUTPUT

< table of extent 0 x 0 >

Instead we end up with nothing! Why aren’t we getting the rows we are asking for?

Maybe our table is just wrong, let’s look at the actual dataset we end up with

R

print(new_subset)

OUTPUT

[1] record_id species   sex
<0 rows> (or 0-length row.names)

Still nothing! What is going on?? Well, we certainly replicated our issue. Time to ask for help!

But wait, our dataset is now minimal and relevant, but is it reproducible (accessible outside your device)? Not yet. We created a subset of our original dataset rodents but this came from a file on our computer. We could share our csv file and add an upload code… but that’s not ideal and it makes it hard to share our problem on a community site. Remember, the more steps required, the less likely someone will want to help.

Thankfully, there is a nifty function dput() that can help us out. Let’s try it and see what happens.

R

dput(new_rodents)

OUTPUT

structure(list(record_id = c(2359L, 16335L, 9910L, 8278L, 12038L,
7862L, 9221L, 1335L, 9862L, 14979L, 11333L, 351L), species = c("merriami",
"albigula", "ordii", "ordii", "merriami", "megalotis", "albigula",
"spectabilis", "harrisi", "merriami", "spilosoma", "leucogaster"
), sex = c("M", "M", "M", "M", "F", "F", "F", "F", "", "", "",
"")), class = "data.frame", row.names = c(NA, -12L))

It spit out a hard-to-read but not excessively long chunk of code. This code, when run, will recreate our new_rodents dataset! We can also break it down and label it further to help the reader.

R

reprex_data <- structure(list(
  
# a unique identifier
record_id = c(2359L, 16335L, 9910L, 8278L, 12038L, 7862L, 9221L, 1335L, 9862L, 14979L, 11333L, 351L), 

# a list of species
species = c("merriami", "albigula", "ordii", "ordii", "merriami", "megalotis", "albigula", "spectabilis", "harrisi", "merriami", "spilosoma", "leucogaster"), 

# a list of sexes. Note: this includes some blanks!
sex = c("M", "M", "M", "M", "F", "F", "F", "F", "", "", "", "")),

class = "data.frame", row.names = c(NA, -12L))

print(reprex_data)

OUTPUT

   record_id     species sex
1       2359    merriami   M
2      16335    albigula   M
3       9910       ordii   M
4       8278       ordii   M
5      12038    merriami   F
6       7862   megalotis   F
7       9221    albigula   F
8       1335 spectabilis   F
9       9862     harrisi
10     14979    merriami
11     11333   spilosoma
12       351 leucogaster    

Ta-da! Now they can easily recreate our minimal dataset and use it to run the minimal code. However, was that really easier than creating a dataset from scratch?

And sure, you could just use dput() on your original dataset. It would work. But that wouldn’t be very considerate to those who are trying to help. Try it.

R

#dput(rodents)

It becomes a huge chunk of code! When clearly we don’t need all of that.

Remember, we want to keep everything minimal for many reasons:

  • to make it easy for our helpers to understand our data and code
  • to allow helpers to quickly focus their efforts on the right factors
  • to make the problem-solving process as easy and painless as possible
  • bonus: to help us better understand and zero-in on the source of our issue, often stumbling upon a solution along the way

Nevertheless, it remains an option for when your data appears too complex or you are not quite sure where your error lies and therefore are not sure what minimal components are needed to reproduce the example.

3.7 Using an R-build dataset


The last approach we mentioned is to build a minimal reproducible dataset based on the datasets that already exist within R (and therefore everyone would have access to).

A list of readily available datasets can be found using library(help="datasets"). You can then use ? in front of the dataset name to get more information about the contents of the dataset.

For a more detailed discussion of the benefits of using this approach see [insert something]

This approach essentially blends the skills we already learned in the first two. We need to identify a dataset with appropriate variables that match the “key elements” of our original dataset. We then need to further reduce that dataset to a minimal, relevant, number or rows. Once again, we can use the previously learned functions such as select(), filter(), or sample().

Since we already practiced everything you need, why not try it yourself?

Exercise 8: Extra Challenge

Using the “HairEyeColor” dataset, create a minimal reproducible dataset for the same issue and minimal code we have been exploring. 1. Start by using ?HairEyeColor to read a description of the dataset and View(HairEyeColor) to see the actual dataset. 2. Which variables would be a good match for our situation? What are our requirements? 3. How can we subset this dataset to make it minimal and still replicate our issue?

Remember, there are many possible solutions! The most important feature is that the example dataset can replicate the issue when used within our minimal code.

The following is 1 possible solution:

We selected Hair and Eye as replacements for species and sex because they are both categorical and have at least 3 levels. We don’t strictly need anything else. We will call our new rodents replacement hyc. We set a seed because we want a random sample.

R

set.seed(1)

# the dummy dataset
hyc <- as.data.frame(HairEyeColor) %>% # oh no! Needs to be converted to df -- might need to change example or have them figure that one out... or we can give them this first line.
  select(Hair, Eye) %>%
  slice_sample(n=10)
print(hyc)

OUTPUT

    Hair   Eye
1  Black Hazel
2  Blond Brown
3    Red  Blue
4  Black Brown
5  Brown Brown
6    Red  Blue
7    Red Hazel
8  Brown Green
9  Brown Brown
10   Red Brown

R

# the minimal code
hyc_subset <- hyc %>%
  filter(Hair == c('Red','Blonde'),
         Eye == c('Blue', 'Brown'))

# illustrating the issue
table(hyc_subset$Hair, hyc_subset$Eye) 

OUTPUT


        Brown Blue Hazel Green
  Black     0    0     0     0
  Brown     0    0     0     0
  Red       0    1     0     0
  Blond     0    0     0     0

R

# the whole subset
print(hyc_subset)

OUTPUT

  Hair  Eye
1  Red Blue

R

# But we know there are more!
table(hyc$Hair, hyc$Eye) # Reds have 2 Blue and 1 Brown, and Blonds have 1 Brown!

OUTPUT


        Brown Blue Hazel Green
  Black     1    0     1     0
  Brown     2    0     0     1
  Red       1    2     1     0
  Blond     1    0     0     0

Callout

What about NAs? If your data has NAs and they may be causing the problem, it is important to include them in your example dataset. You can find where there are NAs in your dataset by using is.na, for example: is.na(krats$weight). This will return a logical vector or TRUE if the cell contains an NA and FALSE if not. The simplest way to include NAs in your dummy dataset is to directly include it in vectors: x <- c(1,2,3,NA). You can also subset a dataset that already contains NAs, or change some of the values to NAs using mutate(ifelse()) or substitute all the values in a column by sampling from within a vector that contains NAs.

One important thing to note when subsetting a dataset with NAs is that subsetting methods that use a condition to match rows won’t necessarily match NA values in the way you expect. For example

R

test <- data.frame(x = c(NA, NA, 3, 1), y = rnorm(4))
test %>% filter(x != 3) 

OUTPUT

  x          y
1 1 -0.3053884

R

# you might expect that the NA values would be included, since “NA is not equal to 3”. But actually, the expression NA != 3 evaluates to NA, not TRUE. So the NA rows will be dropped!
# Instead you should use is.na() to match NAs
test %>% filter(x != 3 | is.na(x))

OUTPUT

   x          y
1 NA  0.4874291
2 NA  0.7383247
3  1 -0.3053884

Key Points

  • A minimal reproducible dataset contains (a) the minimum number of lines, variables, and categories, in the correct format, to replicate your issue; and (b) it must be fully reproducible, meaning that someone else can access or run the same code to reproduce the dataset needed for your reprex.
  • To make it accessible, you can create a dataset from scratch using as.data.frame, you can use an R dataset like cars, or you can use a subset of your own dataset and then use dput() to generate reproducible code.

Bonus: Additional Practice


Here are some more practice exercises if you wish to test your knowledge

Extra Practice? Would need to change from mpg, since that’s from ggplot

For each of the following, identify which data are necessary to create a minimal reproducible dataset using mpg.

  1. We want to know how the highway mpg has changed over the years
  2. We need a list of all “types” of cars and their fuel type for each manufacturer
  3. We want to compare the average city mpg for a compact car from each manufacturer

(I copied these from excercise 6 in the google doc… but I’m not sure that they are getting at the point of the lesson…)

Another Excercise

Each of the following examples needs your help to create a dataset that will correctly reproduce the given result and/or warning message when the code is run. Fix the dataset shown or fill in the blanks so it reproduces the problem.

  1. set.seed(1) sample_data <- data.frame(fruit = rep(c(“apple”, “banana”), 6), weight = rnorm(12)) ggplot(sample_data, aes(x = fruit, y = weight)) + geom_boxplot() HELP: how do I insert an image from clipboard?? Is it even possible?
  2. bodyweight <- c(12, 13, 14, , ) max(bodyweight) [1] NA
  3. sample_data <- data.frame(x = 1:3, y = 4:6) mean(sample_data\(x) [1] NA Warning message: In mean.default(sample_data\)x): argument is not numeric or logical: returning NA
  4. sample_data <- ____ dim(sample_data) NULL
  1. “fruit” needs to be a factor and the order of the levels must be specified: sample_data <- data.frame(fruit = factor(rep(c("apple", "banana"), 6), levels = c("banana", "apple")), weight = rnorm(12))
  2. one of the blanks must be an NA
  3. ?? + what’s really the point of this one?
  4. sample_data <- data.frame(x = factor(1:3), y = 4:6)

Great work! We’ve created a minimal reproducible example. In the next episode, we’ll learn about reprex, a package that can help us double-check that our example is reproducible by running it in a clean environment. (As an added bonus, reprex will format our example nicely so it’s easy to post to places like Slack, GitHub, and StackOverflow.)

Content from Asking your question


Last updated on 2025-05-27 | Edit this page

Overview

Questions

  • How can I verify that my example is reproducible?
  • How can I easily share a reproducible example with a mentor or helper, or online?
  • How do I ask a good question?

Objectives

  • Use the reprex package to test whether an example is reproducible.
  • Use the reprex package to format reprexes for posting online.
  • Understand the benefits and drawbacks of different help forums.
  • Have a road map to follow when posting a question to make sure it’s a good question.
  • Understand what the {reprex} package does and doesn’t do.

Congratulations on finishing your reprex! In this episode, we will introduce a tool, the reprex package. This package will help you check that your example is truly reproducible and format it nicely to make it easy to present to a helper, either in person or online.

There are three principles to remember when you think about sharing your reprex with other people: Reproducibility, formatting, and context.

1. Reproducibility


Haven’t we already talked a lot about reproducibility?

Yes! We have discussed variables and packages, minimal datasets, and making sure that the problem is meaningfully reproduced by the data that you choose. But there are some reasons that a code snippet that appears reproducible in your own R session might not actually be runnable by someone else.

  • You forgot to account for the origins of some functions and/or variables. We went through our code methodically, but what if we missed something? It would be nice to confirm that the code is as self-contained as we thought it was.

  • Your code accidentally relies on objects in your R environment that won’t exist for other people. For example, imagine you defined a function my_awesome_custom_function() in a project-specific functions.R script, and your code calls that function.

A function called "my_awesome_custom_function" is lurking in my R environment. I must have defined it a while ago and forgotten! Code that includes this function will not run for someone else unless the function definition is also included in the reprex.
A function called "my_awesome_custom_function" is lurking in my R environment. I must have defined it a while ago and forgotten! Code that includes this function will not run for someone else unless the function definition is also included in the reprex.

R

my_awesome_custom_function("the kangaroo rat dataset")

ERROR

Error in my_awesome_custom_function("the kangaroo rat dataset"): could not find function "my_awesome_custom_function"

I might conclude that this code is reproducible–after all, it works when I run it! But unless I remembered to include the function definition in the reprex itself, nobody will be able to run the code.

A corrected reprex would look like this:

R

my_awesome_custom_function <- function(x){print(paste0(x, " is awesome!"))}
my_awesome_custom_function("the kangaroo rat dataset")

OUTPUT

[1] "the kangaroo rat dataset is awesome!"
  • Your code depends on some particular characteristic of your R or RStudio environment that is not the same as your helper’s environment. [more details here]

There are so many components to remember when thinking about reproducibility, especially for more complex problems. Wouldn’t it be nice if we had a way to double check our examples? Luckily, the reprex package will help you test your reprexes in a clean, isolated environment to make sure they’re actually reproducible.

The most important function in the reprex package is called reprex(). Here’s how to use it.

First, install and load the reprex package.

R

#install.packages("reprex")
library(reprex)

Second, write some code. This is your reproducible example.

R

(y <- 1:4)

OUTPUT

[1] 1 2 3 4

R

mean(y)

OUTPUT

[1] 2.5

Third, highlight that code and copy it to your clipboard (e.g. Cmd + C on Mac, or Ctrl + C on Windows).

Finally, type reprex() into your console.

# (with the target code snippet copied to your clipboard already...)
# In the console:
reprex()

reprex will grab the code that you copied to your clipboard and run that code in an isolated environment. It will return a nicely formatted reproducible example that includes your code and and any results, plots, warnings, or errors generated.

The generated output will be on your computer’s clipboard by default. Then, you can paste it into GitHub, StackOverflow, Slack, or another venue.

Callout: The reprex package workflow

The reprex package workflow takes some getting used to. Instead of copying your code into the function, you simply copy it to the clipboard (a mysterious, invisible place to most of us) and then let the blank, empty reprex() function go over to the clipboard by itself and find it.

And then the completed, rendered reprex replaces the original code on the clipboard and all you need to do is paste, not copy and paste.

Let’s practice this one more time. Here’s some very simple code:

R

library(ggplot2)
library(dplyr)
mpg %>% 
  ggplot(aes(x = factor(cyl), y = displ))+
  geom_boxplot()

Let’s highlight the code snippet, copy it to the clipboard, and then run reprex() in the console.

# In the console:
reprex()

The result, which was automatically placed onto my clipboard and which I pasted here, looks like this:

R

library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
mpg %>% 
  ggplot(aes(x = factor(cyl), y = displ))+
  geom_boxplot()

Created on 2024-12-29 with reprex v2.1.1

Nice and neat! It even includes the plot produced, so I don’t have to take screenshots and figure out how to attach them to an email or something.

The formatting is great, but reprex really shines when you treat it as a helpful collaborator in your process of building a reproducible example (including all dependencies, providing minimal data, etc.)

Let’s see what happens if we forget to include library(ggplot2) in our small reprex above.

R

library(dplyr)
mpg %>% 
  ggplot(aes(x = factor(cyl), y = displ))+
  geom_boxplot()

As before, let’s copy that code to the clipboard, run reprex() in the console, and paste the result here.

# In the console:
reprex()

R

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
mpg %>% 
  ggplot(aes(x = factor(cyl), y = displ))+
  geom_boxplot()
#> Error in ggplot(., aes(x = factor(cyl), y = displ)): could not find function "ggplot"

Created on 2024-12-29 with reprex v2.1.1

Now we get an error message indicating that R cannot find the function ggplot! That’s because we forgot to load the ggplot2 package in the reprex.

This happened even though we had ggplot2 already loaded in our own current RStudio session. reprex deliberately ignores any packages already loaded, running the code in a clean, isolated R session that’s different from the R session we’ve been working in. This simulates the experience of someone else trying to run your reprex on their own computer.

Let’s return to our previous example with the custom function.

R

my_awesome_custom_function("the kangaroo rat dataset")

OUTPUT

[1] "the kangaroo rat dataset is awesome!"
# In the console:
reprex()

R

my_awesome_custom_function("the kangaroo rat dataset")
#> Error in my_awesome_custom_function("the kangaroo rat dataset"): could not find function "my_awesome_custom_function"

Created on 2024-12-29 with reprex v2.1.1

By contrast, if we include the function definition:

R

my_awesome_custom_function <- function(x){print(paste0(x, " is awesome!"))}
my_awesome_custom_function("the kangaroo rat dataset")

OUTPUT

[1] "the kangaroo rat dataset is awesome!"
# In the console:
reprex()

R

my_awesome_custom_function <- function(x){print(paste0(x, " is awesome!"))}
my_awesome_custom_function("the kangaroo rat dataset")
#> [1] "the kangaroo rat dataset is awesome!"

Created on 2024-12-29 with reprex v2.1.1

Testing it out


Now that we’ve met our new reprex-making collaborator, let’s use it to test out the reproducible example we created in the previous episode.

Here’s the code we wrote:

R

# Mickey's reprex script
# XXX THIS IS NOT FINISHED--NEED TO INSERT FINAL DATA EXAMPLE!

# Load necessary packages to run the code
library(ggplot2)

rodents_subset %>% # XXX replace with simulated dataset
  ggplot(aes(y = weight, x = common_name, fill = sex)) +
  geom_boxplot() # wait, why does this look weird?

# Investigate
table(rodents_subset$sex, rodents_subset$species)
table(rodents$sex, rodents$species)

Time to find out if our example is actually reproducible! Let’s copy it to the clipboard and run reprex(). Since we want to give Jordan a runnable R script, we can use venue = "r".

# In the console:
reprex(venue = "r")

It worked!

R

#replace with final output

Now we have a beautifully-formatted reprex that includes runnable code and all the context needed to reproduce the problem.

Callout: Including information about your R session

Another nice thing about reprex is that you can choose to include information about your R session, in case your error has something to do with your R settings rather than the code itself. You can do that using the session_info argument to reprex().

For example, try running the following reprex, setting session_info = TRUE, and observe what happens.

R

library(ggplot2)
library(dplyr)
mpg %>%
  ggplot(aes(x = factor(cyl), y = displ))+
  geom_boxplot()
# In the console:
reprex(session_info = TRUE)

Formatting

The output of reprex() is markdown, which can easily be copied and pasted into many sites/apps. However, different places have slightly different formatting conventions for markdown. reprex lets you customize the output of your reprex according to where you’re planning to post it.

The default, venue = "gh", gives you “GitHub-Flavored Markdown”, which is a particular type of markdown that works well when posted on GitHub. Another format you might want is “r”, which gives you a runnable R script, with commented output interleaved with pieces of code.

Check out the formatting options in the help file with ?reprex, and try out a few depending on the destination of your reprex!

Callout: reprex can’t do everything for you

People often mention reprex as a useful tool for creating reproducible examples, but it can’t do the work of crafting the example for you! The package doesn’t locate the problem, pare down the code, create a minimal dataset, or automatically include package dependencies.

A better way to think of reprex is as a tool to check your work as you go through the process of creating a reproducible example, and to help you polish up the result.

Context

The final thing to consider when preparing your reproducible example is adding some context so that helpers know a little about your problem and what you’re trying to achieve.

Some context to include: 1. Tell us a little bit about your problem. One sentence should be enough. What domain are you working in? What are these data about? What do the relevant variables mean?

This is particularly important if you have provided a subset of your own data instead of creating a minimal dataset from scratch. Your helper will need to interpret the column names and understand what type of data they are looking at.

  1. Explain what you expected to happen, or what you were trying to achieve, and how it is different from what happened instead.

The contrast between what happened and what was supposed to happen is particularly important for semantic errors, in which the “error” is not always obvious when running the code. The code ran–but you have decided that the output is “wrong” somehow, or that it “didn’t work”. Why? How do you know that? Your helper needs to know that what you got was not what you expected, and they need to know what you expected in order to help you achieve that outcome.

For example, let’s say you made the following plot:

R

rodents %>%
  ggplot(aes(x = plot_type, y = hindfoot_length, color = plot_type))+
  geom_boxplot()

WARNING

Warning: Removed 2003 rows containing non-finite outside the scale range
(`stat_boxplot()`).

This plot doesn’t look the way you want it to look, and you’re not sure why, so you decide to make a reprex. You load the required packages (ggplot2 and dplyr), and you substitute an existing dataset, mtcars, instead of rodents, which you know your helpers won’t have access to. Your reprex looks like this:

R

library(ggplot2)
library(dplyr)
mpg %>%
  ggplot(aes(x = class, y = displ, color = class))+
  geom_boxplot()

It’s minimal! It’s reproducible! But… what is the problem? This is a perfectly reasonable plot, so without context, your helper won’t know what’s wrong. Let’s explain it to them.

R

library(ggplot2)
library(dplyr)
mpg %>%
  ggplot(aes(x = class, y = displ, color = class))+
  geom_boxplot()

R

# I want to make a boxplot where each of the categories has its own color. But even though I set color = class here, only the outlines of the boxplots got colored in, and the inside is still white. How do I change this so that the whole box is colored in?

Exercise 1: What makes a good description?

For each of the following reprexes, improve the description given. a. I’m trying to plot the displacements of different cars. I made this boxplot, but the boxes are showing up in the wrong order. How do I fix this? Here is my minimal reproducible example.

R

library(ggplot2)
library(dplyr)
mpg %>%
  ggplot(aes(x = class, y = displ, color = class))+
  geom_boxplot()
  1. I’m working with this data about cars. The class column refers to the type of car–for example, “compact” class means that the car is quite small, while “pickup” would be a pickup truck. For each car, I also have information about the city and highway mileage, and the transmission, and the number of cylinders, as well as the displacement. This dataset has 234 rows and 11 columns, although this is an example dataset because my real dataset is much larger and has more like 500,000 rows. Anyway, in this example, I want to make a boxplot of displacement where each of the categories has its own color. But even though I set color = class here, only the outlines of the boxplots got colored in, and the inside is still white. How do I make the inside a different color? Here’s a reprex.

R

library(ggplot2)
library(dplyr)
mpg %>%
  ggplot(aes(x = class, y = displ, color = class))+
  geom_boxplot()
  1. Help, my code isn’t working! It says I have too many elements. I made a reprex so you can see the data and the error message. I hope that’s helpful. Thank you so much!

R

library(ggplot2)
table(mpg)

ERROR

Error in table(mpg): attempt to make a table with >= 2^31 elements

As we wrap up this lesson, let’s work on adding some context for Mickey’s reprex so they’ll be ready to send it to Remy or post it online.

Exercise 2: Adding context

Working with the person next to you, write a brief description of Mickey’s problem that they could include with their reprex when they post it online.

Make sure that the description gives a little bit of background, describes what Mickey was trying to achieve, and describes what happened instead.

When you’re done, compare notes between the groups and see if you can come up with a final reprex for Mickey!

Key Points

  • The reprex package makes it easy to format and share your reproducible examples.
  • The reprex package helps you test whether your reprex is reproducible, and also helps you prepare the reprex to share with others.
  • Following a certain set of steps will make your questions clearer and likelier to get answered.