Content from What is a reprex and why is it useful?

Last updated on 2025-07-22 | Edit this page

Overview

Questions

What steps can you take to solve problems in your code?
What is a minimal reproducible example?
Why are minimal reproducible examples important?
What variables are included in the Portal Project dataset?

Objectives

Understand the high-level process for getting unstuck in R.
Define each key characteristic of a minimal reproducible example.
Explain why minimal reproducible examples are central to getting help from others.
Load in the rodent survey data and briefly explain its contents.

Mickey is an ecology researcher who has just started in a new lab. Mickey’s lab has been working for many years with data from the Portal Project, a long-term research study of rodents in Portal, Arizona. Mickey is particularly interested in learning about rodent morphology. For now, they are getting familiar with the dataset by doing some descriptive analyses and visualizations.

Mickey starts by loading the data so they can begin to explore it. They also load the {tidyverse}, a set of packages that will be useful for wrangling and visualizing the data.

R

# Load packages and data
library(tidyverse)

surveys <- read_csv("data/surveys_complete_77_89.csv")

Mickey has some past experience in R, but this project will require more data analysis than they have done before. Mickey attended a Carpentries workshop, “Data Analysis and Visualization in R for Ecologists,” and they feel comfortable with the fundamentals of coding in R. Still, they are a little nervous about starting this project.

Prerequisites and target audience

This workshop assumes some prior experience with working in R and RStudio. We will assume you’ve taken the equivalent of the Data Analysis and Visualization in R for Ecologists workshop and are comfortable with basic commands, and we won’t necessarily explain every line of code in detail.

If you’re much more experienced in R, this workshop is still for you! Even expert coders may not always know how to get unstuck. We hope this workshop will be useful to people with a variety of coding backgrounds.

Sometimes, Mickey’s code doesn’t work as expected and they go to their colleague, Remy, for help. Remy has spent many hours sitting with Mickey, helping to work through various errors. But soon, Remy will be starting a big project, and soon they won’t have as much time to help with debugging.

To help Mickey get more comfortable troubleshooting their own code, Remy suggests some steps to follow the next time they get stuck. Remy calls this the “Road Map to Getting Unstuck in R.”

Remy explains that the road map includes two main phases. First, there is guidance about “code first aid.” This includes understanding types of errors, reading function documentation, investigating error messages, and running through code line by line to diagnose problems.

Sometimes, these first aid steps are not enough to solve your problem. One of the most frustrating parts of learning to code is getting stuck and not knowing what to do! Luckily, there are many people in the R and data science communities who are happy to help, as long as you give them the right information. But figuring out how to ask a good question can feel even harder than the original problem that got you stuck in the first place. That’s why the second part of Remy’s road map includes guidance on how to create a minimal reproducible example (also known as a reprex).

A minimal reproducible example is a piece of code that demonstrates the problem you are facing, includes all necessary information to show the problem but nothing extra, and will run easily on someone else’s computer.

Minimal reproducible examples are very important tools to get help when you’re stuck on a coding problem.

Stripping the code and data down to their simplest (minimal) parts makes it easy for a helper to zero in on what might be going wrong.
Making your example reproducible allows a helper to run your code on their own computer so they can “feel your pain” and understand what’s going wrong. Even experts often have to “tinker” with code in order to fix it. Providing a reprex makes that “tinkering” easy, which makes it more likely that a helper will take the time to assist you.
The process of making a minimal reproducible example often gives you insight into your own code. Often, you might end up solving the problem yourself, without even needing to ask for help.

Rubber duck debugging

The phenomenon of solving one’s own problem during the process of trying to explain it to someone else is often called “rubber duck debugging.” This is a reference to a story about programmers who would keep rubber ducks on their desks to explain coding problems to. Jenny Bryan refers to reprexes as “basically the rubber duck in disguise,” because they force you to explain your problem to someone else, often solving it in the process.

Jenny Bryan shares many other insights about reprexes in her 2018 talk “Help me help you: Creating reproducible examples.”

Helpers

There are lots of people who might help you with your code: friends, colleagues, mentors, or total strangers online. In this lesson, we will use the term “helper” to refer to the person who is helping you to debug your code. Helpers are the target audience for your minimal reproducible example.

Remy emphasizes to Mickey that they are still happy to be a helper, but that since they won’t have as much time to devote to debugging in the future, following this road map first will make the helping process more efficient. Hopefully, it will also make Mickey into a more confident coder!

Before heading off to their own work, Remy also introduces Mickey to the dataset they’ve just loaded in.

R

# Take a look at the data
glimpse(surveys)
min(surveys$year)
max(surveys$year)

OUTPUT

Rows: 16,878
Columns: 13
$ record_id       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ month           <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, …
$ day             <dbl> 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16…
$ year            <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, …
$ plot_id         <dbl> 2, 3, 2, 7, 3, 1, 2, 1, 1, 6, 5, 7, 3, 8, 6, 4, 3, 2, …
$ species_id      <chr> "NL", "NL", "DM", "DM", "DM", "PF", "PE", "DM", "DM", …
$ sex             <chr> "M", "M", "F", "M", "M", "M", "F", "M", "F", "F", "F",…
$ hindfoot_length <dbl> 32, 33, 37, 36, 35, 14, NA, 37, 34, 20, 53, 38, 35, NA…
$ weight          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ genus           <chr> "Neotoma", "Neotoma", "Dipodomys", "Dipodomys", "Dipod…
$ species         <chr> "albigula", "albigula", "merriami", "merriami", "merri…
$ taxa            <chr> "Rodent", "Rodent", "Rodent", "Rodent", "Rodent", "Rod…
$ plot_type       <chr> "Control", "Long-term Krat Exclosure", "Control", "Rod…
[1] 1977
[1] 1989

Remy explains that the dataset is made up of many individual rodent records (record_id). The date of each record is given by the month, day, and year columns.

The dataset includes data from a number of different study plots that had different treatments applied: plot IDs are given by the plot_id column, and the type of treatment is specified in plot_type.

There is information about the genus and species of each individual caught, as well as higher-level taxa information and a short-form species_id code.

For each individual caught, the field crew took weight, sex and hindfoot_length measurements, although those measurements are sometimes missing.

The dataset contains 16,878 rodent observations ranging across years from 1977 through 1989.

With an introduction to the dataset and a road map to guide them if they get stuck, Mickey feels ready to start coding!

Key Points

Applying some “code first aid” can help address problems in your code.Add commentMore actions
Helpers can more easily debug your code if you provide them with a small example of the problem (a “reprex”) that they can tinker with themselves.
In the process of building a reprex, you may find the solution yourself.
In the rest of this lesson, we will be working through a “road map” to getting unstuck that includes code first aid and the process of making a reprex.
The surveys dataset includes records of rodents captured in a variety of experimental plots over a 12-year period, including some data about each rodent’s sex and morphology.

Content from Identify the problem and make a plan

Last updated on 2025-07-22 | Edit this page

Overview

Questions

What do I do when I encounter an error?
What do I do when my code outputs something I don’t expect?
Why do errors and warnings appear in R?
How can I find which areas of code are responsible for errors?
How can I fix my code? What other options exist if I can’t fix it?

Objectives

After completing this episode, participants should be able to…

Describe how the desired code output differs from the actual output
Categorize an error message (e.g. syntax error, semantic errors, package-specific errors, etc.)
Describe what an error message is trying to communicate
Identify specific lines and/or functions generating the error message
Use R Documentation to look up function syntax and examples
Quickly fix commonly-encountered R errors using ‘code first aid’
Identify when a problem is better suited for asking for further help by making a reprex

(initial intro – edit upon looking at intro episode)

The first step we’ll cover is what to do when encountering an error or other undesired output from your code. With this episode, we hope to teach you the basics about identifying errors, rectifying them if possible, and if not, how to isolate the problem for others to look at. This is the first step in our “roadmap” of how to solve coding problems – recognizing when something you don’t intend is happening with your code, and then identifying the problem (to a lesser or greater degree) in order to solve it yourself or be able to succinctly describe it to a helper.

3.1 What do I do when I encounter an error message?

While sometimes frustrating to read, R will often let us know when a problem occurs by generating an error message that tells us why R was unable to run our code. This type of ‘error’ is often referred to as a syntax error. When R is unable to run your code, it will return this type of error message, and stop the program (as opposed to a warning or attempting to run further lines despite the error). Error messages may happen for many reasons. However, deciphering the meaning of such error messages is not always as easy as we might hope. While we can’t review every type of reason your code generates an error, we will try to teach you some tools for you to interpret and figure out syntax errors for yourself.

The accompanying error message attempts to tell you exactly how your code failed. For example, consider the following error message that occurs when I run this command in the R console:

R

# Make some plots
ggplot(x = taxa) + geom_bar()

ERROR

Error: object 'taxa' not found

Though we know somewhere there is an object called taxa (it is actually a column of the dataset surveys), R is trying to communicate that it cannot find any such object in the local environment. Let’s try again, appropriately pointing ggplot to the surveys dataset and taxa column using the $ operator. For the sake of argument, let’s say we also remember that geom_bar expects an aesthetic (aes).

R

ggplot(aes(x = surveys$taxa)) + geom_bar()

ERROR

Error in `fortify()`:
! `data` must be a <data.frame>, or an object coercible by `fortify()`,
  or a valid <data.frame>-like object coercible by `as.data.frame()`, not a
  <uneval> object.
ℹ Did you accidentally pass `aes()` to the `data` argument?

Whoops! Here we see another error message – this time, R responds with a perhaps more uninterpretable message.

Let’s go over each part briefly. First, we see an error from a function called fortify, which we didn’t even call! Then, there’s a more helpful informational message: Did we accidentally pass aes() to the data argument? This does seem to relate to our function call, as we do pass aes into the ggplot function. But what is this “data argument?” A helpful starting place when attempting to decipher an error message is checking the documentation for the function which caused the error:

?ggplot

Here, a Help window pops up in RStudio which provides some more information. Skipping the general description at the top, we see ggplot takes positional arguments of data, then mapping, which uses the aes call. We can see in “Arguments” that the aes(x = surveys$taxa) object used in the plot is attempted by fortify to be converted to a data.frame: now the picture is clear! We accidentally passed our mapping argument (telling ggplot how to map variables to the plot) into the position it expected data in the form of a data frame. And if we scroll down to “Examples”, to “Pattern 1”, we can see exactly how ggplot expects these arguments in practice. Let’s amend our result:

R

ggplot(surveys, aes(x = taxa)) + geom_bar()

Here we see our desired plot.

Stop no. 1 on our roadmap: Identifying the problem

Let’s pause here to highlight some patterns we’re starting to see in the course of fixing our code:

Seeing a problem arise in our code (in this case, R is explicitly telling us it has a problem running it).
Reading and interpreting the error message R gives us.

Other steps we might take then include:

Acting on parts of the error we can understand, such as changing input to a function.
Pulling up the R Documentation for that function, and reading the documentation’s Description, Usage, Arguments, Details and Examples entries for greater insight into our error.
Copying and pasting the error message into a search engine / generative LLM for more interpretable explanations.

And, when all else fails, we can prepare our code into a reproducible example for expert help.

While the above steps may be new or seem familiar, we formalize this a little bit to explicitly address something: recognizing when a problem arises and attempting to interpret what is going wrong is essential to fixing it. This is true whether you fix the problem on your own, or communicate it to an expert. The latter steps we listed might be categorized as attempts to immediately address the problem – we’ll call these code first aid – these steps might fix the problem, give you greater insight into what the problem is (and how R is interpreting your code), or not be helpful at all.

In any case, we want to emphasize that these skill sets are essential to being a practiced coder able to effectively seek help. While these examples may seem trivial to warrant pull up a whole checklist, below we will see examples of problems that are trickier to both recognize and interpret. But in those cases, we’ll nonetheless apply the same framework.

Challenge

Below we see an error message pop up when trying to …. Which of the following interpretations of the error message most aptly describes the problem?

Show me the solution

3.2 What do I do when my code outputs something I don’t expect

Another type of problem you may encounter is when your R code runs without errors, but does not produce the desired output. You may sometimes see these called semantic errors. As with syntax errors, semantic errors may occur for a variety of non-intuitive reasons, and are often harder to solve as there is no description of the error – you must work out where your code is defective yourself!

Let’s go back to our rodent analysis. The next step in the plan is to subset to just the Rodent taxa (as opposed to other taxa: Bird, Rabbit, Reptile or NA). Let’s quickly check to see how much data we’d be throwing out by doing so:

R

table(surveys$taxa)

OUTPUT


   Bird  Rabbit Reptile  Rodent
    300      69       4   16148

We’re interested in the rodents, and thankfully it seems like a majority of our observations will be maintained when subsetting to rodents. But wait… In our plot above, we can clearly see the presence of NA values. Why are we not seeing them here? Our command was correctly executed, but the output is not everything we intended. Having no error message to interpret, let’s jump straight to the function documentation:

R

?table

OUTPUT

Help on topic 'table' was found in the following packages:

  Package               Library
  vctrs                 /home/runner/.local/share/renv/cache/v5/linux-ubuntu-jammy/R-4.5/x86_64-pc-linux-gnu/vctrs/0.6.5/c03fa420630029418f7e6da3667aac4a
  base                  /home/runner/.cache/R/renv/sandbox/linux-ubuntu-jammy/R-4.5/x86_64-pc-linux-gnu/9a444a72


Using the first match ...

Here, the documentation provides some clues: there seems to be an argument called useNA that accepts “no”, “ifany”, and “always”, but it’s not immediately apparent which one we should use to show our NA values. As a second approach, let’s go to Examples to see if we can find any quick fixes. Here we see a couple lines further down:

R

table(a)                 # does not report NA's
table(a, exclude = NULL) # reports NA's

That seems like it should be inclusive. Let’s try again:

R

table(surveys$taxa, exclude = NULL)

OUTPUT


   Bird  Rabbit Reptile  Rodent    <NA>
    300      69       4   16148     357

Now our NA values show up in the table. We see that by subsetting to the “Rodent” taxa, we would losing about 357 NAs, which themselves could be rodents! However, in this case, it seems a small enough portion to safely omit. Let’s subset our data to the rodent taxon.

R

# Just rodents
rodents <- surveys %>% filter(taxa == "Rodent")

Challenge

There are 3 lines of code below, and each attempts to create the same plot. Identify which produces a syntax error, which produces a semantic error, and which correctly creates the plot (hint: this may require you inferring what type of graph we’re trying to create!)

ggplot(rodents) + geom_bin_2d(aes(month, plot_type))
ggplot(rodents) + geom_tile(aes(month, plot_type), stat = "count")
ggplot(rodents) + geom_tile(aes(month, plot_type))

Show me the solution

In this case, A correctly creates the graph, plotting as colors in the tile the number of times an observation is seen. It essentially runs the following lines of code:

R

rodents_summary <- rodents %>% group_by(plot_type, month) %>% summarize(count=n())

OUTPUT

`summarise()` has grouped output by 'plot_type'. You can override using the
`.groups` argument.

R

ggplot(rodents_summary) + geom_tile(aes(month, plot_type, fill=count))

B is a syntax error, and will produce the following error:

R

ggplot(rodents) + geom_tile(aes(month, plot_type), stat = "count")

ERROR

Error in `geom_tile()`:
! Problem while computing stat.
ℹ Error occurred in the 1st layer.
Caused by error in `setup_params()`:
! `stat_count()` must only have an x or y aesthetic.

Finally, C is a semantic error. It produces a plot, which is rather meaningless:

R

ggplot(rodents) + geom_tile(aes(month, plot_type))

Summary

In general, encountering semantic errors can make our job more difficult, but the roadmap remains the same:

Seeing a problem arise in our code.
Interpreting the problem.

Other steps we might take then include:

Acting on parts of the error we can understand, such as changing input to a function.
Pulling up the R Documentation for relevant functions, and reading the documentation’s Description, Usage, Arguments, Details and Examples entries for greater insight into our error.
Describing our problem into a search engine / generative LLM for more interpretable explanations.

And, when all else fails, we can prepare our code into a reproducible example for expert help.

The steps to identifying the problem and in code first aid matches what we’ve seen above. However, here seeing the problem arise in our code may be much more subtle, and comes from us recognizing output we don’t expect or know to be wrong. Even if the code is run, R may give us warning or informational messages which pop up when executing your code. Most of the time, however, it’s up to the coder to be vigilant and be sure steps are running as they should. Interpreting the problem may also be more difficult as R gives us little or no indication about how it’s misinterpreting our intent.

Callout

Generally, the more your code deviates from just using base R functions, or the more you use specific packages, both the quality of documentation and online help available from search engines and Googling gets worse and worse. While base R errors will often be solvable in a couple of minutes from a quick ?help check or a long online discussion and solutions on a website like Stack Overflow, errors arising from little-used packages applied in bespoke analyses might merit isolating your specific problem to a reproducible example for online help, or even getting in touch with the developers! Such community input and questions are often the way packages and documentation improves over time.

3.3 How can I find where my code is failing?

Isolating your problem may not be as simple as assessing the output from a single function call on the line of code which produces your error. Often, it may be difficult to determine which lines or logical sections of code (e.g. functions) are producing the error.

Consider the example below, where we now are attempting to see which species of kangaroo rodents appear in different plot types over the years. To do so, we’ll filter our dataset to just include the genus Dipodomys. Then we’ll plot a histogram of which how many observations are seen in each plot type over an x axis of years.

R

# Just k-rats
krats <- rodents %>% filter(genus == "Dipadomys") #kangaroo rat genus

ggplot(krats, aes(year, fill=plot_type)) + 
geom_histogram() +
facet_wrap(~species)

ERROR

Error in `combine_vars()`:
! Faceting variables must have at least one value.

Uh-oh. Another error here, when we try to make a ggplot. But what is “combine_vars?” And then: “Faceting variables must have at least one value” What does that mean?

This is not an easily-interpretable error message from ggplot, and our code looks like it should run. Perhaps we can take a step back and see whether our error is actually not in the ggplot code itself. Often, when trying to isolate the problem area, it is a good idea to look back at the original input. So let’s take a look at our krats dataset.

R

krats

OUTPUT

# A tibble: 0 × 13
# ℹ 13 variables: record_id <dbl>, month <dbl>, day <dbl>, year <dbl>,
#   plot_id <dbl>, species_id <chr>, sex <chr>, hindfoot_length <dbl>,
#   weight <dbl>, genus <chr>, species <chr>, taxa <chr>, plot_type <chr>

It’s empty! What went wrong with our “Dipadomys” filter? Let’s use a print statement to see which genuses are included in the original rodents dataset.

R

print(rodents %>% count(genus))

OUTPUT

# A tibble: 12 × 2
   genus                n
   <chr>            <int>
 1 Ammospermophilus   136
 2 Baiomys              3
 3 Chaetodipus        382
 4 Dipodomys         9573
 5 Neotoma            904
 6 Onychomys         1656
 7 Perognathus        553
 8 Peromyscus        1271
 9 Reithrodontomys   1412
10 Rodent               4
11 Sigmodon           103
12 Spermophilus       151

We see two things here. For one, we’ve misspelled Dipodomys, which we can now amend. This quick function call also tells us we should expect a data frame with 9573 values resulting after subsetting to the genus Dipodomys.

R

krats <- rodents %>% filter(genus == "Dipodomys") #kangaroo rat genus
dim(krats)

OUTPUT

[1] 9573   13

R

ggplot(krats, aes(year, fill=plot_type)) + 
geom_histogram() +
facet_wrap(~species)

OUTPUT

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Our improved code here looks good. Checking the dimensions of our subsetted data frame using the dim() function confirms we now have all the Dipodomys observations, and our plot is looking better. In general, having a ‘print’ statement or some other output after we manipulate data or other major steps can be a good way to check your code is producing intermediate results consistent with your expectations.

Callout

Often, giving your expert helpers access to the entire problem, with a detailed description of your desired output, allows you to directly improve your coding skills and learn about new functions and techniques.

Summary

With the length of a full script of code, it may be difficult to find exactly where our code is falling short, even if we can identify the proximal problem that arises (e.g. a plot not showing up).

Our roadmap to identifying problems in our code may now look like:

Seeing a problem arise in our code.
Isolating our code to the problem area.
Interpreting the problem.

Now we can see the need to isolate the specific areas of code causing the bug or problem, even if that does not solve the problem itself. There is no general rule of thumb as to how large this needs to be. But, unless our problem occurs on the first line, we should be able to isolate our code a bit: Any early lines which we know run correctly and as intended may not need to be included, and by isolating the problem area as much as we can to make it understandable to others.

Let’s add to our code first aid:

Identify the problem area – add print statements immediately upstream or downstream of problem areas, check the desired output from functions, and see whether any intermediate output can be further isolated.
Acting on parts of the error we can understand, such as changing input to a function.
Pulling up the R Documentation for relevant functions, and reading the documentation’s Description, Usage, Arguments, Details and Examples entries for greater insight into our error.
Describing our problem into a search engine / generative LLM for more interpretable explanations.

And, when all else fails, we can prepare our code into a reproducible example for expert help.

While this is similar to our previous checklists, we can now understand these steps as a continuous cycle of isolating the problem into more and more discrete chunks for a reproducible example. Any step in the above that helps us identify the specific areas or aspects of our code that are failing in particular, we can zoom in on and restart the checklist. We can stop as soon as we don’t understand anymore how our code fails. At this point, we can excise that area for further help using a reprex.

Challenge

With our new skills in mind, try to isolate the problem area as much as you can with the following lines of code:

R

# first line, no problem

# second line -- problem

# third line -- no problem

# print or graph output -- does not work

Show me the solution

3.3 When should I prepare my code for a reprex?

There may be some point at which our code first aid does not help us anymore, and we still cannot figure out the problem our code is giving us, even if we’ve isolated it as much as we can – in that case, it may be time to turn to expert help, by asking a coworker, mentor, or someone online for aid.

While in a classroom setting we may be used to raising our hand, pointing at our code, and saying “I’m not sure what’s wrong,” in reality, people have limited time, bandwidth, or requisite knowledge to be able to help out with any problem that might arise. That’s why reproducing the problem with a reproducible example is an essential skill to getting unstuck: it allows you to ask for expert help with a problem that’s clearly identified, self-contained, and reproducible. We’ve already covered skills pertaining to the first two items above! We’ll see how to apply these below as we start to create a reproducible example.

MOVED OVER FROM MIN REPRODUCIBLE CODE

Mickey is interested in understanding how kangaroo rat weights differ across species and sexes, so they create a quick visualization.

R

ggplot(rodents, aes(x = species, fill = sex))+
  geom_bar()

Whoa, this is really overwhelming! Mickey forgot that the dataset includes data for a lot of different rodent species, not just kangaroo rats. Mickey is only interested in two kangaroo rat species: Dipodomys ordii (Ord’s kangaroo rat) and Dipodomys spectabilis (Banner-tailed kangaroo rat).

Mickey also notices that there are three categories for sex: F, M, and what looks like a blank field when there is no sex information available. For the purposes of comparing weights, Mickey wants to focus only rodents of known sex.

Mickey filters the data to include only the two focal species and only rodents whose sex is F or M.

R

rodents_subset <- rodents %>%
  filter(species == c("ordii", "spectabilis"),
         sex == c("F", "M"))

Because these scientific names are long, Mickey also decides to add common names to the dataset. They start by creating a data frame with the common names, which they will then join to the rodents_subset dataset:

R

# Add common names
common_names <- data.frame(species = unique(rodents_subset$species), common_name = c("Ord's", "Banner-tailed"))
common_names

OUTPUT

      species   common_name
1 spectabilis         Ord's
2       ordii Banner-tailed

But looking at the common names dataset reveals a problem! The common names are not properly matched to the scientific names. For example, the genus Ordii should correspond to Ord’s kangaroo rat, but currently, it is matched with the Banner-tailed kangaroo rat instead.

Challenge

Is this a syntax error or a semantic error? Explain why.
What “code first aid” steps might be appropriate here? Which ones are unlikely to be helpful?

Mickey re-orders the names and tries the code again. This time, it works! The common names are joined to the correct scientific names. Mickey joins the common names to rodents_subset.

R

common_names <- data.frame(species = sort(unique(rodents_subset$species)), common_name = c("Ord's", "Banner-Tailed"))
rodents_subset <- left_join(rodents_subset, common_names, by = "species")

Now, Mickey is ready to start learning about kangaroo rat weights. They start by running a quick linear regression to predict weight based on species and sex.

R

# Explore k-rat weights
weight_model <- lm(weight ~ common_name + sex, data = rodents_subset)
summary(weight_model)

OUTPUT


Call:
lm(formula = weight ~ common_name + sex, data = rodents_subset)

Residuals:
     Min       1Q   Median       3Q      Max
-111.201   -6.466    2.534   10.799   45.799

Coefficients: (1 not defined because of singularities)
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)      123.2007     0.8061  152.83   <2e-16 ***
common_nameOrd's -74.7342     1.3352  -55.97   <2e-16 ***
sexM                   NA         NA      NA       NA
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.71 on 939 degrees of freedom
  (35 observations deleted due to missingness)
Multiple R-squared:  0.7694,	Adjusted R-squared:  0.7691
F-statistic:  3133 on 1 and 939 DF,  p-value: < 2.2e-16

The negative coefficient for common_nameOrd's tells Mickey that Ord’s kangaroo rats are significantly less heavy than Banner-tailed kangaroo rats.

But something is wrong with the coefficients for sex. Why are there so many NA values for sexM?

When Mickey visualizes the data, they see a problem in the graph, too. As the model showed, Ord’s kangaroo rats are significantly smaller than Banner-tailed kangaroo rats. But something is definitely wrong! Because the boxes are colored by sex, we can see that all of the Banner-tailed kangaroo rats are male and all of the Ord’s kangaroo rats are female. That can’t be right! What are the chances of catching all one sex for two different species?

R

rodents_subset %>%
  ggplot(aes(y = weight, x = common_name, fill = sex)) +
  geom_boxplot()

WARNING

Warning: Removed 35 rows containing non-finite outside the scale range
(`stat_boxplot()`).

To verify that the problem comes from the data, not from the plot code, Mickey creates a two-way frequency table, which confirms that there are no observations of female spectabilis or male ordii in rodents_subset. Something definitely seems wrong. Those rows should not be missing.

R

table(rodents_subset$sex, rodents_subset$species)

OUTPUT


    ordii spectabilis
  F   350           0
  M     0         626

To double check, Mickey looks at the original dataset.

R

table(rodents$sex, rodents$species)

OUTPUT


    albigula eremicus flavus fulvescens fulviventer harrisi hispidus
  F      474      372    222         46           3       0       68
  M      368      468    302         16           2       0       42

    leucogaster maniculatus megalotis merriami ordii penicillatus  sp.
  F         373         160       637     2522   690          221    4
  M         397         248       680     3108   792          155    5

    spectabilis spilosoma taylori torridus
  F        1135         1       0      390
  M        1232         1       3      441

Not only were there originally males and females present from both ordii and spectabilis, but the original numbers were way, way higher! It looks like somewhere along the way, Mickey lost a lot of observations.

[WORKING THROUGH CODE FIRST AID STEPS HERE] Mickey is feeling overwhelmed and not sure where their code went wrong. They were able to fix the errors and warning messages that they encountered so far, but this one seems more complicated, and there has been no clear indication of what went wrong. They work their way through the code first aid steps, but they are not able to solve the problem.

They decide to consult Remy’s road map to figure out what to do next.

Since code first aid was not enough to solve this problem, it looks like it’s time to ask for help using a reprex.

::::::: key points

The first step to getting unstuck is identifying a problem, isolating the problem area, and interpreting the problem
Often, using “code first aid” – acting on error messages, looking at data, inputs, etc., pulling up documentation, asking a search engine or LLM, can help us to quickly fix the error on our own
If code first aid doesn’t work, we can ask for help and prepare a reproducible example (reprex) with a defined problem and isolated code
We’ll cover future steps to prepare a reproducible example (reprex) in future episodes.

:::::::

Content from Minimal reproducible code

Last updated on 2025-07-22 | Edit this page

Overview

Questions

Why is it important to make a minimal code example?
Which part of my code is causing the problem?
Which parts of my code should I include in a minimal example?
How can I tell whether a code snippet is reproducible or not?
How can I make my code reproducible?

Objectives

Explain the value of a minimal code snippet.
Identify the problem area of a script.
Identify supporting parts of the code that are essential to include.
Simplify a script down to a minimal code example.
Evaluate whether a piece of code is reproducible as is or not. If not, identify what is missing.
Edit a piece of code to make it reproducible
Have a road map to follow to simplify your code.
Describe the {reprex} package and its uses

Making a reprex

Simplify the code

When asking someone else for help, it is important to simplify your code as much as possible to make it easier for the helper to understand what is wrong. Simplifying code helps to reduce frustration and overwhelm when debugging an error in a complicated script. The more that we can make the process of helping easy and painless for the helper, the more likely that they will take the time to help.

Callout

Depending on how closely you have been following the lesson and which challenges you have attempted, your script may not look exactly like Mickey’s. That’s okay!

Mickey has written a lot of code so far. The code is also a little messy–for example, after fixing the previous errors, they sometimes commented out the old code and kept it for future reference.

Create a new script

To make the task of simplifying the code less overwhelming, let’s create a separate script for our reprex. This will let us experiment with simplifying our code while keeping the original script intact.

Let’s create and save a new, blank R script and give it a name, such as “reprex-script.R”

Making an R script

There are several ways to make an R script:

File > New File > R Script
Click the white square with a green plus sign at the top left corner of your RStudio window
Use a keyboard shortcut: Cmd + Shift + N (on a Mac) or Ctrl + Shift + N (on Windows)

We’re going to start by copying over all of our code, so we have an exact copy of the full analysis script.

R

# Minimal reproducible example script
# Load packages and data
library(tidyverse)
surveys <- read_csv("data/surveys_complete_77_89.csv")

# Take a look at the data
glimpse(surveys)
min(surveys$year)
max(surveys$year)

# Make some plots
ggplot(surveys, aes(x = taxa)) + geom_bar()
table(surveys$taxa)
?table
table(surveys$taxa, exclude = NULL)

# Just rodents
rodents <- surveys %>% filter(taxa == "Rodent")
rodents_summary <- rodents %>% group_by(plot_type, month) %>% summarize(count=n())
ggplot(rodents_summary) + geom_tile(aes(month, plot_type, fill=count))

# Just k-rats
krats <-rodents %>% filter(genus == "Dipodomys") #kangaroo rat genus
dim(krats)

ggplot(krats, aes(year, fill=plot_type)) +
  geom_histogram() +
  facet_wrap(~species)
ggplot(rodents, aes(x = species, fill = sex))+
  geom_bar()
rodents_subset <- rodents %>%
  filter(species == c("ordii", "spectabilis"),
         sex == c("F", "M"))

# Add common names
common_names <- data.frame(species = sort(unique(rodents_subset$species)), common_name = c("Ord's", "Banner-Tailed"))
rodents_subset <- left_join(rodents_subset, common_names, by = "species")

# Explore k-rat weights
weight_model <- lm(weight ~ common_name + sex, data = rodents_subset)
summary(weight_model) # this looks weird
rodents_subset %>%
  ggplot(aes(y = weight, x = common_name, fill = sex)) +
  geom_boxplot() # still looks weird
table(rodents_subset$sex, rodents_subset$species)
table(rodents$sex, rodents$species)

Now, we will follow a process: 1. Identify the symptom of the problem. 2. Remove a piece of code to make the reprex more minimal. 3. Re-run the reprex to make sure the reduced code still demonstrates the problem–check that the symptom is still present.

In this case, the symptom is that we are missing rows in rodents_subset that were present in rodents and should not have been removed!

Let’s start by identifying pieces of code that we can probably remove. A good start is to look for lines of code that do not create variables for later use, or lines that add complexity to the analysis that is not relevant to the problem at hand.

START HERE WITH FIXING THIS We can start by removing the broken code that we commented out earlier. Also, adding the date column is not directly relevant to the current problem. Let’s go ahead and remove those pieces of code. Now our script looks like this:

R

# Minimal reproducible example script
library(tidyverse)
surveys <- read_csv("data/surveys_complete_77_89.csv")
glimpse(surveys)
min(surveys$year)
max(surveys$year)
# Read in the data
ggplot(x = taxa) + geom_bar()
ggplot(aes(x = surveys$taxa)) + geom_bar()
ggplot(surveys, aes(x = taxa)) + geom_bar()
table(surveys$taxa)
?table
table(surveys$taxa, exclude = NULL)
rodents <- surveys %>% filter(taxa == "Rodent")
rodents_summary <- rodents %>% group_by(plot_type, month) %>% summarize(count=n())
ggplot(rodents_summary) + geom_tile(aes(month, plot_type, fill=count))
ggplot(rodents) + geom_tile(aes(month, plot_type), stat = "count")
ggplot(rodents) + geom_tile(aes(month, plot_type))
krats <- rodents %>% filter(genus == "Dipadomys") #kangaroo rat genus

ggplot(krats, aes(year, fill=plot_type)) +
  geom_histogram() +
  facet_wrap(~species)
krats
print(rodents %>% count(genus))
krats <- rodents %>% filter(genus == "Dipodomys") #kangaroo rat genus
dim(krats)

ggplot(krats, aes(year, fill=plot_type)) +
  geom_histogram() +
  facet_wrap(~species)
ggplot(rodents, aes(x = species, fill = sex))+
  geom_bar()
rodents_subset <- rodents %>%
  filter(species == c("ordii", "spectabilis"),
         sex == c("F", "M"))
## Adding common names
common_names <- data.frame(species = unique(rodents_subset$species), common_name = c("Ord's", "Banner-tailed"))
common_names
common_names <- data.frame(species = sort(unique(rodents_subset$species)), common_name = c("Ord's", "Banner-Tailed"))
rodents_subset <- left_join(rodents_subset, common_names, by = "species")
weight_model <- lm(weight ~ common_name + sex, data = rodents_subset)
summary(weight_model)
rodents_subset %>%
  ggplot(aes(y = weight, x = common_name, fill = sex)) +
  geom_boxplot()
table(rodents_subset$sex, rodents_subset$species)
table(rodents$sex, rodents$species)

When we run this code, we can confirm that it still demonstrates our problem. There are still many rows missing from rodents_subset.

We’ve made progress on minimizing our code, but we still have a ways to go. This script is still pretty long! Let’s identify more pieces of code that we can remove.

Exercise 2: Minimizing code

Which other lines of code can you remove to make this script more minimal? After removing each one, be sure to re-run the code to make sure that it still reproduces the error.

Show me the solution

[Peter’s episode code]
Visualizing sex by species (ggplot) can be removed because it generates a plot but does not create any variables that are used later.
Filtering to only rodents can be removed because later we filter to only two species in particular
Adding common names can be removed because we didn’t actually use those common names. This one is tricky because technically we did use the common names in the rodents_subset plot. But is that plot really necessary? We can still demonstrate the problem using the table() lines of code at the end. Also, we could still make the equivalent plot using the species column instead of the common_name column, and it would demonstrate the same thing!
The weight model and the summary can be removed

A totally minimal script would look like this:

R

rodents <- read.csv("data/surveys_complete_77_89.csv")

rodents_subset <- rodents %>%
  filter(species == c("ordii", "spectabilis"),
         sex == c("F", "M"))

table(rodents_subset$sex, rodents_subset$species)
table(rodents$sex, rodents$species)

Great, now we have a totally minimal script!

However, we’re not done yet.

Exercise 3: The problem area is not enough

Let’s suppose that Mickey has created the minimal problem area script shown above. They email this script to Remy so that Remy can help them debug the code.

Remy opens up the script and tries to run it on their computer, but it doesn’t work. - What do you think will happen when Remy tries to run the code from this reprex script? - What do you think Mickey should do next to improve the minimal reproducible example?

We haven’t yet included enough code to allow a helper, such as Remy, to run the code on their own computer. If Remy tries to run the reprex script in its current state, they will encounter errors because they don’t have access to the same R environment that Mickey does.

Include dependencies

R code consists primarily of functions and variables. In order to make our minimal examples truly reproducible, we have to give our helpers access to all the functions and variables that are necessary to run our code.

First, let’s talk about functions. Functions in R typically come from packages. You can access them by loading the package into your environment.

To make sure that your helper has access to the packages necessary to run your reprex, you will need to include calls to library() for whichever packages are used in the code. For example, if your code uses the function lmer from the lme4 package, you would have to include library(lme4) at the top of your reprex script to make sure your helper has the lme4 package loaded and can run your code.

Default packages

Some packages, such as {base} and {stats}, are loaded in R by default, so you might not have realized that a lot of functions, such as dim, colSums, factor, and length actually come from those packages!

You can see a complete list of the functions that come from the {base} and {stats} packages by running library(help = "base") or library(help = "stats").

Let’s do this for our own reprex. We can start by identifying all the functions used, and then we can figure out where each function comes from to make sure that we tell our helper to load the right packages.

The first function used in our example is ggplot(), which comes from the package ggplot2. Therefore, we know we will need to add library(ggplot2) at the top of our script.

The function geom_boxplot() also comes from ggplot2. We also used the function table(). Running ?table tells us that the table function comes from the package {base}, which is automatically installed and loaded when you use R–that means we don’t need to include library(base) in our script.

Our reprex script now looks like this:

R

# Mickey's reprex script

# Load necessary packages to run the code
library(ggplot2)

rodents_subset %>%
  ggplot(aes(y = weight, x = common_name, fill = sex)) +
  geom_boxplot() # wait, why does this look weird?

ERROR

Error: object 'rodents_subset' not found

R

# Investigate
table(rodents_subset$sex, rodents_subset$species)

ERROR

Error: object 'rodents_subset' not found

R

table(rodents$sex, rodents$species)

OUTPUT


    albigula eremicus flavus fulvescens fulviventer harrisi hispidus
  F      474      372    222         46           3       0       68
  M      368      468    302         16           2       0       42

    leucogaster maniculatus megalotis merriami ordii penicillatus  sp.
  F         373         160       637     2522   690          221    4
  M         397         248       680     3108   792          155    5

    spectabilis spilosoma taylori torridus
  F        1135         1       0      390
  M        1232         1       3      441

Installing vs. loading packages

But what if our helper doesn’t have all of these packages installed? Won’t the code not be reproducible?

Typically, we don’t include install.packages() in our code for each of the packages that we include in the library() calls, because install.packages() is a one-time piece of code that doesn’t need to be repeated every time the script is run. We assume that our helper will see library(specialpackage) and know that they need to go install “specialpackage” on their own.

Technically, this makes that part of the code not reproducible! But it’s also much more “polite”. Our helper might have their own way of managing package versions, and forcing them to install a package when they run our code risks messing up our workflow. It is a common convention to stick with library() and let them figure it out from there.

Exercise 4: Which packages are essential?

In each of the following code snippets, identify the necessary packages (or other code) to make the example reproducible.

weight_model <- lm(weight ~ common_name + sex, data = rodents_subset)
tab_mod(weight_model)

mod <- lmer(weight ~ hindfoot_length + (1|plot_type), data = rodents)
summary(mod)

rodents_processed <- process_rodents_data(rodents)
glimpse(rodents_processed)

This exercise should take about 10 minutes. :::solution a. lm is part of base R, so there’s no package needed for that. tab_mod comes from the package sjPlot. You could add libary(sjPlot) to this code to make it reproducible. b. lmer is a linear mixed modeling function that comes from the package lme4. summary is from base R. You could add library(lme4) to this code to make it reproducible. c. process_rodents_data is not from any package that we know of, so it was probably an originally-created function. In order to make this example reproducible, you would have to include the definition of process_rodents_data. glimpse is probably from dplyr, but it’s worth noting that there is also a glimpse function in the pillar package, so this might be ambiguous. This is another reason it’s important to specify your packages–if you leave your helper guessing, they might load the wrong package and misunderstand your error!

:::::::::::::::::::::::::::::::::::::::::::

Including library() calls will definitely help Remy run the code. But this code still won’t work as written because Remy does not have access to the same objects that Mickey used in the code.

The code as written relies on rodents_subset, which Remy will not have access to if they try to run the code. That means that we’ve succeeded in making our example minimal, but it is not reproducible: it does not allow someone else to reproduce the problem!

[PULL UP ROAD MAP]

Exercise 5: Reflection

Let’s take a moment to reflect on this process.

What’s one thing you learned in this episode? An insight; a new skill; a process?
What is one thing you’re still confused about? What questions do you have?

This exercise should take about 5 minutes.

Key Points

Making a reprex is the next step after trying code first aid.
In order to make a good reprex, it is important to simplify your code
Simplify code by removing parts not directly related to the question
Give helpers access to the functions used in your code by loading all necessary packages

Content from Minimal reproducible data

Last updated on 2025-07-22 | Edit this page

Overview

Questions

What is a minimal reproducible dataset, and why do I need it?
How do I create a minimal reproducible dataset?
Can I just use my own data?

Objectives

Describe a minimal reproducible dataset
Identify the aspects of your data necessary to replicate your issue
Create a dataset from scratch to replicate your issue
Share your own dataset in a way that is minimal, accessible, and reproducible
Subset an existing dataset to replicate your issue

3.1 What is a minimal reproducible dataset and why do I need it?

[INSERT ROADMAP]

Now that Mickey has narrowed down their problem area and stripped down their code to make it minimal, they need to ensure it is reproducible; this means it needs to be accessible and executable such that anyone else can copy-paste it into their system, run the code, and replicate their issue. Importantly, an example code will always require example data in order to run! Therefore, every reprex requires the creation of a minimal reproducible dataset to use with the code.

Furthermore, as we have seen previously, sometimes the source of the problem isn’t actually the code, but rather the data! Providing an appropriate example, or mock dataset allows a helper to better investigate and manipulate that data to fix the problem.

Remember: your helper may not have access to your computer and files!

You might be used to always uploading data from separate files, but helpers can’t access those files. Even if you sent someone a file, they would need to put it in the right directory, make sure to load it in exactly the same way, make sure their computer can open it, etc. Since the goal is to make everyone’s lives as easy as possible, we need to think about our data in a different way–as a mock object created in the script itself.

As with the example code, an example dataset should also be minimal–free of unnecessary information. By removing extraneous information and only keeping what is required to replicate the issue, a helper can more clearly see how the data is structured and where the problem arises. While it may sometimes feel like unnecessary effort, the process of creating a minimal dataset will not only help others help you, but also allow you to better understand your data and often discover the source of the problem without the need for external help.

In summary, a minimal reproducible dataset must be:

minimal: it only contains information required to run your minimal code. You can also think of this as being relevant to the problem (keep only what is necessary).
reproducible: it must be accessible to someone without your computer, and it must consistently replicate your problem. This means it also needs to be complete, meaning there are no dependencies that have been omitted (e.g., packages).

Pro-tip

An example of what minimal reproducible examples look like can also be found in the ?help section, in R Studio. Just scroll all the way down to where there are examples listed. These will usually be minimal and reproducible, since the intended to be directly copy-pasted and run by anyone.

For example, let’s look at the function mean:

R

?mean

We see examples that can be run directly on the console, with no additional code.

R

x <- c(0:10, 50)
xm <- mean(x)
c(xm, mean(x, trim = 0.10))

OUTPUT

[1] 8.75 5.50

In this case, x is the mock dataset consisting of just 1 variable. Notice how it was created as part of the example. This will be your goal with your reprex.

Exercise 1: Test your knowledge!

The datasets listed below are not well suited for use in a reprex. Can you explain why? Copy each one onto your own R script to check whether they are reproducible. What did you find?

sample_data <- read_csv(“/Users/kaija/Desktop/RProjects/ResearchProject/data/sample_dataset.csv”)
dput(complete_old[1:100,])
sample_data <- data.frame(species = species_vector, weight = c(10, 25, 14, 26, 30, 17))

Show me the solution

Not reproducible because it is a screenshot.
Not reproducible because it is a path to a file that only exists on someone else’s computer and therefore you do not have access to it using that path.
Not minimal, it has far too many columns and probably too many rows. It is also not reproducible because we were not given the source for complete_old.
Not reproducible because we are not given the source for species_vector.

Extra Practice (optional)

Let’s say we want to know the average weight of all the species in our rodents dataset. We try to use the following code…

R

mean(rodents$weight)

OUTPUT

[1] NA

…but it returns NA! We don’t know why that is happening and we want to ask for help.

Which of the following represents a minimal reproducible dataset for this code? Can you describe why the other ones are not?

sample_data <- data.frame(month = rep(7:9, each = 2), hindfoot_length = c(10, 25, 14, 26, 30, 17))
sample_data <- data.frame(weight = rnorm(10))
sample_data <- data.frame(weight = c(100, NA, 30, 60, 40, NA))
sample_data <- sample(rodents$weight, 10)
sample_data <- rodents_modified[1:20,]

Show me the solution

The correct answer is C!

does not include the variable of interest (weight).
does not produce the same problem (NA result with a warning message)–the code runs just fine.
minimal and reproducible.
is not reproducible. Sample randomly samples 10 items; sometimes it may include NAs, sometime it may not (not guaranteed to reproduce the error). It can be used if a seed is set (see next section for more info).
uses a dataset that isn’t accessible without previous data wrangling code–the object rodents_modified doesn’t exist.

3.2 Can I just use my own data?

While Mickey is grateful to Remy for providing them with a roadmap to follow when they need help, they still feel it would be much easier to just send Remy their data rather than creating a different dataset.

Exercise 2: Reflect

When Mickey feels like sharing their own data would be easier, for whom do you think they are referring? Who would find it easier, Mickey or Remy?
Can you think of any reasons why sharing the original data may not be possible or recommended?

Show me the solution

Mickey is thinking that it would be easier for themselves, not necessarily for Remy.

Remember: one of the goals of creating a reprex is to help the helpers. They don’t have to help, they are volunteering their time. As such, they deserve to be treated with kindness and respect. If you find yourself getting frustrated at how much time and effort creating a reprex might be taking, remember that (1) trusting the process may reveal the solution along the way; and (2) being kind, clear, and helpful will reward you with a quicker, more accurate solution.

There are several reasons why sharing the original data may not be possible or recommended. The original dataset may be:

too large - the Portal dataset is ~35,000 rows with 13 columns and contains data for decades. That’s a lot!
private - the dataset might not be published yet, it may not be yours to share, or maybe it includes protected information such as personal medical information or the location of endangered species.
hard to send - on most online forums, you can’t attach supplemental files (more on this later). Even if you are just sending data to a colleague, file paths can get complicated, the data might be too large to attach, etc.
complicated - it would be hard to locate the relevant information. One example to steer away from are needing a ‘data dictionary’ to understand all the meanings of the columns (e.g. what is “plot type” in ratdat?) We don’t our helper to waste valuable time to figure out what everything means.
highly derived/modified from the original file. You may have already done a bunch of preliminary data wrangling you don’t want to include when you send the example, so you would need to provide the intermediate dataset directly to your helper.

Mickey is not entirely wrong. While there are instances when it is not possible or advisable to share original data, there are also many ways around such challenges and some instances may indeed benefit from using the original data. However, it is still important to provide helpers with data that is minimal and reproducible. Therefore, while Mickey does not have to create a brand new example dataset, they should at least work to make their original data minimal and accessible (see the above reflection exercise), which isn’t necessarily easier or faster than creating a mock dataset from scratch.

In summary, there are multiple ways to provide a minimal dataset for a reprex, including using a simplified version of the original dataset. They key is that any data provided remains minimal and reproducible. In the following section we will highlight 3 common approaches.

3.3 How do I create a minimal reproducible dataset?

In general, there are 3 common ways to provide minimal reproducible data for a reprex.

Write a script that creates a new mock dataset with the same key characteristics as the original data.
Edit the original data to be minimal and reproducible.
Use a dataset that is already provided by R (e.g., cars, npk, penguins, etc.). For a complete list, use library(help = "datasets").

The developers of this lesson believe everyone is entitled to use any option they prefer, and the rest of this episode will expand on each of the 3 approaches listed above. However, within the data science community, opinions generally differ on which method is best recommended. Below is a summary table of advantages and disadvantages of each approach based on many conversations with several data science groups.

	Advantages	Disadvantages
Data from Scratch	Often the most concise Easiest for helpers Helps problem-solve along the way (e.g., identify what data aspects are generating the problem) More universally applicable: Easy to share, collaborate, teach, and understand Avoids privacy/security concerns Lets you clearly illustrate the sought outcome Uses important-to-learn skills Easier for more skilled individuals	While some disadvantages are universal, many apply mostly to novices. Can be intimidating Requires good understanding of your data Harder to generate if the error is idiosyncratic or dependent on having a large dataset Time-consuming if unskilled Iterative (you may need to trial and error a few times to replicate the problem) More likely to require analogies–less interpretable/connected to real problems, more likely to require greater context Risks generating XY problems–make sure you are asking the right question/replicating the right problem
R-built Data	Simple and easy to share–no need to provide additional data Familiar–helpers already know the data structure Potentially faster than starting from scratch (i.e. faster than generating rows/columns de novo). No privacy/security concerns Can easily be manipulated or simplified Generalizes the problem	May require a good mental model of the problem Harder if the error is idiosyncratic Greatest risk of generating XY problems–make sure you are asking the right question/replicating the right problem Iterative (you may need to trial and error a few times to replicate the problem) Need to re-think the question so it matches a different dataset and context–mental gymnastics Still need to choose which dataset and variables are more appropriate
Your Data	Can require the least mental effort Problem is easy to replicate even if you don’t understand it at all Accurately represents the actual problem; avoids XY problems. Richer context may intrigue/motivate helpers Can be quicker if dataset is small May be required for idiosyncratic problems that are based on aspects of the data itself that you don’t know about (e.g. when the data itself, rather than the code per se, is central to the problem) Captures data structures that are difficult or time-consuming to replicate if you are a novice	Less growth-minded if chosen as the “easy way out”–skips the learning process of trying to construct a dataset and any insights that that process might give you. Easy to do poorly and think that you’re doing it well; perceived “easiness” leads to overcomplication/confusion Leaves all the work to the helpers if you don’t also work to minimize it–less motivation for harder work Could obfuscate the problem if not minimized–less likely to find a quick answer More difficult to share–may be large/messy Security/privacy/sanitizeation problems

3.4 Creating a mock dataset from scratch

While starting from scratch can be daunting at first, it often becomes the easier and faster option once you are familiar with the process; once you understand the basic building blocks and start practicing, it becomes the most straight-forward method of creating a minimal reproducible dataset. This is also the preferred method for other activities that require a reprex (e.g., teaching, collaborating, developing, etc.), and it often provides valuable problem-solving insights. So let’s breakdown this process to be more digestible!

Mickey is still new at this and has 2 pressing questions:

How do I create a dataset from scratch?
How do I know which key aspects of my data to recreate?

Let’s start with the first.

There are many ways one can create a dataset in R.

You can start by creating vectors using c()

R

vector <- c(1,2,3,4) 
vector

OUTPUT

[1] 1 2 3 4

You can also add some randomness by sampling from a vector using sample().

For example you can sample numbers 1 through 10 in a random order

R

x <- sample(1:10)
x

OUTPUT

 [1]  7  9  3 10  5  2  8  4  1  6

Or you can randomly sample from a normal distribution

R

x <- rnorm(10)
x

OUTPUT

 [1]  1.19818610  0.01704092  2.07772031 -0.96906642 -0.70851763 -0.44057679
 [7] -0.02585938 -0.40951825 -0.11618711  1.47806367

You can also use letters to create factors.

R

x <- sample(letters[1:4], 20, replace=T)
x

OUTPUT

 [1] "a" "d" "b" "a" "a" "b" "b" "c" "a" "a" "c" "b" "c" "a" "d" "d" "c" "a" "c"
[20] "d"

Remember that a data frame is just a collection of vectors. You can create a data frame using data.frame (or tibble in the dplyr package). You can then define a vector for each variable.

R

data <- data.frame (x = sample(letters[1:3], 20, replace=T), 
                    y = rnorm(1:20))
head(data)

OUTPUT

  x           y
1 a  1.27293263
2 b -1.67935435
3 c  0.09416562
4 b -1.13942537
5 a  0.70867496
6 a  0.11686550

However, when sampling at random you must remember to set.seed() before sending it to someone to make sure you both get the same numbers!

Callout

For more handy functions for creating data frames and variables, see the [cheatsheet]. Some questions may require specific formats. For these, you can use any of the provided as.someType functions: as.factor, as.integer, as.numeric, as.character, as.Date, as.xts.

Extra Practice (optional)

Create a data frame with:

A. One categorical variable with 2 levels and one continuous variable.

B. One continuous variable that is normally distributed.

C. Name, sex, age, and treatment type.

3.5 Identifying the key aspects of the data

No matter which approach you choose to take for providing a dataset, they key is always to identify which elements of the original data are necessary to replicate the problem. To do so, here are a few guiding questions:

Which variables are necessary to the problem?
What data type (discrete or continuous) is each variable?
How many levels and/or observations are necessary?
Do the values need to be distributed in a specific way?
Are there any NAs that could be relevant?

Let’s check back with Mickey and the minimal code they settled on:

R

# Mickey's minimal code [ UPDATE AS NEEDED ]

library(dplyr)
library(ggplot2)

rodents<-read.csv('data/surveys_complete_77_89.csv')

rodents_subset <- rodents %>%
  filter(species == c("ordii", "spectabilis"),
         sex == c("F", "M"))

table(rodents_subset$sex, rodents_subset$species)

OUTPUT


    ordii spectabilis
  F   333           0
  M     0         610

R

table(rodents$sex, rodents$species)

OUTPUT


    albigula audubonii bilineata brunneicapillus chlorurus clarki eremicus
          62        69       223              23        11      1       14
  F      474         0         0               0         0      0      372
  M      368         0         0               0         0      0      468

    flavus fulvescens fulviventer fuscus gramineus harrisi hispidus leucogaster
        15          0           0      2         3     136        2          16
  F    222         46           3      0         0       0       68         373
  M    302         16           2      0         0       0       42         397

    leucophrys maniculatus megalotis melanocorys merriami ordii penicillatus
             2           9        33          13       45     3            6
  F          0         160       637           0     2522   690          221
  M          0         248       680           0     3108   792          155

    scutalatus  sp. spectabilis spilosoma squamata taylori torridus viridis
             1   18          42       149       16       0       28       1
  F          0    4        1135         1        0       0      390       0
  M          0    5        1232         1        0       3      441       0

To make sure Remy can work on this wherever, Mickey needs to ensure they have the required dataset to run the code.

Exercise 3: Quick, think!

Based on the current minimal code, which dataset does Mickey need to recreate? Hint: they currently have two datasets, rodents and rodents_subset.

Show me the solution

Mickey needs to provide a mock dataset to replace the original rodents dataset.

Let’s take a closer look at the dataset we need to substitute and then answer the questions outlined earlier.

R

head(rodents)

OUTPUT

  record_id month day year plot_id species_id sex hindfoot_length weight
1         1     7  16 1977       2         NL   M              32     NA
2         2     7  16 1977       3         NL   M              33     NA
3         3     7  16 1977       2         DM   F              37     NA
4         4     7  16 1977       7         DM   M              36     NA
5         5     7  16 1977       3         DM   M              35     NA
6         6     7  16 1977       1         PF   M              14     NA
        genus  species   taxa                plot_type
1     Neotoma albigula Rodent                  Control
2     Neotoma albigula Rodent Long-term Krat Exclosure
3   Dipodomys merriami Rodent                  Control
4   Dipodomys merriami Rodent         Rodent Exclosure
5   Dipodomys merriami Rodent Long-term Krat Exclosure
6 Perognathus   flavus Rodent        Spectab exclosure

Excercise 4

Try to answer the following questions on your own to determine what we need to include in our minimal reproducible dataset:

Which variables does Mickey need to reproduce their problem?
What data type (discrete or continuous) is each variable?
How many levels and/or observations are necessary?
Do the values need to be distributed in a specific way?
Are there any NAs that could be relevant?

Let’s go over the answers together and help Mickey build a dataset as we go along!

How many variables does Mickey need to reproduce their problem?

They need species, sex, and maybe a third identifier like record_id. This means they potentially need 3 vectors (remember, each column in a dataframe is essentially a vector, and in “tidy data” should correspond with a variable; each row is then an observation).

What data type (discrete or continuous) is each variable?

Species and sex are both discrete (categorical) variables, while record ID would be more continuous.

How many levels and/or observations are necessary?

Since Mickey is filtering their dataset down to 2 categories for both species and sex, that means they need at least 3 levels in each to start with. In terms of number of observations there don’t seem to be specific restrictions other than they probably want at least 1 observation per original category, so 2*3=6, or they can just pick a generally nice number like 10. This is where creating a reprex dataset becomes a bit more of an art than a science; it is common to use trial and error until the problem is replicated accurately.

Do the values need to be distributed in a specific way?

This question probably isn’t going to be relevant most of the time, but certainly worth considering. If Mickey needed a longer dataset of measurements they may have wanted to make sure it was normally distributed. If they needed a longer dataset of counts they may have wanted to make sure it was Poisson distributed. Or maybe they had binary data. But in this case, Mickey has a fairly short dataset and the code doesn’t include anything that should vary depending on the distribution, so it probably doesn’t matter. Again, this process can be one of trial and error. They can always come back to this question if they are unable to replicate their problem (hint: in which case the distribution may be related to the problem they are having!).

Are there any NAs that could be relevant?

Mickey’s data does have NAs for the sex variable. It might not matter or it could be important, so let’s have them put in NAs in the mock dataset just in case.

R

# We need 3 variables: species, sex, and record_id
# species and sex are categorical with at least 3 levels, one of which is blank for sex
species <- sample(letters, 3, replace=F) 
          # or name 3 categories like we do with sex below
species

OUTPUT

[1] "b" "m" "l"

R

sex <- c('M','F',NA)
sex

OUTPUT

[1] "M" "F" NA

R

# record_id is continuous 
record_id <- 1:10
record_id

OUTPUT

 [1]  1  2  3  4  5  6  7  8  9 10

R

# Now let's go sampling and put it all together
sample_data <- data.frame(
  record_id = record_id,
  species = sample(species, 10, replace=T),
  sex = sample(sex, 10, replace=T)
)
sample_data

OUTPUT

   record_id species  sex
1          1       b    F
2          2       b <NA>
3          3       l <NA>
4          4       l <NA>
5          5       l <NA>
6          6       l    M
7          7       m <NA>
8          8       l    M
9          9       m <NA>
10        10       l    F

And just like that we helped Mickey create a mock dataset from scratch! Notice that they could also have compiled the same type of dataset in a single line by creating each vector already within the data.frame()

R

sample2_data <- data.frame(
  record_id = 1:10,
  species = sample(letters[1:3], 10, replace=T),
  sex = sample(c('M','F', NA), 10, replace=T)
)
sample2_data

OUTPUT

   record_id species sex
1          1       a   F
2          2       c   F
3          3       c   F
4          4       a   M
5          5       a   F
6          6       a   M
7          7       c   F
8          8       a   F
9          9       c   F
10        10       a   F

Important: Notice that the outputs of the two datasets are not the same. If you want the outputs to be EXACTLY the same each time, but you are using sample() which is an inherently random process, you must first use set.seed() and share that with your helper too.

R

set.seed(1) # set seed before recreating the sample
sample_data <- data.frame(
  record_id = 1:10,
  species = sample(letters[1:3], 10, replace=T),
  sex = sample(c('M','F', NA), 10, replace=T)
)
sample_data

OUTPUT

   record_id species  sex
1          1       a <NA>
2          2       c    M
3          3       a    M
4          4       b    M
5          5       a    F
6          6       c    F
7          7       c    F
8          8       b    F
9          9       b <NA>
10        10       c    M

Callout

Adding a set.seed() at the start of your reprex will ensure anyone else who runs the same code in the same order will always get the same results. However, if using it more generally, you may want to read more about it. For example, in the example below we set a seed of 2 and then run sample(10) twice. You will notice that the output of each sample run is not the same. However, if you run the whole code again, you will see that each of the outputs actually do stay the same.

R

set.seed(2)
sample(10)

OUTPUT

 [1]  5  6  9  1 10  7  4  8  3  2

R

sample(10)

OUTPUT

 [1]  1  3  6  2  9 10  7  5  4  8

Great! Now we need to check whether the mock dataset works with the minimal code Mickey created earlier. Does it run? Does it reproduce the problem they were having?

R

# Minimal code [or whatever we end up with] 
sample_subset <- sample_data %>% # replace rodents with our sample dataset
  filter(species == c("a", "b"), # replace species with those from our sample dataset
         sex == c("F", "M")) # this can stay the same because we recreated it the same

table(sample_subset$sex, sample_subset$species)

OUTPUT


    a b
  F 1 0
  M 0 1

It works! The sample size has unexpectedly been reduced to just 2 observations, when we would have expected a sample of 8, based on the sample_data output above. Wherever the issue may lie, we were able to successfully replicate it in this minimal reproducible example.

Exercise 5: Your turn!

Now practice doing it yourself. Create a data frame with:

A. One categorical variable with 2 levels and one continuous variable.

B. One continuous variable that is normally distributed.

C. Name, sex, age, and treatment type.

3.6 Using the original data set

Even once you master the art of creating mock datasets, there may be occasions in which your data or problem is maybe too complex and you can’t seem to replicate the issue. Or maybe you still think using your original data would just be easier.

In cases when you want to make your own data minimal and reproducible, you will want to take a similar approach to what we did in Episode 3 when making the code minimal. Keep what is essential, get rid of the rest. In other words, we will want to subset our data into a smaller, more digestible chunk.

The question still arises: how do I know what is essential?

Use the same guiding questions that we used earlier!

Which variables are necessary to replicate the problem?
What data type (discrete or continuous) is each variable? (perhaps less necessary, since you are keeping the original variables)
How many levels and/or observations are necessary? (we don’t want to get rid of more than we need)
Do the values need to be distributed in a specific way? (worth keeping in mind in terms of how we are removing observations)
Are there any NAs that could be relevant?

Based on our previous answers we end up with:

We need species, sex, and maybe record_id
Species and sex are categorical, record_id is a continuous count of our observations.
As we said earlier, we want 3 each for species and sex, which happens to already be the case. And we could reduce our record_id size to ~10.
Not really, but we want to make sure that when we reduce the number of observations we still have observations in each of the 3 levels in species and sex.
NA’s are present in the sex variable, so let’s make sure we keep at least one.

Now that we have a clearer goal, let’s subset the data.

Useful functions for subsetting a dataset include subset(), head(), tail(), and indexing with [] (e.g., iris[1:4,]). Alternatively, you can use tidyverse functions like select(), and filter() from the tidyverse. You can also use the same sample() functions we covered earlier.

Note: you should already have an understanding of how to subset or wrangle data using the tidyverse from the R for Ecology lesson. If not, go check it out! [insert link to lesson]

R

# Mickey's minimal code [ UPDATE AS NEEDED ]

library(dplyr)
library(ggplot2)

rodents<-read.csv('data/surveys_complete_77_89.csv')

rodents_subset <- rodents %>%
  filter(species == c("ordii", "spectabilis"),
         sex == c("F", "M"))

table(rodents_subset$sex, rodents_subset$species)

OUTPUT


    ordii spectabilis
  F   333           0
  M     0         610

R

table(rodents$sex, rodents$species)

OUTPUT


    albigula audubonii bilineata brunneicapillus chlorurus clarki eremicus
          62        69       223              23        11      1       14
  F      474         0         0               0         0      0      372
  M      368         0         0               0         0      0      468

    flavus fulvescens fulviventer fuscus gramineus harrisi hispidus leucogaster
        15          0           0      2         3     136        2          16
  F    222         46           3      0         0       0       68         373
  M    302         16           2      0         0       0       42         397

    leucophrys maniculatus megalotis melanocorys merriami ordii penicillatus
             2           9        33          13       45     3            6
  F          0         160       637           0     2522   690          221
  M          0         248       680           0     3108   792          155

    scutalatus  sp. spectabilis spilosoma squamata taylori torridus viridis
             1   18          42       149       16       0       28       1
  F          0    4        1135         1        0       0      390       0
  M          0    5        1232         1        0       3      441       0

Given that the code that is going wrong is that which creates rodents_subset, we need to create a minimal reproducible version of rodents! We can then insert our new_rodents dataset in place of the original rodents one.

Step 1: select the variables of interest

R

# subset rodent into new_rodent to make it minimal
# Note: there are many ways you could do this!
new_rodents <- rodents %>% 
  # 1. select the variables of interest
  select(record_id, species, sex)
  # PAUSE. Does this work so far?
new_rodents

OUTPUT

   record_id      species sex
1          1     albigula   M
2          2     albigula   M
3          3     merriami   F
4          4     merriami   M
5          5     merriami   M
6          6       flavus   M
7          7     eremicus   F
8          8     merriami   M
9          9     merriami   F
10        10       flavus   F
11        11  spectabilis   F
12        12     merriami   M
13        13     merriami   M
14        14     merriami
15        15     merriami   F
16        16     merriami   F
17        17  spectabilis   F
18        18 penicillatus   M
19        19       flavus
20        20  spectabilis   F
21        21     merriami   F
22        22     albigula   F
23        23     merriami   M
24        24     hispidus   M
25        25     merriami   M
26        26     merriami   M
27        27     merriami   M
28        28     merriami   M
29        29 penicillatus   M
30        30  spectabilis   F
31        31     merriami   F
32        32     merriami   F
33        33     merriami   F
 [ reached 'max' / getOption("max.print") -- omitted 16845 rows ]

Step 2-5: reduce the number of observations to ~10 while making sure the dataset still contains at least 3 species and at least 3 sexes

While the rest is just one step, it is the trickiest, because this is where we want to ensure the key elements of our original dataset, as defined earlier, are preserved.

Exercise 6: Your Turn!

How would you continue the subsetting pipeline? How could you reduce the number of observations while making sure you still have at least 3 species and 3 sexes left? Hint: there is no single right answer! Trial and error works wonders.

R

set.seed(1)
new_rodents <- rodents %>% 
  # 1. select the variables of interest
  select(record_id, species, sex) %>%
  slice_sample(n=4, replace = F, by='sex')
new_rodents

OUTPUT

   record_id     species sex
1       2359    merriami   M
2      16335    albigula   M
3       9910       ordii   M
4       8278       ordii   M
5      12038    merriami   F
6       7862   megalotis   F
7       9221    albigula   F
8       1335 spectabilis   F
9       3320 melanocorys
10       343      flavus
11     14482        <NA>
12      9376   spilosoma

The code ran wihtout issues, yay! But do we end up with what we were looking for?

Doe we have ~10 observations? Yes! 9 seems good enough
Do we have at least 3 species? Yes! We have 7 (we could choose to reduce this further)
Do we have at least 3 sexes? Yes! M, F, and blank

Great! All of our requirements are fulfilled. Now let’s see if it replicates our issue when we add it to our minimal code.

Note: slice_sample() and similar functions allow you to specify and customize how exactly you want that sample to be taken (check the documentation!). For example, you can specify a proportion of rows to select, specify how to order variables, whether ties [may require more explanation] should be kept together, or even whether to weigh certain variables. All of this allows you to keep aspects of your dataset that may be relevant and hard to replicate otherwise.

Remember the minimal code:

R

rodents_subset <- rodents %>%
  filter(species == c("ordii", "spectabilis"),
         sex == c("F", "M"))

table(rodents_subset$sex, rodents_subset$species)

OUTPUT


    ordii spectabilis
  F   333           0
  M     0         610

R

table(rodents$sex, rodents$species)

OUTPUT


    albigula audubonii bilineata brunneicapillus chlorurus clarki eremicus
          62        69       223              23        11      1       14
  F      474         0         0               0         0      0      372
  M      368         0         0               0         0      0      468

    flavus fulvescens fulviventer fuscus gramineus harrisi hispidus leucogaster
        15          0           0      2         3     136        2          16
  F    222         46           3      0         0       0       68         373
  M    302         16           2      0         0       0       42         397

    leucophrys maniculatus megalotis melanocorys merriami ordii penicillatus
             2           9        33          13       45     3            6
  F          0         160       637           0     2522   690          221
  M          0         248       680           0     3108   792          155

    scutalatus  sp. spectabilis spilosoma squamata taylori torridus viridis
             1   18          42       149       16       0       28       1
  F          0    4        1135         1        0       0      390       0
  M          0    5        1232         1        0       3      441       0

We now want to replace rodents with our new_rodents. Do we need to change anything else?

We actually still have ordii and spectabilis as species in our list, so we can keep it as is. Same for sex. So we’re all set!

R

new_subset <- new_rodents %>%
  filter(species == c("ordii", "spectabilis"),
         sex == c("F", "M"))

The code ran without any issues! But does it replicate our issue?

Take a step back to remind yourself of what you are looking for. What was the issue we had identified?

The number of rows we end up after the filer is lower than expected.

So what would we expect to see with this new dataset? Since it is nice and short, this makes it a lot easier to predict the outcome.

We are asking for the 2 ordii rows, both males, and the 1 spectabilis row, which is female.

R

table(new_subset$sex, new_subset$species)

OUTPUT

< table of extent 0 x 0 >

Instead we end up with nothing! Why aren’t we getting the rows we are asking for?

Maybe our table is just wrong, let’s look at the actual dataset we end up with

R

new_subset

OUTPUT

[1] record_id species   sex
<0 rows> (or 0-length row.names)

Still nothing! What is going on?? We don’t have an answer, but we certainly replicated a problem that occurs when we filter the data. Time to ask for help!

But wait, our dataset is now minimal and relevant, but is it reproducible (accessible outside your device)? Not yet. We created a subset of our original dataset rodents but this came from a file on our computer. We could share our csv file and add an upload code… but that’s not ideal and it makes it hard to share our problem on a community site. Remember, the more steps required, the less likely someone will want to help.

Thankfully, there is a nifty function dput() that can help us out. Let’s try it and see what happens.

R

dput(new_rodents)

OUTPUT

structure(list(record_id = c(2359L, 16335L, 9910L, 8278L, 12038L,
7862L, 9221L, 1335L, 3320L, 343L, 14482L, 9376L), species = c("merriami",
"albigula", "ordii", "ordii", "merriami", "megalotis", "albigula",
"spectabilis", "melanocorys", "flavus", NA, "spilosoma"), sex = c("M",
"M", "M", "M", "F", "F", "F", "F", "", "", "", "")), class = "data.frame", row.names = c(NA,
-12L))

It spit out a hard-to-read but not excessively long chunk of code. This code, when run, will recreate our new_rodents dataset! We can also break it down and label it further to help the reader.

R

reprex_data <- structure(list(
  
# a unique identifier
record_id = c(2359L, 16335L, 9910L, 8278L, 12038L, 7862L, 9221L, 1335L, 9862L, 14979L, 11333L, 351L), 

# a list of species
species = c("merriami", "albigula", "ordii", "ordii", "merriami", "megalotis", "albigula", "spectabilis", "harrisi", "merriami", "spilosoma", "leucogaster"), 

# a list of sexes. Note: this includes some blanks!
sex = c("M", "M", "M", "M", "F", "F", "F", "F", "", "", "", "")),

class = "data.frame", row.names = c(NA, -12L))

print(reprex_data)

OUTPUT

   record_id     species sex
1       2359    merriami   M
2      16335    albigula   M
3       9910       ordii   M
4       8278       ordii   M
5      12038    merriami   F
6       7862   megalotis   F
7       9221    albigula   F
8       1335 spectabilis   F
9       9862     harrisi
10     14979    merriami
11     11333   spilosoma
12       351 leucogaster

Ta-da! Now they can easily recreate our minimal dataset and use it to run the minimal code. However, was that really easier than creating a dataset from scratch?

And sure, you could just use dput() on your original dataset. It would work. But that wouldn’t be very considerate to those who are trying to help. Try running dput(rodents) in your script.

It becomes a huge chunk of code! When clearly we don’t need all of that.

Remember, we want to keep everything minimal for many reasons:

to make it easy for our helpers to understand our data and code
to allow helpers to quickly focus their efforts on the right factors
to make the problem-solving process as easy and painless as possible
bonus: to help us better understand and zero-in on the source of our issue, often stumbling upon a solution along the way

Nevertheless, it remains an option for when your data appears too complex or you are not quite sure where your problem lies and therefore are not sure what minimal components are needed to reproduce the example. In other words, when you don’t have a good mental model of what the problem is even after going through the initial steps we outlined earlier int he lesson.

3.7 Using an R-build dataset

The last approach we mentioned is to build a minimal reproducible dataset based on the datasets that already exist within R (and therefore everyone would have access to).

A list of readily available datasets can be found using library(help="datasets"). You can then use ? in front of the dataset name to get more information about the contents of the dataset.

For a more detailed discussion of the benefits of using this approach see [insert something]

This approach essentially blends the skills we already learned in the first two. We need to identify a dataset with appropriate variables that match the “key elements” of our original dataset. We then need to further reduce that dataset to a minimal, relevant, number or rows. Once again, we can use the previously learned functions such as select(), filter(), or sample().

Since we already practiced everything you need, why not try it yourself?

Exercise 8: Extra Challenge

Using the “HairEyeColor” dataset, create a minimal reproducible dataset for the same issue and minimal code we have been exploring. 1. Start by using ?HairEyeColor to read a description of the dataset and View(HairEyeColor) to see the actual dataset. 2. Which variables would be a good match for our situation? What are our requirements? 3. How can we subset this dataset to make it minimal and still replicate our issue?

Show me the solution

Remember, there are many possible solutions! The most important feature is that the example dataset can replicate the issue when used within our minimal code.

The following is 1 possible solution:

We selected Hair and Eye as replacements for species and sex because they are both categorical and have at least 3 levels. We don’t strictly need anything else. We will call our new rodents replacement hyc. We set a seed because we want a random sample.

R

set.seed(1)

# the dummy dataset
hyc <- as.data.frame(HairEyeColor) %>% # oh no! Needs to be converted to df -- might need to change example or have them figure that one out... or we can give them this first line.
  select(Hair, Eye) %>%
  slice_sample(n=10)
print(hyc)

OUTPUT

    Hair   Eye
1  Black Hazel
2  Blond Brown
3    Red  Blue
4  Black Brown
5  Brown Brown
6    Red  Blue
7    Red Hazel
8  Brown Green
9  Brown Brown
10   Red Brown

R

# the minimal code
hyc_subset <- hyc %>%
  filter(Hair == c('Red','Blonde'),
         Eye == c('Blue', 'Brown'))

# illustrating the issue
table(hyc_subset$Hair, hyc_subset$Eye)

OUTPUT


        Brown Blue Hazel Green
  Black     0    0     0     0
  Brown     0    0     0     0
  Red       0    1     0     0
  Blond     0    0     0     0

R

# the whole subset
print(hyc_subset)

OUTPUT

  Hair  Eye
1  Red Blue

R

# But we know there are more!
table(hyc$Hair, hyc$Eye) # Reds have 2 Blue and 1 Brown, and Blonds have 1 Brown!

OUTPUT


        Brown Blue Hazel Green
  Black     1    0     1     0
  Brown     2    0     0     1
  Red       1    2     1     0
  Blond     1    0     0     0

Callout

What about NAs? If your data has NAs and they may be causing the problem, it is important to include them in your example dataset. You can find where there are NAs in your dataset by using is.na, for example: is.na(krats$weight). This will return a logical vector or TRUE if the cell contains an NA and FALSE if not. The simplest way to include NAs in your dummy dataset is to directly include it in vectors: x <- c(1,2,3,NA). You can also subset a dataset that already contains NAs, or change some of the values to NAs using mutate(ifelse()) or substitute all the values in a column by sampling from within a vector that contains NAs.

One important thing to note when subsetting a dataset with NAs is that subsetting methods that use a condition to match rows won’t necessarily match NA values in the way you expect. For example

R

test <- data.frame(x = c(NA, NA, 3, 1), y = rnorm(4))
test %>% filter(x != 3)

OUTPUT

  x          y
1 1 -0.3053884

R

# you might expect that the NA values would be included, since “NA is not equal to 3”. But actually, the expression NA != 3 evaluates to NA, not TRUE. So the NA rows will be dropped!
# Instead you should use is.na() to match NAs
test %>% filter(x != 3 | is.na(x))

OUTPUT

   x          y
1 NA  0.4874291
2 NA  0.7383247
3  1 -0.3053884

Key Points

A minimal reproducible dataset contains (a) the minimum number of lines, variables, and categories, in the correct format, to replicate your issue; and (b) it must be fully reproducible, meaning that someone else can access or run the same code to reproduce the dataset needed for your reprex.
To make it accessible, you can create a dataset from scratch using as.data.frame, you can use an R dataset like cars, or you can use a subset of your own dataset and then use dput() to generate reproducible code.

Bonus: Additional Practice

Here are some more practice exercises if you wish to test your knowledge

Extra Practice? Would need to change from mpg, since that’s from ggplot

For each of the following, identify which data are necessary to create a minimal reproducible dataset using mpg.

We want to know how the highway mpg has changed over the years
We need a list of all “types” of cars and their fuel type for each manufacturer
We want to compare the average city mpg for a compact car from each manufacturer

(I copied these from excercise 6 in the google doc… but I’m not sure that they are getting at the point of the lesson…)

Another Excercise

Each of the following examples needs your help to create a dataset that will correctly reproduce the given result and/or warning message when the code is run. Fix the dataset shown or fill in the blanks so it reproduces the problem.

set.seed(1) sample_data <- data.frame(fruit = rep(c(“apple”, “banana”), 6), weight = rnorm(12)) ggplot(sample_data, aes(x = fruit, y = weight)) + geom_boxplot() HELP: how do I insert an image from clipboard?? Is it even possible?
bodyweight <- c(12, 13, 14, , ) max(bodyweight) [1] NA
sample_data <- data.frame(x = 1:3, y = 4:6) mean(sample_data$x) [1] NA Warning message: In mean.default(sample_data$x): argument is not numeric or logical: returning NA
sample_data <- ____ dim(sample_data) NULL

Show me the solution

“fruit” needs to be a factor and the order of the levels must be specified: sample_data <- data.frame(fruit = factor(rep(c("apple", "banana"), 6), levels = c("banana", "apple")), weight = rnorm(12))
one of the blanks must be an NA
?? + what’s really the point of this one?
sample_data <- data.frame(x = factor(1:3), y = 4:6)

Great work! We’ve created a minimal reproducible example. In the next episode, we’ll learn about reprex, a package that can help us double-check that our example is reproducible by running it in a clean environment. (As an added bonus, reprex will format our example nicely so it’s easy to post to places like Slack, GitHub, and StackOverflow.)

Content from Asking your question

Last updated on 2025-07-22 | Edit this page

Overview

Questions

How can I verify that my example is reproducible?
How can I easily share a reproducible example with a mentor or helper, or online?
How do I ask a good question?

Objectives

Use the reprex package to test whether an example is reproducible.
Use the reprex package to format reprexes for posting online.
Understand the benefits and drawbacks of different help forums.
Have a road map to follow when posting a question to make sure it’s a good question.
Understand what the {reprex} package does and doesn’t do.

Congratulations on finishing your reprex! In this episode, we will introduce a tool, the reprex package. This package will help you check that your example is truly reproducible and format it nicely to make it easy to present to a helper, either in person or online.

There are three principles to remember when you think about sharing your reprex with other people: Reproducibility, formatting, and context.

1. Reproducibility

Haven’t we already talked a lot about reproducibility?

Yes! We have discussed variables and packages, minimal datasets, and making sure that the problem is meaningfully reproduced by the data that you choose. But there are some reasons that a code snippet that appears reproducible in your own R session might not actually be runnable by someone else.

You forgot to account for the origins of some functions and/or variables. We went through our code methodically, but what if we missed something? It would be nice to confirm that the code is as self-contained as we thought it was.
Your code accidentally relies on objects in your R environment that won’t exist for other people. For example, imagine you defined a function my_awesome_custom_function() in a project-specific functions.R script, and your code calls that function.

A function called "my_awesome_custom_function" is lurking in my R environment. I must have defined it a while ago and forgotten! Code that includes this function will not run for someone else unless the function definition is also included in the reprex.

R

my_awesome_custom_function("the kangaroo rat dataset")

ERROR

Error in my_awesome_custom_function("the kangaroo rat dataset"): could not find function "my_awesome_custom_function"

I might conclude that this code is reproducible–after all, it works when I run it! But unless I remembered to include the function definition in the reprex itself, nobody will be able to run the code.

A corrected reprex would look like this:

R

my_awesome_custom_function <- function(x){print(paste0(x, " is awesome!"))}
my_awesome_custom_function("the kangaroo rat dataset")

OUTPUT

[1] "the kangaroo rat dataset is awesome!"

Your code depends on some particular characteristic of your R or RStudio environment that is not the same as your helper’s environment. [more details here]

There are so many components to remember when thinking about reproducibility, especially for more complex problems. Wouldn’t it be nice if we had a way to double check our examples? Luckily, the reprex package will help you test your reprexes in a clean, isolated environment to make sure they’re actually reproducible.

The most important function in the reprex package is called reprex(). Here’s how to use it.

First, install and load the reprex package.

R

#install.packages("reprex")
library(reprex)

Second, write some code. This is your reproducible example.

R

(y <- 1:4)

OUTPUT

[1] 1 2 3 4

R

mean(y)

OUTPUT

[1] 2.5

Third, highlight that code and copy it to your clipboard (e.g. Cmd + C on Mac, or Ctrl + C on Windows).

Finally, type reprex() into your console.

# (with the target code snippet copied to your clipboard already...)
# In the console:
reprex()

reprex will grab the code that you copied to your clipboard and run that code in an isolated environment. It will return a nicely formatted reproducible example that includes your code and and any results, plots, warnings, or errors generated.

The generated output will be on your computer’s clipboard by default. Then, you can paste it into GitHub, StackOverflow, Slack, or another venue.

Callout: The `reprex` package workflow

The reprex package workflow takes some getting used to. Instead of copying your code into the function, you simply copy it to the clipboard (a mysterious, invisible place to most of us) and then let the blank, empty reprex() function go over to the clipboard by itself and find it.

And then the completed, rendered reprex replaces the original code on the clipboard and all you need to do is paste, not copy and paste.

Let’s practice this one more time. Here’s some very simple code:

R

library(ggplot2)
library(dplyr)
mpg %>% 
  ggplot(aes(x = factor(cyl), y = displ))+
  geom_boxplot()

Let’s highlight the code snippet, copy it to the clipboard, and then run reprex() in the console.

# In the console:
reprex()

The result, which was automatically placed onto my clipboard and which I pasted here, looks like this:

R

library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
mpg %>% 
  ggplot(aes(x = factor(cyl), y = displ))+
  geom_boxplot()

^{Created on 2024-12-29 with reprex v2.1.1}

Nice and neat! It even includes the plot produced, so I don’t have to take screenshots and figure out how to attach them to an email or something.

The formatting is great, but reprex really shines when you treat it as a helpful collaborator in your process of building a reproducible example (including all dependencies, providing minimal data, etc.)

Let’s see what happens if we forget to include library(ggplot2) in our small reprex above.

R

library(dplyr)
mpg %>% 
  ggplot(aes(x = factor(cyl), y = displ))+
  geom_boxplot()

As before, let’s copy that code to the clipboard, run reprex() in the console, and paste the result here.

# In the console:
reprex()

R

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
mpg %>% 
  ggplot(aes(x = factor(cyl), y = displ))+
  geom_boxplot()
#> Error in ggplot(., aes(x = factor(cyl), y = displ)): could not find function "ggplot"

^{Created on 2024-12-29 with reprex v2.1.1}

Now we get an error message indicating that R cannot find the function ggplot! That’s because we forgot to load the ggplot2 package in the reprex.

This happened even though we had ggplot2 already loaded in our own current RStudio session. reprex deliberately ignores any packages already loaded, running the code in a clean, isolated R session that’s different from the R session we’ve been working in. This simulates the experience of someone else trying to run your reprex on their own computer.

Let’s return to our previous example with the custom function.

R

my_awesome_custom_function("the kangaroo rat dataset")

OUTPUT

[1] "the kangaroo rat dataset is awesome!"

# In the console:
reprex()

R

my_awesome_custom_function("the kangaroo rat dataset")
#> Error in my_awesome_custom_function("the kangaroo rat dataset"): could not find function "my_awesome_custom_function"

^{Created on 2024-12-29 with reprex v2.1.1}

By contrast, if we include the function definition:

R

my_awesome_custom_function <- function(x){print(paste0(x, " is awesome!"))}
my_awesome_custom_function("the kangaroo rat dataset")

OUTPUT

[1] "the kangaroo rat dataset is awesome!"

# In the console:
reprex()

R

my_awesome_custom_function <- function(x){print(paste0(x, " is awesome!"))}
my_awesome_custom_function("the kangaroo rat dataset")
#> [1] "the kangaroo rat dataset is awesome!"

^{Created on 2024-12-29 with reprex v2.1.1}

Testing it out

Now that we’ve met our new reprex-making collaborator, let’s use it to test out the reproducible example we created in the previous episode.

Here’s the code we wrote:

R

# Mickey's reprex script
# XXX THIS IS NOT FINISHED--NEED TO INSERT FINAL DATA EXAMPLE!

# Load necessary packages to run the code
library(ggplot2)

rodents_subset %>% # XXX replace with simulated dataset
  ggplot(aes(y = weight, x = common_name, fill = sex)) +
  geom_boxplot() # wait, why does this look weird?

# Investigate
table(rodents_subset$sex, rodents_subset$species)
table(rodents$sex, rodents$species)

Time to find out if our example is actually reproducible! Let’s copy it to the clipboard and run reprex(). Since we want to give Jordan a runnable R script, we can use venue = "r".

# In the console:
reprex(venue = "r")

It worked!

R

#replace with final output

Now we have a beautifully-formatted reprex that includes runnable code and all the context needed to reproduce the problem.

Callout: Including information about your R session

Another nice thing about reprex is that you can choose to include information about your R session, in case your error has something to do with your R settings rather than the code itself. You can do that using the session_info argument to reprex().

For example, try running the following reprex, setting session_info = TRUE, and observe what happens.

R

library(ggplot2)
library(dplyr)
mpg %>%
  ggplot(aes(x = factor(cyl), y = displ))+
  geom_boxplot()

# In the console:
reprex(session_info = TRUE)

Formatting

The output of reprex() is markdown, which can easily be copied and pasted into many sites/apps. However, different places have slightly different formatting conventions for markdown. reprex lets you customize the output of your reprex according to where you’re planning to post it.

The default, venue = "gh", gives you “GitHub-Flavored Markdown”, which is a particular type of markdown that works well when posted on GitHub. Another format you might want is “r”, which gives you a runnable R script, with commented output interleaved with pieces of code.

Check out the formatting options in the help file with ?reprex, and try out a few depending on the destination of your reprex!

Callout: `reprex` can’t do everything for you

People often mention reprex as a useful tool for creating reproducible examples, but it can’t do the work of crafting the example for you! The package doesn’t locate the problem, pare down the code, create a minimal dataset, or automatically include package dependencies.

A better way to think of reprex is as a tool to check your work as you go through the process of creating a reproducible example, and to help you polish up the result.

Context

The final thing to consider when preparing your reproducible example is adding some context so that helpers know a little about your problem and what you’re trying to achieve.

Some context to include: 1. Tell us a little bit about your problem. One sentence should be enough. What domain are you working in? What are these data about? What do the relevant variables mean?

This is particularly important if you have provided a subset of your own data instead of creating a minimal dataset from scratch. Your helper will need to interpret the column names and understand what type of data they are looking at.

Explain what you expected to happen, or what you were trying to achieve, and how it is different from what happened instead.

The contrast between what happened and what was supposed to happen is particularly important for semantic errors, in which the “error” is not always obvious when running the code. The code ran–but you have decided that the output is “wrong” somehow, or that it “didn’t work”. Why? How do you know that? Your helper needs to know that what you got was not what you expected, and they need to know what you expected in order to help you achieve that outcome.

For example, let’s say you made the following plot:

R

rodents %>%
  ggplot(aes(x = plot_type, y = hindfoot_length, color = plot_type))+
  geom_boxplot()

WARNING

Warning: Removed 2003 rows containing non-finite outside the scale range
(`stat_boxplot()`).

This plot doesn’t look the way you want it to look, and you’re not sure why, so you decide to make a reprex. You load the required packages (ggplot2 and dplyr), and you substitute an existing dataset, mtcars, instead of rodents, which you know your helpers won’t have access to. Your reprex looks like this:

R

library(ggplot2)
library(dplyr)
mpg %>%
  ggplot(aes(x = class, y = displ, color = class))+
  geom_boxplot()

It’s minimal! It’s reproducible! But… what is the problem? This is a perfectly reasonable plot, so without context, your helper won’t know what’s wrong. Let’s explain it to them.

R

library(ggplot2)
library(dplyr)
mpg %>%
  ggplot(aes(x = class, y = displ, color = class))+
  geom_boxplot()

R

# I want to make a boxplot where each of the categories has its own color. But even though I set color = class here, only the outlines of the boxplots got colored in, and the inside is still white. How do I change this so that the whole box is colored in?

Exercise 1: What makes a good description?

For each of the following reprexes, improve the description given. a. I’m trying to plot the displacements of different cars. I made this boxplot, but the boxes are showing up in the wrong order. How do I fix this? Here is my minimal reproducible example.

R

library(ggplot2)
library(dplyr)
mpg %>%
  ggplot(aes(x = class, y = displ, color = class))+
  geom_boxplot()

I’m working with this data about cars. The class column refers to the type of car–for example, “compact” class means that the car is quite small, while “pickup” would be a pickup truck. For each car, I also have information about the city and highway mileage, and the transmission, and the number of cylinders, as well as the displacement. This dataset has 234 rows and 11 columns, although this is an example dataset because my real dataset is much larger and has more like 500,000 rows. Anyway, in this example, I want to make a boxplot of displacement where each of the categories has its own color. But even though I set color = class here, only the outlines of the boxplots got colored in, and the inside is still white. How do I make the inside a different color? Here’s a reprex.

R

library(ggplot2)
library(dplyr)
mpg %>%
  ggplot(aes(x = class, y = displ, color = class))+
  geom_boxplot()

Help, my code isn’t working! It says I have too many elements. I made a reprex so you can see the data and the error message. I hope that’s helpful. Thank you so much!

R

library(ggplot2)
table(mpg)

ERROR

Error in table(mpg): attempt to make a table with >= 2^31 elements

As we wrap up this lesson, let’s work on adding some context for Mickey’s reprex so they’ll be ready to send it to Remy or post it online.

Exercise 2: Adding context

Working with the person next to you, write a brief description of Mickey’s problem that they could include with their reprex when they post it online.

Make sure that the description gives a little bit of background, describes what Mickey was trying to achieve, and describes what happened instead.

When you’re done, compare notes between the groups and see if you can come up with a final reprex for Mickey!

Key Points

The reprex package makes it easy to format and share your reproducible examples.
The reprex package helps you test whether your reprex is reproducible, and also helps you prepare the reprex to share with others.
Following a certain set of steps will make your questions clearer and likelier to get answered.