Content from What is a reprex and why is it useful?
Last updated on 2025-05-27 | Edit this page
Overview
Questions
- What steps can you take to solve problems in your code?
- What is a minimal reproducible example?
- Why are minimal reproducible examples important?
- What variables are included in the Portal Project dataset?
Objectives
- Understand the high-level process for getting unstuck in R.
- Define each key characteristic of a minimal reproducible example.
- Explain why minimal reproducible examples are central to getting help from others.
- Load in the rodent survey data and briefly explain its contents.
Mickey is an ecologist working with data from The Portal Project, a long-term research study of rodents in Portal, Arizona. Mickey has just started in a new lab at their university. They are interested in learning about rodent morphology. For now, they are learning about the dataset by doing some descriptive analyses and visualizations of the data.
Mickey starts by loading the data so they can begin to explore it. They also load the {tidyverse}, a set of packages that will be useful for wrangling and visualizing the data.
::: instructor note Loading the entire {tidyverse} here, rather than a few component packages, is an intentional over-complication so that we can teach learners to simplify their packages later. Learners should have {tidyverse} installed, as per the setup instructions.
Would it be better to have the surveys dataset as a downloaded file for them to load in, or does loading it from a url make sense? :::
R
library(tidyverse)
OUTPUT
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
R
surveys <- read_csv("https://raw.githubusercontent.com/carpentries-incubator/R-help-reprexes/refs/heads/main/episodes/data/surveys_complete_77_89.csv")
OUTPUT
Rows: 16878 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): species_id, sex, genus, species, taxa, plot_type
dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Mickey has some past experience in R, but this project will require more data analysis than they have done before. Mickey attended a Carpentries workshop, “Data Analysis and Visualization in R for Ecologists,” and they feel comfortable with the fundamentals of coding in R. Still, they are a little nervous about starting this project.
Prerequisites and target audience
This workshop assumes some prior experience with working in R and RStudio. We will assume you’ve taken the equivalent of the Data Analysis and Visualization in R for Ecologists workshop and are comfortable with basic commands, and we won’t necessarily explain every line of code in detail.
If you’re much more experienced in R, this workshop is still for you! Even expert coders may not always know how to get unstuck. We hope this workshop will be useful to people with a variety of coding backgrounds.
Sometimes, Mickey’s code doesn’t work as expected and they go to their colleague, Remy, for help. Remy has spent many hours sitting with Mickey, helping to work through various errors. But soon, Remy will be starting a big project, and they won’t have as much time to help with debugging.
::: instructor note The following exercises are optional, but they can are useful for getting learners settled in. :::
To help Mickey get more comfortable troubleshooting their own code, Remy suggests some steps to follow the next time they get stuck. Remy calls this the “Road Map to Getting Unstuck in R.”

Remy explains that the road map includes two main phases. First, there is guidance about “code first aid.” This includes understanding types of errors, reading function documentation, investigating error messages, and running through code line by line to diagnose problems.
Sometimes, these first aid steps are not enough to solve your problem. One of the most frustrating parts of learning to code is getting stuck and not knowing what to do! Luckily, there are many people in the R and data science communities who are happy to help, as long as you give them the right information. But figuring out how to ask a good question can feel even harder than the original problem that got you stuck in the first place. That’s why the second part of Jordan’s road map includes guidance on how to create a minimal reproducible example (also known as a “reprex”).
A minimal reproducible example is a piece of code that demonstrates the problem you are facing, includes all necessary information to show the problem but nothing extra, and will run easily on someone else’s computer.

Minimal reproducible examples are very important tools to get help when you’re stuck on a coding problem.
Stripping the code and data down to their simplest (minimal) parts makes it easy for a helper to zero in on what might be going wrong.
Making your example reproducible allows a helper to run your code on their own computer so they can “feel your pain” and understand what’s going wrong. Even experts often have to “tinker” with code in order to fix it. Providing a reprex makes that “tinkering” easy, which makes it more likely that a helper will take the time to assist you.
The process of making a minimal reproducible example often gives you insight into your own code. Often, you might end up solving the problem yourself, without even needing to ask for help.
Callout
The phenomenon of solving one’s own problem during the process of trying to explain it to someone else is often called “rubber duck debugging.” This is a reference to a story about programmers who would keep rubber ducks on their desks to explain coding problems to. Jenny Bryan refers to reprexes as “basically the rubber duck in disguise,” because they force you to explain your problem to someone else, often solving it in the process.
Jenny Bryan shares many other insights about reprexes in her 2018 talk “Help me help you: creating reproducible examples.”

Helpers
There are lots of people who might help you with your code: friends, colleagues, mentors, or total strangers online. In this lesson, we will use the term “helper” to refer to the person who is helping you to debug your code. Helpers are the target audience for your minimal reproducible example.
Jordan emphasizes to Mickey that they are still happy to be a helper, but that since they won’t have as much time to devote to debugging in the future, following this road map first will make the helping process more efficient. Hopefully, it will also make Mickey into a more confident coder!
Before heading off to their own work, Jordan also introduces Mickey to the dataset they’ve just loaded in.
R
glimpse(surveys)
OUTPUT
Rows: 16,878
Columns: 13
$ record_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ month <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, …
$ day <dbl> 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16…
$ year <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, …
$ plot_id <dbl> 2, 3, 2, 7, 3, 1, 2, 1, 1, 6, 5, 7, 3, 8, 6, 4, 3, 2, …
$ species_id <chr> "NL", "NL", "DM", "DM", "DM", "PF", "PE", "DM", "DM", …
$ sex <chr> "M", "M", "F", "M", "M", "M", "F", "M", "F", "F", "F",…
$ hindfoot_length <dbl> 32, 33, 37, 36, 35, 14, NA, 37, 34, 20, 53, 38, 35, NA…
$ weight <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ genus <chr> "Neotoma", "Neotoma", "Dipodomys", "Dipodomys", "Dipod…
$ species <chr> "albigula", "albigula", "merriami", "merriami", "merri…
$ taxa <chr> "Rodent", "Rodent", "Rodent", "Rodent", "Rodent", "Rod…
$ plot_type <chr> "Control", "Long-term Krat Exclosure", "Control", "Rod…
R
min(surveys$year)
OUTPUT
[1] 1977
R
max(surveys$year)
OUTPUT
[1] 1989
Jordan explains that the dataset is made up of many individual rodent
records (record_id
). The date of each record is given by
the month
, day
, and year
columns.
The dataset includes data from a number of different study plots that
had different treatments applied: plot IDs are given by the
plot_id
column, and the type of treatment is specified in
plot_type
.
There is information about the genus
and
species
of each individual caught, as well as higher-level
taxa
information and a short-form species_id
code.
For each individual caught, the field crew took weight
,
sex
and hindfoot_length
measurements, although
those measurements are sometimes missing.
The dataset contains 16,878 rodent observations ranging across years from 1977 through 1989.
Callout
More information about the Portal Project and the surveys dataset is available at [LINK].
With an introduction to the dataset and a road map to guide them if they get stuck, Mickey feels ready to start coding!
Key Points
- kp1
- kp2
- kp3
Content from Identify the problem and make a plan
Last updated on 2025-05-27 | Edit this page
Overview
Questions
- What do I do when I encounter an error?
- What do I do when my code outputs something I don’t expect?
- Why do errors and warnings appear in R?
- How can I find which areas of code are responsible for errors?
- How can I fix my code? What other options exist if I can’t fix it?
Objectives
After completing this episode, participants should be able to…
- Describe how the desired code output differs from the actual output
- Categorize an error message (e.g. syntax error, semantic errors, package-specific errors, etc.)
- decode/describe what an error message is trying to communicate
- Identify specific lines and/or functions generating the error message
- Use R Documentation to look up function syntax and examples
- Quickly fix commonly-encountered R errors using the internet
- Identify when a problem is better suited for asking for further help, including online help and reprex
(initial intro – edit upon looking at intro episode)
The first step we’ll cover is what to do when encountering an error or other undesired output from your code. With this episode, we hope to teach you the basics about identifying errors, rectifying them if possible, and if not, how to isolate the problem for others to look at. This is the first step in our “roadmap” of how to solve coding problems – recognizing when something you don’t intend is happening with your code, and then identifying the problem (to a lesser or greater degree) in order to solve it yourself or be able to succinctly describe it to a helper.
3.1 What do I do when I encounter an error message?
While sometimes frustrating to read, R will often let us know when a problem occurs by generating an error message that tells us why R was unable to run our code. This type of ‘error’ is often referred to as a syntax error. When R is unable to run your code, it will return this type of error message, and stop the program (as opposed to a warning or attempting to run further lines despite the error). Error messages may happen for many reasons. However, deciphering the meaning of such error messages is not always as easy as we might hope. While we can’t review every type of reason your code generates an error, we will try to teach you some tools for you to interpret and figure out syntax errors for yourself.
The accompanying error message attempts to tell you exactly how your code failed. For example, consider the following error message that occurs when I run this command in the R console:
R
ggplot(x = taxa) + geom_bar()
ERROR
Error: object 'taxa' not found
Though we know somewhere there is an object called taxa
(it is actually a column of the dataset rodents
), R is
trying to communicate that it cannot find any such object in the local
environment. Let’s try again, appropriately pointing ggplot to the
rodents
dataset and taxa
column using the
$
operator. For the sake of argument, let’s say we also
remember that geom_bar
expects an aesthetic
(aes
).
R
ggplot(aes(x = rodents$taxa)) + geom_bar()
ERROR
Error in `fortify()`:
! `data` must be a <data.frame>, or an object coercible by `fortify()`,
or a valid <data.frame>-like object coercible by `as.data.frame()`, not a
<uneval> object.
ℹ Did you accidentally pass `aes()` to the `data` argument?
Whoops! Here we see another error message – this time, R responds with a perhaps more-uninterpretable message.
Let’s go over each part briefly. First, we see an error from a
function called fortify
, which we didn’t even call! Then,
there’s a more helpful informational message: Did we accidentally pass
aes()
to the data
argument? This does seem to
relate to our function call, as we do pass aes
into the
ggplot
function. But what is this “data
argument?” A helpful starting place when attempting to decipher an error
message is checking the documentation for the function which caused the
error:
?ggplot
Here, a Help window pops up in RStudio which provides some more
information. Skipping the general description at the top, we see ggplot
takes positional arguments of data
, then
mapping
, which uses the aes
call. We can see
in “Arguments” that the aes(x = rodents$taxa)
object used
in the plot is attempted by fortify
to be converted to a
data.frame: now the picture is clear! We accidentally passed our
mapping
argument (telling ggplot how to map variables to
the plot) into the position it expected data
in the form of
a data frame. And if we scroll down to “Examples”, to “Pattern 1”, we
can see exactly how ggplot expects these arguments in practice. Let’s
amend our result:
R
ggplot(rodents, aes(x = taxa)) + geom_bar()

Here we see our desired plot.
Stop no. 1 on our roadmap: Identifying the problem
Let’s pause here to highlight some patterns we’re starting to see in the course of fixing our code:
Seeing a problem arise in our code (in this case, R is explicitly telling us it has a problem running it).
Reading and interpreting the error message R gives us.
Other steps we might take then include:
Acting on parts of the error we can understand, such as changing input to a function.
Pulling up the R Documentation for that function, and reading the documentation’s Description, Usage, Arguments, Details and Examples entries for greater insight into our error.
Copying and pasting the error message into a search engine / generative LLM for more interpretable explanations.
And, when all else fails, we can prepare our code into a reproducible example for expert help.
While the above steps may be new or seem familiar, we formalize this a little bit to explicitly address something: recognizing when a problem arises and attempting to interpret what is going wrong is essential to fixing it. This is true whether you fix the problem on your own, or communicate it to an expert. The latter steps we listed might be categorized as attempts to immediately address the problem – we’ll call these code first aid – these steps might fix the problem, give you greater insight into what the problem is (and how R is interpreting your code), or not be helpful at all.
In any case, we want to emphasize that these skill sets are essential to being a practiced coder able to effectively seek help. While these examples may seem trivial to pull up a whole checklist, below we will see examples of problems that are trickier to both recognize and interpret. But in those cases, we’ll nonetheless apply the same framework.
3.2 What do I do when my code outputs something I don’t expect
Another type of problem you may encounter is when your R code runs without errors, but does not produce the desired output. You may sometimes see these called semantic errors. As with syntax errors, semantic errors may occur for a variety of non-intuitive reasons, and are often harder to solve as there is no description of the error – you must work out where your code is defective yourself!
Let’s go back to our rodent analysis. The next step in the plan is to
subset to just the Rodent
taxa (as opposed to other taxa:
Bird, Rabbit, Reptile or NA). Let’s quickly check to see how much data
we’d be throwing out by doing so:
R
table(rodents$taxa)
OUTPUT
Bird Rabbit Reptile Rodent
300 69 4 16148
We’re interested in the Rodents, and thankfully it seems like a majority of our observations will be maintained when subsetting to rodents. But wait… In our plot above, we can clearly see the presence of NA values. Why are we not seeing them here? Our command was correctly executed, but the output is not everything we intended. Having no error message to interpret, let’s jump straight to the function documentation:
R
?table
OUTPUT
Help on topic 'table' was found in the following packages:
Package Library
vctrs /home/runner/.local/share/renv/cache/v5/linux-ubuntu-jammy/R-4.5/x86_64-pc-linux-gnu/vctrs/0.6.5/c03fa420630029418f7e6da3667aac4a
base /home/runner/.cache/R/renv/sandbox/linux-ubuntu-jammy/R-4.5/x86_64-pc-linux-gnu/9a444a72
Using the first match ...
Here, the documentation provides some clues: there seems to be an
argument called useNA
that accepts “no”, “ifany”, and
“always”, but it’s not immediately apparent which one we should use to
show our NA values. As a second approach, let’s go to
Examples
to see if we can find any quick fixes. Here we see
a couple lines further down:
R
table(a) # does not report NA's
table(a, exclude = NULL) # reports NA's
That seems like it should be inclusive. Let’s try again:
R
table(rodents$taxa, exclude = NULL)
OUTPUT
Bird Rabbit Reptile Rodent <NA>
300 69 4 16148 357
Now our NA values show up in the table. We see that by subsetting to the “Rodent” taxa, we would losing about 357 NAs, which themselves could be rodents! However, in this case, it seems a small enough portion to safely omit. Let’s subset our data to the rodent taxon.
R
rodents <- rodents %>% filter(taxa == "Rodent")
Challenge
There are 3 lines of code below, and each attempts to create the same plot. Identify which produces a syntax error, which produces a semantic error, and which correctly creates the plot (hint: this may require you inferring what type of graph we’re trying to create!)
ggplot(rodents) + geom_bin_2d(aes(month, plot_type))
ggplot(rodents) + geom_tile(aes(month, plot_type), stat = "count")
ggplot(rodents) + geom_tile(aes(month, plot_type))
In this case, A correctly creates the graph, plotting as colors in the tile the number of times an observation is seen. It essentially runs the following lines of code:
R
rodents_summary <- rodents %>% group_by(plot_type, month) %>% summarize(count=n())
OUTPUT
`summarise()` has grouped output by 'plot_type'. You can override using the
`.groups` argument.
R
ggplot(rodents_summary) + geom_tile(aes(month, plot_type, fill=count))

B is a syntax error, and will produce the following error:
R
ggplot(rodents) + geom_tile(aes(month, plot_type), stat = "count")
ERROR
Error in `geom_tile()`:
! Problem while computing stat.
ℹ Error occurred in the 1st layer.
Caused by error in `setup_params()`:
! `stat_count()` must only have an x or y aesthetic.
Finally, C is a semantic error. It produces a plot, which is rather meaningless:
R
ggplot(rodents) + geom_tile(aes(month, plot_type))

Summary
In general, encountering semantic errors can make our job more difficult, but the roadmap remains the same:
Seeing a problem arise in our code.
Interpreting the problem.
Other steps we might take then include:
Acting on parts of the error we can understand, such as changing input to a function.
Pulling up the R Documentation for relevant functions, and reading the documentation’s Description, Usage, Arguments, Details and Examples entries for greater insight into our error.
Describing our problem into a search engine / generative LLM for more interpretable explanations.
And, when all else fails, we can prepare our code into a reproducible example for expert help.
The steps to identifying the problem and in code first aid matches what we’ve seen above. However, here seeing the problem arise in our code may be much more subtle, and comes from us recognizing output we don’t expect or know to be wrong. Even if the code is run, R may give us warning or informational messages which pop up when executing your code. Most of the time, however, it’s up to the coder to be vigilant and be sure steps are running as they should. Interpreting the problem may also be more difficult as R gives us little or no indication about how it’s misinterpreting our intent.
Callout
Generally, the more your code deviates from just using base R
functions, or the more you use specific packages, both the quality of
documentation and online help available from search engines and Googling
gets worse and worse. While base R errors will often be solvable in a
couple of minutes from a quick ?help
check or a long online
discussion and solutions on a website like Stack Overflow, errors
arising from little-used packages applied in bespoke analyses might
merit isolating your specific problem to a reproducible example for
online help, or even getting in touch with the developers! Such
community input and questions are often the way packages and
documentation improves over time.
3.3 How can I find where my code is failing?
Isolating your problem may not be as simple as assessing the output from a single function call on the line of code which produces your error. Often, it may be difficult to determine which lines or logical sections of code (e.g. functions) are producing the error.
Consider the example below, where we now are attempting to see which species of kangaroo rodents appear in different plot types over the years. To do so, we’ll filter our dataset to just include the genus Dipodomys. Then we’ll plot a histogram of which how many observations are seen in each plot type over an x axis of years.
R
krats <- rodents %>% filter(genus == "Dipadomys") #kangaroo rat genus
ggplot(krats, aes(year, fill=plot_type)) +
geom_histogram() +
facet_wrap(~species)
ERROR
Error in `combine_vars()`:
! Faceting variables must have at least one value.
Uh-oh. Another error here, when we try to make a ggplot. But what is “combine_vars?” And then: “Faceting variables must have at least one value” What does that mean?
This is not an easily-interpretable error message from ggplot, and our code looks like it should run. Perhaps we can take a step back and see whether our error is actually not in the ggplot code itself. Often, when trying to isolate the problem area, it is a good idea to look back at the original input. So let’s take a look at our krats dataset.
R
krats
OUTPUT
# A tibble: 0 × 13
# ℹ 13 variables: record_id <dbl>, month <dbl>, day <dbl>, year <dbl>,
# plot_id <dbl>, species_id <chr>, sex <chr>, hindfoot_length <dbl>,
# weight <dbl>, genus <chr>, species <chr>, taxa <chr>, plot_type <chr>
It’s empty! What went wrong with our “Dipadomys” filter? Let’s use a print statement to see which genuses are included in the original rodents dataset.
R
print(rodents %>% count(genus))
OUTPUT
# A tibble: 12 × 2
genus n
<chr> <int>
1 Ammospermophilus 136
2 Baiomys 3
3 Chaetodipus 382
4 Dipodomys 9573
5 Neotoma 904
6 Onychomys 1656
7 Perognathus 553
8 Peromyscus 1271
9 Reithrodontomys 1412
10 Rodent 4
11 Sigmodon 103
12 Spermophilus 151
We see two things here. For one, we’ve misspelled Dipodomys, which we can now amend. This quick function call also tells us we should expect a data frame with 9573 values resulting after subsetting to the genus Dipodomys.
R
krats <- rodents %>% filter(genus == "Dipodomys") #kangaroo rat genus
dim(krats)
OUTPUT
[1] 9573 13
R
ggplot(krats, aes(year, fill=plot_type)) +
geom_histogram() +
facet_wrap(~species)
OUTPUT
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Our improved code here looks good. Checking the dimensions of our subsetted data frame using the dim() function confirms we now have all the Dipodomys observations, and our plot is looking better. In general, having a ‘print’ statement or some other output after we manipulate data or other major steps can be a good way to check your code is producing intermediate results consistent with your expectations.
Callout
Often, giving your expert helpers access to the entire problem, with a detailed description of your desired output allows you to directly improve your coding skills and learn about new functions and techniques.
Summary
In general, encountering semantic errors can make our job more difficult, but the roadmap remains the same:
Seeing a problem arise in our code.
Interpreting the problem.
Other steps we might take then include:
Acting on parts of the error we can understand, such as changing input to a function.
Pulling up the R Documentation for relevant functions, and reading the documentation’s Description, Usage, Arguments, Details and Examples entries for greater insight into our error.
Describing our problem into a search engine / generative LLM for more interpretable explanations.
And, when all else fails, we can prepare our code into a reproducible example for expert help.
Our roadmap to identifying problems in our code may now look like:
Seeing a problem arise in our code.
Isolating our code to the problem area.
Interpreting the problem.
Now we can see the need to isolate the specific areas of code causing the bug or problem. There is no general rule of thumb as to how large this needs to be. But, unless our problem occurs on the first line, we should be able to isolate our code a bit: Any early lines which we know run correctly and as intended may not need to be included, and by isolating the problem area as much as we can to make it understandable to others, even if that does not help us solve the problem ourselves.
Let’s add to our code first aid:
Identify the problem area – add print statements immediately upstream or downstream of problem areas, check the desired output from functions, and see whether any intermediate output can be further isolated.
Acting on parts of the error we can understand, such as changing input to a function.
Pulling up the R Documentation for relevant functions, and reading the documentation’s Description, Usage, Arguments, Details and Examples entries for greater insight into our error.
Describing our problem into a search engine / generative LLM for more interpretable explanations.
And, when all else fails, we can prepare our code into a reproducible example for expert help.
While this is similar to our previous checklists, we can now understand these steps as a continuous cycle of isolating the problem into more and more discrete chunks for a reproducible example. Any step in the above that helps us identify the specific areas or aspects of our code that are failing in particular, we can zoom in on and restart the checklist. We can stop as soon as we don’t understand anymore how our code fails. At this point, we can excise that area for further help using a reprex.
3.3 When should I prepare my code for a reprex?
There may be some point at which our code first aid does not help us anymore, and we still cannot figure out the problem our code is giving us – in that case, it may be time to turn to expert help, by asking a coworker, mentor, or someone online for aid in
While it is common practice in intro coding courses to call over the instructor with a raised hand and a statement such as “I don’t know what’s wrong,” in reality people have limited time, bandwidth, or requisite knowledge to be able to help out with any problem that might arise. Even if they can’t figure out a bug on their own, the practiced coder can identify and articulate the problem effectively such that someone with available time and expertise can help out
Content from Minimal Reproducible Code
Last updated on 2025-05-27 | Edit this page
Overview
Questions
- Why is it important to make a minimal code example?
- Which part of my code is causing the problem?
- Which parts of my code should I include in a minimal example?
- How can I tell whether a code snippet is reproducible or not?
- How can I make my code reproducible?
Objectives
- Explain the value of a minimal code snippet.
- Identify the problem area of a script.
- Identify supporting parts of the code that are essential to include.
- Simplify a script down to a minimal code example.
- Evaluate whether a piece of code is reproducible as is or not. If not, identify what is missing.
- Edit a piece of code to make it reproducible
- Have a road map to follow to simplify your code.
- Describe the {reprex} package and its uses
Mickey is interested in understanding how kangaroo rat weights differ across species and sexes, so they create a quick visualization.
R
ggplot(rodents, aes(x = species, fill = sex))+
geom_bar()
Whoa, this is really overwhelming! Mickey forgot that the dataset
includes data for a lot of different rodent species, not just kangaroo
rats. Mickey is only interested in two kangaroo rat species:
Dipodomys ordii (Ord’s kangaroo rat) and Dipodomys
spectabilis (Banner-tailed kangaroo rat).
Mickey also notices that there are three categories for sex: F, M, and what looks like a blank field when there is no sex information available. For the purposes of comparing weights, Mickey wants to focus only rodents of known sex.
Mickey filters the data to include only the two focal species and only rodents whose sex is F or M.
R
rodents_subset <- rodents %>%
filter(species == c("ordii", "spectabilis"),
sex == c("F", "M"))
Because these scientific names are long, Mickey also decides to add
common names to the dataset. They start by creating a data frame with
the common names, which they will then join to the
rodents_subset
dataset:
R
## Adding common names
common_names <- data.frame(species = unique(rodents_subset$species), common_name = c("Ord's", "Banner-tailed"))
common_names
OUTPUT
species common_name
1 spectabilis Ord's
2 ordii Banner-tailed
But looking at the common names
dataset reveals a
problem!
Exercise 1a: Applying code first aid
- Is this a syntax error or a semantic error? Explain why.
- What “code first aid” steps might be appropriate here? Which ones are unlikely to be helpful?
Mickey re-orders the names and tries the code again. This time, it
works! Now they can join the common names to
rodents_subset
.
R
common_names <- data.frame(species = sort(unique(rodents_subset$species)), common_name = c("Ord's", "Banner-Tailed"))
rodents_subset <- left_join(rodents_subset, common_names, by = "species")
Before moving on to answering their research question about kangaroo
rat weights, Mickey also wants to create a date column, since they
realized that having the dates stored in three separate columns
(month
, day
, and year
) might be
hard for future analysis. They want to use lubridate to
parse the dates. But here, too, they run into trouble.
R
rodents_subset <- rodents_subset %>%
mutate(date = lubridate(paste(year, month, day, sep = "-")))
ERROR
Error in `mutate()`:
ℹ In argument: `date = lubridate(paste(year, month, day, sep = "-"))`.
Caused by error in `lubridate()`:
! could not find function "lubridate"
Exercise 1b: Applying code first aid, part 2
- Is this a syntax error or a semantic error? Explain why.
- What “code first aid” steps might be appropriate here?
- What would be your next step to fix this error, if you were Mickey?
Exercise 1c: Applying code first aid, part 2 (extra challenge)
Mickey tried several methods to create a date column. Here’s one of them.
R
test <- rodents_subset %>%
mutate(date = lubridate::as_date(paste(day, month, year)))
WARNING
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `date = lubridate::as_date(paste(day, month, year))`.
Caused by warning:
! All formats failed to parse. No formats found.
- What type of error is this?
- What do you learn from the warning message? Why do you think this code causes a warning message, rather than an error message?
- Try some code first aid steps. What do you think happened here? How did you figure it out?
Mickey reads some of the lubridate documentation and
changes their code so that the date
column is created
correctly.
R
rodents_subset <- rodents_subset %>%
mutate(date = lubridate::ymd(paste(year, month, day, sep = "-")))
Now that the dataset is cleaned, Mickey is ready to start learning about kangaroo rat weights!
They start by running a quick linear regression to predict
weight
based on species
and
sex
.
R
weight_model <- lm(weight ~ common_name + sex, data = rodents_subset)
summary(weight_model)
OUTPUT
Call:
lm(formula = weight ~ common_name + sex, data = rodents_subset)
Residuals:
Min 1Q Median 3Q Max
-111.201 -6.466 2.534 10.799 45.799
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 123.2007 0.8061 152.83 <2e-16 ***
common_nameOrd's -74.7342 1.3352 -55.97 <2e-16 ***
sexM NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.71 on 939 degrees of freedom
(35 observations deleted due to missingness)
Multiple R-squared: 0.7694, Adjusted R-squared: 0.7691
F-statistic: 3133 on 1 and 939 DF, p-value: < 2.2e-16
The negative coefficient for common_nameOrd's
tells
Mickey that Ord’s kangaroo rats are significantly less heavy than
Banner-tailed kangaroo rats.
But something is wrong with the coefficients for sex. Why is
everything NA for sexM
?
When Mickey visualizes the data, they see a problem in the graph, too. As the model showed, Ord’s kangaroo rats are significantly smaller than Banner-tailed kangaroo rats. But something is definitely wrong! Because the boxes are colored by sex, we can see that all of the Banner-tailed kangaroo rats are male and all of the Ord’s kangaroo rats are female. That can’t be right! What are the chances of catching all one sex for two different species?
R
rodents_subset %>%
ggplot(aes(y = weight, x = common_name, fill = sex)) +
geom_boxplot()
WARNING
Warning: Removed 35 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Mickey confirms this with a two-way frequency table.
R
table(rodents_subset$sex, rodents_subset$species)
OUTPUT
ordii spectabilis
F 350 0
M 0 626
To double check, Mickey looks at the original dataset.
R
table(rodents$sex, rodents$species)
OUTPUT
albigula eremicus flavus fulvescens fulviventer harrisi hispidus
62 14 15 0 0 136 2
F 474 372 222 46 3 0 68
M 368 468 302 16 2 0 42
leucogaster maniculatus megalotis merriami ordii penicillatus sp.
16 9 33 45 3 6 10
F 373 160 637 2522 690 221 4
M 397 248 680 3108 792 155 5
spectabilis spilosoma taylori torridus
42 149 0 28
F 1135 1 0 390
M 1232 1 3 441
Not only were there originally males and females present from both ordii and spectabilis, but the original numbers were way, way higher! It looks like somewhere along the way, Mickey lost a lot of observations.
[WORKING THROUGH CODE FIRST AID STEPS HERE] Mickey is feeling overwhelmed and not sure where their code went wrong. They were able to fix the errors and warning messages that they encountered so far, but this one seems more complicated, and there has been no clear indication of what went wrong. They work their way through the code first aid steps, but they are not able to solve the problem.
They decide to consult Remy’s road map to figure out what to do next.

Since code first aid was not enough to solve this problem, it looks like it’s time to ask for help using a reprex.
Making a reprex
Simplify the code
When asking someone else for help, it is important to simplify your code as much as possible to make it easier for the helper to understand what is wrong. Simplifying code helps to reduce frustration and overwhelm when debugging an error in a complicated script. The more that we can make the process of helping easy and painless for the helper, the more likely that they will take the time to help.
Callout
Depending on how closely you have been following the lesson and which challenges you have attempted, your script may not look exactly like Mickey’s. That’s okay!
Mickey has written a lot of code so far. The code is also a little messy–for example, after fixing the previous errors, they sometimes commented out the old code and kept it for future reference.
Create a new script
To make the task of simplifying the code less overwhelming, let’s create a separate script for our reprex. This will let us experiment with simplifying our code while keeping the original script intact.
Let’s create and save a new, blank R script and give it a name, such as “reprex-script.R”
Callout: Making an R script
There are several ways to make an R script:
- File > New File > R Script
- Click the white square with a green plus sign at the top left corner of your RStudio window
- Use a keyboard shortcut: Cmd + Shift + N (on a Mac) or Ctrl + Shift + N (on Windows)
We’re going to start by copying over all of our code, so we have an exact copy of the full analysis script.
R
# Minimal reproducible example script
# Load packages and data
library(ggplot2)
library(dplyr)
rodents <- read.csv("data/surveys_complete_77_89.csv")
# XXX ADD PETER'S EPISODE CODE HERE
## Filter to only rodents
rodents <- rodents %>% filter(taxa == "Rodent")
# Visualize sex by species
ggplot(rodents, aes(x = species, fill = sex))+
geom_bar()
# Subset to species and sexes of interest
rodents_subset <- rodents %>%
filter(species == c("ordii", "spectabilis"),
sex == c("F", "M"))
# Add common names
# common_names <- data.frame(species = unique(rodents_subset$species), common_name = c("Ord's", "Banner-tailed"))
# common_names # oops, this looks wrong!
common_names <- data.frame(species = sort(unique(rodents_subset$species)), common_name = c("Ord's", "Banner-Tailed"))
common_names
rodents_subset <- left_join(rodents_subset, common_names)
# Add a date column
# rodents_subset <- rodents_subset %>%
# mutate(date = lubridate(paste(year, month, day, sep = "-"))) # that didn't work!
rodents_subset <- rodents_subset %>%
mutate(date = lubridate::ymd(paste(year, month, day, sep = "-")))
# Predict weight by species and sex, and make a plot
weight_model <- lm(weight ~ common_name + sex, data = rodents_subset)
summary(weight_model)
rodents_subset %>%
ggplot(aes(y = weight, x = common_name, fill = sex)) +
geom_boxplot() # wait, why does this look weird?
# Investigate
table(rodents_subset$sex, rodents_subset$species)
table(rodents$sex, rodents$species)
Now, we will follow a process: 1. Identify the symptom of the problem. 2. Remove a piece of code to make the reprex more minimal. 3. Re-run the reprex to make sure the reduced code still demonstrates the problem–check that the symptom is still present.
In this case, the symptom is that we are missing rows in
rodents_subset
that were present in
rodents
and should not have been removed!
Let’s start by identifying pieces of code that we can probably remove. A good start is to look for lines of code that do not create variables for later use, or lines that add complexity to the analysis that is not relevant to the problem at hand.
We can start by removing the broken code that we commented out earlier. Also, adding the date column is not directly relevant to the current problem. Let’s go ahead and remove those pieces of code. Now our script looks like this:
R
# Minimal reproducible example script
# Load packages and data
library(ggplot2)
library(dplyr)
rodents <- read.csv("data/surveys_complete_77_89.csv")
# XXX ADD PETER'S EPISODE CODE HERE
## Filter to only rodents
rodents <- rodents %>% filter(taxa == "Rodent")
# Visualize sex by species
ggplot(rodents, aes(x = species, fill = sex))+
geom_bar()
# Subset to species and sexes of interest
rodents_subset <- rodents %>%
filter(species == c("ordii", "spectabilis"),
sex == c("F", "M"))
# Add common names
common_names <- data.frame(species = sort(unique(rodents_subset$species)), common_name = c("Ord's", "Banner-Tailed"))
common_names
rodents_subset <- left_join(rodents_subset, common_names)
# Predict weight by species and sex, and make a plot
weight_model <- lm(weight ~ common_name + sex, data = rodents_subset)
summary(weight_model)
rodents_subset %>%
ggplot(aes(y = weight, x = common_name, fill = sex)) +
geom_boxplot() # wait, why does this look weird?
# Investigate
table(rodents_subset$sex, rodents_subset$species)
table(rodents$sex, rodents$species)
When we run this code, we can confirm that it still demonstrates our
problem. There are still many rows missing from
rodents_subset
.
We’ve made progress on minimizing our code, but we still have a ways to go. This script is still pretty long! Let’s identify more pieces of code that we can remove.
Exercise 2: Minimizing code
Which other lines of code can you remove to make this script more minimal? After removing each one, be sure to re-run the code to make sure that it still reproduces the error.
- [Peter’s episode code]
- Visualizing sex by species (ggplot) can be removed because it generates a plot but does not create any variables that are used later.
- Filtering to only rodents can be removed because later we filter to only two species in particular
- Adding common names can be removed because we didn’t actually use
those common names. This one is tricky because technically we did use
the common names in the rodents_subset plot. But is that plot
really necessary? We can still demonstrate the problem using
the table() lines of code at the end. Also, we could still make the
equivalent plot using the
species
column instead of thecommon_name
column, and it would demonstrate the same thing! - The weight model and the summary can be removed
A totally minimal script would look like this:
R
rodents <- read.csv("data/surveys_complete_77_89.csv")
rodents_subset <- rodents %>%
filter(species == c("ordii", "spectabilis"),
sex == c("F", "M"))
table(rodents_subset$sex, rodents_subset$species)
table(rodents$sex, rodents$species)
Great, now we have a totally minimal script!
However, we’re not done yet.
Exercise 3: The problem area is not enough
Let’s suppose that Mickey has created the minimal problem area script shown above. They email this script to Remy so that Remy can help them debug the code.
Remy opens up the script and tries to run it on their computer, but it doesn’t work. - What do you think will happen when Remy tries to run the code from this reprex script? - What do you think Mickey should do next to improve the minimal reproducible example?
We haven’t yet included enough code to allow a helper, such as Remy, to run the code on their own computer. If Remy tries to run the reprex script in its current state, they will encounter errors because they don’t have access to the same R environment that Mickey does.
Include dependencies
R code consists primarily of functions and variables. In order to make our minimal examples truly reproducible, we have to give our helpers access to all the functions and variables that are necessary to run our code.
First, let’s talk about functions. Functions in R typically come from packages. You can access them by loading the package into your environment.
To make sure that your helper has access to the packages necessary to
run your reprex, you will need to include calls to
library()
for whichever packages are used in the code. For
example, if your code uses the function lmer
from the
lme4 package, you would have to include
library(lme4)
at the top of your reprex script to make sure
your helper has the lme4 package loaded and can run your
code.
Callout: Default packages
Some packages, such as {base}
and {stats}
,
are loaded in R by default, so you might not have realized that a lot of
functions, such as dim
, colSums
,
factor
, and length
actually come from those
packages!
You can see a complete list of the functions that come from the
{base}
and {stats}
packages by running
library(help = "base")
or
library(help = "stats")
.
Let’s do this for our own reprex. We can start by identifying all the functions used, and then we can figure out where each function comes from to make sure that we tell our helper to load the right packages.
The first function used in our example is ggplot()
,
which comes from the package ggplot2. Therefore, we know
we will need to add library(ggplot2)
at the top of our
script.
The function geom_boxplot()
also comes from
ggplot2. We also used the function table()
.
Running ?table
tells us that the table
function comes from the package {base}
, which is
automatically installed and loaded when you use R–that means we don’t
need to include library(base)
in our script.
Our reprex script now looks like this:
R
# Mickey's reprex script
# Load necessary packages to run the code
library(ggplot2)
rodents_subset %>%
ggplot(aes(y = weight, x = common_name, fill = sex)) +
geom_boxplot() # wait, why does this look weird?
WARNING
Warning: Removed 35 rows containing non-finite outside the scale range
(`stat_boxplot()`).

R
# Investigate
table(rodents_subset$sex, rodents_subset$species)
OUTPUT
ordii spectabilis
F 350 0
M 0 626
R
table(rodents$sex, rodents$species)
OUTPUT
albigula eremicus flavus fulvescens fulviventer harrisi hispidus
62 14 15 0 0 136 2
F 474 372 222 46 3 0 68
M 368 468 302 16 2 0 42
leucogaster maniculatus megalotis merriami ordii penicillatus sp.
16 9 33 45 3 6 10
F 373 160 637 2522 690 221 4
M 397 248 680 3108 792 155 5
spectabilis spilosoma taylori torridus
42 149 0 28
F 1135 1 0 390
M 1232 1 3 441
Callout: Installing vs. loading packages
But what if our helper doesn’t have all of these packages installed? Won’t the code not be reproducible?
Typically, we don’t include install.packages()
in our
code for each of the packages that we include in the
library()
calls, because install.packages()
is
a one-time piece of code that doesn’t need to be repeated every time the
script is run. We assume that our helper will see
library(specialpackage)
and know that they need to go
install “specialpackage” on their own.
Technically, this makes that part of the code not reproducible! But
it’s also much more “polite”. Our helper might have their own way of
managing package versions, and forcing them to install a package when
they run our code risks messing up our workflow. It is a common
convention to stick with library()
and let them figure it
out from there.
Exercise 4: Which packages are essential?
In each of the following code snippets, identify the necessary packages (or other code) to make the example reproducible.
weight_model <- lm(weight ~ common_name + sex, data = rodents_subset)
tab_mod(weight_model)
mod <- lmer(weight ~ hindfoot_length + (1|plot_type), data = rodents)
summary(mod)
rodents_processed <- process_rodents_data(rodents)
glimpse(rodents_processed)
This exercise should take about 10 minutes. :::solution a.
lm
is part of base R, so there’s no package needed for
that. tab_mod
comes from the package sjPlot
.
You could add libary(sjPlot)
to this code to make it
reproducible. b. lmer
is a linear mixed modeling function
that comes from the package lme4
. summary
is
from base R. You could add library(lme4)
to this code to
make it reproducible. c. process_rodents_data
is not from
any package that we know of, so it was probably an originally-created
function. In order to make this example reproducible, you would have to
include the definition of process_rodents_data
.
glimpse
is probably from dplyr
, but it’s worth
noting that there is also a glimpse
function in the
pillar
package, so this might be ambiguous. This is another
reason it’s important to specify your packages–if you leave your helper
guessing, they might load the wrong package and misunderstand your
error!
:::::::::::::::::::::::::::::::::::::::::::
Including library()
calls will definitely help Remy run
the code. But this code still won’t work as written because Remy does
not have access to the same objects that Mickey used in the
code.
The code as written relies on rodents_subset
, which Remy
will not have access to if they try to run the code. That means that
we’ve succeeded in making our example minimal, but it is not
reproducible: it does not allow someone else to reproduce the
problem!
[Transition to minimal data episode]
Exercise 5: Reflection
Let’s take a moment to reflect on this process.
What’s one thing you learned in this episode? An insight; a new skill; a process?
What is one thing you’re still confused about? What questions do you have?
This exercise should take about 5 minutes.
Key Points
- Making a reprex is the next step after trying code first aid.
- In order to make a good reprex, it is important to simplify your code
- Simplify code by removing parts not directly related to the question
- Give helpers access to the functions used in your code by loading all necessary packages
Content from Minimal Reproducible Data
Last updated on 2025-05-27 | Edit this page
Overview
Questions
- What is a minimal reproducible dataset, and why do I need it?
- How do I create a minimal reproducible dataset?
- Can I just use my own data?
Objectives
- Describe a minimal reproducible dataset
- Identify the aspects of your data necessary to replicate your issue
- Create a dataset from scratch to replicate your issue
- Share your own dataset in a way that is minimal, accessible, and reproducible
- Subset an existing dataset to replicate your issue
3.1 What is a minimal reproducible dataset and why do I need it?
Now that you have narrowed down your problem area and stripped down your code to make it minimal we need to ensure it is reproducible; this means it needs to be accessible and runnable such that anyone else can copy-paste it into their system, run the code, and replicate your issue. Importantly, an example code will always require example data in order to run! Therefore, every reprex requires you to provide a minimal reproducible dataset to use with the code.
Furthermore, as we have seen previously, sometimes the source of the problem isn’t actually your code, but rather your data! By providing an example dataset that, when used in your example code, still replicates your issue, you also give your helper the opportunity to better investigate and manipulate that data to fix your issue. It would be great if we could give the helper our entire computer so they could just take over where we left off, but usually we can’t.
Just as we did with our code, when providing such an example dataset we also want to make sure we keep it minimal–free of unnecessary data. This will allow your helper to more clearly see what the data looks like and what the source of your issue may be. Furthermore, it will allow you to not only better understand your data but also potentially work out the source of your issue. When extraneous information is removed and only the parts that replicate the issue are kept, we can begin to see where the issues arise.
In short, a minimal reproducible dataset must be:
- minimal: it only contains the necessary information to run your minimal code. You can also think of this as being relevant to the problem: keep only what is necessary.
- reproducible: it must be accessible to someone without your computer, and it must consistently replicate your output/issue. This means it also needs to be complete, meaning there are no dependencies that have been omitted.
Remember: your helper may not be in the room with you or have access to your computer and the files that are on it!
You might be used to always uploading data from separate files, but helpers can’t access those files. Even if you sent someone a file, they would need to put it in the right directory, make sure to load it in exactly the same way, make sure their computer can open it, etc. Since the goal is to make everyone’s lives as easy as possible, we need to think about our data in a different way–as a dummy object created in the script itself.
Pro-tip
An example of what minimal reproducible examples look like can also
be found in the ?help
section, in R Studio. Just scroll all
the way down to where there are examples listed. These will usually be
minimal and reproducible.
For example, let’s look at the function mean
:
R
?mean
We see examples that can be run directly on the console, with no additional code.
R
x <- c(0:10, 50)
xm <- mean(x)
c(xm, mean(x, trim = 0.10))
OUTPUT
[1] 8.75 5.50
In this case, x is the dummy dataset consisting of just 1 variable. Notice how it was created as part of the example.
Exercise 1
These datasets are not well suited for use in a reprex. For each one, try to reproduce the dataset on your own in R (copy-paste). Does it work? What happened? Explain.
sample_data <- read_csv(“/Users/kaija/Desktop/RProjects/ResearchProject/data/sample_dataset.csv”)
dput(complete_old[1:100,])
sample_data <- data.frame(species = species_vector, weight = c(10, 25, 14, 26, 30, 17))
- Not reproducible because it is a screenshot.
- Not reproducible because it is a path to a file that only exists on someone else’s computer and therefore you do not have access to it using that path.
- Not minimal, it has far too many columns and probably too many rows.
It is also not reproducible because we were not given the source for
complete_old
. - Not reproducible because we are not given the source for
species_vector
.
Exercise 2
Let’s say we want to know the average weight of all the species in
our rodents
dataset. We try to use the following code…
R
mean(rodents$weight)
OUTPUT
[1] NA
…but it returns NA! We don’t know why that is happening and we want to ask for help.
Which of the following represents a minimal reproducible dataset for this code? Can you describe why the other ones are not?
sample_data <- data.frame(month = rep(7:9, each = 2), hindfoot_length = c(10, 25, 14, 26, 30, 17))
sample_data <- data.frame(weight = rnorm(10))
sample_data <- data.frame(weight = c(100, NA, 30, 60, 40, NA))
sample_data <- sample(rodents$weight, 10)
sample_data <- rodents_modified[1:20,]
The correct answer is C!
- does not include the variable of interest (weight).
- does not produce the same problem (NA result with a warning message)–the code runs just fine.
- minimal and reproducible.
- is not reproducible. Sample randomly samples 10 items; sometimes it may include NAs, sometime it may not (not guaranteed to reproduce the error). It can be used if a seed is set (see next section for more info).
- uses a dataset that isn’t accessible without previous data wrangling code–the object rodents_modified doesn’t exist.
3.2 Can I just use my own data?
At this point you may be wondering why you need a separate dataset, can’t you just use your own data if you made sure it was minimal and your helper could access it?
Callout
There are several reasons why you might need to create a separate dataset that is minimal and reproducible instead of trying to use your actual dataset. The original dataset may be:
- too large - the Portal dataset is ~35,000 rows with 13 columns and contains data for decades. That’s a lot!
- private - your dataset might not be published yet, or maybe you’re studying an endangered species whose locations can’t easily be shared. Another example: many medical datasets cannot be shared publically.
- hard to send - on most online forums, you can’t attach supplemental files (more on this later). Even if you are just sending data to a colleague, file paths can get complicated, the data might be too large to attach, etc.
- complicated - it would be hard to locate the relevant information.
One example to steer away from are needing a ‘data dictionary’ to
understand all the meanings of the columns (e.g. what is “plot type” in
ratdat
?) We don’t our helper to waste valuable time to figure out what everything means. - highly derived/modified from the original file. You may have already done a bunch of preliminary data wrangling you don’t want to include when you send the example, so you would need to provide the intermediate dataset directly to your helper.
If so, you wouldn’t be entirely wrong. While there could be several ways in which your original data may be inaccessible or hard to derive or subset, there are likely just as many ways you could make it anonymous, minimal, and reproducible. And we will show you how! Nevertheless, making your own data minimal and reproducible isn’t necessarily easier than creating a new dataset from scratch. Furthermore, creating a dataset from scratch can often highlight the source of your issue! Which means you might not need to ask help after all or you can ask a more specific question.
3.3 How do I create a minimal reproducible dataset?
In general, there are 3 common ways to provide minimal reproducible data for a reprex.
You can write a script that creates a new “dummy” dataset with the same key characteristics as your original data.
You can make your own data minimal and reproducible.
You can use a dataset that is already embedded in R and is therefore already accessible.
For the purpose of this lesson, we believe each coder is entitled to all the options, therefore we will walk you through how to provide a minimal reproducible dataset using each of these 3 methods. However, opinions generally differ on which method is best for which situations. Below we compiled a summary table of advantages and disadvantages of each method based on many conversations with several data science groups.
Advantages | Disadvantages | |
Data from Scratch |
|
While some disadvantages are universal, many apply mostly to novices.
|
R-built Data |
|
|
Your Data |
|
|
3.4 Creating a “dummy” dataset from scratch
While this might be the most daunting option for novices, it tends to be the preferred methods for experts. That’s probably because, once you really understand the basic building blocks, it becomes the most straight-forward method of creating a minimal reproducible dataset. This is also the method that makes most sense when doing other activities that also require a reprex (e.g., teaching, collaborating, developing). Lastly, in this lesson we believe there are greater problem-solving insights to be gained by creating a new “dummy” dataset. So let’s start by making this scary process more easily digestible!
Usually, at this stage, you would have 2 pressing quesions:
- How do I create a dataset?
- How do I recreate the key elements of my data that replicate my issue?
Let’s start with the first.
There are many ways one can create a dataset in R.
You can start by creating vectors using c()
R
vector <- c(1,2,3,4)
vector
OUTPUT
[1] 1 2 3 4
You can also add some randomness by sampling from a vector using
sample()
.
For example you can sample numbers 1 through 10 in a random order
R
x <- sample(1:10)
x
OUTPUT
[1] 6 9 10 4 1 7 3 8 2 5
Or you can randomly sample from a normal distribution
R
x <- rnorm(10)
x
OUTPUT
[1] -0.7645173 -0.9526317 -1.2161516 0.6193248 -0.2540016 0.9599602
[7] 0.1201105 0.9103763 0.5288797 0.7116196
You can also use letters
to create factors.
R
x <- sample(letters[1:4], 20, replace=T)
x
OUTPUT
[1] "d" "c" "c" "a" "d" "c" "d" "a" "d" "b" "c" "b" "b" "d" "c" "a" "d" "b" "d"
[20] "d"
Remember that a data frame is just a collection of vectors. You can
create a data frame using data.frame
(or
tibble
in the dplyr
package). You can then
create a vector for each variable.
R
data <- data.frame (x = sample(letters[1:3], 20, replace=T),
y = rnorm(1:20))
head(data)
OUTPUT
x y
1 b -0.4722334
2 a -0.6265888
3 b 1.6929309
4 a 1.6969200
5 a 0.8845575
6 a 1.3272655
However, when sampling at random you must remember
to set.seed()
before sending it to someone to make sure you
both get the same numbers!
Callout
For more handy functions for creating data frames and variables, see
the cheatsheet. For some questions, specific formats can be needed. For
these, one can use any of the provided as.someType functions:
as.factor
, as.integer
,
as.numeric
, as.character
,
as.Date
, as.xts
.
Exercise 3: Your turn! (Optional)
Create a data frame with:
A. One categorical variable with 2 levels and one continuous variable. B. One continuous variable that is normally distributed. C. Name, sex, age, and treatment type.
3.5 Identifying the key elements of your data
No matter which approach we take for providing a dataset, we need to identify which elements of our original data are necessary. To do so, we propose starting a few simple questions to investigate your data:
- How many variables do we need?
- What data type (discrete or continuous) is each variable?
- How many levels and/or observations are necessary?
- Should the values be distributed in a specific way?
- Are there any NAs that could be relevant?
Let’s come back to our kangaroo rats example. Here is the minimal code we settled on:
R
# Minimal code [Or whatever we end up with]
krats_subset <- rodents %>%
filter(species == c("ordii", "spectabilis"),
sex == c("F", "M"))
table(krats_subset$sex, krats_subset$species)
OUTPUT
ordii spectabilis
F 350 0
M 0 626
So we want to create a minimal reproducible ‘dummy’ version of
krats_subet
. Let’s start by taking a quick look, then
aswering the questions.
R
head(krats_subset)
OUTPUT
record_id month day year plot_id species_id sex hindfoot_length weight
1 58 7 18 1977 12 DS M 45 NA
2 80 8 19 1977 1 DS M 48 NA
3 104 8 20 1977 11 DS M 43 NA
4 144 8 21 1977 15 DS M 40 NA
5 176 9 11 1977 12 DS M 45 NA
6 182 9 11 1977 21 DS M 49 NA
genus species taxa plot_type
1 Dipodomys spectabilis Rodent Control
2 Dipodomys spectabilis Rodent Spectab exclosure
3 Dipodomys spectabilis Rodent Control
4 Dipodomys spectabilis Rodent Long-term Krat Exclosure
5 Dipodomys spectabilis Rodent Control
6 Dipodomys spectabilis Rodent Long-term Krat Exclosure
Excercise 4
Try to answer the following questions on your own to determine what we need to include in our minimal reproducible dataset:
- How many variables do we need?
- What data type (discrete or continuous) is each variable?
- How many levels and/or observations are necessary?
- Should the values be distributed in a specific way?
- Are there any NAs that could be relevant?
Let’s go over the answers together and build a dataset as we go along!
- How many variables do we need?
We need species, sex, and maybe a third identifier like record_id. This means we potentially need 3 vector (remember, each column in a dataframe is essentially just a vector).
- What data type (discrete or continuous) is each variable?
Species and sex are both discrete (categorical) variables, while record ID would be more continuous.
- How many levels and/or observations are necessary?
Since we are filtering our dataset down to 2 categories for both species and sex, that means we need at least 3 levels in each to start with. In terms of number of observations there don’t seem to be specific restrictions other than we probably want at least 1 observations per original category, so 2*3=6, or we can just pick another typical nice number like 10.
- Should the values be distributed in a specific way?
Since question probably isn’t going to be relevant most of the time, but certainly worth considering. If we needed a longer dataset of measurements we may have wanted to make sure it was normally distributed. If we needed a longer dataset of counts we may have wanted to make sure it was Poisson distributed. Or maybe we had bimodal data. But in this case, we had a short dataset and I don’t think it matters. We can always come back to this if we are unable to replicate our issue (hint: in which case the distribution may be related to the issue).
- Are there any NAs that could be relevant?
We don’t have any NAs but we do have a blank category under sex. For all we know that could be important, so maybe we want to also make one of our categories blank.
R
# We need 3 variables: species, sex, and record_id
# species and sex are categorical with at least 3 levels, one of which is blank for sex
species <- sample(letters, 3, replace=F)
print(species)
OUTPUT
[1] "o" "g" "k"
R
sex <- c('M','F','')
print(sex)
OUTPUT
[1] "M" "F" ""
R
# record_id is continuous
record_id <- 1:10
print(record_id)
OUTPUT
[1] 1 2 3 4 5 6 7 8 9 10
R
# Now let's go sampling and put it all together
sample_data <- data.frame(
record_id = record_id,
species = sample(species, 10, replace=T),
sex = sample(sex, 10, replace=T)
)
print(sample_data)
OUTPUT
record_id species sex
1 1 g M
2 2 o M
3 3 o
4 4 g F
5 5 g F
6 6 g
7 7 k
8 8 k
9 9 k M
10 10 k F
And just like that we created a ‘dummy’ dataset from scratch! Notice
that we could also have compiled the same type of dataset in a single
line by creating each vector already within the
data.frame()
R
sample2_data <- data.frame(
record_id = 1:10,
species = sample(letters[1:3], 10, replace=T),
sex = sample(c('M','F',''), 10, replace=T)
)
print(sample2_data)
OUTPUT
record_id species sex
1 1 c M
2 2 c M
3 3 c F
4 4 c
5 5 c M
6 6 c
7 7 c F
8 8 a
9 9 c M
10 10 b F
Important: Notice that the outputs of if you want
the outputs to be EXACTLY the same each time, but you are using
sample()
which is an inherently random process, you must
first use set.seed()
and share that with your helper
too.
R
set.seed(1) # set seed before recreating the sample
sample_data <- data.frame(
record_id = 1:10,
species = sample(letters[1:3], 10, replace=T),
sex = sample(c('M','F',''), 10, replace=T)
)
print(sample_data)
OUTPUT
record_id species sex
1 1 a
2 2 c M
3 3 a M
4 4 b M
5 5 a F
6 6 c F
7 7 c F
8 8 b F
9 9 b
10 10 c M
Callout
Adding a set.seed()
at the start of your reprex will
ensure anyone else who runs the same code in the same
order will always get the same results. However, if using it
more generally, you may want to read more about it. For example, in the
example below we set a seed of 2 and then run sample(10)
twice. You will notice that the output of each sample run is not the
same. However, if you run the whole code again, you will see that each
of the outputs actually do stay the same.
R
set.seed(2)
sample(10)
OUTPUT
[1] 5 6 9 1 10 7 4 8 3 2
R
sample(10)
OUTPUT
[1] 1 3 6 2 9 10 7 5 4 8
Great! Now we need to check whether it works within our code and whether it reproduces our issue
R
# Minimal code [or whatever we end up with]
sample_subset <- sample_data %>% # replace rodents with our sample dataset
filter(species == c("a", "b"), # replace species with those from our sample dataset
sex == c("F", "M")) # this can stay the same because we recreated it the same
table(sample_subset$sex, sample_subset$species)
OUTPUT
a b
F 1 0
M 0 1
It works! Our sample size has unexpectedly been reduced to just 2 observations, when we would have expected a sample of 8, based on the sample_data output above. Wherever the issue lies, we were able to successfully replicate it in our minimal ‘dummy’ dataset.
Exercise 5: Your turn!
Now practice doing it yourself. Create a data frame with:
A. One categorical variable with 2 levels and one continuous variable. B. One continuous variable that is normally distributed. C. Name, sex, age, and treatment type.
3.6 Using your own data set
Even once you master the art of creating ‘dummy’ datasets, there may be occasions in which your data or your issue is maybe too complex and you can’t seem to replicate the issue. Or maybe you still think using your own data would just be easier.
In cases when you want to make your own data minimal and reproducible, you will want to take a similar approach to what we did in Episode 2 when making our code minimal. Keep what is essential, get rid of the rest. In other words, we will want to subset our data into a smaller, more digestible chunk.
The question still arises: how do I know what is essential?
Use the same guiding questions that we used earlier!
- How many (or rather which) variables do we need?
- What data type is each variable? (less necessary, since we are keeping the actual variables)
- How many levels and/or observations are necessary? (potentially still useful, we don’t want to get rid of more than we need)
- Should the values be distributed in a specific way? (they are as they are, but worth keeping in mind in terms of how we are removing observations)
- Are there any NAs that could be relevant?
Based on our previous answers we end up with:
- We need species, sex, and maybe record_id
- Species and sex are categorical, record_id is a continuous count of our observations.
- As we said earlier, we want 3 each for species and sex, which happens to already be the case. And we could reduce our record_id size to ~10.
- Not really, but we want to make sure that when we reduce the number of observations we still have observations in each of the 3 levels in species and sex.
- No NA’s, but we still don’t know if the blanks are relevant, so let’s make sure we keep at least one.
Now that we have a clearer goal, let’s subset our data.
Useful functions for subsetting a dataset include
subset()
, head()
, tail()
, and
indexing with [] (e.g., iris[1:4,]). Alternatively, you can use
tidyverse functions like select()
, and
filter()
from the tidyverse. You can also use the same
sample()
functions we covered earlier.
Note: you should already have an understanding of how to subset or wrangle data using the tidyverse from the R for Ecology lesson. If not, go check it out! [insert link to lesson]
R
# Remember your minimal code
krats_subset <- rodents %>%
filter(species == c("ordii", "spectabilis"),
sex == c("F", "M"))
table(krats_subset$sex, krats_subset$species)
OUTPUT
ordii spectabilis
F 350 0
M 0 626
Exercise 6: Think quick!
Which dataset are we trying to make minimal and reproducible? Hint: the two datasets we can see are krats_sebset and rodents
Given that the code that is going wrong is that which creates krats_subset, we need to create a minimal reproducible version of rodents! We can then insert our new_rodent dataset in place of the original rodent one.
Step 1: select the variables of interest
R
# subset rodent into new_rodent to make it minimal
# Note: there are many ways you could do this!
new_rodent <- rodents %>%
# 1. select the variables of interest
select(record_id, species, sex)
# PAUSE. Does this work so far?
print(new_rodent)
OUTPUT
record_id species sex
1 1 albigula M
2 2 albigula M
3 3 merriami F
4 4 merriami M
5 5 merriami M
6 6 flavus M
7 7 eremicus F
8 8 merriami M
9 9 merriami F
10 10 flavus F
11 11 spectabilis F
12 12 merriami M
13 13 merriami M
14 14 merriami
15 15 merriami F
16 16 merriami F
17 17 spectabilis F
18 18 penicillatus M
19 19 flavus
20 20 spectabilis F
21 21 merriami F
22 22 albigula F
23 23 merriami M
24 24 hispidus M
25 25 merriami M
26 26 merriami M
27 27 merriami M
28 28 merriami M
29 29 penicillatus M
30 30 spectabilis F
31 31 merriami F
32 32 merriami F
33 33 merriami F
[ reached 'max' / getOption("max.print") -- omitted 16115 rows ]
Step 2-5: reduce the number of observations to ~10 while making sure the dataset still contains at least 3 species and at least 3 sexes
While the rest is just one step, it is the trickiest, because this is where we want to ensure the key elements of our original dataset, as defined earlier, are preserved.
Exercise 7: Try it yourself
How would you continue the subsetting pipeline? How could you reduce the number of observations while making sure you still have at least 3 species and 3 sexes left? Hint: there is no single right answer! Trial and error works wonders.
R
set.seed(1)
new_rodents <- rodents %>%
# 1. select the variables of interest
select(record_id, species, sex) %>%
slice_sample(n=4, replace = F, by='sex')
print(new_rodents)
OUTPUT
record_id species sex
1 2359 merriami M
2 16335 albigula M
3 9910 ordii M
4 8278 ordii M
5 12038 merriami F
6 7862 megalotis F
7 9221 albigula F
8 1335 spectabilis F
9 9862 harrisi
10 14979 merriami
11 11333 spilosoma
12 351 leucogaster
The code ran wihtout issues, yay! But do we end up with what we were looking for?
- Doe we have ~10 observations? Yes! 9 seems good enough
- Do we have at least 3 species? Yes! We have 7 (we could choose to reduce this further)
- Do we have at least 3 sexes? Yes! M, F, and blank
Great! All of our requirements are fulfilled. Now let’s see if it replicates our issue when we add it to our minimal code.
Note: slice_sample()
and similar functions allow you to
specify and customize how exactly you want that sample to be taken
(check the documentation!). For example, you can specify a proportion of
rows to select, specify how to order variables, whether ties [may
require more explanation] should be kept together, or even whether to
weigh certain variables. All of this allows you to keep aspects of your
dataset that may be relevant and hard to replicate otherwise.
Remember your minimal code:
R
krats_subset <- rodents %>%
filter(species == c("ordii", "spectabilis"),
sex == c("F", "M"))
table(krats_subset$sex, krats_subset$species)
OUTPUT
ordii spectabilis
F 350 0
M 0 626
We now want to replace rodents with our new_rodents. Do we need to change anything else?
We actually still have ordii and spectabilis as species in our list, so we can keep it as is. Same for sex. So we’re all set!
R
new_subset <- new_rodents %>%
filter(species == c("ordii", "spectabilis"),
sex == c("F", "M"))
The code ran without any issues! But does it replicate our issue?
Take a step back to remind yourself of what you are looking for. What was the issue we had identified?
The number of rows we end up after the filer is lower than expected.
So what would we expect to see with this new dataset? Since it is nice and short, this makes it a lot easier to predict the outcome.
We are asking for the 2 ordii rows, both males, and the 1 spectabilis row, which is female.
R
table(new_subset$sex, new_subset$species)
OUTPUT
< table of extent 0 x 0 >
Instead we end up with nothing! Why aren’t we getting the rows we are asking for?
Maybe our table is just wrong, let’s look at the actual dataset we end up with
R
print(new_subset)
OUTPUT
[1] record_id species sex
<0 rows> (or 0-length row.names)
Still nothing! What is going on?? Well, we certainly replicated our issue. Time to ask for help!
But wait, our dataset is now minimal and relevant, but is it
reproducible (accessible outside your device)? Not yet. We created a
subset of our original dataset rodents
but this came from a
file on our computer. We could share our csv file and add an upload
code… but that’s not ideal and it makes it hard to share our problem on
a community site. Remember, the more steps required, the less likely
someone will want to help.
Thankfully, there is a nifty function dput()
that can
help us out. Let’s try it and see what happens.
R
dput(new_rodents)
OUTPUT
structure(list(record_id = c(2359L, 16335L, 9910L, 8278L, 12038L,
7862L, 9221L, 1335L, 9862L, 14979L, 11333L, 351L), species = c("merriami",
"albigula", "ordii", "ordii", "merriami", "megalotis", "albigula",
"spectabilis", "harrisi", "merriami", "spilosoma", "leucogaster"
), sex = c("M", "M", "M", "M", "F", "F", "F", "F", "", "", "",
"")), class = "data.frame", row.names = c(NA, -12L))
It spit out a hard-to-read but not excessively long chunk of code.
This code, when run, will recreate our new_rodents
dataset!
We can also break it down and label it further to help the reader.
R
reprex_data <- structure(list(
# a unique identifier
record_id = c(2359L, 16335L, 9910L, 8278L, 12038L, 7862L, 9221L, 1335L, 9862L, 14979L, 11333L, 351L),
# a list of species
species = c("merriami", "albigula", "ordii", "ordii", "merriami", "megalotis", "albigula", "spectabilis", "harrisi", "merriami", "spilosoma", "leucogaster"),
# a list of sexes. Note: this includes some blanks!
sex = c("M", "M", "M", "M", "F", "F", "F", "F", "", "", "", "")),
class = "data.frame", row.names = c(NA, -12L))
print(reprex_data)
OUTPUT
record_id species sex
1 2359 merriami M
2 16335 albigula M
3 9910 ordii M
4 8278 ordii M
5 12038 merriami F
6 7862 megalotis F
7 9221 albigula F
8 1335 spectabilis F
9 9862 harrisi
10 14979 merriami
11 11333 spilosoma
12 351 leucogaster
Ta-da! Now they can easily recreate our minimal dataset and use it to run the minimal code. However, was that really easier than creating a dataset from scratch?
And sure, you could just use dput()
on your original
dataset. It would work. But that wouldn’t be very considerate to those
who are trying to help. Try it.
R
#dput(rodents)
It becomes a huge chunk of code! When clearly we don’t need all of that.
Remember, we want to keep everything minimal for many reasons:
- to make it easy for our helpers to understand our data and code
- to allow helpers to quickly focus their efforts on the right factors
- to make the problem-solving process as easy and painless as possible
- bonus: to help us better understand and zero-in on the source of our issue, often stumbling upon a solution along the way
Nevertheless, it remains an option for when your data appears too complex or you are not quite sure where your error lies and therefore are not sure what minimal components are needed to reproduce the example.
3.7 Using an R-build dataset
The last approach we mentioned is to build a minimal reproducible dataset based on the datasets that already exist within R (and therefore everyone would have access to).
A list of readily available datasets can be found using
library(help="datasets")
. You can then use ?
in front of the dataset name to get more information about the contents
of the dataset.
For a more detailed discussion of the benefits of using this approach see [insert something]
This approach essentially blends the skills we already learned in the
first two. We need to identify a dataset with appropriate variables that
match the “key elements” of our original dataset. We then need to
further reduce that dataset to a minimal, relevant, number or rows. Once
again, we can use the previously learned functions such as
select()
, filter()
, or
sample()
.
Since we already practiced everything you need, why not try it yourself?
Exercise 8: Extra Challenge
Using the “HairEyeColor” dataset, create a minimal reproducible
dataset for the same issue and minimal code we have been exploring. 1.
Start by using ?HairEyeColor
to read a description of the
dataset and View(HairEyeColor)
to see the actual dataset.
2. Which variables would be a good match for our situation? What are our
requirements? 3. How can we subset this dataset to make it minimal and
still replicate our issue?
Remember, there are many possible solutions! The most important feature is that the example dataset can replicate the issue when used within our minimal code.
The following is 1 possible solution:
We selected Hair and Eye as replacements for species and sex because
they are both categorical and have at least 3 levels. We don’t strictly
need anything else. We will call our new rodents
replacement hyc
. We set a seed because we want a random
sample.
R
set.seed(1)
# the dummy dataset
hyc <- as.data.frame(HairEyeColor) %>% # oh no! Needs to be converted to df -- might need to change example or have them figure that one out... or we can give them this first line.
select(Hair, Eye) %>%
slice_sample(n=10)
print(hyc)
OUTPUT
Hair Eye
1 Black Hazel
2 Blond Brown
3 Red Blue
4 Black Brown
5 Brown Brown
6 Red Blue
7 Red Hazel
8 Brown Green
9 Brown Brown
10 Red Brown
R
# the minimal code
hyc_subset <- hyc %>%
filter(Hair == c('Red','Blonde'),
Eye == c('Blue', 'Brown'))
# illustrating the issue
table(hyc_subset$Hair, hyc_subset$Eye)
OUTPUT
Brown Blue Hazel Green
Black 0 0 0 0
Brown 0 0 0 0
Red 0 1 0 0
Blond 0 0 0 0
R
# the whole subset
print(hyc_subset)
OUTPUT
Hair Eye
1 Red Blue
R
# But we know there are more!
table(hyc$Hair, hyc$Eye) # Reds have 2 Blue and 1 Brown, and Blonds have 1 Brown!
OUTPUT
Brown Blue Hazel Green
Black 1 0 1 0
Brown 2 0 0 1
Red 1 2 1 0
Blond 1 0 0 0
Callout
What about NAs? If your data has NAs and they may be causing
the problem, it is important to include them in your example dataset.
You can find where there are NAs in your dataset by using
is.na
, for example: is.na(krats$weight)
. This
will return a logical vector or TRUE if the cell contains an NA and
FALSE if not. The simplest way to include NAs in your dummy dataset is
to directly include it in vectors: x <- c(1,2,3,NA)
. You
can also subset a dataset that already contains NAs, or change some of
the values to NAs using mutate(ifelse())
or substitute all
the values in a column by sampling from within a vector that contains
NAs.
One important thing to note when subsetting a dataset with NAs is that subsetting methods that use a condition to match rows won’t necessarily match NA values in the way you expect. For example
R
test <- data.frame(x = c(NA, NA, 3, 1), y = rnorm(4))
test %>% filter(x != 3)
OUTPUT
x y
1 1 -0.3053884
R
# you might expect that the NA values would be included, since “NA is not equal to 3”. But actually, the expression NA != 3 evaluates to NA, not TRUE. So the NA rows will be dropped!
# Instead you should use is.na() to match NAs
test %>% filter(x != 3 | is.na(x))
OUTPUT
x y
1 NA 0.4874291
2 NA 0.7383247
3 1 -0.3053884
Key Points
- A minimal reproducible dataset contains (a) the minimum number of lines, variables, and categories, in the correct format, to replicate your issue; and (b) it must be fully reproducible, meaning that someone else can access or run the same code to reproduce the dataset needed for your reprex.
- To make it accessible, you can create a dataset from scratch using
as.data.frame
, you can use an R dataset likecars
, or you can use a subset of your own dataset and then usedput()
to generate reproducible code.
Bonus: Additional Practice
Here are some more practice exercises if you wish to test your knowledge
Extra Practice? Would need to change from mpg, since that’s from ggplot
For each of the following, identify which data are necessary to
create a minimal reproducible dataset using mpg
.
- We want to know how the highway mpg has changed over the years
- We need a list of all “types” of cars and their fuel type for each
manufacturer
- We want to compare the average city mpg for a compact car from each
manufacturer
(I copied these from excercise 6 in the google doc… but I’m not sure that they are getting at the point of the lesson…)
Another Excercise
Each of the following examples needs your help to create a dataset that will correctly reproduce the given result and/or warning message when the code is run. Fix the dataset shown or fill in the blanks so it reproduces the problem.
-
set.seed(1)
sample_data <- data.frame(fruit = rep(c(“apple”, “banana”), 6), weight = rnorm(12))
ggplot(sample_data, aes(x = fruit, y = weight)) + geom_boxplot()
HELP: how do I insert an image from clipboard?? Is it even possible? - bodyweight <- c(12, 13, 14, , ) max(bodyweight) [1] NA
- sample_data <- data.frame(x = 1:3, y = 4:6) mean(sample_data\(x) [1] NA Warning message: In mean.default(sample_data\)x): argument is not numeric or logical: returning NA
- sample_data <- ____ dim(sample_data) NULL
- “fruit” needs to be a factor and the order of the levels must be
specified:
sample_data <- data.frame(fruit = factor(rep(c("apple", "banana"), 6), levels = c("banana", "apple")), weight = rnorm(12))
- one of the blanks must be an NA
- ?? + what’s really the point of this one?
sample_data <- data.frame(x = factor(1:3), y = 4:6)
Great work! We’ve created a minimal reproducible example. In the next episode, we’ll learn about reprex, a package that can help us double-check that our example is reproducible by running it in a clean environment. (As an added bonus, reprex will format our example nicely so it’s easy to post to places like Slack, GitHub, and StackOverflow.)
Content from Asking your question
Last updated on 2025-05-27 | Edit this page
Overview
Questions
- How can I verify that my example is reproducible?
- How can I easily share a reproducible example with a mentor or helper, or online?
- How do I ask a good question?
Objectives
- Use the reprex package to test whether an example is reproducible.
- Use the reprex package to format reprexes for posting online.
- Understand the benefits and drawbacks of different help forums.
- Have a road map to follow when posting a question to make sure it’s a good question.
- Understand what the {reprex} package does and doesn’t do.
Congratulations on finishing your reprex! In this episode, we will introduce a tool, the reprex package. This package will help you check that your example is truly reproducible and format it nicely to make it easy to present to a helper, either in person or online.
There are three principles to remember when you think about sharing your reprex with other people: Reproducibility, formatting, and context.
1. Reproducibility
Haven’t we already talked a lot about reproducibility?
Yes! We have discussed variables and packages, minimal datasets, and making sure that the problem is meaningfully reproduced by the data that you choose. But there are some reasons that a code snippet that appears reproducible in your own R session might not actually be runnable by someone else.
You forgot to account for the origins of some functions and/or variables. We went through our code methodically, but what if we missed something? It would be nice to confirm that the code is as self-contained as we thought it was.
Your code accidentally relies on objects in your R environment that won’t exist for other people. For example, imagine you defined a function
my_awesome_custom_function()
in a project-specificfunctions.R
script, and your code calls that function.

"my_awesome_custom_function"
is lurking in my R
environment. I must have defined it a while ago and forgotten! Code that
includes this function will not run for someone else unless the function
definition is also included in the reprex.R
my_awesome_custom_function("the kangaroo rat dataset")
ERROR
Error in my_awesome_custom_function("the kangaroo rat dataset"): could not find function "my_awesome_custom_function"
I might conclude that this code is reproducible–after all, it works when I run it! But unless I remembered to include the function definition in the reprex itself, nobody will be able to run the code.
A corrected reprex would look like this:
R
my_awesome_custom_function <- function(x){print(paste0(x, " is awesome!"))}
my_awesome_custom_function("the kangaroo rat dataset")
OUTPUT
[1] "the kangaroo rat dataset is awesome!"
- Your code depends on some particular characteristic of your R or RStudio environment that is not the same as your helper’s environment. [more details here]
There are so many components to remember when thinking about
reproducibility, especially for more complex problems. Wouldn’t it be
nice if we had a way to double check our examples? Luckily, the
reprex
package will help you test your reprexes in a clean,
isolated environment to make sure they’re actually reproducible.
The most important function in the reprex
package is
called reprex()
. Here’s how to use it.
First, install and load the reprex
package.
R
#install.packages("reprex")
library(reprex)
Second, write some code. This is your reproducible example.
R
(y <- 1:4)
OUTPUT
[1] 1 2 3 4
R
mean(y)
OUTPUT
[1] 2.5
Third, highlight that code and copy it to your clipboard
(e.g. Cmd + C
on Mac, or Ctrl + C
on
Windows).
Finally, type reprex()
into your console.
# (with the target code snippet copied to your clipboard already...)
# In the console:
reprex()
reprex
will grab the code that you copied to your
clipboard and run that code in an isolated environment. It will
return a nicely formatted reproducible example that includes your code
and and any results, plots, warnings, or errors generated.
The generated output will be on your computer’s clipboard by default. Then, you can paste it into GitHub, StackOverflow, Slack, or another venue.
Callout: The reprex
package
workflow
The reprex
package workflow takes some getting used to.
Instead of copying your code into the function, you simply copy
it to the clipboard (a mysterious, invisible place to most of us) and
then let the blank, empty reprex()
function go over to the
clipboard by itself and find it.
And then the completed, rendered reprex replaces the original code on the clipboard and all you need to do is paste, not copy and paste.
Let’s practice this one more time. Here’s some very simple code:
R
library(ggplot2)
library(dplyr)
mpg %>%
ggplot(aes(x = factor(cyl), y = displ))+
geom_boxplot()

Let’s highlight the code snippet, copy it to the clipboard, and then
run reprex()
in the console.
# In the console:
reprex()
The result, which was automatically placed onto my clipboard and which I pasted here, looks like this:
R
library(ggplot2)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
mpg %>%
ggplot(aes(x = factor(cyl), y = displ))+
geom_boxplot()

Created on 2024-12-29 with reprex v2.1.1
Nice and neat! It even includes the plot produced, so I don’t have to take screenshots and figure out how to attach them to an email or something.
The formatting is great, but reprex
really shines when
you treat it as a helpful collaborator in your process of building a
reproducible example (including all dependencies, providing minimal
data, etc.)
Let’s see what happens if we forget to include
library(ggplot2)
in our small reprex above.
R
library(dplyr)
mpg %>%
ggplot(aes(x = factor(cyl), y = displ))+
geom_boxplot()

As before, let’s copy that code to the clipboard, run
reprex()
in the console, and paste the result here.
# In the console:
reprex()
R
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
mpg %>%
ggplot(aes(x = factor(cyl), y = displ))+
geom_boxplot()
#> Error in ggplot(., aes(x = factor(cyl), y = displ)): could not find function "ggplot"
Created on 2024-12-29 with reprex v2.1.1
Now we get an error message indicating that R cannot find the
function ggplot
! That’s because we forgot to load the
ggplot2
package in the reprex.
This happened even though we had ggplot2
already loaded
in our own current RStudio session. reprex
deliberately
ignores any packages already loaded, running the code in a clean,
isolated R session that’s different from the R session we’ve been
working in. This simulates the experience of someone else trying to
run your reprex on their own computer.
Let’s return to our previous example with the custom function.
R
my_awesome_custom_function("the kangaroo rat dataset")
OUTPUT
[1] "the kangaroo rat dataset is awesome!"
# In the console:
reprex()
R
my_awesome_custom_function("the kangaroo rat dataset")
#> Error in my_awesome_custom_function("the kangaroo rat dataset"): could not find function "my_awesome_custom_function"
Created on 2024-12-29 with reprex v2.1.1
By contrast, if we include the function definition:
R
my_awesome_custom_function <- function(x){print(paste0(x, " is awesome!"))}
my_awesome_custom_function("the kangaroo rat dataset")
OUTPUT
[1] "the kangaroo rat dataset is awesome!"
# In the console:
reprex()
R
my_awesome_custom_function <- function(x){print(paste0(x, " is awesome!"))}
my_awesome_custom_function("the kangaroo rat dataset")
#> [1] "the kangaroo rat dataset is awesome!"
Created on 2024-12-29 with reprex v2.1.1
Testing it out
Now that we’ve met our new reprex-making collaborator, let’s use it to test out the reproducible example we created in the previous episode.
Here’s the code we wrote:
R
# Mickey's reprex script
# XXX THIS IS NOT FINISHED--NEED TO INSERT FINAL DATA EXAMPLE!
# Load necessary packages to run the code
library(ggplot2)
rodents_subset %>% # XXX replace with simulated dataset
ggplot(aes(y = weight, x = common_name, fill = sex)) +
geom_boxplot() # wait, why does this look weird?
# Investigate
table(rodents_subset$sex, rodents_subset$species)
table(rodents$sex, rodents$species)
Time to find out if our example is actually reproducible! Let’s copy
it to the clipboard and run reprex()
. Since we want to give
Jordan a runnable R script, we can use venue = "r"
.
# In the console:
reprex(venue = "r")
It worked!
R
#replace with final output
Now we have a beautifully-formatted reprex that includes runnable code and all the context needed to reproduce the problem.
Callout: Including information about your R session
Another nice thing about reprex
is that you can choose
to include information about your R session, in case your error has
something to do with your R settings rather than the code itself. You
can do that using the session_info
argument to
reprex()
.
For example, try running the following reprex, setting session_info = TRUE, and observe what happens.
R
library(ggplot2)
library(dplyr)
mpg %>%
ggplot(aes(x = factor(cyl), y = displ))+
geom_boxplot()

# In the console:
reprex(session_info = TRUE)
Formatting
The output of reprex()
is markdown, which can easily be
copied and pasted into many sites/apps. However, different places have
slightly different formatting conventions for markdown.
reprex lets you customize the output of your reprex
according to where you’re planning to post it.
The default, venue = "gh"
, gives you “GitHub-Flavored Markdown”,
which is a particular type of markdown that works well when posted on
GitHub. Another format you might want is “r”, which gives you a runnable
R script, with commented output interleaved with pieces of code.
Check out the formatting options in the help file with
?reprex
, and try out a few depending on the destination of
your reprex!
Callout: reprex
can’t do
everything for you
People often mention reprex
as a useful tool for
creating reproducible examples, but it can’t do the work of crafting the
example for you! The package doesn’t locate the problem, pare down the
code, create a minimal dataset, or automatically include package
dependencies.
A better way to think of reprex
is as a tool to check
your work as you go through the process of creating a reproducible
example, and to help you polish up the result.
Context
The final thing to consider when preparing your reproducible example is adding some context so that helpers know a little about your problem and what you’re trying to achieve.
Some context to include: 1. Tell us a little bit about your problem. One sentence should be enough. What domain are you working in? What are these data about? What do the relevant variables mean?
This is particularly important if you have provided a subset of your own data instead of creating a minimal dataset from scratch. Your helper will need to interpret the column names and understand what type of data they are looking at.
- Explain what you expected to happen, or what you were trying to achieve, and how it is different from what happened instead.
The contrast between what happened and what was supposed to happen is particularly important for semantic errors, in which the “error” is not always obvious when running the code. The code ran–but you have decided that the output is “wrong” somehow, or that it “didn’t work”. Why? How do you know that? Your helper needs to know that what you got was not what you expected, and they need to know what you expected in order to help you achieve that outcome.
For example, let’s say you made the following plot:
R
rodents %>%
ggplot(aes(x = plot_type, y = hindfoot_length, color = plot_type))+
geom_boxplot()
WARNING
Warning: Removed 2003 rows containing non-finite outside the scale range
(`stat_boxplot()`).

This plot doesn’t look the way you want it to look, and you’re not
sure why, so you decide to make a reprex. You load the required packages
(ggplot2
and dplyr
), and you substitute an
existing dataset, mtcars
, instead of rodents
,
which you know your helpers won’t have access to. Your reprex looks like
this:
R
library(ggplot2)
library(dplyr)
mpg %>%
ggplot(aes(x = class, y = displ, color = class))+
geom_boxplot()
It’s minimal! It’s reproducible! But… what is the problem? This is a
perfectly reasonable plot, so without context, your helper won’t know
what’s wrong. Let’s explain it to them.
R
library(ggplot2)
library(dplyr)
mpg %>%
ggplot(aes(x = class, y = displ, color = class))+
geom_boxplot()

R
# I want to make a boxplot where each of the categories has its own color. But even though I set color = class here, only the outlines of the boxplots got colored in, and the inside is still white. How do I change this so that the whole box is colored in?
Exercise 1: What makes a good description?
For each of the following reprexes, improve the description given. a. I’m trying to plot the displacements of different cars. I made this boxplot, but the boxes are showing up in the wrong order. How do I fix this? Here is my minimal reproducible example.
R
library(ggplot2)
library(dplyr)
mpg %>%
ggplot(aes(x = class, y = displ, color = class))+
geom_boxplot()

- I’m working with this data about cars. The
class
column refers to the type of car–for example, “compact” class means that the car is quite small, while “pickup” would be a pickup truck. For each car, I also have information about the city and highway mileage, and the transmission, and the number of cylinders, as well as the displacement. This dataset has 234 rows and 11 columns, although this is an example dataset because my real dataset is much larger and has more like 500,000 rows. Anyway, in this example, I want to make a boxplot of displacement where each of the categories has its own color. But even though I set color = class here, only the outlines of the boxplots got colored in, and the inside is still white. How do I make the inside a different color? Here’s a reprex.
R
library(ggplot2)
library(dplyr)
mpg %>%
ggplot(aes(x = class, y = displ, color = class))+
geom_boxplot()

- Help, my code isn’t working! It says I have too many elements. I made a reprex so you can see the data and the error message. I hope that’s helpful. Thank you so much!
R
library(ggplot2)
table(mpg)
ERROR
Error in table(mpg): attempt to make a table with >= 2^31 elements
As we wrap up this lesson, let’s work on adding some context for Mickey’s reprex so they’ll be ready to send it to Remy or post it online.
Exercise 2: Adding context
Working with the person next to you, write a brief description of Mickey’s problem that they could include with their reprex when they post it online.
Make sure that the description gives a little bit of background, describes what Mickey was trying to achieve, and describes what happened instead.
When you’re done, compare notes between the groups and see if you can come up with a final reprex for Mickey!
Key Points
- The
reprex
package makes it easy to format and share your reproducible examples. - The
reprex
package helps you test whether your reprex is reproducible, and also helps you prepare the reprex to share with others. - Following a certain set of steps will make your questions clearer and likelier to get answered.