Minimal reproducible code
Last updated on 2025-06-24 | Edit this page
Estimated time: 75 minutes
Overview
Questions
- Why is it important to make a minimal code example?
- Which part of my code is causing the problem?
- Which parts of my code should I include in a minimal example?
- How can I tell whether a code snippet is reproducible or not?
- How can I make my code reproducible?
Objectives
- Explain the value of a minimal code snippet.
- Identify the problem area of a script.
- Identify supporting parts of the code that are essential to include.
- Simplify a script down to a minimal code example.
- Evaluate whether a piece of code is reproducible as is or not. If not, identify what is missing.
- Edit a piece of code to make it reproducible
- Have a road map to follow to simplify your code.
- Describe the {reprex} package and its uses
Making a reprex
Simplify the code
When asking someone else for help, it is important to simplify your code as much as possible to make it easier for the helper to understand what is wrong. Simplifying code helps to reduce frustration and overwhelm when debugging an error in a complicated script. The more that we can make the process of helping easy and painless for the helper, the more likely that they will take the time to help.
Callout
Depending on how closely you have been following the lesson and which challenges you have attempted, your script may not look exactly like Mickey’s. That’s okay!
Mickey has written a lot of code so far. The code is also a little messy–for example, after fixing the previous errors, they sometimes commented out the old code and kept it for future reference.
Create a new script
To make the task of simplifying the code less overwhelming, let’s create a separate script for our reprex. This will let us experiment with simplifying our code while keeping the original script intact.
Let’s create and save a new, blank R script and give it a name, such as “reprex-script.R”
Making an R script
There are several ways to make an R script:
- File > New File > R Script
- Click the white square with a green plus sign at the top left corner of your RStudio window
- Use a keyboard shortcut: Cmd + Shift + N (on a Mac) or Ctrl + Shift + N (on Windows)
We’re going to start by copying over all of our code, so we have an exact copy of the full analysis script.
R
# Minimal reproducible example script
# Load packages and data
library(tidyverse)
surveys <- read_csv("data/surveys_complete_77_89.csv")
# Take a look at the data
glimpse(surveys)
min(surveys$year)
max(surveys$year)
# Make some plots
ggplot(surveys, aes(x = taxa)) + geom_bar()
table(surveys$taxa)
?table
table(surveys$taxa, exclude = NULL)
# Just rodents
rodents <- surveys %>% filter(taxa == "Rodent")
rodents_summary <- rodents %>% group_by(plot_type, month) %>% summarize(count=n())
ggplot(rodents_summary) + geom_tile(aes(month, plot_type, fill=count))
# Just k-rats
krats <-rodents %>% filter(genus == "Dipodomys") #kangaroo rat genus
dim(krats)
ggplot(krats, aes(year, fill=plot_type)) +
geom_histogram() +
facet_wrap(~species)
ggplot(rodents, aes(x = species, fill = sex))+
geom_bar()
rodents_subset <- rodents %>%
filter(species == c("ordii", "spectabilis"),
sex == c("F", "M"))
# Add common names
common_names <- data.frame(species = sort(unique(rodents_subset$species)), common_name = c("Ord's", "Banner-Tailed"))
rodents_subset <- left_join(rodents_subset, common_names, by = "species")
# Explore k-rat weights
weight_model <- lm(weight ~ common_name + sex, data = rodents_subset)
summary(weight_model) # this looks weird
rodents_subset %>%
ggplot(aes(y = weight, x = common_name, fill = sex)) +
geom_boxplot() # still looks weird
table(rodents_subset$sex, rodents_subset$species)
table(rodents$sex, rodents$species)
Now, we will follow a process: 1. Identify the symptom of the problem. 2. Remove a piece of code to make the reprex more minimal. 3. Re-run the reprex to make sure the reduced code still demonstrates the problem–check that the symptom is still present.
In this case, the symptom is that we are missing rows in
rodents_subset
that were present in
rodents
and should not have been removed!
Let’s start by identifying pieces of code that we can probably remove. A good start is to look for lines of code that do not create variables for later use, or lines that add complexity to the analysis that is not relevant to the problem at hand.
START HERE WITH FIXING THIS We can start by removing the broken code that we commented out earlier. Also, adding the date column is not directly relevant to the current problem. Let’s go ahead and remove those pieces of code. Now our script looks like this:
R
# Minimal reproducible example script
library(tidyverse)
surveys <- read_csv("data/surveys_complete_77_89.csv")
glimpse(surveys)
min(surveys$year)
max(surveys$year)
# Read in the data
ggplot(x = taxa) + geom_bar()
ggplot(aes(x = surveys$taxa)) + geom_bar()
ggplot(surveys, aes(x = taxa)) + geom_bar()
table(surveys$taxa)
?table
table(surveys$taxa, exclude = NULL)
rodents <- surveys %>% filter(taxa == "Rodent")
rodents_summary <- rodents %>% group_by(plot_type, month) %>% summarize(count=n())
ggplot(rodents_summary) + geom_tile(aes(month, plot_type, fill=count))
ggplot(rodents) + geom_tile(aes(month, plot_type), stat = "count")
ggplot(rodents) + geom_tile(aes(month, plot_type))
krats <- rodents %>% filter(genus == "Dipadomys") #kangaroo rat genus
ggplot(krats, aes(year, fill=plot_type)) +
geom_histogram() +
facet_wrap(~species)
krats
print(rodents %>% count(genus))
krats <- rodents %>% filter(genus == "Dipodomys") #kangaroo rat genus
dim(krats)
ggplot(krats, aes(year, fill=plot_type)) +
geom_histogram() +
facet_wrap(~species)
ggplot(rodents, aes(x = species, fill = sex))+
geom_bar()
rodents_subset <- rodents %>%
filter(species == c("ordii", "spectabilis"),
sex == c("F", "M"))
## Adding common names
common_names <- data.frame(species = unique(rodents_subset$species), common_name = c("Ord's", "Banner-tailed"))
common_names
common_names <- data.frame(species = sort(unique(rodents_subset$species)), common_name = c("Ord's", "Banner-Tailed"))
rodents_subset <- left_join(rodents_subset, common_names, by = "species")
weight_model <- lm(weight ~ common_name + sex, data = rodents_subset)
summary(weight_model)
rodents_subset %>%
ggplot(aes(y = weight, x = common_name, fill = sex)) +
geom_boxplot()
table(rodents_subset$sex, rodents_subset$species)
table(rodents$sex, rodents$species)
When we run this code, we can confirm that it still demonstrates our
problem. There are still many rows missing from
rodents_subset
.
We’ve made progress on minimizing our code, but we still have a ways to go. This script is still pretty long! Let’s identify more pieces of code that we can remove.
Exercise 2: Minimizing code
Which other lines of code can you remove to make this script more minimal? After removing each one, be sure to re-run the code to make sure that it still reproduces the error.
- [Peter’s episode code]
- Visualizing sex by species (ggplot) can be removed because it generates a plot but does not create any variables that are used later.
- Filtering to only rodents can be removed because later we filter to only two species in particular
- Adding common names can be removed because we didn’t actually use
those common names. This one is tricky because technically we did use
the common names in the rodents_subset plot. But is that plot
really necessary? We can still demonstrate the problem using
the table() lines of code at the end. Also, we could still make the
equivalent plot using the
species
column instead of thecommon_name
column, and it would demonstrate the same thing! - The weight model and the summary can be removed
A totally minimal script would look like this:
R
rodents <- read.csv("data/surveys_complete_77_89.csv")
rodents_subset <- rodents %>%
filter(species == c("ordii", "spectabilis"),
sex == c("F", "M"))
table(rodents_subset$sex, rodents_subset$species)
table(rodents$sex, rodents$species)
Great, now we have a totally minimal script!
However, we’re not done yet.
Exercise 3: The problem area is not enough
Let’s suppose that Mickey has created the minimal problem area script shown above. They email this script to Remy so that Remy can help them debug the code.
Remy opens up the script and tries to run it on their computer, but it doesn’t work. - What do you think will happen when Remy tries to run the code from this reprex script? - What do you think Mickey should do next to improve the minimal reproducible example?
We haven’t yet included enough code to allow a helper, such as Remy, to run the code on their own computer. If Remy tries to run the reprex script in its current state, they will encounter errors because they don’t have access to the same R environment that Mickey does.
Include dependencies
R code consists primarily of functions and variables. In order to make our minimal examples truly reproducible, we have to give our helpers access to all the functions and variables that are necessary to run our code.
First, let’s talk about functions. Functions in R typically come from packages. You can access them by loading the package into your environment.
To make sure that your helper has access to the packages necessary to
run your reprex, you will need to include calls to
library()
for whichever packages are used in the code. For
example, if your code uses the function lmer
from the
lme4 package, you would have to include
library(lme4)
at the top of your reprex script to make sure
your helper has the lme4 package loaded and can run your
code.
Default packages
Some packages, such as {base}
and {stats}
,
are loaded in R by default, so you might not have realized that a lot of
functions, such as dim
, colSums
,
factor
, and length
actually come from those
packages!
You can see a complete list of the functions that come from the
{base}
and {stats}
packages by running
library(help = "base")
or
library(help = "stats")
.
Let’s do this for our own reprex. We can start by identifying all the functions used, and then we can figure out where each function comes from to make sure that we tell our helper to load the right packages.
The first function used in our example is ggplot()
,
which comes from the package ggplot2. Therefore, we know
we will need to add library(ggplot2)
at the top of our
script.
The function geom_boxplot()
also comes from
ggplot2. We also used the function table()
.
Running ?table
tells us that the table
function comes from the package {base}
, which is
automatically installed and loaded when you use R–that means we don’t
need to include library(base)
in our script.
Our reprex script now looks like this:
R
# Mickey's reprex script
# Load necessary packages to run the code
library(ggplot2)
rodents_subset %>%
ggplot(aes(y = weight, x = common_name, fill = sex)) +
geom_boxplot() # wait, why does this look weird?
ERROR
Error: object 'rodents_subset' not found
R
# Investigate
table(rodents_subset$sex, rodents_subset$species)
ERROR
Error: object 'rodents_subset' not found
R
table(rodents$sex, rodents$species)
OUTPUT
albigula eremicus flavus fulvescens fulviventer harrisi hispidus
F 474 372 222 46 3 0 68
M 368 468 302 16 2 0 42
leucogaster maniculatus megalotis merriami ordii penicillatus sp.
F 373 160 637 2522 690 221 4
M 397 248 680 3108 792 155 5
spectabilis spilosoma taylori torridus
F 1135 1 0 390
M 1232 1 3 441
Installing vs. loading packages
But what if our helper doesn’t have all of these packages installed? Won’t the code not be reproducible?
Typically, we don’t include install.packages()
in our
code for each of the packages that we include in the
library()
calls, because install.packages()
is
a one-time piece of code that doesn’t need to be repeated every time the
script is run. We assume that our helper will see
library(specialpackage)
and know that they need to go
install “specialpackage” on their own.
Technically, this makes that part of the code not reproducible! But
it’s also much more “polite”. Our helper might have their own way of
managing package versions, and forcing them to install a package when
they run our code risks messing up our workflow. It is a common
convention to stick with library()
and let them figure it
out from there.
Exercise 4: Which packages are essential?
In each of the following code snippets, identify the necessary packages (or other code) to make the example reproducible.
weight_model <- lm(weight ~ common_name + sex, data = rodents_subset)
tab_mod(weight_model)
mod <- lmer(weight ~ hindfoot_length + (1|plot_type), data = rodents)
summary(mod)
rodents_processed <- process_rodents_data(rodents)
glimpse(rodents_processed)
This exercise should take about 10 minutes. :::solution a.
lm
is part of base R, so there’s no package needed for
that. tab_mod
comes from the package sjPlot
.
You could add libary(sjPlot)
to this code to make it
reproducible. b. lmer
is a linear mixed modeling function
that comes from the package lme4
. summary
is
from base R. You could add library(lme4)
to this code to
make it reproducible. c. process_rodents_data
is not from
any package that we know of, so it was probably an originally-created
function. In order to make this example reproducible, you would have to
include the definition of process_rodents_data
.
glimpse
is probably from dplyr
, but it’s worth
noting that there is also a glimpse
function in the
pillar
package, so this might be ambiguous. This is another
reason it’s important to specify your packages–if you leave your helper
guessing, they might load the wrong package and misunderstand your
error!
:::::::::::::::::::::::::::::::::::::::::::
Including library()
calls will definitely help Remy run
the code. But this code still won’t work as written because Remy does
not have access to the same objects that Mickey used in the
code.
The code as written relies on rodents_subset
, which Remy
will not have access to if they try to run the code. That means that
we’ve succeeded in making our example minimal, but it is not
reproducible: it does not allow someone else to reproduce the
problem!
[PULL UP ROAD MAP]
Exercise 5: Reflection
Let’s take a moment to reflect on this process.
What’s one thing you learned in this episode? An insight; a new skill; a process?
What is one thing you’re still confused about? What questions do you have?
This exercise should take about 5 minutes.
Key Points
- Making a reprex is the next step after trying code first aid.
- In order to make a good reprex, it is important to simplify your code
- Simplify code by removing parts not directly related to the question
- Give helpers access to the functions used in your code by loading all necessary packages