Content from What is a reprex and why is it useful?


Last updated on 2025-02-04 | Edit this page

Estimated time: 25 minutes

Overview

Questions

  • How is the process of getting help in R different from getting help with other things?
  • Why is a minimal reproducible example an important tool for getting help in R?
  • What will we be learning in the rest of the course?

Objectives

  • Recognize what it takes to debug someone else’s code.
  • Define a minimal reproducible example.
  • Describe the general workflow that we will cover in the rest of this lesson.

Welcome and introductions

  • Welcome to “RRRR, I’m Stuck!” We’re glad you’re here! Let’s first take care of a few setup steps. You should have all followed the setup instructions on the workshop website, and you should have both R and RStudio installed.

  • You should also have the following packages installed: {reprex}, {ratdat}, {dplyr}, and {ggplot2}.

  • We have a range of levels of experience here. This workshop assumes that you are researcher in ecology or biology who has some prior experience working with R in RStudio.

  • The examples in this lesson follow an example data wrangling, visualization, and analysis workflow similar to one that might be used by an ecology researcher. We will not spend much time discussing the code itself. [Reference guide?]

  • It’s okay if you’re not an R expert–but even more experienced R coders might be less familiar with how to get unstuck, so we hope this workshop will be useful to many levels of coding background.

Think, pair, share: When you get stuck

When you’re coding in R and you get stuck, what are some things that you do to get help or get unstuck?

Think, pair, share: Helping someone else

Think about a time that you helped someone else with their code. What information did you need to know in order to help? (If you have never helped someone else with their code, think about a time that someone helped you–what information did they need to know in order to help?)

Fixing problems can be one of the most challenging parts of learning to code. Luckily, there are many people in the R and data science communities that will be able to help when you get stuck. However, learners and potential helpers alike often run into trouble when they try to communicate about broken code. Helping someone to fix their code can feel impossible when they don’t give you enough information to replicate the problem. But as a novice, figuring out how to ask a good question can feel even harder than the original problem that got you stuck in the first place!

When you get stuck, often the first step is to try some strategies on your own to fix your code. This might include reading help resources, investigating error messages, and methodically walking through each line of your code to figure out what might have gone wrong. But it’s also very common to ask for help from someone else, such as a colleague or strangers online.

Asking for help with your code is a little different than asking for help with many other things. That’s because it is usually not enough to describe the problem in general or theoretical terms. Most help forums (StackOverflow, the Posit Community, and the R for Data Science Slack workspace are common places to ask for help with R code!) require users to post a description of their problem along with a minimal reproducible example, or “reprex”, of their code to make it easier for helpers to figure out what the problem is.

Callout

A minimal reproducible example (MRE) is also sometimes called a minimal working example (MWE), or a reprex (which is short for “reproducible example”). We will mostly use the term reprex in this lesson.

The term reprex was coined by Romain François in a 2014 tweet: The origin of the term “reprex”, as a word smash of “reproducible” and “example”.

As the name suggests, a reprex should be:

  1. Minimal. It’s important to strip the code and data down to their simplest parts and remove anything that is not directly relevant to the problem. This makes it easy for a helper to zero in on what might be going wrong. It also makes the helping process simpler for the helper, making them more likely to take the time to work on your problem.

  2. Reproducible. The reprex should recreate the problem you’re encountering, and it needs to be runnable by someone other than you, on a different computer.

Why is reproducibility important? In many disciplines, experts can give helpful abstract advice about problems. Coding is very hands-on. Even experts usually have to “tinker” with a problem before they can determine what is happening. Without the ability to “tinker”, debugging is both difficult and frustrating, which means that you are less likely to get help.

The Tidyverse documentation puts it simply:

“The goal of a reprex is to package your problematic code in such a way that other people can run it and feel your pain. Then, hopefully, they can provide a solution and put you out of your misery.” - Get help! (Tidyverse)

The wrong ways to ask questions

  • Screenshots
  • Descriptions of the data
  • “It doesn’t work” XXX expand on this–not sure about the best way/time to introduce these “don’t”s. Maybe not at all?

As an added bonus, the process of simplifying your problem and creating a reprex often leads to a better understanding of your own code!

A tweet from Dr. Sam Tyner-Monroe, describing her experience solving her own problem through the process of making a reprex (December 12, 2019)
A tweet from Dr. Sam Tyner-Monroe, describing her experience solving her own problem through the process of making a reprex (December 12, 2019)

In fact, the phenomenon of solving one’s own problem during the process of trying to explain it clearly to someone else is well known–it’s often called “rubber duck debugging”. Jenny Bryan, who created a great video about reprexes called “Help me help you”, called reprexes “basically the rubber duck in disguise”.

A tweet from Jenny Bryan comparing reprexes to rubber duck debugging (January 4, 2018)
A tweet from Jenny Bryan comparing reprexes to rubber duck debugging (January 4, 2018)

Making a reprex might seem simple in theory, but it can be challenging to put into practice. In this lesson, we will walk through the process of creating a minimal reproducible example. We’ll talk about each of the steps and provide a workflow that you can follow when you get stuck in the future. At the end, we’ll introduce you to the reprex package, a useful tool for creating good minimal reproducible examples.

Overview of this lesson


[XXX visual for the roadmap? or just an outline?]

Key Points

  • A helper usually needs to run your code in order to debug it.
  • Minimal reproducible examples make it possible for helpers to run your code, which lets them “feel your pain” and figure out what’s wrong.
  • Making a minimal reproducible example helps you understand your own problem and often leads to finding the answer yourself!

Content from Identify the problem and make a plan


Last updated on 2025-02-04 | Edit this page

Estimated time: 70 minutes

Overview

Questions

  • What do I do when I encounter an error?
  • What do I do when my code outputs something I don’t expect?
  • Why do errors and warnings appear in R?
  • How can I find which areas of code are responsible for errors?
  • How can I fix my code? What other options exist if I can’t fix it?

Objectives

After completing this episode, participants should be able to…

  • Describe how the desired code output differs from the actual output
  • Categorize an error message (e.g. syntax error, semantic errors, package-specific errors, etc.)
  • decode/describe what an error message is trying to communicate
  • Identify specific lines and/or functions generating the error message
  • Use R Documentation to look up function syntax and examples
  • Quickly fix commonly-encountered R errors using the internet
  • Identify when a problem is better suited for asking for further help, including online help and reprex

OUTPUT


Attaching package: 'dplyr'

OUTPUT

The following objects are masked from 'package:stats':

    filter, lag

OUTPUT

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

OUTPUT

Rows: 16878 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): species_id, sex, genus, species, taxa, plot_type
dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

(initial intro – edit upon looking at intro episode)

The first step we’ll cover is what to do when encountering an error or other undesired output from your code. With this episode, we hope to teach you the basics about identifying errors, rectifying them if possible, and if not, how to isolate the problem for others to look at. While the “RRRR I’m Stuck” lesson outlines a complete workflow on solving coding problems from errors to minimum reproducible examples, it is our experience that many seemingly-impossible errors can be understood and fixed without the full workflow. Here, we’ll go over errors, including how to solve many by yourself.

3.1 What do I do when I encounter an error?


Something that may go wrong is an error in your code. By this, we mean any code which generates an error message. This happens when R is unable to run your code, for a variety of reasons. While we can’t review every type of reason your code generates an error, we will try to teach you some tools for you to interpret and figure out errors yourself.

The accompanying error message attempts to tell you exactly how your code failed. For example, consider the following error message that occurs when I run this command in the R console:

R

ggplot(x = taxa) + geom_bar()

ERROR

Error: object 'taxa' not found

Though we know somewhere there is an object called taxa (it is actually a column of the dataset rodents), R is trying to communicate that it cannot find any such object in the local environment. Let’s try again, appropriately pointing ggplot to the rodents dataset and taxa column using the $ operator.

R

ggplot(aes(x = rodents$taxa)) + geom_bar()

ERROR

Error in `fortify()`:
! `data` must be a <data.frame>, or an object coercible by `fortify()`,
  or a valid <data.frame>-like object coercible by `as.data.frame()`, not a
  <uneval> object.
ℹ Did you accidentally pass `aes()` to the `data` argument?

Whoops! Here we see another error message – this time, R responds with a perhaps more-uninterpretable message.

Let’s go over each part briefly. First, we see an error from a function called fortify, which we didn’t even call! Then a much more helpful informational message: Did we accidentally pass aes() to the data argument? This does seem to relate to our function call, as we do pass aes, potentially where our data should go. A helpful starting place when attempting to decipher an error message is checking the documentation for the function which caused the error:

?ggplot

Here, a Help window pops up in RStudio which provides some more information. Skipping the general description at the top, we see ggplot takes positional arguments of data, then mapping, which uses the aes call. We can see in “Arguments” that the aes(x = rodents$taxa) object used in the plot is attempted by fortify to be converted to a data.frame: now the picture is clear! We accidentally passed our mapping argument (telling ggplot how to map variables to the plot) into the position it expected data in the form of a data frame. And if we scroll down to “Examples”, to “Pattern 1”, we can see exactly how ggplot expects these arguments in practice. Let’s amend our result:

R

ggplot(rodents, aes(x = taxa)) + geom_bar()

Here we see our desired plot.

Some common errors you might run into include R expecting different input types for a function than those given, being unable to process data in the way the code is written (e.g. divide by zero errors), or being unable to find a variable or function you’ve specified.

Summary


In general, when encountering an error message for which a remedy is not immediately apparent, some steps to take include:

  1. Reading each part of the error message, to see if we can interpret and act on any.

  2. Pulling up the R Documentation for that function (which may be specific to a package, such as with ggplot).

  3. Reading through the documentation’s Description, Usage, Arguments, Details and Examples entries for greater insight into our error.

  4. Copying and pasting the error message into a search engine for more interpretable explanations.

And, when all else fails, preparing our code into a reproducible example for expert help.

3.2 What do I do when my code outputs something I don’t expect


Another type of ‘error’ which you may encounter is when your R code runs without errors, but does not produce the desired output. You may sometimes see these called “semantic errors” (as opposed to syntax errors, though these term themselves are vague within computer science and describe a variety of different scenarios). As with actual R errors, semantic errors may occur for a variety of non-intuitive reasons, and are often harder to solve as there is no description of the error – you must work out where your code is defective yourself!

With our rodent analysis, the next step in the plan is to subset to just the Rodent taxa (as opposed to other taxa: Bird, Rabbit, Reptile or NA). Let’s quickly check to see how much data we’d be throwing out by doing so:

R

table(rodents$taxa)

OUTPUT


   Bird  Rabbit Reptile  Rodent
    300      69       4   16148 

We’re interested in the Rodents, and thankfully it seems like a majority of our observations will be maintained when subsetting to rodents. Except wait. In our plot, we can clearly see the presence of NA values. Why are we not seeing them here? This is an example of a semantic error – our command is executed correctly, but the output is not everything we intended. Having no error message to interpret, let’s jump straight to the documentation:

R

?table

OUTPUT

Help on topic 'table' was found in the following packages:

  Package               Library
  vctrs                 /home/runner/.local/share/renv/cache/v5/linux-ubuntu-jammy/R-4.4/x86_64-pc-linux-gnu/vctrs/0.6.5/c03fa420630029418f7e6da3667aac4a
  base                  /home/runner/.cache/R/renv/sandbox/linux-ubuntu-jammy/R-4.4/x86_64-pc-linux-gnu/9a444a72


Using the first match ...

Here, the documentation provides some clues: there seems to be an argument called useNA that accepts “no”, “ifany”, and “always”, but it’s not immediately apparent which one we should use. As a second approach, let’s go to Examples to see if we can find any quick fixes. Here we see a couple lines further down:

R

table(a)                 # does not report NA's
table(a, exclude = NULL) # reports NA's

That seems like it should be inclusive. Let’s try again:

R

table(rodents$taxa, exclude = NULL)

OUTPUT


   Bird  Rabbit Reptile  Rodent    <NA>
    300      69       4   16148     357 

Now, we do see that by subsetting to the “Rodent” taxa, we are losing about 357 NAs, which themselves could be rodents! However, in this case, it seems a small enough portion to safely omit. Let’s subset our data to the rodent taxa

R

rodents <- rodents %>% filter(taxa == "Rodent")

Challenge

There are 3 lines of code below, and each attempts to create the same plot. Identify which produces a syntax error, which produces a semantic error, and which correctly creates the plot (hint: this may require you inferring what type of graph we’re trying to create!)

  1. ggplot(rodents) + geom_bin_2d(aes(month, plot_type))

  2. ggplot(rodents) + geom_tile(aes(month, plot_type), stat = "count")

  3. ggplot(rodents) + geom_tile(aes(month, plot_type))

In this case, A correctly creates the graph, plotting as colors in the tile the number of times an observation is seen. It essentially runs the following lines of code:

R

rodents_summary <- rodents %>% group_by(plot_type, month) %>% summarize(count=n())

OUTPUT

`summarise()` has grouped output by 'plot_type'. You can override using the
`.groups` argument.

R

ggplot(rodents_summary) + geom_tile(aes(month, plot_type, fill=count))

B is a syntax error, and will produce the following error:

R

ggplot(rodents) + geom_tile(aes(month, plot_type), stat = "count")

ERROR

Error in `geom_tile()`:
! Problem while computing stat.
ℹ Error occurred in the 1st layer.
Caused by error in `setup_params()`:
! `stat_count()` must only have an x or y aesthetic.

Finally, C is a semantic error. It produces a plot, which is rather meaningless:

R

ggplot(rodents) + geom_tile(aes(month, plot_type))

Summary


In general, when encountering a semantic error for which a remedy is not immediately apparent, some steps to take include:

  1. Reading any warning or informational messages that may pop up when executing your code.

  2. Pulling up the R Documentation for that function (which may be specific to a package, such as with ggplot).

  3. Reading through the documentation’s Description, Usage, Arguments, Details and Examples entries for greater insight into our error.

  4. Changing the input to your function call to see if the output or semantic error can be made more apparent (maybe delete).

And, when all else fails, preparing our code into a reproducible example for expert help. Note, there are fewer options available with semantic errors as when an error message prevents your code from running. You may find yourself having to isolate or ask for help more often than with easily-solvable syntax errors.

Callout

Generally, the more your code deviates from just using base R functions, or the more you use specific packages, both the quality of documentation and online help available from search engines and Googling gets worse and worse. While base R errors will often be solvable in a couple of minutes from a quick ?help check or a long online discussion and solutions on a website like Stack Overflow, errors arising from little-used packages applied in bespoke analyses might merit isolating your specific problem to a reproducible example for online help, or even getting in touch with the developers! Such community input and questions are often the way packages and documentation improves over time.

3.3 How can I find where my code is failing?


Isolating your problem may not be as simple as assessing the output from a single function call on the line of code which produces your error. Often, it may be difficult to determine which line(s) in your code are producing the error.

Consider the example below, where we now are attempting to see how which species of kangaroo rodents appear in different plot types over the years. To do so, we’ll filter our dataset to just include the genus Dipodomys. Then we’ll plot a histogram of which how many observations are seen in each plot type over an x axis of years.

R

krats <- rodents %>% filter(genus == "Dipadomys") #kangaroo rat genus

ggplot(krats, aes(year, fill=plot_type)) + 
geom_histogram() +
facet_wrap(~species)

ERROR

Error in `combine_vars()`:
! Faceting variables must have at least one value.

Uh-oh. Another error here, this time in “combine_vars?” What is that? “Faceting variables must have at least one value”: What does that mean?

Well it may be clear enough that we seem to be missing “species” values where we intend. Maybe we can try to make a different graph looking at what species values are present? Or perhaps there’s an error earlier – our safest approach may actually be seeing what krats looks like:

R

krats

OUTPUT

# A tibble: 0 × 13
# ℹ 13 variables: record_id <dbl>, month <dbl>, day <dbl>, year <dbl>,
#   plot_id <dbl>, species_id <chr>, sex <chr>, hindfoot_length <dbl>,
#   weight <dbl>, genus <chr>, species <chr>, taxa <chr>, plot_type <chr>

It’s empty! What went wrong with our “Dipadomys” filter? Let’s use a print statement to see which genuses are included in the original rodents dataset.

R

print(rodents %>% count(genus))

OUTPUT

# A tibble: 12 × 2
   genus                n
   <chr>            <int>
 1 Ammospermophilus   136
 2 Baiomys              3
 3 Chaetodipus        382
 4 Dipodomys         9573
 5 Neotoma            904
 6 Onychomys         1656
 7 Perognathus        553
 8 Peromyscus        1271
 9 Reithrodontomys   1412
10 Rodent               4
11 Sigmodon           103
12 Spermophilus       151

We see two things here. For one, we’ve misspelled Dipodomys, which we can now amend. This quick function call also tells us we should expect a data frame with 9573 values resulting after subsetting to the genus Dipodomys.

R

krats <- rodents %>% filter(genus == "Dipodomys") #kangaroo rat genus
dim(krats)

OUTPUT

[1] 9573   13

R

ggplot(krats, aes(year, fill=plot_type)) + 
geom_histogram() +
facet_wrap(~species)

OUTPUT

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Our improved code here looks good. Checking the dimensions of our subsetted data frame using the dim() function confirms we now have all the Dipodomys observations, and our plot is looking better. In general, having a ‘print’ statement or some other output before plots or other major steps can be a good way to check your code is producing intermediate results consistent with your expectations.

However, there’s something funky happening here. The bins are definitely weirdly spaced – we can see some bins are not filled with any observations, while those exactly containing one of the integer years happens to contain all the observations for that year.

A solution here might be to tinker with the bin width in the histogram code, but let’s step back a minute. Do we necessarily need to dive into the depths of tinkering with the plot? We can evalulate this problem not in terms of the plot having a problem, but with our data type having a problem. There’s an opportunity to encode the observation times outside of coarse, somewhat arbitrary year groupings with the real, interpretable date they were collected. Let’s do that using the tidyverse’s ‘lubridate’ package. The important details here are that we are creating a ‘datetime’-type variable using the recorded years, months, and days, which are currently all encoded as numeric types.

R

krats <- rodents %>% filter(genus == "Dipodomys") #kangaroo rat genus
dim(krats)

OUTPUT

[1] 9573   13

R

krats <- krats %>% mutate(date = lubridate::ymd(paste(year,month,day,sep='-')))

ggplot(krats, aes(date, fill=plot_type)) + 
  geom_histogram() +
  facet_wrap(~species)

OUTPUT

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This looks much better, and is easier to see the trends over time as well. Note our x axis still shows bins with year labelings, but the continuous spread of our data over these bins shows that dates are treated more continuously and fall more continuously within histogram bins.

One aspect we can see with this exercise above is that sometimes the ‘problem’ we need to solve is not super intuitive or can be isolated to one line. Before attempting to hack a bespoke solution, sometimes it’s helpful to consider whether this might be an often-encountered problem (even if you can’t exactly articulate the problem yourself) or one where you have some intuition there is a simpler or more elegant solution.

By setting up a reproducible example, we can isolate the problem with real data rather than simply asking a proximal question online or to a friend (e.g. ‘how can i change my plot to look like so’). This allows the community to assist in identifying the problem – you don’t always need to have it figured out before asking for help!

Callout

Often, giving your expert helpers access to the entire problem, with a detailed description of your desired output allows you to directly improve your coding skills and learn about new functions and techniques.

Summary


In general, we need to isolate the specific areas of code causing the bug or problem. There is no general rule of thumb as to how large this needs to be, but in general, think about what we would want to include in a reprex. Any early lines which we know run correctly and as intended may not need to be included, and we should seek to isolate the problem area as much as we can to make it understandable to others.

Let’s add to our list of steps for identifying the problem:

  1. Identify the problem area – add print statements immediately upstream or downstream of problem areas, check the desired output from functions, and see whether any intermediate output can be further isolated.

  2. Read each part of the error or warning message (if applicable), to see if we can immediately interpret and act on any.

  3. Pull up R Documentation for any function calls causing the error (which may be specific to a package, such as with ggplot).

  4. Read through the documentation’s Description, Usage, Arguments, Details and Examples entries for greater insight into our error.

  5. Copy and paste the error message into a search engine for more interpretable explanations.

And, when all else fails, prepare our code into a reproducible example for expert help.

While this is similar to our previous checklists, we can now understand these steps as a continuous cycle of isolating the problem into more and more discrete chunks for a reproducible example. Any step in the above that helps us identify the specific areas or aspects of our code that are failing in particular, we can zoom in on and restart the checklist. We can stop as soon as we don’t understand anymore how our code fails. At this point, we can excise that area for further help using a reprex.

Finally, let’s make our plot publication-ready by changing some aesthetics. Let’s also add a vertical line to show when the study design changed on the exclosures.

R

krats <- rodents %>% filter(genus == "Dipodomys") #kangaroo rat genus
dim(krats)

OUTPUT

[1] 9573   13

R

krats <- krats %>% mutate(date = lubridate::ymd(paste(year,month,day,sep='-')))


krats %>%
  ggplot(aes(x = date, fill = plot_type)) +
  geom_histogram()+
  facet_wrap(~species, ncol = 1)+ 
  theme_bw()+
  scale_fill_viridis_d(option = "plasma")+
  geom_vline(aes(xintercept = lubridate::ymd("1988-01-01")), col = "dodgerblue")

OUTPUT

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It looks like the study change helped to reduce merriami sightings in the Rodent and Short-term Krat exclosures.

Content from Minimal Reproducible Data


Last updated on 2025-02-04 | Edit this page

Estimated time: 96 minutes

Overview

Questions

  • What is a minimal reproducible dataset, and why do I need it?
  • What do I need to include in a minimal reproducible dataset?
  • How do I create a minimal reproducible dataset?
  • How do I make my own dataset reproducible?

Objectives

  • Describe a minimal reproducible dataset
  • List the requirements for a minimal reproducible dataset
  • Identify the important aspects of the data that are necessary to reproduce your coding problem
  • Create a dataset from scratch to reproduce a given example
  • Subset an existing dataset to reproduce a given example
  • Share your own dataset in a way that is accessible and reproducible

4.1 What is a minimal reproducible dataset and why do I need it?


Now that we understand some basic errors and how to fix them, let’s look at what to do when we can’t figure out a solution to our coding problem.

This is when you really need to know how to create a minimal reproducible example (MRE) as we talked about in episode 1.

In general, an MRE will need:

  • A minimal dataset that can reproduce the error (or access to a such a dataset)
  • Minimal runnable code that can reproduce the error using the minimal dataset
  • Basic information about the system, R version, and packages being used
  • In case of random functions (e.g., sample()), a seed that will produce the same results each time (e.g., use set.seed())

The first step in creating an MRE is to create a shareable dataset that your helper can manipulate and use to reproduce your error and fix your issue.

Why? Remember our IT problem? It would be a lot easier for the IT support person to fix your computer if they could actually touch it, see the screen, and click around.

You can use the next example as well if you choose.

You’re knitting a sweater and one of the sleeves looks wonky. You call a friend and ask why it’s messed up. They can’t possibly help without being able to hold the sweater and look at the stitches themselves.

It would be great if we could give the helper our entire computer so they could just take over where we left off, but usually we can’t.

Callout

There are several reasons why you might need to create a separate dataset that is minimal and reproducible instead of trying to use your actual dataset. The original dataset may be:

  • too large - the Portal dataset is ~35,000 rows with 13 columns and contains data for decades. That’s a lot!
  • private - your dataset might not be published yet, or maybe you’re studying an endangered species whose locations can’t easily be shared. Another example: many medical datasets cannot be shared publically.
  • hard to send - on most online forums, you can’t attach supplemental files (more on this later). Even if you are just sending data to a colleague, file paths can get complicated, the data might be too large to attach, etc.
  • complicated - it would be hard to locate the relevant information. One example to steer away from are needing a ‘data dictionary’ to understand all the meanings of the columns (e.g. what is “plot type” in ratdat?) We don’t our helper to waste valuable time to figure out what everything means.
  • highly derived/modified from the original file. As an example, you may have already done a bunch of preliminary data wrangling and you don’t want to include all that code when you send the example (see later: the minimal code section), so you need to provide the intermediate dataset directly to your helper.

It’s useful to strip the dataset to its essential parts to identify where exactly the problem area is. A minimal dataset is a dataset that includes the information necessary to run the code, but removes all other unnecessary parts (extra columns/rows, extra context, etc.)

We need minimal reproducible datasets to make it easy/simple/fast for the helper to focus in on the problem at hand and “get their hands dirty” tinkering with the dataset.

4.2 What do I need to include in a minimal reproducible dataset?


Ask the audience, wait for them to respond

It’s actually all in the name:

  • it needs to be minimal, which means it only contains the necessary information to run the piece of code with which you are struggling. You can also think of this as being relevant to the problem. Only keep the necessary elements/variables.
  • it needs to be reproducible. The data you provide must consistently reproduce the output or error with which you are struggling.
  • For it to truly be reproducible, it also needs to be complete, meaning there are no dependencies that have been omitted, and accessible, which means the helper must have access to the relevant data and code (more on this later).

Remember: your helper may not be in the room with you or have access to your computer and the files that are on it!

You might be used to always uploading data from separate files, but helpers can’t access those files. Even if you sent someone a file, they would need to put it in the right directory, make sure to load it in exactly the same way, make sure their computer can open it, etc. Since the goal is to make everyone’s lives as easy as possible, we need to think about our data in a different way–as a dummy object created in the script itself.

Pro-tip

An example of what minimal reproducible examples look like can also be found in the ?help section, in R Studio. Just scroll all the way down to where there are examples listed. These will usually be minimal and reproducible.

For example, let’s look at the function mean:

R

?mean

We see examples that can be run directly on the console, with no additional code.

R

x <- c(0:10, 50)
xm <- mean(x)
c(xm, mean(x, trim = 0.10))

OUTPUT

[1] 8.75 5.50

In this case, x is the dummy dataset consisting of just 1 variable. Notice how it was created as part of the example.

Exercise 1

These datasets are not well suited for use in a reprex. For each one, try to reproduce the dataset on your own in R (copy-paste). Does it work? What happened? Explain.

  1. Screenshot of the ratdat comple_old dataset.
  2. sample_data <- read_csv(“/Users/kaija/Desktop/RProjects/ResearchProject/data/sample_dataset.csv”)
  3. dput(complete_old[1:100,])
  4. sample_data <- data.frame(species = species_vector, weight = c(10, 25, 14, 26, 30, 17))
  1. Not reproducible because it is a screenshot.
  2. Not reproducible because it is a path to a file that only exists on someone else’s computer and therefore you do not have access to it using that path.
  3. Not minimal, it has far too many columns and probably too many rows. It is also not reproducible because we were not given the source for complete_old.
  4. Not reproducible because we are not given the source for species_vector.

Exercise 2

Let’s say we want to know the average weight of all the species in our rodents dataset. We try to use the following code…

R

mean(rodents$weight)

OUTPUT

[1] NA

…but it returns NA! We don’t know why that is happening and we want to ask for help.

Which of the following represents a minimal reproducible dataset for this code? Can you describe why the other ones are not?

  1. sample_data <- data.frame(month = rep(7:9, each = 2), hindfoot_length = c(10, 25, 14, 26, 30, 17))
  2. sample_data <- data.frame(weight = rnorm(10))
  3. sample_data <- data.frame(weight = c(100, NA, 30, 60, 40, NA))
  4. sample_data <- sample(rodents$weight, 10)
  5. sample_data <- rodents_modified[1:20,]

The correct answer is C!

  1. does not include the variable of interest (weight).
  2. does not produce the same problem (NA result with a warning message)–the code runs just fine.
  3. minimal and reproducible.
  4. is not reproducible. Sample randomly samples 10 items; sometimes it may include NAs, sometime it may not (not guaranteed to reproduce the error). It can be used if a seed is set (see next section for more info).
  5. uses a dataset that isn’t accessible without previous data wrangling code–the object rodents_modified doesn’t exist.

4.3 How do I create a minimal reproducible dataset?


This is where many often get stuck: how do I recreate the key elements of my dataset in order to reproduce my error? That seems really hard! If you also find this initially overwhelming, don’t worry. We will break it down into smaller steps.

First, there are three approaches to providing a dataset. You can (1) create one from scratch, (2) use a dataset that is already available, (3) copy/recreate your actual dataset in a way that is minimal and reproducible. The approach you choose to take will depend largely on the nature and source of the problem as well as the complexity of your original dataset. Therefore, no matter which approach we take we first need to know which elements of our dataset are necessary:

  • How many variables do we need?
  • What data type is each variable?
  • How many levels and/or observations are necessary?
  • How many of the values need to be the same/different?
  • Are there any NAs that could be relevant?

Keep these questions in mind as we move through our examples.

OR should we start with an example right here where they need to answer those questions?

Let’s start from scratch.

4.3.1 Create a dummy dataset from scratch

There are many ways one can create a dummy dataset from scratch.

You can create vectors using c()

R

vector <- c(1,2,3,4) 
vector

OUTPUT

[1] 1 2 3 4

You can add some randomness by sampling from a vector using sample().

For example you can sample numbers 1 through 10 in a random order

R

x <- sample(1:10)
x

OUTPUT

 [1] 10  6  1  3  9  7  4  8  2  5

Or you can randomly sample from a normal distribution

R

x <- rnorm(10)
x

OUTPUT

 [1]  0.33485421  0.85806706  0.41260017  1.43141250  0.04204059 -1.21948214
 [7]  0.93282754 -1.50675878  0.48008962 -1.48332834

You can also use letters to create factors.

R

x <- sample(letters[1:4], 20, replace=T)
x

OUTPUT

 [1] "c" "c" "d" "c" "d" "c" "c" "d" "c" "b" "d" "a" "c" "d" "b" "d" "a" "c" "d"
[20] "b"

Remember that a data frame is just a collection of vectors. You can create a data frame using data.frame (or tibble in the dplyr package). You can then create a vector for each variable.

R

data <- data.frame (x = sample(letters[1:3], 20, replace=T), 
                    y = rnorm(1:20))
head(data)

OUTPUT

  x          y
1 a  2.4107260
2 a  0.1644751
3 c -0.4531877
4 b  1.9294096
5 a  0.4410250
6 c  0.1544100

However, when sampling at random you must remember to set.seed() before sending it to someone to make sure you both get the same numbers!

Callout

For more handy functions for creating data frames and variables, see the cheatsheet. For some questions, specific formats can be needed. For these, one can use any of the provided as.someType functions: as.factor, as.integer, as.numeric, as.character, as.Date, as.xts.

Let’s come back to our kangaroo rats example.

Since we will be working with the same dataset this year, we want to know how many kangaroo rats of each species were found in each plot type in past years so that we can better estimate what sample size we can expect.

Here is the code we use:

R

krats %>%
  ggplot(aes(x = date, fill = plot_type)) +
  geom_histogram(alpha=0.6)+
  facet_wrap(~species)+ 
  theme_bw()+
  scale_fill_viridis_d(option = "plasma")+
  geom_vline(aes(xintercept = lubridate::ymd("1988-01-01")), col = "red")

Now let’s say we saw this and decided we wanted to get rid of “sp.” but didn’t know how. We want to ask someone online but we first need to create a minimal reproducible example. Remember our questions from earlier…

Excercise 3

Try to answer the following questions oon your own and see if you can determine what we need to include in our minimal reproducible dataset:

  1. How many variables do we need?
  2. What data type is each variable?
  3. How many levels and/or observations are necessary?
  4. How many of the values need to be the same/different?
  5. Are there any NAs that could be relevant?
  1. We will need 3 variables to represent species, plot type, and date.
  2. Two of our variables will need to be categorical (factors) and one of them continuous.
  3. To reproduce the figure, we can use 2-4 levels for one factors (species), and maybe 2 levels for the other factor (plot type) to keep it minimal. Our continuous variable could range 1 to 10 (date). We don’t need too many observations, but we do have 2 categories, one with 4 levels. Let’s make it an even 100.
  4. NAs are not relevant to our problem

Maybe we don’t need to include the solution and we just walk through it in the following section.

  1. What variables would we need to reproduce this figure?

We will need 3 variables to represent species, plot type, and date.

  1. What data type is each variable?

Two of our variables (species and plot type) will need to be categorical (factors) and one of them continuous (date).

  1. How many levels and/or observations are necessary?

For species, our original figure has 4 levels. We could reduce this to 2, but let’s keep it at 4. Let’s call these species A, B, C, and D.

R

species <- c('A','B','C','D')

For plot type, our original figure has 5 levles, but we could cut it down to 2. Let’s call them P1 and P2. In reality, we probably don’t even need this for this question, but for the sake of practicing let’s add it in.

R

plot.type <- c('P1','P2')

Lastly, date is our continuous variable. To mimic our original figure, we probably want it long enough to show multiple bars along the x axis, but we still want to keep it minimal. Let’s just call it days and make it 1-10.

R

days <- c(1:10)

Great! Now we have all of our variables, we need to go sampling. How many observations do we need? Again, we want enough to show a similar graph, but also keep it minimal. We need to sample each plot for 10 days, and each plot should give us a varying number of species, of which we have 4. Let’s say we find 20 individuals.

We can simulate the data collected each day by using sample() and specifying our number of observations (we need to sample 20 times). Since we want species and plots to repeat, we will also set replace to T.

All together we get:

R

sample_data <- data.frame(
  Day = days,
  Plot = sample(plot.type, 20, replace=T),
  Species = sample(species, 20, replace=T)
)
sample_data

OUTPUT

   Day Plot Species
1    1   P1       C
2    2   P1       B
3    3   P1       C
4    4   P1       A
5    5   P2       B
6    6   P2       B
7    7   P2       A
8    8   P2       C
9    9   P2       A
10  10   P1       B
11   1   P2       B
12   2   P2       A
13   3   P1       B
14   4   P1       B
15   5   P1       A
16   6   P1       D
17   7   P1       D
18   8   P1       A
19   9   P1       D
20  10   P1       D

Great! Now we have a sample data set that is minimal, but is is reproducible?

Give them time to think about it and answer the question.

It isn’t! Why?

Remember: sample() creates a random dataset! This will not be consistently reproducible. In order to make this example fully reproducible we should first set.seed().

R

set.seed(1)
sample_data <- data.frame(
  Species = sample(species, 20, replace=T),
  Plot = sample(plot.type, 20, replace=T),
  Day = days
)
head(sample_data)

OUTPUT

  Species Plot Day
1       A   P1   1
2       D   P1   2
3       C   P1   3
4       A   P1   4
5       B   P1   5
6       A   P1   6

Now we have our minimal reproducible example! But are we sure it reproduces what we are trying to reproduce? Let’s test it out.

R

sample_data %>%
  ggplot(aes(x = Day, fill = Plot)) +
  geom_histogram(alpha=0.6)+
  facet_wrap(~Species)+ 
  theme_bw()+
  scale_fill_viridis_d(option = "plasma")

OUTPUT

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Yes!

It is certainly simplified, but it has the elements we want it to have. And now we can ask how to get rid of “C”.

Given that this was a very simple question, we could have simplified this example even further; we could have used 2 species and even just 2 days, in which case a simple solution could be

R

sample_data2 <- data.frame(
species = sample(c('A','B'), 6, replace = T),
days = 1:2
)

sample_data2 %>%
  ggplot(aes(x=days)) +
  geom_histogram()+
  facet_wrap(~species)

OUTPUT

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

which is even more simplistic than the one before but still contains the elements we are interested in–we have a set of “species” separated into facets and we want to get rid of one of them. In reality, had we realized that we needed to get rid of the rows with “sp.” in them, we could have ignored the figure entirely and posed the question about the data alone. E.g., “how do I remove rows that contain a specific name?” Then give just the example dataset we created.

Exercise 4 (10 minutes) – optional?

Now practice doing it yourself. Create a data frame with:

A. One categorical variable with 2 levels and one continuous variable. B. One continuous variable that is normally distributed. C. Name, sex, age, and treatment type.

4.3.2 Create a dataset using an existing dataset

If you don’t want to create a dataset from scratch, maybe because you have too many variables or it’s a more complicated structure and you are not sure where the error is, you can subset from an existing dataset. Useful functions for subsetting a dataset include subset(), head(), tail(), and indexing with [] (e.g., iris[1:4,]). Alternatively, you can use tidyverse functions like select(), and filter() from the tidyverse. You can also use the same sample() functions we covered earlier.

A list of readily available datasets can be found using library(help="datasets"). You can then use ? in front of the dataset name to get more information about the contents of the dataset.

When working with a built-in dataset you still have to edit your code to fit the new data, but it is probably faster than building a large dataset from scratch, and it gets easier with practice!

Let’s keep using our previous example, how can we reproduce that figure using the existing dataset mpg. First, let’s interrogate this dataset to see what we are working with.

R

?mpg

Which variable from mpg do you think we could use to replace our variables? Remember: we need one for species, one for plot type, and one for date.

There are certainly multiple options! Let’s go with model for species, manufacturer for plot type, and year for date.

R

data <- mpg %>% select(model, manufacturer, year)
dim(data)

OUTPUT

[1] 234   3

R

glimpse(data)

OUTPUT

Rows: 234
Columns: 3
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…

We only need 4 species, and 5 plots. How many do we have here?

R

length(unique(data$model)) 

OUTPUT

[1] 38

R

length(unique(data$manufacturer))

OUTPUT

[1] 15

Certainly more than we need. Then let’s simplify.

R

set.seed(1)
data <- data %>%
    filter(model %in% sample(model, 4, replace = F))

Cool, now we have just 4 models. BUT we also only have 2 years… so maybe year wasn’t the best choice afterall, let’s change it to hwy

R

data <- mpg %>% select(model, manufacturer, hwy) %>%
  filter(model %in% sample(model, 4, replace = F))

Now we can try our plot

R

data %>%
  ggplot(aes(x = hwy, fill = manufacturer)) +
  geom_histogram(alpha=0.6)+
  facet_wrap(~model)+ 
  theme_bw()+
  scale_fill_viridis_d(option = "plasma")

OUTPUT

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Do you think that works?

It turns out that maybe manufacturer was not the best representation for plot, since we do need each car model to appear in each “plot”. What would all cars have?

Let’s change model to manufacturer, and let’s add class.

R

set.seed(1)
data2 <- mpg %>% select(manufacturer, class, hwy) %>%
  filter(manufacturer %in% sample(manufacturer, 4, replace = F))
  
data2 %>% 
  ggplot(aes(x = hwy, fill = class)) +
  geom_histogram(alpha=0.6)+
  facet_wrap(~manufacturer)+ 
  theme_bw()+
  scale_fill_viridis_d(option = "plasma")

OUTPUT

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

That’s more like it! You can keep playing around with it or you can give it more thought apriori, but either way you get the idea. While what we get is not an exact replica, it’s an analogy. The important thing is that we created a figure whose basic elements/structure or “key features” remain intact–namely, the number and type of variables and categories.

Now it is your turn!

Excercise 4

For each of the following, identify which data are necessary to create a minimal reproducible dataset using mpg.

  1. We want to know how the highway mpg has changed over the years
  2. We need a list of all “types” of cars and their fuel type for each manufacturer
  3. We want to compare the average city mpg for a compact car from each manufacturer

OR change the above challenge to be about ratdat

OR move to…

Now that we know how many of each species were captured over the years, we want to know how many of each species you might expect to catch per day.

Let’s practice how we would do this with our data.

Let the students try it out and discuss outloud

We end up with the following code:

R

krats_per_day <- krats %>%
  group_by(date, year, species) %>%
  summarize(n = n()) %>%
  group_by(species)

OUTPUT

`summarise()` has grouped output by 'date', 'year'. You can override using the
`.groups` argument.

R

krats_per_day %>%
  ggplot(aes(x = species, y = n))+
  geom_boxplot(outlier.shape = NA)+
  geom_jitter(width = 0.2, alpha = 0.2)+
  theme_classic()+
  ylab("Number per day")+
  xlab("Species")

Excercise 5

How might you reproduce this using the mpg dataset?

Substitute krats with cars, species with class, date with year. The question becomes, how many cars of each class are produced per year?

R

  set.seed(1)
    cars_per_y <- mpg %>%
    filter(class %in% sample(class, 4, replace=F)) %>%
    group_by(class, year) %>%
    summarize(n=n()) %>%
    group_by(class)

OUTPUT

`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.

R

  cars_per_y %>%
    ggplot(aes(x = class, y = n))+
    geom_boxplot(outlier.shape = NA)+
    geom_jitter(width = 0.2, alpha=0.2)+
    theme_classic()+
    ylab("Cars per year")+
    xlab("Class")

R

  # this is only giving us 3 classes even though we asked for 4, why?
  
  # Because it is sampling from the column "class" which has many of the same class. 
  # Therefore, we need to specify that we want to sample from within the unique values in "class".
  
  cars_per_y <- mpg %>%
    filter(class %in% sample(unique(mpg$class), 4, replace=F)) %>%
    group_by(class, year) %>%
    summarize(n=n()) %>%
    group_by(class)

OUTPUT

`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.

R

  cars_per_y %>%
    ggplot(aes(x = class, y = n))+
    geom_boxplot(outlier.shape = NA)+
    geom_jitter(width = 0.2, alpha=0.2)+
    theme_classic()+
    ylab("Cars per year")+
    xlab("Class")

4.4 Using your own data by creating a minimal subset


Perhaps you are now thinking that if you can use a subset of an existing dataset, wouldn’t it be easier to just subset my own data to make it minimal? You are not wrong. There are cases when you can subset your own data in the same way you would subset an existing dataset to make a minimal dataset, the key is to then make it reproducible. That’s when we use the function dput, which essentially takes your dataframe and give you code to reproduce it!

For example, using our previous data2

R

dput(cars_per_y)

OUTPUT

structure(list(class = c("midsize", "midsize", "pickup", "pickup",
"subcompact", "subcompact", "suv", "suv"), year = c(1999L, 2008L,
1999L, 2008L, 1999L, 2008L, 1999L, 2008L), n = c(20L, 21L, 16L,
17L, 19L, 16L, 29L, 33L)), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -8L), groups = structure(list(
    class = c("midsize", "pickup", "subcompact", "suv"), .rows = structure(list(
        1:2, 3:4, 5:6, 7:8), ptype = integer(0), class = c("vctrs_list_of",
    "vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -4L), .drop = TRUE))

As you can see, even with our minimal dataset, it is still quite a chunk of code. What if you tried putting in krats_per_day? It is clear that either way you will still need to considerably minimize your data. Even then, it will often be simpler to provide an existing dataset or provide one from scratch. Furthermore, often we are able to discover the source of our error or solve our own problem when we have to go through the process of breaking it down into its essential components!

Nevertheless, it remains an option for when your data appears too complex or you are not quite sure where your error lies and therefore are not sure what minimal components are needed to reproduce the example.

Callout

What about NAs? If your data has NAs and they may be causing the problem, it is important to include them in your MR dataset. You can find where there are NAs in your dataset by using is.na, for example: is.na(krats$weight). This will return a logical vector or TRUE if the cell contains an NA and FALSE if not. The simplest way to include NAs in your dummy dataset is to directly include it in vectors: x <- c(1,2,3,NA). You can also subset a dataset that already contains NAs, or change some of the values to NAs using mutate(ifelse()) or substitute all the values in a column by sampling from within a vector that contains NAs.

One important thing to note when subsetting a dataset with NAs is that subsetting methods that use a condition to match rows won’t necessarily match NA values in the way you expect. For example

R

test <- data.frame(x = c(NA, NA, 3, 1), y = rnorm(4))
test %>% filter(x != 3) 

OUTPUT

  x         y
1 1 0.7635935

R

# you might expect that the NA values would be included, since “NA is not equal to 3”. But actually, the expression NA != 3 evaluates to NA, not TRUE. So the NA rows will be dropped!
# Instead you should use is.na() to match NAs
test %>% filter(x != 3 | is.na(x))

OUTPUT

   x            y
1 NA -0.294720447
2 NA -0.005767173
3  1  0.763593461

Here are some more practice exercises if you wish to test your knowledge

(I copied these from excercise 6 in the google doc… but I’m not sure that they are getting at the point of the lesson…)

Excercise 6

Each of the following examples needs your help to create a dataset that will correctly reproduce the given result and/or warning message when the code is run. Fix the dataset shown or fill in the blanks so it reproduces the problem.

  1. set.seed(1) sample_data <- data.frame(fruit = rep(c(“apple”, “banana”), 6), weight = rnorm(12)) ggplot(sample_data, aes(x = fruit, y = weight)) + geom_boxplot() HELP: how do I insert an image from clipboard?? Is it even possible?
  2. bodyweight <- c(12, 13, 14, , ) max(bodyweight) [1] NA
  3. sample_data <- data.frame(x = 1:3, y = 4:6) mean(sample_data\(x) [1] NA Warning message: In mean.default(sample_data\)x): argument is not numeric or logical: returning NA
  4. sample_data <- ____ dim(sample_data) NULL
  1. “fruit” needs to be a factor and the order of the levels must be specified: sample_data <- data.frame(fruit = factor(rep(c("apple", "banana"), 6), levels = c("banana", "apple")), weight = rnorm(12))
  2. one of the blanks must be an NA
  3. ?? + what’s really the point of this one?
  4. sample_data <- data.frame(x = factor(1:3), y = 4:6)

Key Points

  • A minimal reproducible dataset contains (a) the minimum number of lines, variables, and categories, in the correct format, to reproduce a certain problem; and (b) it must be fully reproducible, meaning that someone else can reproduce the same problem using only the information provided.
  • You can create a dataset from scratch using as.data.frame, you can use available datasets like iris or you can use a subset of your own dataset
  • You can share your own data by first subsetting it into its minimal components and then using dput() to create it via reproducible code

Content from Minimal Reproducible Code


Last updated on 2025-02-04 | Edit this page

Estimated time: 75 minutes

Overview

Questions

  • Why is it important to make a minimal code example?
  • Which part of my code is causing the problem?
  • Which parts of my code should I include in a minimal example?
  • How can I tell whether a code snippet is reproducible or not?
  • How can I make my code reproducible?

Objectives

  • Explain the value of a minimal code snippet.
  • Identify the problem area of a script.
  • Identify supporting parts of the code that are essential to include.
  • Simplify a script down to a minimal code example.
  • Evaluate whether a piece of code is reproducible as is or not. If not, identify what is missing.
  • Edit a piece of code to make it reproducible
  • Have a road map to follow to simplify your code.
  • Describe the {reprex} package and its uses

R

library(tidyverse)

OUTPUT

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

R

library(ratdat)
library(here)

OUTPUT

here() starts at /home/runner/work/R-help-reprexes/R-help-reprexes/site/built

R

source(here("scripts/krat-analysis.R"))

WARNING

Warning in file(filename, "r", encoding = encoding): cannot open file
'/home/runner/work/R-help-reprexes/R-help-reprexes/site/built/scripts/krat-analysis.R':
No such file or directory

ERROR

Error in file(filename, "r", encoding = encoding): cannot open the connection

You’re happy with the progress you’ve made in exploring your data, and you’re excited to move on to your second research question: Do the k-rat exclusion plots actually exclude k-rats?

You start by making a couple predictions.

Prediction 1: If the k-rat plots work, then we would expect to see fewer k-rats in the k-rat exclusion plots than in the control, with the fewest in the long-term k-rat exclosure plot.

Second, if the k-rat plots work, then we would expect a lower proportion of spectabilis in the spectabilis-specific exclosure than in the control plot, regardless of the absolute number of k-rats caught.

You start by computing summary counts of the number of k-rats per day in the control and the two k-rat exclosure plots, and then using boxplots to show the differences between daily counts in each plot type.

R

# Prediction 1: expect fewer k-rats in exclusion plots than in control plot
# Count number of k-rats per day, per plot type
counts_per_day <- krats %>%
  filter(plot_type %in% c("Control", "Long-term Krat Exclosure", "Short-term Krat Exclosure")) %>%
  group_by(year, plot_id, plot_type, month, day) %>%
  summarize(count_per_day = n(), .groups = "drop")

ERROR

Error: object 'krats' not found

R

# Visualize
counts_per_day %>%
  ggplot(aes(x = plot_type, y = count_per_day, fill = plot_type))+
  geom_boxplot(outlier.size = 0.5)+
  theme_minimal()+
  labs(title = "Kangaroo rat captures, all years",
       x = "Plot type",
       y = "Individuals per day")+
  theme(legend.position = "none")

ERROR

Error: object 'counts_per_day' not found

This is a coarse analysis that doesn’t compare numbers on the same day against each other, but even so, our prediction is supported. The fewest k-rats are caught per day in the Long-term Krat Exclosure plots, followed by the Short-term Krat Exclosure. The Control plots have the largest number of Krats.

Next, you decide to test the second prediction: that there will be proportionally fewer spectabilis caught in the spectabilis exclusion plots than in the control plot.

You restrict the data to just the control and spectabilis exclosure plots. Then you calculate the number and proportion of each species of k-rat caught in each plot and visualize the proportions for spectabilis.

R

# Focus in on the control and spectab exclosure plots
control_spectab <- krats %>%
  filter(plot_type %in% c("Control", "Spectab exclosure"))

ERROR

Error: object 'krats' not found

R

# Calculate proportion of spectabilis caught
head(control_spectab)

ERROR

Error: object 'control_spectab' not found

R

prop_spectab <- control_spectab %>%
  group_by(year, plot_type, species_id) %>%
  summarize(total_count = n(), .groups = "drop") %>%
  mutate(prop = total_count/sum(total_count)) %>%
  filter(species_id == "DS") # keep only spectabilis

ERROR

Error: object 'control_spectab' not found

R

# Visualize proportions
prop_spectab %>%
  ggplot(aes(x = year, y = prop, col = plot_type))+
  geom_point()+
  geom_line()+
  theme_minimal()+
  labs(title = "Spectab exclosures reduced the proportion of spectab captures",
       y = "Spectabilis proportion",
       x = "Year",
       color = "Plot type")

ERROR

Error: object 'prop_spectab' not found

Interesting! In this analysis, we do see, as expected, that the spectabilis exclosure seems to have a consistently lower proportion of spectabilis than we see in the control plots. This supports our second prediction, and it also does a better job of getting at our research question because it’s looking at proportions, not raw numbers.

But speaking of proportions… looking closer at the y-axis, we can see that something looks wrong. You remember from the previous plot that merriami was generally more common than the other two species, but you don’t remember that spectabilis was quite this rare! Could it really be true that only 3% of the k-rats caught were spectabilis at the absolute highest?

XXX FIXME:

Here, I’ve presented this as though part of the mystery is figuring out whether there’s something wrong with the data. That type of data sleuthing (is my code bad, or is my data bad??) is a really important skill, but I think it might be a bit tangential for us to focus on here. The easy way to have this stick to the LOs would be to have some ground truth that shows us for absolutely sure that <3% is wrong, which means it’s the code that’s bad, not the original data.

This seems wrong. But your code is running fine, so you’re not really sure where your error is or how to start fixing it.

You’re feeling overwhelmed, so you decide to ask your labmate, Jordan, for help. > Hi Jordan, > I ran into a problem with my code today, and I’m wondering if you can help. I made this plot about kangaroo rat abundance in different plot types, but the proportions look weirdly low. I don’t know what’s wrong–can you help me figure it out? I’m attaching my script to this email. > Thanks, > Mishka > Attachment: krat-analysis.R

You know that Jordan is a very experienced coder. If they look through your script, surely they’ll be able to figure out what’s going on much more quickly than you can!

Making sense of someone else’s code

Imagine you are Jordan, and you’ve in the middle of your own analysis when you receive this email. What reaction do you have? How do you feel about helping your lab mate? How would you feel if it were a complete stranger asking you for help instead?

Now, reflect on a time when you have had to look through and run someone else’s code, or a time when you’ve opened your own coding project after a long time away from it. (If you have easy access to one of your past projects, maybe try opening it now and taking a look through it right now!)

How do situations like this make you feel? Turn to the person next to you and discuss for a minute (or write some reflections on the Etherpad).

This exercise should take about 3 minutes. :::solution FIXME: I think this is way too long and needs to be more targeted. We are trying to get them to realize that this is a frustrating/overwhelming/scary experience for the person in the helper role.

I am worried that people are going to use this time to start reflecting on the substance of the actual error. Maybe that’s a reason to just focus it on a general time you’ve dealt with someone else’s code, instead of “put yourself in Jordan’s shoes”? I feel like I need to demo it with people to find out, though.

Suggestions for revising this challenge are welcome!

:::::::::::::::::::::::::::::::::::::::::::

Jordan wants to help their labmate, but they are really busy. They’ve looked through enough code in the past to know that it will take a long time to make sense of yours. They email you back:

Hi Mishka, I’d be happy to help, but this is a lot of code to go through without much context. Could you just show me the part that’s most relevant? Maybe make a minimal example? Thanks, Jordan

Why is it important to simplify code?

In order to get help from Jordan, we’re going to need to do some work up front to make things easier for them. In general, learning how to simplify your code is one of the most important parts of making a minimal reproducible example, asking others for help, and helping yourself.

Debugging is a time when it’s common to have to read through long and complex code (either your own or someone else’s). The more we can reduce their frustration and overwhelm and make the experience of solving errors easy and painless, the more likely that others will want to take the time to help us. Helpers are doing us a favor–why put barriers in their way?

A road map for simplifying your code

Sometimes, it can be challenging to know where to start simplifying your code. In this episode, we’re going to walk through a road map for breaking your code down to its simplest form while making sure that 1) it still runs, and 2) it reproduces the problem you care about solving.

For now, we’ll go through this road map step by step. At the end, we’ll review the whole thing. One takeaway from this lesson is that there is a step by step process to follow, and you can refer back to it if you feel lost in the future.

Step 0. Create a separate script

To begin creating a simpler piece of code to show Jordan, let’s create a separate script. We can separate out small pieces of the code and craft the perfect minimal example while still keeping the original analysis intact.

Create and save a new, blank R script and give it a name, such as “reprex-script.R”

R

# This script will contain my minimal reproducible example.
# Created by Mishka on 2024-12-17

Creating a new R script

There are several ways to make an R script - File > New File > R Script - Click the white square with a green plus sign at the top left corner of your RStudio window - Use a keyboard shortcut: Cmd + Shift + N (on a Mac) or Ctrl + Shift + N (on Windows)

Step 1. Identify the problem area

Now that we have a script, let’s zero in on what’s broken.

First, we should use some of the techniques we learned in the “Identify the Problem” episode and see if they help us solve our error.

XXX FIXME: Peter, let’s make sure this follows nicely from your lesson. - Look for error messages or warnings –> there aren’t any - Google it –> not really informative

In this case, the debugging tips we learned weren’t enough to help us figure out what was wrong with our code.

This is quite common, especially when you’re faced with a semantic error (your code runs, but it doesn’t produce the correct result).

In this case, we can start by identifying the piece of code that showed us there was a problem. We noticed the problem by looking at the plot, so that would be an obvious place to start. But is the plot code really the problem area? Maybe, or maybe not. We know that some weird values are showing up on the plot. That means that either there are weird values in the data that created the plot, or the values in the data looked okay and something happened when we created the plot.

To distinguish between those possibilities, let’s take a look at the plot again:

R

# Visualize proportions
prop_spectab %>%
  ggplot(aes(x = year, y = prop, col = plot_type))+
  geom_point()+
  geom_line()+
  theme_minimal()+
  labs(title = "Spectab exclosures did not reduce proportion of\nspectab captures",
       y = "Spectabilis proportion",
       x = "Year",
       color = "Plot type")

ERROR

Error: object 'prop_spectab' not found

…and then at the data used to generate that plot

R

head(prop_spectab)

ERROR

Error: object 'prop_spectab' not found

Aha! The prop column of prop_spectab shows very small values, and those values correspond to the plot we created.

So it looks like the problem is in the data, not in the plot code itself. The plot was just what allowed us to see the problem in our data. This is often the case. Visualizations are important!

So instead of focusing on the plot, let’s zoom in on the step right before the plot as our “problem area”: the code that created the prop_spectab object. Let’s copy and paste that section of code into the new script that we just created.

R

# This script will contain my minimal reproducible example.
# Created by Mishka on 2024-12-17

# Calculate proportion of spectabilis caught
head(control_spectab)

ERROR

Error: object 'control_spectab' not found

R

prop_spectab <- control_spectab %>%
  group_by(year, plot_type, species_id) %>%
  summarize(total_count = n(), .groups = "drop") %>%
  mutate(prop = total_count/sum(total_count)) %>%
  filter(species_id == "DS") # keep only spectabilis

ERROR

Error: object 'control_spectab' not found

Now the code is much smaller and easier to dig through! You decide to send this updated script along to Jordan.

Hi Jordan, Sorry about that! You’re right, that was a lot of code to dig through. I’ve narrowed it down now to just focus on the part of the code that was causing the problem. The proportions of k-rats in the prop_spectab data frame are too small to make sense. Can you help me figure out why the proportions are wrong? Thank you, Mishka Attachment: reprex-script.R

Jordan gets your email and opens the script. They try to run your code, but they immediately run into some problems. They write back,

Hey Mishka, This is almost there, but I can’t run your code because I don’t have the objects and packages it relies on. Can you elaborate on your example to make it runnable? Jordan

Step 2. Identify dependencies

Jordan is right, of course–you gave them a minimal example of your code, but you didn’t provide the context around that code to make it runnable for someone else. You need to provide the dependencies for your code, including the functions it uses and the datasets it relies on.

R code consists primarily of functions and variables. Making sure that we give our helpers access to all the functions and variables we use in our minimal code will make our examples truly reproducible.

Let’s talk about functions first. When we code in R, we use a lot of different functions. Functions in R typically come from packages, and you get access to them by loading the package into your environment. (Some packages, such as {base} and {stats}, are loaded in R by default, so you might not have realized that a lot of functions, such as dim, colSums, factor, and length actually come from those packages!)

Callout

You can see a complete list of the functions that come from the {base} and {stats} packages by running library(help = "base") or library(help = "stats").

To make sure that your helper has access to the functions necessary for your reprex, you can include a library() call in your reprex.

Let’s do this for our own reprex. We can start by identifying all the functions used, and then we can figure out where each function comes from to make sure that we tell our helper to load the right packages.

Here’s what our script looks like so far.

R

# This script will contain my minimal reproducible example.
# Created by Mishka on 2024-12-17

# Calculate proportion of spectabilis caught
head(control_spectab)

ERROR

Error: object 'control_spectab' not found

R

prop_spectab <- control_spectab %>%
  group_by(year, plot_type, species_id) %>%
  summarize(total_count = n(), .groups = "drop") %>%
  mutate(prop = total_count/sum(total_count)) %>%
  filter(species_id == "DS") # keep only spectabilis

ERROR

Error: object 'control_spectab' not found

You look through the script and list out the following functions: head() group_by() summarize() n() mutate() sum() filter()

Callout

%>% is technically an operator, not a function… FIXME note: %>% is an operator that also needs dplyr to be loaded. It might be relevant because the literal error that Jordan will get first is one about %>% if they don’t have that loaded. But it’s also exhausting to talk about and it might confuse people. By the time they finish loading {dplyr} for the sake of the other functions, the %>% will be included with {dplyr} and it will no longer be a problem. Hmm…

You remember from your introductory R lesson that group_by() and summarize() are both functions that come from the dplyr package, so you can go ahead and add library(dplyr) to the top of your script.

R

# This script will contain my minimal reproducible example.
# Created by Mishka on 2024-12-17

# Load packages needed for the analysis
library(dplyr)

# Calculate proportion of spectabilis caught
head(control_spectab)

ERROR

Error: object 'control_spectab' not found

R

prop_spectab <- control_spectab %>%
  group_by(year, plot_type, species_id) %>%
  summarize(total_count = n(), .groups = "drop") %>%
  mutate(prop = total_count/sum(total_count)) %>%
  filter(species_id == "DS") # keep only spectabilis

ERROR

Error: object 'control_spectab' not found

You’re pretty sure that mutate() and filter() also come from dplyr, but you’re not sure. And you don’t know where n() and sum() come from at all!

Some functions might be created, or “defined”, in your code itself, instead of being contained in a package.

It’s pretty common not to remember where functions come from, especially if they belong to packages you use regularly. Let’s try searching for the sum() function in the documentation.

In the “Help” tab of your RStudio window (which should be next to the “Packages” and “Viewer” tabs in the bottom right pane of RStudio if you have the default layout), search for “sum”.

[INCLUDE SCREENSHOT OF SUM HELP DOCS HERE]

The top of the help file says sum {base}, which means that sum() is a function that comes from the {base} package. That’s good for us–everyone has {base} loaded by default in R, so we don’t need to tell our helper to load any additional packages in order to be able to access sum.

Let’s confirm that we remembered correctly about mutate() and filter() by searching for the documentation on those as well.

Searching for “mutate” quickly shows us the help file for mutate {dplyr}, but searching for “filter” gives a long list of possible functions, all called filter. You look around a little bit and realize that the filter function you’re using is also from the dplyr package. It might be a good idea to explicitly declare that in your code, in case your helper already has one of these other packages loaded.

Callout

Maybe something about functions masking other functions? Or is that too much?

You can explicitly declare which package a function comes from using a double colon ::–for example, dplyr::filter(). (Declaring the function with a double colon also allows you to use that function even if the package is not loaded, as long as it’s installed.)

Let’s add that to our script, so that it’s really clear which filter() we’re using.

R

# This script will contain my minimal reproducible example.
# Created by Mishka on 2024-12-17

# Load packages needed for the analysis
library(dplyr)

# Calculate proportion of spectabilis caught
head(control_spectab)

ERROR

Error: object 'control_spectab' not found

R

prop_spectab <- control_spectab %>%
  group_by(year, plot_type, species_id) %>%
  summarize(total_count = n(), .groups = "drop") %>%
  mutate(prop = total_count/sum(total_count)) %>%
  dplyr::filter(species_id == "DS") # keep only spectabilis

ERROR

Error: object 'control_spectab' not found

The last function we need to find a source for is n(). Searching that in the Help files shows that it also comes from dplyr. Great! We don’t need to consider any other packages, and the library(dplyr) line in our reprex script will tell our helper that the dplyr package is necessary to run our code.

Installing vs. loading packages

But what if our helper doesn’t have all of these packages installed? Won’t the code not be reproducible?

Typically, we don’t include install.packages() in our code for each of the packages that we include in the library() calls, because install.packages() is a one-time piece of code that doesn’t need to be repeated every time the script is run. We assume that our helper will see library(specialpackage) and know that they need to go install “specialpackage” on their own.

Technically, this makes that part of the code not reproducible! But it’s also much more “polite”. Our helper might have their own way of managing package versions, and forcing them to install a package when they run our code risks messing up our workflow. It is a common convention to stick with library() and let them figure it out from there.

Which packages are essential?

In each of the following code snippets, identify the necessary packages (or other code) to make the example reproducible.

  • [Example (including an ambiguous function: dplyr::select() is a good one because it masks plyr::select())]
  • [Example where you have to look up which package a function comes from]
  • [Example with a user-defined function that doesn’t exist in any package]

This exercise should take about 10 minutes. :::solution FIXME

:::::::::::::::::::::::::::::::::::::::::::

Installing packages conditionally

There is an alternative approach to installing packages [insert content/example of the if(require()) thing–but note that explaining this properly requires explaining why require() is different from library(), why it returns a logical, etc. and is kind of a rabbit hole that I don’t want to go down here.]

Including library() calls will definitely help Jordan be able to run your code. But another reason that your code still won’t work as written is that Jordan doesn’t have access to the same objects, or variables, that you used in the code.

The piece of code that we copied over to “reprex-script.R” from “krat-analysis.R” came from line 117. We had done a lot of analyses before then, starting from the raw dataset and creating and modifying new objects along the way.

You realize that Jordan doesn’t have access to control_spectab, the dataset that the reprex relies on.

One way to fix this would be to add the code that creates control_spectab to the reprex, like this:

R

# This script will contain my minimal reproducible example.
# Created by Mishka on 2024-12-17

# Load packages needed for the analysis
library(dplyr)

# B. For Spectabilis-specific exclosure, we expect a lower proportion of spectabilis there than in the other plots.
control_spectab <- krats %>%
  filter(plot_type %in% c("Control", "Spectab exclosure"))

ERROR

Error: object 'krats' not found

R

# Calculate proportion of spectabilis caught
head(control_spectab)

ERROR

Error: object 'control_spectab' not found

R

prop_spectab <- control_spectab %>%
  group_by(year, plot_type, species_id) %>%
  summarize(total_count = n(), .groups = "drop") %>%
  mutate(prop = total_count/sum(total_count)) %>%
  dplyr::filter(species_id == "DS") # keep only spectabilis

ERROR

Error: object 'control_spectab' not found

But that doesn’t fix the problem either, because now Jordan doesn’t have access to krats. Let’s go back to where krats was created, on line 40:

R

# Just kangaroo rats because this is what we are studying
krats <- rodents %>%
  filter(genus == "Dipodomys")

ERROR

Error: object 'rodents' not found

R

dim(krats) # okay, so that's a lot smaller, great.

ERROR

Error: object 'krats' not found

R

glimpse(krats)

ERROR

Error: object 'krats' not found

But there are several other places in the code where we modified krats, and we need to include those as well if we want our code to be truly reproducible. For example, on line 47, a date column was added, and on line 70 we removed unidentified k-rats. And even after including those, we would need to go back even farther, because the rodents object also isn’t something that Jordan has access to! We called the raw data rodents after reading it in, and we also made several modifications, such as when we removed non-rodent taxa on line 35. XXX This example would be better illustrated if we could actually show it breaking… like, if we think we’ve succeeded in tracing back far enough, and then it doesn’t run because e.g. an essential column is missing/different. XXX This is definitely too much detail… but I do want this section to at least partially feel like “aaaah information overload, too complicated!” because the point is to show how hard it is to trace every object backwards and convince them of the value of using minimal data.

Identifying variables

For each of the following code snippets, identify all the variables used

  • [Straightforward example]
  • [Example where they use a built-in dataset but it contains a column that that dataset doesn’t actually contain, i.e. because it’s been modified previously. Might be good to use the date column that we put into krats for this]

This exercise should take about 5 minutes. ::: solution TBH maybe we don’t need this exercise at all, since we want to just redirect them to minimal data anyway?

:::::::::::::::::::::::::::::::::::::::::::

This process is confusing and difficult! If you keep tracing each variable back through your script, before you know it, you end up needing to include the entire script to give context, and then your code is no longer minimal.

We can make our lives easier if we realize that helpers don’t always need the exact same variables and datasets that we were using, just reasonably good stand-ins. Let’s think back to the last episode, where we talked about different ways to create minimal reproducible datasets. We can lean on those skills here to make our example reproducible and greatly reduce the amount of code that we need to include.

Incorporating minimal datasets

What are some ways that you could use a minimal dataset to make this reprex better? What are the advantages and disadvantages of each approach?

This exercise should take about 5 minutes. ::: solution Could provide the control_spectab file directly to Jordan, e.g. via email attachment. Advantages: less work, keeps the context, Jordan is a coworker so they probably understand it. Disadvantages: file might be large, relies on ability to email file, won’t be able to use this method if you post the question online, file contains extra rows/columns that aren’t actually necessary to this problem and might therefore be confusing.

Could create a new dataset from scratch. Advantages: Disadvantages:

Could take a minimal subset of the control_spectab data and use dput or similar to include it directly Advantages: Disadvantages:

Could use a built-in dataset from R Advantages: Disadvantages:

:::::::::::::::::::::::::::::::::::::::::::

You decide that the easiest way to approach this particular problem would be to use a sample dataset to reproduce the problem.

Let’s think about how to create a sample dataset that will accurately reproduce the problem. What columns are needed to create prop_spectab, and what do we know about those columns?

R

head(control_spectab)

ERROR

Error: object 'control_spectab' not found

R

length(unique(control_spectab$year)) # 13 years

ERROR

Error: object 'control_spectab' not found

R

length(unique(control_spectab$species_id)) # three species

ERROR

Error: object 'control_spectab' not found

R

length(unique(control_spectab$plot_type)) # two plot types

ERROR

Error: object 'control_spectab' not found

Looks like we use year, which is numeric, and plot_type and species_id, which are both character types but have discrete levels like a factor. We have data from 13 years across two plot types, and there are three species that might occur.

Let’s create some vectors. I’m only going to use four years here because 13 seems a little excessive.

I’m going to arbitrarily decide how many species records are present for each year, assign every other row to treatment or control, and randomly select species for each row.

R

years <- 1:4
species <- c("Species A", "Species B", "Species C")
plots <- c("Control", "Treatment")

total_records_per_year <- c(10, 12, 3, 30) # I chose these arbitrarily 
total_rows <- sum(total_records_per_year) # how many total rows will we have?

# Create the fake data using `rep()` and `sample()`.
minimal_data <- data.frame(year = rep(years, times = total_records_per_year),
                      plot_type = rep(plots, length.out = total_rows),
                      species_id = sample(species, size = total_rows, replace = TRUE))

We can go ahead and paste that into our reprex script.

R

# Provide a minimal dataset
years <- 1:4
species <- c("Species A", "Species B", "Species C")
plots <- c("Control", "Treatment")

total_records_per_year <- c(10, 12, 3, 30) # I chose these arbitrarily 
total_rows <- sum(total_records_per_year) # how many total rows will we have?

# Create the fake data using `rep()` and `sample()`.
minimal_data <- data.frame(year = rep(years, times = total_records_per_year),
                      plot_type = rep(plots, length.out = total_rows),
                      species_id = sample(species, size = total_rows, replace = TRUE))

And finally, let’s change our code to use minimal_data instead of control_spectab. (While we’re at it, let’s remove the piece of code that creates control_spectab–including a minimal dataset directly means that we don’t need that code.)

Let’s calculate the proportion of Species A caught, instead of spectabilis.

R

# Calculate proportion of Species A caught
head(minimal_data)

OUTPUT

  year plot_type species_id
1    1   Control  Species A
2    1 Treatment  Species C
3    1   Control  Species B
4    1 Treatment  Species A
5    1   Control  Species C
6    1 Treatment  Species B

R

prop_speciesA <- minimal_data %>%
  group_by(year, plot_type, species_id) %>%
  summarize(total_count = n(), .groups = "drop") %>%
  mutate(prop = total_count/sum(total_count)) %>%
  dplyr::filter(species_id == "Species A") # keep only Species A

And now let’s check to make sure that the code actually reproduces our problem. Remember, the problem was that the proportions of Species A caught seemed to be way too low based on what we knew about the frequency of occurrence of each k-rat species.

When we created our sample data, we randomly allocated species A, B, and C to each row. Even after accounting for the stratification of the years and plot types, we would expect the proportion of Species A caught to be somewhere vaguely around 0.33 in each plot type and year. If we’ve reproduced our problem correctly, we would see values much lower than 0.33.

R

head(prop_speciesA)

OUTPUT

# A tibble: 6 × 5
   year plot_type species_id total_count   prop
  <int> <chr>     <chr>            <int>  <dbl>
1     1 Control   Species A            2 0.0364
2     1 Treatment Species A            2 0.0364
3     2 Control   Species A            1 0.0182
4     2 Treatment Species A            3 0.0545
5     3 Control   Species A            1 0.0182
6     3 Treatment Species A            1 0.0182

Indeed, those proportions look way too low! 3%, 1%… that’s an order of magnitude off from what we expect to see here. I think we have successfully reproduced the problem using a minimal dataset. To make things extra easy for Jordan, let’s add some comments in the script to point out the problem.

R

# This script will contain my minimal reproducible example.
# Created by Mishka on 2024-12-17

# Load packages needed for the analysis
library(dplyr)

# Provide a minimal dataset
years <- 1:4
species <- c("Species A", "Species B", "Species C")
plots <- c("Control", "Treatment")

total_records_per_year <- c(10, 12, 3, 30) # I chose these arbitrarily 
total_rows <- sum(total_records_per_year) # how many total rows will we have?

# Create the fake data using `rep()` and `sample()`.
minimal_data <- data.frame(year = rep(years, times = total_records_per_year),
                      plot_type = rep(plots, length.out = total_rows),
                      species_id = sample(species, size = total_rows, replace = TRUE)) # Because I assigned the species randomly, we should expect Species A to occur *roughly* 33% of the time.

# Calculate proportion of Species A caught
head(minimal_data)

OUTPUT

  year plot_type species_id
1    1   Control  Species C
2    1 Treatment  Species C
3    1   Control  Species B
4    1 Treatment  Species B
5    1   Control  Species A
6    1 Treatment  Species C

R

prop_speciesA <- minimal_data %>%
  group_by(year, plot_type, species_id) %>%
  summarize(total_count = n(), .groups = "drop") %>%
  mutate(prop = total_count/sum(total_count)) %>%
  dplyr::filter(species_id == "Species A") # keep only Species A

head(prop_speciesA) # Species A only occurs 1-3% of the time in each plot type in each year. Why is this off by an order of magnitude? (This is the same problem I was seeing in my real data--the occurrence proportions were way, way too small.)

OUTPUT

# A tibble: 6 × 5
   year plot_type species_id total_count   prop
  <int> <chr>     <chr>            <int>  <dbl>
1     1 Control   Species A            2 0.0364
2     1 Treatment Species A            1 0.0182
3     2 Control   Species A            1 0.0182
4     2 Treatment Species A            1 0.0182
5     3 Control   Species A            1 0.0182
6     4 Control   Species A            5 0.0909

Step 3. Simplify

We’re almost done! Now we have code that runs because it includes the necessary library() calls and makes use of a minimal dataset that still allows us to showcase the problem. Our script is almost ready to send to Jordan.

One last thing we can check is whether there are any other places where we can trim down the minimal example even more to eliminate distractions.

Often, analysis code contains exploratory steps or other analyses that don’t directly relate to the problem, such as calls to head(), View(), str(), or similar functions. (Exception: if you’re using these directly to show things like dimension changes that help to illustrate the problem).

Some other common examples are exploratory analyses, or extra formatting added to plots that doesn’t change their interpretation.

Callout

When cutting extra lines of code, we have to be careful not to remove anything that would cause the code to no longer reproduce our problem. In general, it’s a good idea to comment out the line you think is extraneous, re-run the code, and check that the focal problem persists before removing it entirely.

In our case, the code looks pretty minimal. We do have a call to head() at the end, but that’s being used to clearly demonstrate the problem, so it should be left in. Trimming down the minimal dataset any further, for example by removing plot_type or year, would involve changing the analysis code and possibly not reproducing the problem anymore.

Great work! We’ve created a minimal reproducible example. In the next episode, we’ll learn about reprex, a package that can help us double-check that our example is reproducible by running it in a clean environment. (As an added bonus, reprex will format our example nicely so it’s easy to post to places like Slack, GitHub, and StackOverflow.)

More on that soon. For now, let’s review the road map that we just practiced.

Road map review

Step 0. Create a separate script

  • It helps to have a separate place to work on your minimal code snippet.

Step 1. Identify the problem area

  • Which part of the code is causing the problem? Move it over to the reprex script so we can focus on it.

Step 2. Identify dependencies

  • Make sure that helpers have access to all the functions they’ll need to run your code snippet.
  • Make sure helpers can access the variables they’ll need to run the code, or reasonable stand-ins.

Step 3. Simplify

  • Remove any extra code that isn’t absolutely necessary to demonstrate your problem.

Reflection

Let’s take a moment to reflect on this process.

  • What’s one thing you learned in this episode? An insight; a new skill; a process?

  • What is one thing you’re still confused about? What questions do you have?

This exercise should take about 5 minutes.

XXX maybe this story should end with Mishka solving their own problem.

run an ANOVA or something

XXX Note: There’s more to the analysis, and I only managed to walk through one example of an error. Is this example juicy enough? Should we just discard the rest of the analysis? XXX Note: I actually think the analysis here is wrong and might need to be re-worked–would need to do pairwise comparisons.

MODELING

counts_mod <- lm(count_per_day ~ plot_type + species_id, data = counts_per_day) summary(counts_mod)

with interaction term:

counts_mod_interact <- lm(count_per_day ~ plot_type*species_id, data = counts_per_day) summary(counts_mod_interact)

summary(counts_mod) summary(counts_mod_interact)

Stepping through code, line by line

Placing your cursor on a line of code and using the keyboard shortcut Cmd + Enter (Mac) or Ctrl + Enter (Windows) will run that line of code and it will automatically advance your cursor to the next line. This makes it easy to “step through” your code without having to click or highlight.

Computational reproducibility

Every object should be able to map back to either a file, a built-in dataset in R, or another intermediate step. If you found any variables where you weren’t able to answer the “Where did this come from?” question, then that’s a problem! Did you build a script that mistakenly relied on an object that was in your environment but was never properly defined?

Mapping exercises like this can be a great way to check whether entire script is reproducible. Reproducibility is important in more cases than just debugging! More and more journals are requiring full analysis code to be posted, and if that code isn’t reproducible, it will severely hamper other researchers’ efforts to confirm and expand on what you’ve done.

Various packages can help you keep track of your code and make it more reproducible. Check out the {targets} and {renv} packages in particular if you’re interested in learning more.

Content from Asking your question


Last updated on 2025-02-04 | Edit this page

Estimated time: 12 minutes

Overview

Questions

  • How can I make sure my minimal reproducible example will actually run correctly for someone else?
  • How can I easily share a reproducible example with a mentor or helper, or online?
  • How do I ask a good question?

Objectives

  • Use the reprex package to test whether an example is reproducible.
  • Use the reprex package to format reprexes for posting online.
  • Understand the benefits and drawbacks of different places to get help.
  • Have a road map to follow when posting a question to make sure it’s a good question.
  • Understand what the {reprex} package does and doesn’t do.

Key Points

  • The {reprex} package makes it easy to format and share your reproducible examples.
  • The {reprex} package helps you test whether your reprex is reproducible, and also helps you prepare the reprex to share with others.
  • Following a certain set of steps will make your questions clearer and likelier to get answered.

Congratulations on making it this far. Creating a good reproducible example is a lot of work! In this episode, we’ll talk about how to finish up your reprex, make sure it’s ready to send to your helper or post online, and make absolutely double sure that it actually runs.

There are three principles to remember when you think about sharing your reprex with other people:

  1. Reproducibility
  2. Formatting
  3. Context

1. Reproducibility


You might be thinking, Haven’t we already talked a lot about reproducibility? We have! We discussed variables and packages, minimal datasets, and making sure that the problem is meaningfully reproduced by the data that you choose. But there are some reasons that a code snippet that appears reproducible in your own R session might not actually be runnable by someone else.

Some possible reasons:

  • You forgot to account for the origins of some functions and/or variables. We went through our code methodically, but what if we missed something? It would be nice to confirm that the code is as self-contained as we thought it was.

  • Your code accidentally relies on objects in your R environment that won’t exist for other people. For example, imagine you defined a function my_awesome_custom_function() in a project-specific functions.R script, and your code calls that function.

A function called "my_awesome_custom_function" is lurking in my R environment. I must have defined it a while ago and forgotten! Code that includes this function will not run for someone else unless the function definition is also included in the reprex.
A function called "my_awesome_custom_function" is lurking in my R environment. I must have defined it a while ago and forgotten! Code that includes this function will not run for someone else unless the function definition is also included in the reprex.
my_awesome_custom_function("rain")
# [1] "rain is awesome!"

I might conclude that this code is reproducible–after all, it works when I run it! But unless I include the function defintion in the reprex itself, nobody who doesn’t already have that custom function defined in their R environment will be able to run the code.

A corrected reprex would look like this:

my_awesome_custom_function <- function(x){print(paste0(x, " is awesome!"))}
my_awesome_custom_function("rain")
#[1] "rain is awesome!"

Wouldn’t it be nice if we had a way to check whether any traps like this are lurking in our draft reprex?

Luckily, there’s a package that can help you test out your reprexes in a clean, isolated environment to make sure they’re actually reproducible. Introducing reprex!

The most important function in the reprex package is called reprex(). Here’s how to use it.

First, install and load the reprex package.

#install.packages("reprex")
library(reprex)

Second, write some code. This is your reproducible example.

(y <- 1:4)
mean(y)

Third, highlight that code and copy it to your clipboard (e.g. Cmd + C on Mac, or Ctrl + C on Windows).

Finally, type reprex() into your console.

# (with the target code snippet copied to your clipboard already...)
# In the console:
reprex()

reprex will grab the code that you copied to your clipboard and run that code in an isolated environment. It will return a nicely formatted reproducible example that includes your code and and any results, plots, warnings, or errors generated.

The generated output will be on your clipboard by default, so all you have to do is paste it into GitHub, StackOverflow, Slack, or another venue.

Callout

This workflow is unusual and takes some getting used to! Instead of copying your code into the function, you simply copy it to the clipboard (a mysterious, invisible place to most of us) and then let the blank, empty reprex() function go over to the clipboard by itself and find it.

And then the completed, rendered reprex replaces the original code on the clipboard and all you need to do is paste, not copy and paste.

Try to practice this workflow a few times with very simple reprexes so you can get a sense for the steps.

Let’s practice this one more time. Here’s some very simple code:

library(ggplot2)
library(dplyr)
mpg %>%
  ggplot(aes(x = factor(cyl), y = displ))+
  geom_boxplot()

Let’s highlight the code snippet, copy it to the clipboard, and then run reprex() in the console.

# In the console:
reprex()

The result, which was automatically placed onto my clipboard and which I pasted here, looks like this:

R

library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
mpg %>% 
  ggplot(aes(x = factor(cyl), y = displ))+
  geom_boxplot()

Created on 2024-12-29 with reprex v2.1.1

Nice and neat! It even includes the plot produced, so I don’t have to take screenshots and figure out how to attach them to an email or something.

The formatting is great, but reprex really shines when you treat it as a helpful collaborator in your process of building a reproducible example (including all dependencies, providing minimal data, etc.)

Let’s see what happens if we forget to include library(ggplot2) in our small reprex above.

library(dplyr)
mpg %>%
  ggplot(aes(x = factor(cyl), y = displ))+
  geom_boxplot()

As before, let’s copy that code to the clipboard, run reprex() in the console, and paste the result here.

# In the console:
reprex()

R

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
mpg %>% 
  ggplot(aes(x = factor(cyl), y = displ))+
  geom_boxplot()
#> Error in ggplot(., aes(x = factor(cyl), y = displ)): could not find function "ggplot"

Created on 2024-12-29 with reprex v2.1.1

Now we get an error message indicating that R cannot find the function ggplot! That’s because we forgot to load the ggplot2 package in the reprex.

This happened even though we had ggplot2 already loaded in our own current RStudio session. reprex deliberately “plays dumb”, running the code in a clean, isolated R session that’s different from the R session we’ve been working in. This keeps us honest and makes sure we don’t forget any important packages or function definitions.

Let’s return to our previous example with the custom function.

my_awesome_custom_function("rain")
# In the console:
reprex()

R

my_awesome_custom_function("rain")
#> Error in my_awesome_custom_function("rain"): could not find function "my_awesome_custom_function"

Created on 2024-12-29 with reprex v2.1.1

By contrast, if we include the function definition:

my_awesome_custom_function <- function(x){print(paste0(x, " is awesome!"))}
my_awesome_custom_function("rain")
# In the console:
reprex()

R

my_awesome_custom_function <- function(x){print(paste0(x, " is awesome!"))}
my_awesome_custom_function("rain")
#> [1] "rain is awesome!"

Created on 2024-12-29 with reprex v2.1.1

Other features of {reprex}


Session Info

Another nice thing about reprex is that you can choose to include information about your R session, in case your error has something to do with your R settings rather than the code itself. You can do that using the session_info argument to reprex().

For example:

library(ggplot2)
library(dplyr)
mpg %>%
  ggplot(aes(x = factor(cyl), y = displ))+
  geom_boxplot()
# In the console:
reprex(session_info = TRUE)

R

library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
mpg %>% 
  ggplot(aes(x = factor(cyl), y = displ))+
  geom_boxplot()

Created on 2024-12-29 with reprex v2.1.1

Session info

R

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.2 (2024-10-31)
#>  os       macOS Monterey 12.7.6
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/New_York
#>  date     2024-12-29
#>  pandoc   3.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  cli           3.6.3   2024-06-21 [1] CRAN (R 4.4.0)
#>  colorspace    2.1-1   2024-07-26 [1] CRAN (R 4.4.0)
#>  curl          6.0.1   2024-11-20 [1] https://jeroen.r-universe.dev (R 4.4.2)
#>  digest        0.6.37  2024-08-19 [1] CRAN (R 4.4.1)
#>  dplyr       * 1.1.4   2023-11-17 [1] CRAN (R 4.4.0)
#>  evaluate      1.0.1   2024-10-10 [1] CRAN (R 4.4.1)
#>  fansi         1.0.6   2023-12-08 [1] CRAN (R 4.4.0)
#>  farver        2.1.2   2024-05-13 [1] CRAN (R 4.4.0)
#>  fastmap       1.2.0   2024-05-15 [1] CRAN (R 4.4.0)
#>  fs            1.6.5   2024-10-30 [1] CRAN (R 4.4.1)
#>  generics      0.1.3   2022-07-05 [1] CRAN (R 4.4.0)
#>  ggplot2     * 3.5.1   2024-04-23 [1] CRAN (R 4.4.0)
#>  glue          1.8.0   2024-09-30 [1] CRAN (R 4.4.1)
#>  gtable        0.3.6   2024-10-25 [1] CRAN (R 4.4.1)
#>  htmltools     0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
#>  knitr         1.49    2024-11-08 [1] CRAN (R 4.4.1)
#>  labeling      0.4.3   2023-08-29 [1] CRAN (R 4.4.0)
#>  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.4.0)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.4.0)
#>  munsell       0.5.1   2024-04-01 [1] CRAN (R 4.4.0)
#>  pillar        1.9.0   2023-03-22 [1] CRAN (R 4.4.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.4.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.4.0)
#>  reprex        2.1.1   2024-07-06 [1] CRAN (R 4.4.0)
#>  rlang         1.1.4   2024-06-04 [1] CRAN (R 4.4.0)
#>  rmarkdown     2.29    2024-11-04 [1] CRAN (R 4.4.1)
#>  rstudioapi    0.17.1  2024-10-22 [1] CRAN (R 4.4.1)
#>  scales        1.3.0   2023-11-28 [1] CRAN (R 4.4.0)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.4.0)
#>  tibble        3.2.1   2023-03-20 [1] CRAN (R 4.4.0)
#>  tidyselect    1.2.1   2024-03-11 [1] CRAN (R 4.4.0)
#>  utf8          1.2.4   2023-10-22 [1] CRAN (R 4.4.0)
#>  vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.4.0)
#>  withr         3.0.2   2024-10-28 [1] CRAN (R 4.4.1)
#>  xfun          0.49    2024-10-31 [1] CRAN (R 4.4.1)
#>  xml2          1.3.6   2023-12-04 [1] CRAN (R 4.4.0)
#>  yaml          2.3.10  2024-07-26 [1] CRAN (R 4.4.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

Formatting

The output of reprex() is markdown, which can easily be copied and pasted into many sites/apps. However, different places have slightly different formatting conventions for markdown. reprex lets you customize the output of your reprex according to where you’re planning to post it.

The default, venue = "gh", gives you “GitHub-Flavored Markdown”. Another format you might want is “r”, which gives you a runnable R script, with commented output interleaved.

Check out the formatting options in the help file with ?reprex, and try out a few depending on the destination of your reprex!

{reprex} can’t do everything for you!

People often mention reprex as a useful tool for creating reproducible examples, but it can’t do the work of crafting the example for you! The package doesn’t locate the problem, pare down the code, create a minimal dataset, or automatically include package dependencies.

A better way to think of reprex is as a tool to check your work as you go through the process of creating a reproducible example, and to help you polish up the result.

Testing it out


Now that we’ve met our new reprex-making collaborator, let’s use it to test out the reproducible example we created in the previous episode.

Here’s the code we wrote:

# This script will contain my minimal reproducible example.
# Created by Mishka on 2024-12-17

# Load packages needed for the analysis
library(dplyr)

# Provide a minimal dataset
years <- 1:4
species <- c("Species A", "Species B", "Species C")
plots <- c("Control", "Treatment")

total_records_per_year <- c(10, 12, 3, 30) # I chose these arbitrarily
total_rows <- sum(total_records_per_year) # how many total rows will we have?

# Create the fake data using `rep()` and `sample()`.
minimal_data <- data.frame(year = rep(years, times = total_records_per_year),
                      plot_type = rep(plots, length.out = total_rows),
                      species_id = sample(species, size = total_rows, replace = TRUE)) # Because I assigned the species randomly, we should expect Species A to occur *roughly* 33% of the time.

# Calculate proportion of Species A caught
head(minimal_data)
prop_speciesA <- minimal_data %>%
  group_by(year, plot_type, species_id) %>%
  summarize(total_count = n(), .groups = "drop") %>%
  mutate(prop = total_count/sum(total_count)) %>%
  dplyr::filter(species_id == "Species A") # keep only Species A

head(prop_speciesA) # Species A only occurs 1-3% of the time in each plot type in each year. Why is this off by an order of magnitude? (This is the same problem I was seeing in my real data--the occurrence proportions were way, way too small.)

Time to find out if our example is actually reproducible! Let’s copy it to the clipboard and run reprex(). Since we want to give Jordan a runnable R script, we can use venue = "r".

# In the console:
reprex(venue = "r")

It worked!

# This script will contain my minimal reproducible example.
# Created by Mishka on 2024-12-17

# Load packages needed for the analysis
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#>     filter, lag
#> The following objects are masked from 'package:base':
#>
#>     intersect, setdiff, setequal, union

# Provide a minimal dataset
years <- 1:4
species <- c("Species A", "Species B", "Species C")
plots <- c("Control", "Treatment")

total_records_per_year <- c(10, 12, 3, 30) # I chose these arbitrarily
total_rows <- sum(total_records_per_year) # how many total rows will we have?

# Create the fake data using `rep()` and `sample()`.
minimal_data <- data.frame(year = rep(years, times = total_records_per_year),
                      plot_type = rep(plots, length.out = total_rows),
                      species_id = sample(species, size = total_rows, replace = TRUE)) # Because I assigned the species randomly, we should expect Species A to occur *roughly* 33% of the time.

# Calculate proportion of Species A caught
head(minimal_data)
#>   year plot_type species_id
#> 1    1   Control  Species A
#> 2    1 Treatment  Species A
#> 3    1   Control  Species A
#> 4    1 Treatment  Species B
#> 5    1   Control  Species C
#> 6    1 Treatment  Species A
prop_speciesA <- minimal_data %>%
  group_by(year, plot_type, species_id) %>%
  summarize(total_count = n(), .groups = "drop") %>%
  mutate(prop = total_count/sum(total_count)) %>%
  dplyr::filter(species_id == "Species A") # keep only Species A

head(prop_speciesA) # Species A only occurs 1-3% of the time in each plot type in each year. Why is this off by an order of magnitude? (This is the same problem I was seeing in my real data--the occurrence proportions were way, way too small.)
#> # A tibble: 6 × 5
#>    year plot_type species_id total_count   prop
#>   <int> <chr>     <chr>            <int>  <dbl>
#> 1     1 Control   Species A            2 0.0364
#> 2     1 Treatment Species A            2 0.0364
#> 3     2 Control   Species A            1 0.0182
#> 4     2 Treatment Species A            1 0.0182
#> 5     3 Control   Species A            2 0.0364
#> 6     4 Control   Species A            4 0.0727

Now we have a beautifully-formatted reprex that includes runnable code and all the context needed to reproduce the problem.

It’s time to send another email to Jordan to finally get help with our problem. It’s important to include a brief description of the problem, including what we were hoping to achieve and what the problem seems to be.

Hi Jordan, I took some time to learn how to write reproducible examples. I’ve simplified my problem down to a “reprex” and formatted it with the reprex package (attached). I’m using a simplified dataset here, but hopefully it illustrates the problem. Basically, my data has three species, and I’m trying to calculate the proportion of each species that was caught in each plot type in each of several years. In the example, I would expect each species to come up approximately one third of the time, but instead I’m seeing really tiny proportions, like 1-3%. My real data shows a similar pattern. I think something must be wrong with how I’m calculating the proportions, but I can’t quite figure it out. Can you shed some light on what might be going on? Thanks so much! Mishka Attachment: reprex-script.R

XXX how should this end? Some possibilities: - no solution! The point wasn’t to find the solution, it was to create a reprex. - Jordan writes back and solves the problem - Mishka figures out what was wrong in the process of drafting the email - Something else?