Data

Last updated on 2024-12-03 | Edit this page

Estimated time: 60 minutes

Overview

Questions

  • How can I include data in my package?

Objectives

  • Learn how to use the data folder

Packaging data


In some situations, it could be a good idea to include data sets as part of your package. Some packages, indeed, include only data.

Take a look, for instance, at the babynames package. According to the package description, it contains “US baby names provided by the SSA”.

In this chapter we will learn how to include some data in our package. This can be very useful to ship the data together with the package, in an easy to install way.

Is your data too big?

Packages are typically not larger than a few megabytes.

If you need to deal with large datasets, adding them to the package is not an advisable solution. Instead, consider using Figshare or similar services.

Using R data files


R has its own native data format, the R data file. These files are recognizable by their extensions: .rda or .RData. Using R data files is the simplest approach to data management inside R Packages.

Step 1: let R know that you’ll use data

We can add data to our project by letting the package usethis help us. In the snippet below, we generate some data and then we use usethis to store it as part of the package:

R

example_names <- c("Luke", "Vader", "Leia", "Chewbacca", "Solo", "R2D2")

R

usethis::use_data(example_names)

What happened?

Type and execute in your console the code we just showed. What does it do? Does it require any further action from our side?

It provides a very informative output. Probably you’ll see something like this:

OUTPUT

✔ Setting active project to '<working folder>/mysterycoffee'
✔ Saving 'example_names' to 'data/example_names.rda'
• Document your data (see 'https://r-pkgs.org/data.html')

So, what happened is that it created the file example_names.rda inside the data folder. Additionally, it activated the project, but that’s not very relevant because most likely the project was already active.

The last element in the list shows something that didn’t happen: the data documentation. Actually, we are asked to do it ourselves. usethis is kind enough to provide us with a link with further information, in case we need it.

So let’s move to the second (and last) step, and document our data.

Step 2: document your data

Everything you put inside your package needs some documentation. Data is no exception. But, how to document it? The answer is easy: not very differently as did with functions in the documentation episode.

An example documentation string for our data could be:

#' Example names
#'
#' An example data set containing six names from the Star Wars universe
#'
#' @format A vector of strings
#' @source Star Wars
"example_names"

We will save this text in R/example_names.R, and we are ready to go.

Checking that everything went ok


In the build panel, press install and restart. Now, type ?example_names in the console. Do you see some information about the dataset?

Tip: if not, make sure that you activated Generate documentation with Roxygen in the Build/More/Configure build tools tab.

Using raw data


Sometimes you need to use data in formats other than .rda. Examples of this are .csv or .txt files.

In order to store raw data in your package, you have to save them in inst/extdata. For example, we can add our example names vector here as a text file:

R

dir.create("inst/extdata", recursive = TRUE)
writeLines(example_names, "inst/extdata/names.txt")

Then, after we reload the package, our users will be able to access this file path using system.file:

R

filepath <- system.file("extdata", "names.txt", package = "mysterycoffee")
readLines(filepath)

OUTPUT

[1] "Luke"      "Vader"     "Leia"      "Chewbacca" "Solo"      "R2D2"     

Discussion

When do you think is it useful for a package to include data that do not have the .rda or .RData extensions?

Having files without the R extensions is useful when one of the main purposes of the package is to read external files. For instance, the readr package loads rectangular data from files where the values are comma- or tab-separated.

Summary


Data handling inside R packages can be a bit tricky. The diagram below summarizes the most common cases:

The diagram was created with mermaid. This is the original code:

flowchart LR
    id1(Does the user need access?) --Yes--> id6(Store it in data/)
    id3(Is the data in .Rda format?)--Yes--> id1
    id1 --No, but tests do--> id5(Store it in tests/)
    id1 --No, but functions do--> id4(Store it in R/sysdata.Rda*)
    id3 --No--> id8(But can it be?)
    id8 --Yes, with some work --> id9(Document the process in data-raw/**)
    id8 --No, it shouldn't--> id7(Store it in inst/extdata)

*) R/sysdata.Rda is a file dedicated to (larger) data needed by your functions. Read more about it here.

**) data-raw/ is a folder dedicated to the origin and cleanup of your data. Read more about it here.

If you need further help, please take a look at section 14.3 of the excellent R Packages tutorial by Hadley Wickham.

Key Points

  • R packages can also include data