Working with External Files
Last updated on 2024-12-13 | Edit this page
Estimated time: 12 minutes
Overview
Questions
- How can we load external data?
Objectives
- Be able to load external data into a workflow
- Configure the workflow to rerun if the contents of the external data change
Episode summary: Show how to read and write external files
Treating external files as a dependency
Almost all workflows will start by importing data, which is typically stored as an external file.
As a simple example, let’s create an external data file in RStudio
with the “New File” menu option. Enter a single line of text, “Hello
World” and save it as “hello.txt” text file in
_targets/user/data/
.
We will read in the contents of this file and store it as
some_data
in the workflow by writing the following plan and
running tar_make()
:
Save your progress
You can only have one active _targets.R
file at a time
in a given project.
We are about to create a new _targets.R
file, but you
probably don’t want to lose your progress in the one we have been
working on so far (the penguins bill analysis). You can temporarily
rename that one to something like _targets_old.R
so that
you don’t overwrite it with the new example _targets.R
file
below. Then, rename them when you are ready to work on it again.
R
library(targets)
library(tarchetypes)
tar_plan(
some_data = readLines("_targets/user/data/hello.txt")
)
OUTPUT
▶ dispatched target some_data
● completed target some_data [0 seconds, 64 bytes]
▶ ended pipeline [0.089 seconds]
If we inspect the contents of some_data
with
tar_read(some_data)
, it will contain the string
"Hello World"
as expected.
Now say we edit “hello.txt”, perhaps add some text: “Hello World. How are you?”. Edit this in the RStudio text editor and save it. Now run the pipeline again.
R
library(targets)
library(tarchetypes)
tar_plan(
some_data = readLines("_targets/user/data/hello.txt")
)
OUTPUT
✔ skipped target some_data
✔ skipped pipeline [0.087 seconds]
The target some_data
was skipped, even though the
contents of the file changed.
That is because right now, targets is only tracking the
name of the file, not its contents. We need to use a
special function for that, tar_file()
from the
tarchetypes
package. tar_file()
will calculate
the “hash” of a file—a unique digital signature that is determined by
the file’s contents. If the contents change, the hash will change, and
this will be detected by targets
.
R
library(targets)
library(tarchetypes)
tar_plan(
tar_file(data_file, "_targets/user/data/hello.txt"),
some_data = readLines(data_file)
)
OUTPUT
▶ dispatched target data_file
● completed target data_file [0.001 seconds, 26 bytes]
▶ dispatched target some_data
● completed target some_data [0 seconds, 78 bytes]
▶ ended pipeline [0.109 seconds]
This time we see that targets
does successfully re-build
some_data
as expected.
A shortcut (or, About target factories)
However, also notice that this means we need to write two targets
instead of one: one target to track the contents of the file
(data_file
), and one target to store what we load from the
file (some_data
).
It turns out that this is a common pattern in targets
workflows, so tarchetypes
provides a shortcut to express
this more concisely, tar_file_read()
.
R
library(targets)
library(tarchetypes)
tar_plan(
tar_file_read(
hello,
"_targets/user/data/hello.txt",
readLines(!!.x)
)
)
Let’s inspect this pipeline with tar_manifest()
:
R
tar_manifest()
OUTPUT
# A tibble: 2 × 2
name command
<chr> <chr>
1 hello_file "\"_targets/user/data/hello.txt\""
2 hello "readLines(hello_file)"
Notice that even though we only specified one target in the pipeline
(hello
, with tar_file_read()
), the pipeline
actually includes two targets, hello_file
and hello
.
That is because tar_file_read()
is a special function
called a target factory, so-called because it makes
multiple targets at once. One of the main purposes of
the tarchetypes
package is to provide target factories to
make writing pipelines easier and less error-prone.
Non-standard evaluation
What is the deal with the !!.x
? That may look unfamiliar
even if you are used to using R. It is known as “non-standard
evaluation,” and gets used in some special contexts. We don’t have time
to go into the details now, but just remember that you will need to use
this special notation with tar_file_read()
. If you forget
how to write it (this happens frequently!) look at the examples in the
help file by running ?tar_file_read
.
Other data loading functions
Although we used readLines()
as an example here, you can
use the same pattern for other functions that load data from external
files, such as readr::read_csv()
,
xlsx::read_excel()
, and others (for example,
read_csv(!!.x)
, read_excel(!!.x)
, etc.).
This is generally recommended so that your pipeline stays up to date with your input data.
Challenge: Use tar_file_read()
with the penguins example
We didn’t know about tar_file_read()
yet when we started
on the penguins bill analysis.
How can you use tar_file_read()
to load the CSV file
while tracking its contents?
R
source("R/packages.R")
source("R/functions.R")
tar_plan(
tar_file_read(
penguins_data_raw,
path_to_file("penguins_raw.csv"),
read_csv(!!.x, show_col_types = FALSE)
),
penguins_data = clean_penguin_data(penguins_data_raw)
)
OUTPUT
▶ dispatched target penguins_data_raw_file
● completed target penguins_data_raw_file [0.001 seconds, 53.098 kilobytes]
▶ dispatched target penguins_data_raw
● completed target penguins_data_raw [0.099 seconds, 10.403 kilobytes]
▶ dispatched target penguins_data
● completed target penguins_data [0.015 seconds, 1.495 kilobytes]
▶ ended pipeline [0.369 seconds]
Writing out data
Writing to files is similar to loading in files: we will use the
tar_file()
function. There is one important caveat: in this
case, the second argument of tar_file()
(the command to
build the target) must return the path to the file. Not
all functions that write files do this (some return nothing; these treat
the output file is a side-effect of running the function), so you may
need to define a custom function that writes out the file and then
returns its path.
Let’s do this for writeLines()
, the R function that
writes character data to a file. Normally, its output would be
NULL
(nothing), as we can see here:
R
x <- writeLines("some text", "test.txt")
x
OUTPUT
NULL
Here is our modified function that writes character data to a file
and returns the name of the file (the ...
means “pass the
rest of these arguments to writeLines()
”):
R
write_lines_file <- function(text, file, ...) {
writeLines(text = text, con = file, ...)
file
}
Let’s try it out:
R
x <- write_lines_file("some text", "test.txt")
x
OUTPUT
[1] "test.txt"
We can now use this in a pipeline. For example let’s change the text to upper case then write it out again:
R
library(targets)
library(tarchetypes)
source("R/functions.R")
tar_plan(
tar_file_read(
hello,
"_targets/user/data/hello.txt",
readLines(!!.x)
),
hello_caps = toupper(hello),
tar_file(
hello_caps_out,
write_lines_file(hello_caps, "_targets/user/results/hello_caps.txt")
)
)
OUTPUT
▶ dispatched target hello_file
● completed target hello_file [0 seconds, 26 bytes]
▶ dispatched target hello
● completed target hello [0 seconds, 78 bytes]
▶ dispatched target hello_caps
● completed target hello_caps [0.001 seconds, 78 bytes]
▶ dispatched target hello_caps_out
● completed target hello_caps_out [0 seconds, 26 bytes]
▶ ended pipeline [0.111 seconds]
Take a look at hello_caps.txt
in the
results
folder and verify it is as you expect.
Challenge: What happens to file output if its modified?
Delete or change the contents of hello_caps.txt
in the
results
folder. What do you think will happen when you run
tar_make()
again? Try it and see.
targets
detects that hello_caps_out
has
changed (is “invalidated”), and re-runs the code to make it, thus
writing out hello_caps.txt
to results
again.
So this way of writing out results makes your pipeline more robust:
we have a guarantee that the contents of the file in
results
are generated solely by the code in your plan.
Key Points
-
tarchetypes::tar_file()
tracks the contents of a file - Use
tarchetypes::tar_file_read()
in combination with data loading functions likeread_csv()
to keep the pipeline in sync with your input data - Use
tarchetypes::tar_file()
in combination with a function that writes to a file and returns its path to write out data