Parallel Processing
Last updated on 2024-09-24 | Edit this page
Overview
Questions
- How can we build targets in parallel?
Objectives
- Be able to build targets in parallel
Once a pipeline starts to include many targets, you may want to think about parallel processing. This takes advantage of multiple processors in your computer to build multiple targets at the same time.
When to use parallel processing
Parallel processing should only be used if your workflow has independent tasks—if your workflow only consists of a linear sequence of targets, then there is nothing to parallelize. Most workflows that use branching can benefit from parallelism.
targets
includes support for high-performance computing,
cloud computing, and various parallel backends. Here, we assume you are
running this analysis on a laptop and so will use a relatively simple
backend. If you are interested in high-performance computing, see the
targets
manual.
Set up workflow
To enable parallel processing with crew
you only need to
load the crew
package, then tell targets
to
use it using tar_option_set
. Specifically, the following
lines enable crew, and tells it to use 2 parallel workers. You can
increase this number on more powerful machines:
R
library(crew)
tar_option_set(
controller = crew_controller_local(workers = 2)
)
Make these changes to the penguins analysis. It should now look like this:
R
source("R/functions.R")
source("R/packages.R")
# Set up parallelization
library(crew)
tar_option_set(
controller = crew_controller_local(workers = 2)
)
tar_plan(
# Load raw data
tar_file_read(
penguins_data_raw,
path_to_file("penguins_raw.csv"),
read_csv(!!.x, show_col_types = FALSE)
),
# Clean data
penguins_data = clean_penguin_data(penguins_data_raw),
# Build models
models = list(
combined_model = lm(
bill_depth_mm ~ bill_length_mm, data = penguins_data),
species_model = lm(
bill_depth_mm ~ bill_length_mm + species, data = penguins_data),
interaction_model = lm(
bill_depth_mm ~ bill_length_mm * species, data = penguins_data)
),
# Get model summaries
tar_target(
model_summaries,
glance_with_mod_name(models),
pattern = map(models)
),
# Get model predictions
tar_target(
model_predictions,
augment_with_mod_name(models),
pattern = map(models)
)
)
There is still one more thing we need to modify only for the purposes of this demo: if we ran the analysis in parallel now, you wouldn’t notice any difference in compute time because the functions are so fast.
So let’s make “slow” versions of glance_with_mod_name()
and augment_with_mod_name()
using the
Sys.sleep()
function, which just tells the computer to wait
some number of seconds. This will simulate a long-running computation
and enable us to see the difference between running sequentially and in
parallel.
Add these functions to functions.R
(you can copy-paste
the original ones, then modify them):
R
glance_with_mod_name_slow <- function(model_in_list) {
Sys.sleep(4)
model_name <- names(model_in_list)
model <- model_in_list[[1]]
broom::glance(model) |>
mutate(model_name = model_name)
}
augment_with_mod_name_slow <- function(model_in_list) {
Sys.sleep(4)
model_name <- names(model_in_list)
model <- model_in_list[[1]]
broom::augment(model) |>
mutate(model_name = model_name)
}
Then, change the plan to use the “slow” version of the functions:
R
source("R/functions.R")
source("R/packages.R")
# Set up parallelization
library(crew)
tar_option_set(
controller = crew_controller_local(workers = 2)
)
tar_plan(
# Load raw data
tar_file_read(
penguins_data_raw,
path_to_file("penguins_raw.csv"),
read_csv(!!.x, show_col_types = FALSE)
),
# Clean data
penguins_data = clean_penguin_data(penguins_data_raw),
# Build models
models = list(
combined_model = lm(
bill_depth_mm ~ bill_length_mm, data = penguins_data),
species_model = lm(
bill_depth_mm ~ bill_length_mm + species, data = penguins_data),
interaction_model = lm(
bill_depth_mm ~ bill_length_mm * species, data = penguins_data)
),
# Get model summaries
tar_target(
model_summaries,
glance_with_mod_name_slow(models),
pattern = map(models)
),
# Get model predictions
tar_target(
model_predictions,
augment_with_mod_name_slow(models),
pattern = map(models)
)
)
Finally, run the pipeline with tar_make()
as normal.
OUTPUT
✔ skip target penguins_data_raw_file
✔ skip target penguins_data_raw
✔ skip target penguins_data
✔ skip target models
• start branch model_predictions_5ad4cec5
• start branch model_predictions_c73912d5
• start branch model_predictions_91696941
• start branch model_summaries_5ad4cec5
• start branch model_summaries_c73912d5
• start branch model_summaries_91696941
• built branch model_predictions_5ad4cec5 [4.884 seconds]
• built branch model_predictions_c73912d5 [4.896 seconds]
• built branch model_predictions_91696941 [4.006 seconds]
• built pattern model_predictions
• built branch model_summaries_5ad4cec5 [4.011 seconds]
• built branch model_summaries_c73912d5 [4.011 seconds]
• built branch model_summaries_91696941 [4.011 seconds]
• built pattern model_summaries
• end pipeline [15.153 seconds]
Notice that although the time required to build each individual target is about 4 seconds, the total time to run the entire workflow is less than the sum of the individual target times! That is proof that processes are running in parallel and saving you time.
The unique and powerful thing about targets is that we did not need to change our custom function to run it in parallel. We only adjusted the workflow. This means it is relatively easy to refactor (modify) a workflow for running sequentially locally or running in parallel in a high-performance context.
Now that we have demonstrated how this works, you can change your analysis plan back to the original versions of the functions you wrote.
Key Points
- Dynamic branching creates multiple targets with a single command
- You usually need to write custom functions so that the output of the branches includes necessary metadata
- Parallel computing works at the level of the workflow, not the function