MPI applications and Maestro
Last updated on 2024-06-04 | Edit this page
Estimated time: 50 minutes
Overview
Questions
- “How do I run an MPI application via Maestro on the cluster?”
Objectives
- “Define rules to run parallel applications on the cluster”
Now it’s time to start getting back to our real workflow. We can execute a command on the cluster, but how do we effectively leverage cluster resources and actually run in parallel? In this episode, we’ll learn how to execute an application that can be run in parallel.
Our application is called amdahl
and is available in the
python virtual environment we’re already using for maestro
.
Check that you have access to this binary by running
which amdahl
at the command line. You should see something
like
OUTPUT
/usr/global/docs/training/janeh/maestro_venv/bin/amdahl
We’ll use this binary to see how efficiently we can run a parallel program on our cluster – i.e. how the amount of work done per processor changes as we use more processors.
You should see output that looks roughly like
OUTPUT
Doing 30.000000 seconds of 'work' on 1 processor,
which should take 30.000000 seconds with 0.800000
parallel proportion of the workload.
Hello, World! I am process 0 of 1 on pascal83.
I will do all the serial 'work' for 5.243022 seconds.
Hello, World! I am process 0 of 1 on pascal83.
I will do parallel 'work' for 25.233023 seconds.
Total execution time (according to rank 0): 30.537750 seconds
In short, this program prints the amount of time spent working serially and the amount of time it spends on the parallel section of a code. On the login node, only a single task is created, so there shouldn’t be any speedup from running in parallel, but soon we’ll use more tasks to run this program!
In the last challenge, we saw how the amdahl
executable
behaves when run on the login node. In the next challenge, let’s get
amdahl
running on the login node using
maestro
.
Challenge
Using what you learned in episode 1, create a Maestro YAML file that
runs amdahl
on the login node and captures the output in a
file.
Next, let’s get amdahl
running in batch, on a compute
node.
Challenge
Update amdahl.yaml
from the last challenge so that this
workflow runs on a compute node with a single task. Use the examples
from episode 2!
Once you’ve done this. Examine the output and verify that only a single task is reporting on its work in your output file.
The contents of amdahl.yaml
should now look something
like
YML
description:
name: Amdahl
description: Run on the cluster
batch:
type: slurm
host: ruby # machine
bank: guests # bank
queue: pbatch # partition
reservation: HPCC1B # reservation for this workshop
study:
- name: amdahl
description: run on the cluster
run:
cmd: |
amdahl >> amdahl.out
nodes: 1
procs: 1
walltime: "00:00:30"
In your amdahl.out
file, you should see that only a
single task – task 0 of 1 – is mentioned.
Note – Exact wording for names and descriptions is not important, but should help you to remember what this file and its study are doing.
Challenge
After checking that amdahl.yaml
looks similar to the
solution above, update the number of nodes
and
procs
each to 2
and rerun
maestro run amdahl.yaml
. How many processes report their
work in the output file?
Your YAML files contents are now updated to
YML
description:
name: Amdahl
description: Run on the cluster
batch:
type: slurm
host: ruby # machine
bank: guests # bank
queue: pbatch # partition
reservation: HPCC1B # reservation for this workshop
study:
- name: amdahl
description: run on the cluster
run:
cmd: |
amdahl >> amdahl.out
nodes: 2
procs: 2
walltime: "00:00:30"
In the study folder, amdahl.slurm.sh
will look something
like
BASH
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --partition=pdebug
#SBATCH --account=guests
#SBATCH --time=00:01:00
#SBATCH --job-name="amdahl"
#SBATCH --output="amdahl.out"
#SBATCH --error="amdahl.err"
#SBATCH --comment "run Amdahl on the cluster"
amdahl >> amdahl.out
and in amdahl.out
, you probably see something like
OUTPUT
Doing 30.000000 seconds of 'work' on 1 processor,
which should take 30.000000 seconds with 0.800000
parallel proportion of the workload.
Hello, World! I am process 0 of 1 on pascal17.
I will do all the serial 'work' for 5.324555 seconds.
Hello, World! I am process 0 of 1 on pascal17.
I will do parallel 'work' for 22.349517 seconds.
Total execution time (according to rank 0): 27.755552 seconds
Notice that this output refers to only “1 processor” and mentions
only one process. We requested two processes, but only a single one
reports back! Additionally, we requested two nodes, but only
one is mentioned in the above output (pascal17
).
So what’s going on?
If your job were really using both tasks and nodes that were
assigned to it, then both would have written to
amdahl.out
.
The amdahl
binary is enabled to run in parallel but it’s
also able to run in serial. If we want it to run in parallel, we’ll have
to tell it so more directly.
Here’s the takeaway from the challenges above: It’s not enough to have both parallel resources and a binary/executable/program that is enabled to run in parallel. We actually need to invoke MPI in order to force our parallel program to use parallel resources.
Maestro and MPI
We didn’t really run an MPI application in the last section as we only ran on one processor. How do we request to run using multiple processes for a single step?
The answer is that we have to tell Slurm that we want to use MPI. In
the Intro to HPC lesson, the episodes introducing Slurm and running
parallel jobs showed that commands to run in parallel need to use
srun
. srun
talks to MPI and allows multiple
processors to coordinate work. A call to srun
might look
something like
To make this easier, Maestro offers the shorthand
$(LAUNCHER)
. Maestro will replace instances of
$(LAUNCHER)
with a call to srun
, specifying as
many nodes and processes we’ve already told Slurm we want to use.
Challenge
Update amdahl.yaml
to include $(LAUNCHER)
in the call to amdahl
so that your study’s cmd
field includes
Run maestro with the updated YAML and explore the outputs. How many
tasks are mentioned in amdahl.out
? In the Slurm submission
script created by Maestro (included in the same subdirectory as
amdahl.out
), what text was used to replace
$(LAUNCHER)
?
The updated YAML should look something like
YML
description:
name: Amdahl
description: Run a parallel program
batch:
type: slurm
host: ruby # machine
bank: guests # bank
queue: pbatch # partition
reservation: HPCC1B # reservation for this workshop
study:
- name: amdahl
description: run in parallel
run:
# Here's where we include our MPI wrapper:
cmd: |
$(LAUNCHER) amdahl >> amdahl.out
nodes: 2
procs: 2
walltime: "00:00:30"
Your output file Amdahl_.../amdahl/amdahl.out
should
include “Doing 30.000000 seconds of ‘work’ on 2 processors” and the
submission script Amdahl_.../amdahl/amdahl.slurm.sh
should
include the line srun -n 2 -N 2 amdahl >> amdahl.out
.
Maestro substituted srun -n 2 -N 2
for
$(LAUNCHER)
!
Commenting Maestro YAML files
In the solution from the last challenge, the line beginning
#
is a comment line. Hopefully you are already in the habit
of adding comments to your own scripts. Good comments make any script
more readable, and this is just as true with our YAML files.
Customizing amdahl output
Another thing about our application amdahl
is that we
ultimately want to process the output to generate our scaling plot. The
output right now is useful for reading but makes processing harder.
amdahl
has an option that actually makes this easier for
us. To see the amdahl
options we can use
OUTPUT
usage: amdahl [-h] [-p [PARALLEL_PROPORTION]] [-w [WORK_SECONDS]] [-t] [-e]
options:
-h, --help show this help message and exit
-p [PARALLEL_PROPORTION], --parallel-proportion [PARALLEL_PROPORTION]
Parallel proportion should be a float between 0 and 1
-w [WORK_SECONDS], --work-seconds [WORK_SECONDS]
Total seconds of workload, should be an integer > 0
-t, --terse Enable terse output
-e, --exact Disable random jitter
The option we are looking for is --terse
, and that will
make amdahl
print output in a format that is much easier to
process, JSON. JSON format in a file typically uses the file extension
.json
so let’s add that option to our shell
command and change the file format of the output
to match our new command:
YML
description:
name: Amdahl
description: Run a parallel program
batch:
type: slurm
host: ruby # machine
bank: guests # bank
queue: pbatch # partition
reservation: HPCC1B # reservation for this workshop
study:
- name: amdahl
description: run in parallel
run:
# Here's where we include our MPI wrapper:
cmd: |
$(LAUNCHER) amdahl --terse >> amdahl.json
nodes: 2
procs: 2
walltime: "00:01:30"
There was another parameter for amdahl
that caught my
eye. amdahl
has an option
--parallel-proportion
(or -p
) which we might
be interested in changing as it changes the behavior of the code, and
therefore has an impact on the values we get in our results. Let’s try
specifying a parallel proportion of 90%:
YML
description:
name: Amdahl
description: Run a parallel program
batch:
type: slurm
host: ruby # machine
bank: guests # bank
queue: pbatch # partition
reservation: HPCC1B # reservation for this workshop
study:
- name: amdahl
description: run in parallel
run:
# Here's where we include our MPI wrapper:
cmd: |
$(LAUNCHER) amdahl --terse -p .9 >> amdahl.json
nodes: 2
procs: 2
walltime: "00:00:30"
Challenge
Create a YAML file for a value of -p
of 0.999 (the
default value is 0.8) for the case where we have a single node and 4
parallel processes. Run this workflow with Maestro to make sure your
script is working.
YML
description:
name: Amdahl
description: Run a parallel program
batch:
type: slurm
host: ruby # machine
bank: guests # bank
queue: pbatch # partition
reservation: HPCC1B # reservation for this workshop
study:
- name: amdahl
description: run in parallel
run:
# Here's where we include our MPI wrapper:
cmd: |
$(LAUNCHER) amdahl --terse -p .999 >> amdahl.json
nodes: 1
procs: 4
walltime: "00:00:30"
Environment variables
Our current directory is probably starting to fill up with
directories starting with Amdahl_...
, distinguished only by
dates and timestamps. It’s probably best to group runs into separate
folders to keep things tidy. One way we can do this is by specifying an
env
section in our YAML file with a variable called
OUTPUT_PATH
specified in this format:
This env
block goes above our study
block;
env
is at the same level of indentation as
study
. In this case, directories created by runs using this
OUTPUT_PATH
will all be grouped inside the directory
Episode3
, to help us group runs by where we are in the
lesson.
Challenge
Modify your YAML so that subsequent runs will be grouped into a
shared parent directory (for example, Episode3
, as
above).
YML
description:
name: Amdahl
description: Run a parallel program
batch:
type: slurm
host: ruby # machine
bank: guests # bank
queue: pbatch # partition
reservation: HPCC1B # reservation for this workshop
env:
variables:
OUTPUT_PATH: ./Episode3
study:
- name: amdahl
description: run in parallel
run:
# Here's where we include our MPI wrapper:
cmd: |
$(LAUNCHER) amdahl --terse -p .999 >> amdahl.json
nodes: 1
procs: 4
walltime: "00:00:30"
Dry-run (--dry
) mode
It’s often useful to run Maestro in --dry
mode, which
causes Maestro to create scripts and the directory structure without
actually running jobs. You will see this parameter if you run
maestro run --help
.
Challenge
Do a couple dry
runs using the script created in the
last challenge. This should help you verify that a new directory
“Episode3” gets created for runs from this episode.
Note: --dry
is an input for
maestro run
, not for amdahl
.
To do a dry run, you shouldn’t need to update your YAML file at all.
Instead, you just run
Key Points
- “Adding
$(LAUNCHER)
before commands signals to Maestro to use MPI viasrun
.” - “New Maestro runs can be grouped within a new directory specified by
the environment variable
OUTPUT_PATH
” - You can use
--dry
to verify that the expected directory structure and scripts are created by a given Maestro YAML file.