HPC Workflow Management with Maestro: All in One View

Content from Running commands with Maestro

Last updated on 2024-08-22 | Edit this page

Overview

Questions

“How do I run a simple command with Maestro?”

Objectives

“Create a Maestro YAML file”

What is the workflow I’m interested in?

In this lesson we will make an experiment that takes an application which runs in parallel and investigate it’s scalability. To do that we will need to gather data, in this case that means running the application multiple times with different numbers of CPU cores and recording the execution time. Once we’ve done that we need to create a visualization of the data to see how it compares against the ideal case.

From the visualization we can then decide at what scale it makes most sense to run the application at in production to maximize the use of our CPU allocation on the system.

We could do all of this manually, but there are useful tools to help us manage data analysis pipelines like we have in our experiment. Today we’ll learn about one of those: Maestro.

In order to get started with Maestro, let’s begin by taking a simple command and see how we can run that via Maestro. Let’s choose the command hostname which prints out the name of the host where the command is executed:

BASH

hostname

OUTPUT

pascal83

That prints out the result but Maestro relies on files to know the status of your workflow, so let’s redirect the output to a file:

BASH

janeh@pascal83:~$ hostname > hostname_login.txt

Writing a Maestro YAML

Edit a new text file named hostname.yaml. The file extension is a recursive initialism for “YAML Ain’t Markup Language”, a popular format for configuration files and key-value data serialization. For more, see the Wikipedia page, esp. YAML Syntax.

Contents of hostname.yaml (spaces matter!):

YML

description:
  name: Hostnames
  description: Report a node's hostname.

study:
  - name: hostname-login
    description: Write the login node's hostname to a file.
    run:
      cmd: |
        hostname > hostname_login.txt

Key points about this file

The name of hostname.yaml is not very important; it gives us information about file contents and type, but maestro will behave the same if you rename it to hostname or foo.txt.
The file specifies fields in a hierarchy. For example, name, description, and run are all passed to study and are at the same level in the hierarchy. description and study are both at the top level in the hierarchy.
Indentation indicates the hierarchy and should be consistent. For example, all the fields passed directly to study are indented relative to study and their indentation is all the same.
The commands executed during the study are given under cmd. Starting this entry with | and a newline character allows us to specify multiple commands.
The example YAML file above is pretty minimal; all fields shown are required.
The names given to study can include letters, numbers, and special characters.

Back in the shell we’ll run our new rule. At this point, we may see an error if a required field is missing or if our indentation is inconsistent.

BASH

janeh@pascal83:~$ maestro run hostname.yaml

`bash: maestro: command not found...`

If your shell tells you that it cannot find the command maestro then we need to make the software available somehow. In our case, this means activating the python virtual environment where maestro is installed.

BASH

source /usr/global/docs/training/janeh/maestro_venv/bin/activate

You can tell this command has already been run when (maestro_venv) appears before your command prompt:

BASH

janeh@pascal83:~$ source /usr/global/docs/training/janeh/maestro_venv/bin/activate
(maestro_venv) janeh@pascal83:~$

Now that the maestro_venv virtual environment has been activated, the maestro command should be available, but let’s double check

BASH

(maestro_venv) janeh@pascal83:~$ which maestro

OUTPUT

/usr/global/docs/training/janeh/maestro_venv/bin/maestro

Running maestro

Once you have maestro available to you, run maestro run hostname.yaml and enter y when prompted:

BASH

(maestro_venv) janeh@pascal83:~$ maestro run hostname.yaml

OUTPUT

[2024-03-20 15:39:34: INFO] INFO Logging Level -- Enabled
[2024-03-20 15:39:34: WARNING] WARNING Logging Level -- Enabled
[2024-03-20 15:39:34: CRITICAL] CRITICAL Logging Level -- Enabled
[2024-03-20 15:39:34: INFO] Loading specification -- path = hostname.yaml
[2024-03-20 15:39:34: INFO] Directory does not exist. Creating directories to ~/Hostnames_20240320-153934/logs
[2024-03-20 15:39:34: INFO] Adding step 'hostname-login' to study 'Hostnames'...
[2024-03-20 15:39:34: INFO]
------------------------------------------
Submission attempts =       1
Submission restart limit =  1
Submission throttle limit = 0
Use temporary directory =   False
Hash workspaces =           False
Dry run enabled =           False
Output path =               ~/Hostnames_20240320-153934
------------------------------------------
Would you like to launch the study? [yn] y
Study launched successfully.

and look at the outputs. You should have a new directory whose name includes a date and timestamp and that starts with the name given under description at the top of hostname.yaml.

In that directory will be a subdirectory for every study run from hostname.yaml. The subdirectories for each study include all output files for that study.

BASH

(maestro_venv) janeh@pascal83:~$ cd Hostnames_20240320-153934/
(maestro_venv) janeh@pascal83:~/Hostnames_20240320-153934$ ls

OUTPUT

batch.info      Hostnames.pkl        Hostnames.txt  logs  status.csv
hostname-login  Hostnames.study.pkl  hostname.yaml  meta

BASH

(maestro_venv) janeh@pascal83:~/Hostnames_20240320-153934$ cd hostname-login/
(maestro_venv) janeh@pascal83:~/Hostnames_20240320-153934/hostname-login$ ls

OUTPUT

hostname-login.2284862.err  hostname-login.2284862.out  hostname-login.sh  hostname_login.txt

Challenge

To which file will the login node’s hostname, pascal83, be written?

hostname-login.2284862.err
hostname-login.2284862.out
hostname-login.sh
hostname_login.txt

Show me the solution

hostname_login.txt

In the original hostname.yaml file that we ran, we specified that hostname would be written to hostname_login.txt, and this is where we’ll see that output, if the run worked!

Challenge

This one is tricky! In the example above, pascal83 was written to ~/Hostnames_{date}_{time}/hostname-login/hostname_login.txt.

Where would Hello be written for the following YAML?

YML

description:
    name: MyHello
    description: Report a node's hostname.

study:
    - name: give-salutation
      description: Write the login node's hostname to a file
      run:
          cmd: |
            echo "hello" > greeting.txt

~/give-salutation_{date}_{time}/greeting/greeting.txt
~/greeting_{date}_{time}/give_salutation/greeting.txt
~/MyHello_{date}_{time}/give-salutation/greeting.txt
~/MyHello_{date}_{time}/greeting/greeting.txt

Show me the solution

.../MyHello_{date}_{time}/give-salutation/greeting.txt

The top-level folder created starts with the name field under description; here, that’s MyHello. Its subdirectory is named after the study; here, that’s give-salutation. The file created is greeting.txt and this stores the output of echo "hello".

Callout

After running a workflow with Maestro, you can check the status via maestro status --disable-theme <directory name>. For example, for the directory Hostnames_20240821-165341 created via maestro run hostnames.yaml:

BASH

maestro status --disable-theme Hostnames_20240821-165341

OUTPUT

                               Study: /usr/WS1/janeh/maestro-tut/Hostnames_20240821-165341
┏━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃            ┃         ┃           ┃          ┃            ┃ Elapsed   ┃            ┃ Submit    ┃            ┃ Number    ┃
┃ Step Name  ┃ Job ID  ┃ Workspace ┃ State    ┃ Run Time   ┃ Time      ┃ Start Time ┃ Time      ┃ End Time   ┃ Restarts  ┃
┡━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ hostname-l │ 2593210 │ hostname- │ FINISHED │ 0d:00h:00m │ 0d:00h:00 │ 2024-08-21 │ 2024-08-2 │ 2024-08-21 │ 0         │
│ ogin       │         │ login     │          │ :01s       │ m:01s     │ 16:53:44   │ 1         │ 16:53:45   │           │
│            │         │           │          │            │           │            │ 16:53:44  │            │           │
└────────────┴─────────┴───────────┴──────────┴────────────┴───────────┴────────────┴───────────┴────────────┴───────────┘
(END)

Key Points

You execute maestro run with a YAML file including information about your run.
Your run includes a description and at least one study (a step in your run).
Your maestro run creates a directory with subdirectories and outputs for each study.
Check the status of a run via maestro status --disable-theme <directory>

Content from Running Maestro on the cluster

Last updated on 2024-06-06 | Edit this page

Callout

If you’ve opened a new terminal, make sure Maestro is available.

BASH

source /usr/global/docs/training/janeh/maestro_venv/bin/activate

How do I run Maestro on the cluster?

What happens when we want to run on the cluster (“to run a batch job”) rather than the login node? The cluster we are using uses Slurm, and Maestro has built in support for Slurm. We just need to tell Maestro which resources we need Slurm to grab for our run.

First, we need to add a batch block to our YAML file, where we’ll provide the names of the machine, bank, and queue in which your jobs should run.

YML

batch:
    type: slurm
    host: ruby          # enter the machine you'll run on
    bank: guests        # enter the bank to charge
    queue: pbatch       # partition in which your job should run
    reservation: HPCC1B # reservation for this workshop

Second, we need to specify the number of nodes, number of processes, and walltime for each step in our YAML file. This information goes under each step’s run field. Here we specify 1 node, 1 process, and a time limit of 30 seconds:

YML

(...)
    run:
        cmd: |
          hostname >> hostname.txt
        nodes: 1
        procs: 1
        walltime: "00:00:30"

Whereas run previously held only info about the command we wanted to execute, steps run on the cluster include a specification of the resources needed in order to execute. Note that the format of the walltime includes quotation marks – "{Hours}:{Minutes}:{Seconds}".

With these changes, our updated YAML file might look like

YML

---
description:
    name: Hostnames
    description: Report a node's hostname.

batch:
    type: slurm
    host: ruby           # machine
    bank: guests         # bank
    queue: pbatch        # partition
    reservation: HPCC1B  # reservation for this workshop

study:
    - name: hostname-login
      description: Write the login node's hostname to a file
      run:
          cmd: |
            hostname > hostname_login.txt
    - name: hostname_batch
      description: Write the node's hostname to a file
      run:
          cmd: |
            hostname >> hostname.txt
          nodes: 1
          procs: 1
          walltime: "00:00:30"

Note that we left the rule hostname-login as is. Because we do not specify any info for slurm under this original step’s run field – like nodes, processes, or walltime – this step will continue running on the login node and only hostname_batch will be handed off to slurm.

Running on the cluster

Modify your YAML file, hostname.yaml to execute hostname on the cluster. Run with 1 node and 1 process using the bank guests on the partition pbatch on ruby.

If you run this multiple times, do you always run on the same node? (Is the hostname printed always the same?)

Show me the solution

The contents of hostname.yaml should look something like:

YML

description:
    name: Hostnames
    description: Report a node's hostname.

batch:
    type: slurm
    host: ruby           # machine
    bank: guests         # bank
    queue: pbatch        # partition
    reservation: HPCC1B  # reservation for this workshop

study:
    - name: hostname-login
      description: Write the login node's hostname to a file
      run:
          cmd: |
              hostname > hostname_login.txt
    - name: hostname_batch
      description: Write the node's hostname to a file
      run:
          cmd: |
               hostname >> hostname.txt
          nodes: 1
          procs: 1
          walltime: "00:00:30"

Go ahead and run the job:

BASH

maestro run batch-hostname.yaml

A directory called Hostname_... will be created. If you look in the subdirectory hostname_batch, you’ll find a file called hostname.txt with info about the compute node where the hostname command ran. If you run the job multiple times, you will probably land on different nodes; this means you’ll see different node numbers in different hostname.txt files. If you see the same number more than once, don’t worry! (If you want to double check that the hostnames printed are not for login nodes, you can run nodeattr -c login to check the IDs of all login nodes on the system.)

Outputs from a batch job

When running in batch, maestro run ... will create a new directory with the same naming scheme as seen in episode 1, and that directory will contain subdirectories for all studies. The hostname_batch subdirectory has four output files, but this time the file ending with extension .sh is a slurm submission script

BASH

cd Hostnames_20240320-170150/hostname_batch
ls

OUTPUT

hostname.err  hostname.out  hostname.slurm.sh  hostname.txt

BASH

cat hostname.slurm.sh

BASH

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --partition=pvis
#SBATCH --account=lc
#SBATCH --time=00:00:30
#SBATCH --job-name="hostname"
#SBATCH --output="hostname.out"
#SBATCH --error="hostname.err"
#SBATCH --comment "Write the node's hostname to a file"

hostname > hostname.txt

Maestro uses the info from your YAML file to write this script and then submits it to the scheduler for you. Soon after you run on the cluster via maestro run hostname.yaml, you should be able to see the job running or finishing up in the queue with the command squeue -u «your username».

BASH

cd ~
maestro run batch-hostname.yaml

OUTPUT

[2024-03-20 17:31:37: INFO] INFO Logging Level -- Enabled
[2024-03-20 17:31:37: WARNING] WARNING Logging Level -- Enabled
[2024-03-20 17:31:37: CRITICAL] CRITICAL Logging Level -- Enabled
[2024-03-20 17:31:37: INFO] Loading specification -- path = batch-hostname.yaml
[2024-03-20 17:31:37: INFO] Directory does not exist. Creating directories to ~/Hostnames_20240320-173137/logs
[2024-03-20 17:31:37: INFO] Adding step 'hostname-login' to study 'Hostnames'...
[2024-03-20 17:31:37: INFO] Adding step 'hostname_batch' to study 'Hostnames'...
[2024-03-20 17:31:37: INFO]
------------------------------------------
Submission attempts =       1
Submission restart limit =  1
Submission throttle limit = 0
Use temporary directory =   False
Hash workspaces =           False
Dry run enabled =           False
Output path =               ~/Hostnames_20240320-173137
------------------------------------------
Would you like to launch the study? [yn] y
Study launched successfully.

BASH

squeue -u janeh

OUTPUT

 JOBID  PARTITION      NAME   USER  ST  TIME   NODES  NODELIST(REASON)
718308       pvis  hostname  janeh   R   0:01      1  pascal13

Key Points

You can run on the cluster by including info for Slurm in your Maestro YAML file
Maestro generates and submits its own batch scripts to your scheduler.
Steps without Slurm parameters will run on the login node by default.

Content from MPI applications and Maestro

Last updated on 2024-06-04 | Edit this page

Overview

Questions

“How do I run an MPI application via Maestro on the cluster?”

Objectives

“Define rules to run parallel applications on the cluster”

Callout

If you’ve opened a new terminal, make sure Maestro is available.

BASH

source /usr/global/docs/training/janeh/maestro_venv/bin/activate

Now it’s time to start getting back to our real workflow. We can execute a command on the cluster, but how do we effectively leverage cluster resources and actually run in parallel? In this episode, we’ll learn how to execute an application that can be run in parallel.

Our application is called amdahl and is available in the python virtual environment we’re already using for maestro. Check that you have access to this binary by running which amdahl at the command line. You should see something like

BASH

which amdahl

OUTPUT

/usr/global/docs/training/janeh/maestro_venv/bin/amdahl

We’ll use this binary to see how efficiently we can run a parallel program on our cluster – i.e. how the amount of work done per processor changes as we use more processors.

Challenge

Without using maestro, simply run amdahl on your login node by typing amdahl on the command line and hitting enter.

BASH

amdahl

What is amdahl doing?

Show me the solution

You should see output that looks roughly like

BASH

amdahl

OUTPUT

Doing 30.000000 seconds of 'work' on 1 processor,
which should take 30.000000 seconds with 0.800000
parallel proportion of the workload.

  Hello, World! I am process 0 of 1 on pascal83.
  I will do all the serial 'work' for 5.243022 seconds.

  Hello, World! I am process 0 of 1 on pascal83.
  I will do parallel 'work' for 25.233023 seconds.

Total execution time (according to rank 0): 30.537750 seconds

In short, this program prints the amount of time spent working serially and the amount of time it spends on the parallel section of a code. On the login node, only a single task is created, so there shouldn’t be any speedup from running in parallel, but soon we’ll use more tasks to run this program!

In the last challenge, we saw how the amdahl executable behaves when run on the login node. In the next challenge, let’s get amdahl running on the login node using maestro.

Challenge

Using what you learned in episode 1, create a Maestro YAML file that runs amdahl on the login node and captures the output in a file.

Show me the solution

Your YAML file might be named amdahl.yaml and its contents might look like

YML

description:
    name: Amdahl
    description: Run a parallel program

study:
    - name: amdahl
      description: run the amdahl executable
      run:
          cmd: |
               amdahl >> amdahl.out

and you would run with maestro run amdahl.yaml.

Next, let’s get amdahl running in batch, on a compute node.

Challenge

Update amdahl.yaml from the last challenge so that this workflow runs on a compute node with a single task. Use the examples from episode 2!

Once you’ve done this. Examine the output and verify that only a single task is reporting on its work in your output file.

Show me the solution

The contents of amdahl.yaml should now look something like

YML

description:
    name: Amdahl
    description: Run on the cluster

batch:
    type: slurm
    host: ruby           # machine
    bank: guests         # bank
    queue: pbatch        # partition
    reservation: HPCC1B  # reservation for this workshop

study:
    - name: amdahl
      description: run on the cluster
      run:
          cmd: |
               amdahl >> amdahl.out
          nodes: 1
          procs: 1
          walltime: "00:00:30"

In your amdahl.out file, you should see that only a single task – task 0 of 1 – is mentioned.

Note – Exact wording for names and descriptions is not important, but should help you to remember what this file and its study are doing.

Challenge

After checking that amdahl.yaml looks similar to the solution above, update the number of nodes and procs each to 2 and rerun maestro run amdahl.yaml. How many processes report their work in the output file?

Show me the solution

Your YAML files contents are now updated to

YML

description:
    name: Amdahl
    description: Run on the cluster

batch:
    type: slurm
    host: ruby           # machine
    bank: guests         # bank
    queue: pbatch        # partition
    reservation: HPCC1B  # reservation for this workshop

study:
    - name: amdahl
      description: run on the cluster
      run:
          cmd: |
               amdahl >> amdahl.out
          nodes: 2
          procs: 2
          walltime: "00:00:30"

In the study folder, amdahl.slurm.sh will look something like

BASH

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --partition=pdebug
#SBATCH --account=guests
#SBATCH --time=00:01:00
#SBATCH --job-name="amdahl"
#SBATCH --output="amdahl.out"
#SBATCH --error="amdahl.err"
#SBATCH --comment "run Amdahl on the cluster"

amdahl >> amdahl.out

and in amdahl.out, you probably see something like

OUTPUT

Doing 30.000000 seconds of 'work' on 1 processor,
which should take 30.000000 seconds with 0.800000
parallel proportion of the workload.

  Hello, World! I am process 0 of 1 on pascal17.
  I will do all the serial 'work' for 5.324555 seconds.

  Hello, World! I am process 0 of 1 on pascal17.
  I will do parallel 'work' for 22.349517 seconds.

Total execution time (according to rank 0): 27.755552 seconds

Notice that this output refers to only “1 processor” and mentions only one process. We requested two processes, but only a single one reports back! Additionally, we requested two nodes, but only one is mentioned in the above output (pascal17).

So what’s going on?

If your job were really using both tasks and nodes that were assigned to it, then both would have written to amdahl.out.

The amdahl binary is enabled to run in parallel but it’s also able to run in serial. If we want it to run in parallel, we’ll have to tell it so more directly.

Here’s the takeaway from the challenges above: It’s not enough to have both parallel resources and a binary/executable/program that is enabled to run in parallel. We actually need to invoke MPI in order to force our parallel program to use parallel resources.

Maestro and MPI

We didn’t really run an MPI application in the last section as we only ran on one processor. How do we request to run using multiple processes for a single step?

The answer is that we have to tell Slurm that we want to use MPI. In the Intro to HPC lesson, the episodes introducing Slurm and running parallel jobs showed that commands to run in parallel need to use srun. srun talks to MPI and allows multiple processors to coordinate work. A call to srun might look something like

BASH

srun -N {# of nodes} -n {number of processes} amdahl >> amdahl.out

To make this easier, Maestro offers the shorthand $(LAUNCHER). Maestro will replace instances of $(LAUNCHER) with a call to srun, specifying as many nodes and processes we’ve already told Slurm we want to use.

Challenge

Update amdahl.yaml to include $(LAUNCHER) in the call to amdahl so that your study’s cmd field includes

BASH

$(LAUNCHER) amdahl >> amdahl.out

Run maestro with the updated YAML and explore the outputs. How many tasks are mentioned in amdahl.out? In the Slurm submission script created by Maestro (included in the same subdirectory as amdahl.out), what text was used to replace $(LAUNCHER)?

Show me the solution

The updated YAML should look something like

YML

description:
    name: Amdahl
    description: Run a parallel program

batch:
    type: slurm
    host: ruby           # machine
    bank: guests         # bank
    queue: pbatch        # partition
    reservation: HPCC1B  # reservation for this workshop

study:
    - name: amdahl
      description: run in parallel
      run:
          # Here's where we include our MPI wrapper:
          cmd: |
               $(LAUNCHER) amdahl >> amdahl.out
          nodes: 2
          procs: 2
          walltime: "00:00:30"

Your output file Amdahl_.../amdahl/amdahl.out should include “Doing 30.000000 seconds of ‘work’ on 2 processors” and the submission script Amdahl_.../amdahl/amdahl.slurm.sh should include the line srun -n 2 -N 2 amdahl >> amdahl.out. Maestro substituted srun -n 2 -N 2 for $(LAUNCHER)!

Commenting Maestro YAML files

In the solution from the last challenge, the line beginning # is a comment line. Hopefully you are already in the habit of adding comments to your own scripts. Good comments make any script more readable, and this is just as true with our YAML files.

Customizing amdahl output

Another thing about our application amdahl is that we ultimately want to process the output to generate our scaling plot. The output right now is useful for reading but makes processing harder. amdahl has an option that actually makes this easier for us. To see the amdahl options we can use

BASH

amdahl --help

OUTPUT

usage: amdahl [-h] [-p [PARALLEL_PROPORTION]] [-w [WORK_SECONDS]] [-t] [-e]

options:
  -h, --help            show this help message and exit
  -p [PARALLEL_PROPORTION], --parallel-proportion [PARALLEL_PROPORTION]
                        Parallel proportion should be a float between 0 and 1
  -w [WORK_SECONDS], --work-seconds [WORK_SECONDS]
                        Total seconds of workload, should be an integer > 0
  -t, --terse           Enable terse output
  -e, --exact           Disable random jitter

The option we are looking for is --terse, and that will make amdahl print output in a format that is much easier to process, JSON. JSON format in a file typically uses the file extension .json so let’s add that option to our shell command and change the file format of the output to match our new command:

YML

description:
    name: Amdahl
    description: Run a parallel program

batch:
    type: slurm
    host: ruby           # machine
    bank: guests         # bank
    queue: pbatch        # partition
    reservation: HPCC1B  # reservation for this workshop

study:
    - name: amdahl
      description: run in parallel
      run:
          # Here's where we include our MPI wrapper:
          cmd: |
               $(LAUNCHER) amdahl --terse >> amdahl.json
          nodes: 2
          procs: 2
          walltime: "00:01:30"

There was another parameter for amdahl that caught my eye. amdahl has an option --parallel-proportion (or -p) which we might be interested in changing as it changes the behavior of the code, and therefore has an impact on the values we get in our results. Let’s try specifying a parallel proportion of 90%:

YML

description:
    name: Amdahl
    description: Run a parallel program

batch:
    type: slurm
    host: ruby           # machine
    bank: guests         # bank
    queue: pbatch        # partition
    reservation: HPCC1B  # reservation for this workshop

study:
    - name: amdahl
      description: run in parallel
      run:
          # Here's where we include our MPI wrapper:
          cmd: |
               $(LAUNCHER) amdahl --terse -p .9 >> amdahl.json
          nodes: 2
          procs: 2
          walltime: "00:00:30"

Challenge

Create a YAML file for a value of -p of 0.999 (the default value is 0.8) for the case where we have a single node and 4 parallel processes. Run this workflow with Maestro to make sure your script is working.

Show me the solution

YML

description:
    name: Amdahl
    description: Run a parallel program

batch:
    type: slurm
    host: ruby           # machine
    bank: guests         # bank
    queue: pbatch        # partition
    reservation: HPCC1B  # reservation for this workshop

study:
    - name: amdahl
      description: run in parallel
      run:
          # Here's where we include our MPI wrapper:
          cmd: |
               $(LAUNCHER) amdahl --terse -p .999 >> amdahl.json
          nodes: 1
          procs: 4
          walltime: "00:00:30"

Environment variables

Our current directory is probably starting to fill up with directories starting with Amdahl_..., distinguished only by dates and timestamps. It’s probably best to group runs into separate folders to keep things tidy. One way we can do this is by specifying an env section in our YAML file with a variable called OUTPUT_PATH specified in this format:

YML

env:
    variables:
      OUTPUT_PATH: ./Episode3

This env block goes above our study block; env is at the same level of indentation as study. In this case, directories created by runs using this OUTPUT_PATH will all be grouped inside the directory Episode3, to help us group runs by where we are in the lesson.

Challenge

Modify your YAML so that subsequent runs will be grouped into a shared parent directory (for example, Episode3, as above).

Show me the solution

YML

description:
    name: Amdahl
    description: Run a parallel program

batch:
    type: slurm
    host: ruby           # machine
    bank: guests         # bank
    queue: pbatch        # partition
    reservation: HPCC1B  # reservation for this workshop

env:
    variables:
      OUTPUT_PATH: ./Episode3

study:
    - name: amdahl
      description: run in parallel
      run:
          # Here's where we include our MPI wrapper:
          cmd: |
               $(LAUNCHER) amdahl --terse -p .999 >> amdahl.json
          nodes: 1
          procs: 4
          walltime: "00:00:30"

Dry-run (`--dry`) mode

It’s often useful to run Maestro in --dry mode, which causes Maestro to create scripts and the directory structure without actually running jobs. You will see this parameter if you run maestro run --help.

Challenge

Do a couple dry runs using the script created in the last challenge. This should help you verify that a new directory “Episode3” gets created for runs from this episode.

Note: --dry is an input for maestro run, not for amdahl. To do a dry run, you shouldn’t need to update your YAML file at all. Instead, you just run

BASH

maestro run --dry «YAML filename»

Show me the solution

After running

BASH

maestro run --dry amdahl.yaml

a directory path of the form Episode3/Amdahl_{DATE}_{TIME}/amdahl should be created.

Key Points

“Adding $(LAUNCHER) before commands signals to Maestro to use MPI via srun.”
“New Maestro runs can be grouped within a new directory specified by the environment variable OUTPUT_PATH”
You can use --dry to verify that the expected directory structure and scripts are created by a given Maestro YAML file.

Content from Placeholders

Last updated on 2024-08-22 | Edit this page

Overview

Questions

“How do I make a generic rule?”

Objectives

“Learn to use variables as placeholders”
“Learn to run many similar Maestro runs at once”

Callout

If you’ve opened a new terminal, make sure Maestro is available.

BASH

source /usr/global/docs/training/janeh/maestro_venv/bin/activate

D.R.Y. (Don’t Repeat Yourself)

Callout

In many programming languages, the bulk of the language features are there to allow the programmer to describe long-winded computational routines as short, expressive, beautiful code. Features in Python, R, or Java, such as user-defined variables and functions are useful in part because they mean we don’t have to write out (or think about) all of the details over and over again. This good habit of writing things out only once is known as the “Don’t Repeat Yourself” principle or D.R.Y.

Maestro YAML files are a form of code and, in any code, repetition can lead to problems (e.g. we rename a data file in one part of the YAML but forget to rename it elsewhere).

In this episode, we’ll set ourselves up with ways to avoid repeating ourselves by using environment variables as placeholders.

Placeholders

Over the course of this lesson, we want to use the amdahl binary to show how the execution time of a program changes with the number of processes used. In on our current setup, to run amdahl for multiple values of procs, we would need to run our workflow, change procs, rerun, and so forth. We’d be repeating our workflow a lot, so let’s first try fixing that by defining multiple rules.

At the end of our last episode, our amdahl.yaml file contained the sections

YML

(...)

env:
    variables:
      OUTPUT_PATH: ./Episode3

study:
    - name: amdahl
      description: run in parallel
      run:
          # Here's where we include our MPI wrapper:
          cmd: |
               $(LAUNCHER) amdahl --terse -p .999 >> amdahl.json
          nodes: 1
          procs: 4
          walltime: "00:00:30"

Let’s call our existing step amdahl-1 (name under study) and create a second step called amdahl-2 which is exactly the same, except that it will define procs: 8. While we’re at it, let’s update OUTPUT_PATH so that it is ./Episode4.

The updated part of the script now looks like

YML

(...)

env:
    variables:
      OUTPUT_PATH: ./Episode4

study:
    - name: amdahl-1
      description: run in parallel
      run:
          cmd: |
               $(LAUNCHER) amdahl --terse -p .999 >> amdahl.json
          nodes: 1
          procs: 4
          walltime: "00:00:30"
    - name: amdahl-2
      description: run in parallel
      run:
          cmd: |
               $(LAUNCHER) amdahl --terse -p .999 >> amdahl.json
          nodes: 1
          procs: 8
          walltime: "00:00:30"

Challenge

Update amdahl.yaml to include the new info shown above. Run a dry run to see what your output directory structure looks like.

Now let’s start to get rid of some of the redundancy in our new workflow.

First off, defining the parallel proportion (-p .999) in two places makes our lives harder. Now if we want to change this value, we have to update it in two places, but we can make this easier by using an environment variable.

Let’s create another environment variable in the variables second under env. We can define a new parallel proportion as P: .999. Then, under run’s cmd for each step, we can call this environment variable with the syntax $(P). $(P) holds the place of and will be substituted by .999 when Maestro creates a Slurm submission script for us.

Let’s also create an environment variable for our output file, amdahl.json called OUTPUT and then call that variable from our cmd fields.

Our updated section will now look like this:

YML

(...)

env:
    variables:
      P: .999
      OUTPUT_PATH: ./Episode4
      OUTPUT: amdahl.json

study:
    - name: amdahl-1
      description: run in parallel
      run:
          cmd: |
               $(LAUNCHER) amdahl --terse -p $(P) >> $(OUTPUT)
          nodes: 1
          procs: 4
          walltime: "00:00:30"
    - name: amdahl-2
      description: run in parallel
      run:
          cmd: |
               $(LAUNCHER) amdahl --terse -p $(P) >> $(OUTPUT)
          nodes: 1
          procs: 8
          walltime: "00:00:30"

We’ve added two new placeholders to make our YAML script to make it a tad bit more efficient. Note that we had already been using a placeholder given to us by Maestro: $(LAUNCHER) holds the place of a call to srun <insert resource requests>.

Callout

Note that in the context of Maestro, the general term for the “placeholders” declared in the env block are “tokens”.

Challenge

Run your updated amdahl.yaml and check results, to verify your workflow is working with the changes you’ve made so far.

Show me the solution

The full YAML text is

YML

description:
    name: Amdahl
    description: Run a parallel program

batch:
    type: slurm
    host: ruby           # machine
    bank: guests         # bank
    queue: pbatch        # partition
    reservation: HPCC1B  # reservation for this workshop

env:
    variables:
      P: .999
      OUTPUT_PATH: ./Episode4
      OUTPUT: amdahl.json

study:
    - name: amdahl-1
      description: run in parallel
      run:
          cmd: |
               $(LAUNCHER) amdahl --terse -p $(P) >> $(OUTPUT)
          nodes: 1
          procs: 4
          walltime: "00:00:30"
    - name: amdahl-2
      description: run in parallel
      run:
          cmd: |
               $(LAUNCHER) amdahl --terse -p $(P) >> $(OUTPUT)
          nodes: 1
          procs: 8
          walltime: "00:00:30"

Maestro’s global.parameters

We’re almost ready to perform our scaling study – to see how the execution time changes as we use more processors in the job. Unfortunately, we’re still repeating ourselves a lot because, in spite of the environment variables we created, most of the information defined for steps amdahl-1 and amdahl-2 is the same. Only the procs field changes!

A great way to avoid repeating ourselves here by defining a parameter that lists multiple values of tasks and runs a separate job step for each value. We do this by adding a global.parameters section at the bottom of the script. We then define individual parameters within this section. Each parameter includes a list of values (Each element is used in its own job step.) and a label. (The label helps define how the output directory structure is named.)

This is what it looks like to define a global parameter:

YML

global.parameters:
    TASKS:
        values: [2, 4, 8, 18, 24, 36]
        label: TASKS.%%

Note that the label should include %% as above; the %% is itself a placeholder! The directory created for the output of each job step will be identified by the value of each parameter it used, and the parameter’s value will be inserted to replace the %%.

Next, we should update the line under run -> cmd defining procs to include the name of the parameter enclosed in $():

YML

          procs: $(TASKS)

If we make this change for steps amdahl-1 and amdahl-2, they will now look exactly the same, so we can simply condense them to one step.

The full YAML file will look like

YML

description:
    name: Amdahl
    description: Run a parallel program

batch:
    type: slurm
    host: ruby           # machine
    bank: guests         # bank
    queue: pbatch        # partition
    reservation: HPCC1B  # reservation for this workshop

env:
    variables:
      P: .999
      OUTPUT: amdahl.json
      OUTPUT_PATH: ./Episode4

study:
    - name: amdahl
      description: run in parallel
      run:
          # Here's where we include our MPI wrapper:
          cmd: |
               $(LAUNCHER) amdahl --terse -p $(P) >> $(OUTPUT)
          nodes: 1
          procs: $(TASKS)
          walltime: "00:00:30"

global.parameters:
    TASKS:
        values: [2, 4, 8, 18, 24, 36]
        label: TASKS.%%

Challenge

Run maestro run --dry amdahl.yaml using the above YAML file and investigate the resulting directory structure. How does the list of task values under global.parameters change the output directory organization?

Show me the solution

Under your current working directory, you should see a directory structure created with the following format – Episode4/Amdahl_<Date>-<Time>/amdahl. Within the amdahl subdirectory, you should see one output directory for each of the values listed for TASKS under global.parameters:

BASH

ls Episode4/Amdahl_<Date>_<Time>/amdahl

OUTPUT

TASKS.18  TASKS.2  TASKS.24  TASKS.36  TASKS.4  TASKS.8

Each TASKS... subdirectory will contain the slurm submission script to be used if this maestro job is run:

BASH

ls Episode4/Amdahl_<Date>_<Time>/amdahl/TASKS.18/

OUTPUT

amdahl_TASKS.18.slurm.sh

Run

BASH

maestro run amdahl.yaml

before moving on to the next episode, to generate the results for various task numbers. You’ll be able to see your jobs queuing and running via squeue -u <username>.

Key Points

Environment variables are placeholders defined under the env section of a Maestro YAML.
Parameters defined under global.parameters require lists of values and labels.

Content from Chaining rules

Last updated on 2024-06-04 | Edit this page

Overview

Questions

How do I combine steps into a workflow?
How do I make a step that uses the outputs from a previous step?

Objectives

Create workflow pipelines with dependencies.
Use the outputs of one step as inputs to the next.

Callout

If you’ve opened a new terminal, make sure Maestro is available.

BASH

source /usr/global/docs/training/janeh/maestro_venv/bin/activate

A pipeline of multiple rules

In this episode, we will plot the scaling results generated in the last episode using the script plot_terse_amdahl_results.py. These results report how long the work of running amdahl took when using between 2 and 36 processes.

We want to plot these results automatically, i.e. as part of our workflow. In order to do this correctly, we need to make sure that our python plotting script runs only after we have finished calculating all results.

We can control the order and relative timing of when steps execute by defining dependencies.

For example, consider the following YAML, depends.yaml:

YML

description:
    name: Dependency-exploration
    description: Experiment with adding dependencies

batch:
    type: slurm
    host: ruby           # machine
    bank: guests         # bank
    queue: pbatch        # partition
    reservation: HPCC1B  # reservation for this workshop

env:
    variables:
      OUTPUT_PATH: ./Episode5
      OUTPUT: date.txt

study:
      - name: date-login
      description: Write the date and login node's hostname to a file
      run:
          cmd: |
              echo "From login node:" >> $(OUTPUT)
              hostname >> $(OUTPUT)
              date >> $(OUTPUT)
              sleep 10
      - name: date-batch
      description: Write the date and node's hostname to a file
      run:
          cmd: |
              echo "From batch node:" >> $(OUTPUT)
              hostname >> $(OUTPUT)
              date >> $(OUTPUT)
              sleep 10
          nodes: 1
          procs: 1
          walltime: "00:00:30"

This script has two steps, each of which writes the type of node of its host, the particular hostname, and the date/time before waiting for 10 seconds. The step date-login is run on a login node and the step date-batch is run on a “batch” or “compute” node.

Challenge

Run the above script with

BASH

maestro run depends.yaml

You can then cat the output of both steps to stdout via a command of the form

BASH

cat Episode5/Dependency-exploration-{FILL}/date-*/*.txt

where {FILL} should be replaced by the date/time info of this run. This will print the output of the two steps to stdout.

Which step is printed first and how close together are they completed?

Show me the solution

You’ll likely see output somewhat similar to

BASH

cat Episode5/Dependency-exploration_20240326-143559/date-*/date.txt

OUTPUT

From batch node:
pascal16
Tue Mar 26 14:36:04 PDT 2024
From login node:
pascal83
Tue Mar 26 14:36:03 PDT 2024

though the times/dates and hostnames will be different. We didn’t try to control the order of operations, so each step will have run as soon as it had resources. Probably you’ll see that the timestamp on the login node is earlier because it will not have to wait for resources from the queue.

Next, let’s add a dependency to ensure that the batch step runs before the login node step. To add a dependency a line with the following format must be added to a step’s run block:

YML

        depends: [`{STEP NAME}`]

{STEP NAME} is replaced by the name of the step from the study that you want the current step to depend upon.

If we update date-login to include a dependency, we’ll see

YML

    - name: date-login
      description: Write the date and login node's hostname to a file
      run:
          cmd: |
              echo "From login node:" >> $(OUTPUT)
              hostname >> $(OUTPUT)
              date >> $(OUTPUT)
              sleep 10
          depends: [date-batch]

Now date-login will not run until date-batch has finished.

Challenge

Update depends.yaml to make date-login wait for date-batch to complete before running. Then rerun maestro run depends.yaml.

How has the output of the two date.txt files changed?

Show me the solution

This time, you should see that the date printed from the login node is at least 10 seconds later than the date printed on the batch node. For example, on Pascal I see

BASH

cat Episode5/Dependency-exploration_20240326-150950/date-*/date.txt

OUTPUT

From batch node:
pascal16
Tue Mar 26 15:09:54 PDT 2024
From login node:
pascal83
Tue Mar 26 15:10:53 PDT 2024

Callout

Slurm can also be used to define dependencies. How is using Maestro to define the order of our steps any different?

One difference is that we can use Maestro to order steps that are not seen by Slurm. Above, date-login was run on the login node. It wasn’t submitted to the queue, and Slurm never saw that step. (It wasn’t given a Slurm job ID, for example.) Maestro can control the order of steps running both in and outside the batch queue, whereas Slurm can only enforce dependencies between Slurm-scheduled jobs.

A step that waits for all iterations of its dependency

Let’s return to our Amdahl scaling study and the YAML with which we ended in the last episode:

YML

description:
    name: Amdahl
    description: Run a parallel program

batch:
    type: slurm
    host: ruby           # machine
    bank: guests         # bank
    queue: pbatch        # partition
    reservation: HPCC1B  # reservation for this workshop

env:
    variables:
      P: .999
      OUTPUT: amdahl.json
      OUTPUT_PATH: ./Episode5

study:
    - name: amdahl
      description: run in parallel
      run:
          # Here's where we include our MPI wrapper:
          cmd: |
               $(LAUNCHER) amdahl --terse -p $(P) >> $(OUTPUT)
          nodes: 1
          procs: $(TASKS)
          walltime: "00:00:30"

global.parameters:
    TASKS:
        values: [2, 4, 8, 18, 24, 36]
        label: TASKS.%%

Ultimately we want to add a plotting step that depends upon amdahl, but for now let’s create a placeholder that will go under study and beneath amdahl:

YML

    - name: plot
      description: Create a plot from `amdahl` results
      run:
          # We'll update this `cmd` later
          cmd: |
               echo "This is where we plot"

Based on what we saw before, we might think that we just need to add

YML

          depends: [amdahl]

to the end of this block. Let’s try this to see what happens.

Challenge

Add the plot step

YML

    - name: plot
      description: Create a plot from `amdahl` results
      run:
          # We'll update this `cmd` later
          cmd: |
               echo "This is where we plot"

          depends: [amdahl]

to your amdahl.yaml. Then perform a dry run via

BASH

maestro run --dry amdahl.yaml

What does the directory structure look like for your plot step?

Show me the solution

Doing a dry run with amdahl.yaml (text below) should generate an Amdahl_{Date & time stamp} directory with a subdirectory for the plot step. Within the ‘plot’ subdirectory, there will be several TASKS.%% subdirectories – one for each of the values of TASKS defined under global.parameters in amdahl.yaml. For example,

BASH

pwd

OUTPUT

~/Episode5/Amdahl_20240429-153515

BASH

ls

OUTPUT

amdahl      Amdahl.study.pkl  batch.info  meta                plot
Amdahl.pkl  Amdahl.txt        logs        pascal-amdahl.yaml  status.csv

BASH

ls plot/

OUTPUT

TASKS.18  TASKS.2  TASKS.24  TASKS.36  TASKS.4  TASKS.8

BASH

ls TASKS.18/*

OUTPUT

TASKS.18/plot_TASKS.18.slurm.sh

Each of these TASKS.%% subdirectories – such as plot/TASKS.18 – represents a separate workflow step that is related to its corresponding amdahl/TASKS.%% workflow step and directory.

The text of the yaml used is below.

YML

description:
    name: Amdahl
    description: Run a parallel program

batch:
    type: slurm
    host: ruby           # machine
    bank: guests         # bank
    queue: pbatch        # partition
    reservation: HPCC1B  # reservation for this workshop

env:
    variables:
      P: .999
      OUTPUT: amdahl.json
      OUTPUT_PATH: ./Episode5

study:
    - name: amdahl
      description: run in parallel
      run:
          # Here's where we include our MPI wrapper:
          cmd: |
               $(LAUNCHER) amdahl --terse -p $(P) >> $(OUTPUT)
          nodes: 1
          procs: $(TASKS)
          walltime: "00:00:30"
    - name: plot
      description: Create a plot from `amdahl` results
      run:
          # We'll update this `cmd` later
          cmd: |
               echo "This is where we plot"
          depends: [amdahl]
global.parameters:
    TASKS:
        values: [2, 4, 8, 18, 24, 36]
        label: TASKS.%%

The takeaway from the above challenge is that, a step depending upon a parameterized step will become parameterized by default. In this case, creating a plotting step that depends on the amdahl step will lead to a series of plots, each of which will use data only from a single run of amdahl.

This is not what we want! Instead, we want to generate a plot that uses data from several runs of amdahl – each of which will use a different number of tasks. This means that plot cannot run until all runs of amdahl have completed.

So, the syntax for defining dependency will change when parameterized steps are involved. To indicate that we want plot to run after ALL amdahl steps, we’ll add a _* to the end of the step name:

YML

          depends: [amdahl_*]

Now our new step definition will look like

YML

    - name: plot
      description: Create a plot from `amdahl` results
      run:
          # We'll update this `cmd` later
          cmd: |
               echo "This is where we plot"
          depends: [amdahl_*]

Challenge

Update your amdahl.yaml so that plot runs after amdahl has run with all values of TASKS. Perform a dry run to verify that plot will run only once.

Show me the solution

Your new directory structure should look something like

BASH

ls Episode5/Amdahl_20240429-161330

OUTPUT

amdahl      Amdahl.study.pkl  batch.info  meta                plot
Amdahl.pkl  Amdahl.txt        logs        pascal-amdahl.yaml  status.csv

BASH

ls plot

OUTPUT

plot.slurm.sh

In other words, the directory plot will have no subdirectories.

Using the outputs from a previous step

Manually plotting scaling results

In your working directory, you should have a copy of plot_terse_amdahl_results.py. The syntax for running this script is

BASH

python3 plot_terse_amdahl_results.py {output image name} {input json filenames}

Before trying to add this command to our workflow, let’s run it manually to see how it works. We can call the output image output.jpg. As for the input names, we can use the .json files created at the end of episode 4. In particular if you run

BASH

ls Episode4/Amdahl_<Date>_<Time>/amdahl/TASKS.*/amdahl.json

using the <Date> and <Time> of your last run in Episode 4, you should see a list of files named amdahl.json:

BASH

ls Episode4/Amdahl_20240326-155434/amdahl/TASKS.*/amdahl.json

OUTPUT

Episode4/Amdahl_20240326-155434/amdahl/TASKS.18/amdahl.json
Episode4/Amdahl_20240326-155434/amdahl/TASKS.24/amdahl.json
Episode4/Amdahl_20240326-155434/amdahl/TASKS.2/amdahl.json
Episode4/Amdahl_20240326-155434/amdahl/TASKS.36/amdahl.json
Episode4/Amdahl_20240326-155434/amdahl/TASKS.4/amdahl.json
Episode4/Amdahl_20240326-155434/amdahl/TASKS.8/amdahl.json

You can use this same filepath with wildcards to specify this list of JSON files as inputs to our python script:

BASH

python3 plot_terse_amdahl_results.py output.jpg Episode4/Amdahl_<Date>_<Time>/amdahl/TASKS.*/amdahl.json

Challenge

Generate a scaling plot by manually specifying the JSON files produced from a previous run of amdahl.yaml.

Show me the solution

The resulting JPEG should look something like

Adding plotting to our workflow

Let’s update our plot step in amdahl.yaml to include python plotting rather than a placeholder echo command. We want the updated step to look something like

YML

    - name: plot
      description: Create a plot from `amdahl` results
      run:
          cmd: |
               python3 plot_terse_amdahl_results.py output.jpg Episode5/Amdahl_<Date>_<Time>/amdahl/TASKS.*/amdahl.json
          depends: [amdahl_*]

The trouble is that we don’t know the exact value of Episode5/Amdahl_<Date>_<Time>/amdahl for a job that we haven’t run yet. Luckily Maestro gives us a placeholder to the equivalent of this path for the current job – $(amdahl.workspace). This is the workspace for the amdahl step, where all outputs for amdahl, including our TASKS.* directories, are written.

This means we can update the plot step as follows:

YML

    - name: plot
      description: Create a plot from `amdahl` results
      run:
          cmd: |
               python3 plot_terse_amdahl_results.py output.jpg $(amdahl.workspace)/TASKS.*/amdahl.json
          depends: [amdahl_*]

Callout

Where does the plot_terse_amdahl_results.py script live? In Maestro, $(SPECROOT) specifies the root directory from which you originally ran maestro run .... This is where plot_terse_amdahl_results.py should live, so let’s be more precise:

YML

    - name: plot
      description: Create a plot from `amdahl` results
      run:
          cmd: |
               python3 $(SPECROOT)/plot_terse_amdahl_results.py output.jpg $(amdahl.workspace)/TASKS.*/amdahl.json
          depends: [amdahl_*]

Challenge

Update amdahl.yaml so that

one step definition runs amdahl for 50% parallelizable code using [2, 4, 8, 16, 32] tasks
a second step plots the results.

Show me the solution

Your YAML file should look something like

YML

description:
    name: Amdahl
    description: Run a parallel program

batch:
    type: slurm
    host: ruby           # machine
    bank: guests         # bank
    queue: pbatch        # partition
    reservation: HPCC1B  # reservation for this workshop

env:
    variables:
      P: .5
      OUTPUT: amdahl.json
      OUTPUT_PATH: ./Episode5

study:
    - name: run-amdahl
      description: run in parallel
      run:
          # Here's where we include our MPI wrapper:
          cmd: |
               $(LAUNCHER) amdahl --terse -p $(P) >> $(OUTPUT)
          nodes: 1
          procs: $(TASKS)
          walltime: "00:01:30"
    - name: plot
      description: Create a plot from `amdahl` results
      run:
          cmd: |
               python3 $(SPECROOT)/plot_terse_amdahl_results.py output.jpg $(run-amdahl.workspace)/TASKS.*/amdahl.json
          depends: [amdahl_*]

global.parameters:
    TASKS:
        values: [2, 4, 8, 16, 32]
        label: TASKS.%%

Errors are normal

Don’t be disheartened if you see errors when first testing your new Maestro pipelines. There is a lot that can go wrong when writing a new workflow, and you’ll normally need several iterations to get things just right. Luckily, Maestro will do some checks for consistency at the outset of the run. If you specify a dependency that doesn’t exist (because of a rename, for example), the job will fail before submitting work to the queue.

Key Points

We can control the order of steps run in a study by creating dependencies.
You can create a dependency with the depends: [{step name}] syntax.
Dependency syntax changes to depends: [{step name}_*] for parameterized steps.

Content from Multiple parameters

Last updated on 2024-06-04 | Edit this page

Overview

Questions

“How do I specify multiple parameters?”
“How do multiple parameters interact?”

Objectives

Create scaling results for different proportions of parallelizable code.
Create and compare scalability plots for codes with different amounts of parallel work.

Callout

If you’ve opened a new terminal, make sure Maestro is available.

BASH

source /usr/global/docs/training/janeh/maestro_venv/bin/activate

Adding a second parameter

In this episode, we want to vary P, the fraction of parallel code, as part of our workflow. To do this, we will add a second entry under global.parameters and remove the definition for P under env:

YML

(...)
env:
    variables:
      OUTPUT: amdahl.json
      OUTPUT_PATH: ./Episode6

(...)

global.parameters:
    TASKS:
        values: [2, 4, 8, 16, 32]
        label: TASKS.%%
    P:
        values: [<Insert values>]
        label: P.%%

How many values do we want to include for P? We need to have the same number of values listed for all global.parameters. This means that to run the previous scaling study with P=.5, and five values for TASKS, our global parameters section would specify .5 for P five times:

YML

global.parameters:
    TASKS:
        values: [2, 4, 8, 16, 32]
        label: TASKS.%%
    P:
        values: [.5, .5, .5, .5, .5]
        label: P.%%

Let’s say we want to perform the same scaling study for a second value of P, .99. This means that we’d have to repeat the same 5 values for TASKS and then provide the second value of P 5 times:

YML

global.parameters:
    TASKS:
        values: [2, 4, 8, 16, 32, 2, 4, 8, 16, 32]
        label: TASKS.%%
    P:
        values: [.5, .5, .5, .5, .5, .99, .99, .99, .99, .99]
        label: P.%%

At this point, our entire YAML file should look something like

YML

description:
    name: Amdahl
    description: Run a parallel program

batch:
    type: slurm
    host: ruby           # machine
    bank: guests         # bank
    queue: pbatch        # partition
    reservation: HPCC1B  # reservation for this workshop

env:
    variables:
      OUTPUT: amdahl.json
      OUTPUT_PATH: ./Episode6

study:
    - name: run-amdahl
      description: run in parallel
      run:
          # Here's where we include our MPI wrapper:
          cmd: |
               $(LAUNCHER) amdahl --terse -p $(P) >> $(OUTPUT)
          nodes: 1
          procs: $(TASKS)
          walltime: "00:01:30"
    - name: plot
      description: Create a plot from `amdahl` results
      run:
          cmd: |
               python3 $(SPECROOT)/plot_terse_amdahl_results.py output.jpg $(run-amdahl.workspace)/TASKS.*/amdahl.json
          depends: [amdahl_*]

global.parameters:
    TASKS:
        values: [2, 4, 8, 16, 32, 2, 4, 8, 16, 32]
        label: TASKS.%%
    P:
        values: [.5, .5, .5, .5, .5, .99, .99, .99, .99, .99]
        label: P.%%

Challenge

Run the workflow above. Do you generate output.jpg? If not, why not?

Show me the solution

If you use the YAML above, your amdahl-run steps should work, but your plot step will fail. plot’s failure will be evident both because output.jpg will be missing from the plot subdirectory and because the plot.*.err file in the same directory will contain an error:

OUTPUT

Traceback (most recent call last):
  File "plot_terse_amdahl_results.py", line 46, in <module>
    process_files(filenames, output=output)
  File "plot_terse_amdahl_results.py", line 10, in process_files
    with open(filename, 'r') as file:
FileNotFoundError: [Errno 2] No such file or directory: '~/Episode6/Amdahl_20240328-163359/run-amdahl/TASKS.*/amdahl.json'

The problem is that the directory path for .json files has changed. This will be discussed more below!

The trouble with the YAML above is that our output directory structure changed when we added a second global parameter, but we didn’t update the directory path specified under plot.

If we look inside the run-amdahl output folder (identified as $(run-amdahl.workspace) in our workflow YAML), its subdirectory names now include information about both global parameters:

BASH

ls Episode6/Amdahl_20240328-163359/run-amdahl

OUTPUT

P.0.5.TASKS.16  P.0.5.TASKS.8    P.0.99.TASKS.4
P.0.5.TASKS.2   P.0.99.TASKS.16  P.0.99.TASKS.8
P.0.5.TASKS.32  P.0.99.TASKS.2
P.0.5.TASKS.4   P.0.99.TASKS.32

whereas directory path for our .json files is specified as $(run-amdahl.workspace)/TASKS.*/amdahl.json under the plot step.

We could get the plot step to work by simply adding a wildcard, *, in front of TASKS so that the path to .json files would be

YML

$(run-amdahl.workspace)/*TASKS.*/amdahl.json

and the definition for plot would be

YML

    - name: plot
      description: Create a plot from `amdahl` results
      run:
          cmd: |
               python3 $(SPECROOT)/plot_terse_amdahl_results.py output.jpg $(run-amdahl.workspace)/*TASKS.*/amdahl.json
          depends: [amdahl_*]

This would allow plot to terminate happily and to produce an output.jpg file, but that image would plot output from all .json files as a single line, and we wouldn’t be able to tell which data points corresponded to a parallel fraction, $P=0.85$ and which corresponded to $P=0.99$.

If we can generate two plots – one for each value of P – we’ll more clearly be able to see scaling behavior for these two situations. We can generate two separate plots by calling python3 plot_terse_amdahl_results.py ... on two sets of input files – those in the P.0.5.TASKS* subdirectories of $(run-amdahl.workspace) and those in the P.0.99.TASKS* subdirectories.

That means we can generate these two plots by inserting the variable $(P) into the path –

YML

$(run-amdahl.workspace)/P.$(P).TASKS.*/amdahl.json

making the definition for plot

YML

    - name: plot
      description: Create a plot from `amdahl` results
      run:
          cmd: |
               python3 $(SPECROOT)/plot_terse_amdahl_results.py output.jpg $(run-amdahl.workspace)/P.$(P).TASKS.*/amdahl.json
          depends: [amdahl_*]

Challenge

Modify your workflow as discussed above to generate output plots for two different values of P. Open these plots and verify they are different from each other. How does changing the workflow to generate two separate plots change the directory structure?

(Feel free to use .5 and .99 or to modify to other values of your choosing.)

Show me the solution

Your full YAML file should be similar to

YML

description:
    name: Amdahl
    description: Run a parallel program

batch:
    type: slurm
    host: ruby           # machine
    bank: guests         # bank
    queue: pbatch        # partition
    reservation: HPCC1B  # reservation for this workshop

env:
    variables:
      OUTPUT: amdahl.json
      OUTPUT_PATH: ./Episode6

study:
    - name: run-amdahl
      description: run in parallel
      run:
          # Here's where we include our MPI wrapper:
          cmd: |
               $(LAUNCHER) amdahl --terse -p $(P) >> $(OUTPUT)
          nodes: 1
          procs: $(TASKS)
          walltime: "00:01:30"
    - name: plot
      description: Create a plot from `amdahl` results
      run:
          cmd: |
               python3 $(SPECROOT)/plot_terse_amdahl_results.py output.jpg $(run-amdahl.workspace)/P.$(P).TASKS.*/amdahl.json
          depends: [amdahl_*]

global.parameters:
    TASKS:
        values: [2, 4, 8, 16, 32, 2, 4, 8, 16, 32]
        label: TASKS.%%
    P:
        values: [.5, .5, .5, .5, .5, .99, .99, .99, .99, .99]
        label: P.%%

Modifying plot to include $(P) caused this step to run twice. As a result, two subdirectories under plot were created – one for each value of P:

BASH

ls Episode6/Amdahl_20240328-171343/plot

OUTPUT

P.0.5  P.0.99

Callout

Instead of modifying the path to our amdahl.json files to $(run-amdahl.workspace)/P.$(P).TASKS.*/amdahl.json, we could have equivalently updated it to $(run-amdahl.workspace)/$(P.label).TASKS.*/amdahl.json.

In other words, P.$(P) is equivalent to $(P.label). Similarly, in Maestro, TASKS.$(TASKS) is equivalent to $(TASKS.label). This syntax works for every global parameter in Maestro.

Key Points

“Multiple parameters can be defined under global.parameters.”
“Lists of values for all global parameters must have the same length; the Nth entries in the lists of values for all global parameters are used in a single job.”