Summary and Setup

Callout

đźš§ Under Construction đźš§

This course is still being constructed—please be patient.

Outlining the course

  • Targeted audience (see learner profiles: New HPC users, Research Software Engineers with users on HPC systems, researchers in HPC.NRW)
  • Estimated length and recommended formats (e.g. X full days, X * 2 half days, in-person/online, live-coding)
  • Course intentions (focus on learners perspective!):
    • Speed up research (efficient computations, more per time, shorter iteration times, “less in the way”)
    • Convey intuition about job-sizes. What is considered large, what small?
    • Improve batch utilization through matching application requirements to requested hardware (minimal resource requirements, maximum resource utilization)
    • Sharpen awareness for importance to avoid wasting time/energy on a shared system
    • Teach common concepts and terms of performance on a beginner level
    • First steps into performance optimizations (cluster-, node-, and application level)
  • Course context for learners:
    • Working on HPC Systems (Batch system, shared file systems, software modules, …)
    • Performance of scheduled batch jobs
    • Application performance is only addressed briefly (related to job efficiency), but in-depth is outside of the scope. Episode “Next Steps” should point towards deeper performance analyses, e.g. with tracers and profilers, and how to get started there

Learning Objectives


After attending this training, participants will be able to:

  • Explain efficiency in the context of High Performance Computing (HPC) systems
  • Use batch system tools and third party tools to measure job efficiency
  • Discern between worse and better performing jobs
  • Describe common concepts and terms related to performance on HPC systems
  • Identify hardware components involved in performance considerations
  • Achieve first results in performance optimization of their application
  • Recall next steps to take towards learning performance optimization
Callout

Additional Note

In this course, job efficiency refers to how effectively an application utilizes allocated computational resources, such as CPU cores, memory, GPUs, runtime, interconnect bandwidth, and energy consumption.

Prerequisites


Prerequisite
  • Access to an HPC system
  • Access to an example workload setup
  • Basic understanding of HPC systems including batch schedulers, parallel file systems, and environment modules
  • Ability to submit basic jobs and understand typical HPC execution workflows
  • Knowledge of tools and workflows used in HPC environments:
    • Bash shell scripting
    • Secure remote access and file transfer using SSH and SCP
    • Basic Slurm job scripts and workload management commands (srun, sbatch, squeue, scancel)
    • Version control systems: Git, GitHub, and GitLab

HPC Access

You will need access to an HPC cluster to run the examples in this lesson. Discuss how to find out where to apply for access as a researcher (in general, in EU, in Germany, in NRW?). To learn how to access and use a compute cluster, refer to the HPC Introduction lessons.

  • Executive summary of typical HPC workflow? Or refer to other HPCC courses that cover this
  • “HPC etiquette”
    • E.g. don’t run benchmarks and other computationally heavy workloads on login node. Emphasise their purpose
    • Don’t disturb jobs on shared nodes (<– this phrasing is hard to grasp for newcomers and should be avoided. It will block them from trying things if they are afraid to break anything. Maybe this is more the responsibility of admins and users should just be aware that they may affect other users?)
  • Setup of the example workflow below (next section)
Discussion

Common Software on HPC Systems

Working on an HPC system commonly involves:

  • a batch system used to schedule and manage jobs (e.g. Slurm, PBS Pro, HTCondor, …)
  • a module system to load and manage centrally provided software packages and software versions
  • a secure method to connect to a login node of the cluster, typically using SSH

To login via ssh, you can use on (remove this since it’s discussed in HPC introduction?)

  • PuTTY
  • ssh in PowerShell
  • ssh in Terminal.app
  • ssh in Terminal

Example Workload: Snowman Raytracer

Throughout the course, we will use an example application to learn workflows and tools for evaluating job performance. The example is a ray tracer used to render a predefined scene. It supports multiple parallelization models, including distributed-memory parallelism using MPI, shared-memory multithreading, and GPU acceleration using CUDA. MPI and multithreading can also be combined. The GPU-accelerated version uses MPI processes primarily for process management and coordination, while all computational work is performed on one or more GPUs.

We do not have to study and understand the example code in detail. After compilation, all necessary options are exposed as separate binaries or through command-line arguments.

We do, however, need to prepare a build environment with all required libraries and build the code with CMake. This is a common occurrence in scientific software as well. Researchers often depend on existing software, and their first interaction with a new project frequently occurs in a situation like this, where they have to build and prepare unfamiliar code. Their first question is typically: “Is this project useful for my research?”

The example application should be prepared in a central location, such as the parallel file system of your cluster, to ensure that it is accessible from multiple worker nodes during distributed job execution.

Let’s get started by cloning the repository:

BASH

# Log in to your cluster via ssh first
mkdir jobefficiencyguide && cd jobefficiencyguide
git clone --recursive https://codeberg.org/HPC-NRW/SnowmanRaytracer.git
cd SnowmanRaytracer
Callout

Do not forget --recursive

Our example project depends on another project, implementing the basic ray tracing methods. This dependency is introduced as a git submodule, so recursive cloning is necessary, otherwise we cannot build the project.

CPU Build

The example application can perform computations on CPUs using shared-memory multithreading and distributed-memory parallelism across multiple nodes via MPI communication. To prepare the out-of-source build:

BASH

# Assuming you are still in the SnowmanRaytracer source directory
cd ..
mkdir build && cd build
Dependencies

To build the example, you need to provide the following dependencies:

  • Compiler, e.g. GCC
  • MPI, e.g. OpenMPI
  • CMake
  • Boost
  • libpng

In HPC systems, this often happens through loading software modules, centrally provided by your administrators. The exact module names and required software stacks can vary significantly depending on the configuration of the cluster. In one particular environment, the setup may look like this:

BASH

# Only one example, consult your cluster documentation or ask the instructor or your HPC support
module load 2025 GCC/13.2.0 OpenMPI/4.1.6 Boost/1.83.0 CMake/3.27.6 libpng/1.6.40 buildenv/default
Callout

Software management differs widely on HPC systems

The details of how different compiler and library versions are loaded depend strongly on the configuration of your particular HPC system. Follow the instructor’s guidance or consult your site’s documentation or support staff if you have questions.

Building the Software

Typically, it is recommended to build software on the same hardware architecture on which it will later be executed. For HPC systems, you should consider whether the login nodes provide sufficient resources for software compilation and whether they use the same hardware architecture as the worker nodes. Check your cluster documentation for any recommendations!

Here, we will build and test the software in a first Slurm job script, build_snowman.sbatch:

BASH

#!/usr/bin/env bash
#SBATCH --job-name=build-and-test-Snowman
#SBATCH --nodes=1
#SBATCH --ntasks=4

# Prepare your environment with the dependencies
# This will likely look different in your case!
module load 2025 GCC/13.2.0 OpenMPI/4.1.6 Boost/1.83.0 CMake/3.27.6 libpng/1.6.40 buildenv/default

# Assuming you are submitting from the "build" directory
cmake -DCMAKE_BUILD_TYPE=Release -DENABLE_CUDA=OFF ../SnowmanRaytracer

# Building the software in parallel
cmake --build . --parallel

# First test run with 4 MPI processes
mpirun -n 4 ./raytracer -width=800 -height=800 -spp=128 -threads=1 -png=snowman.png
Running the Ray Tracer

The mpirun command from our first test run above:

  • starts the raytracer binary with the prepared scene,
  • computes the ray-traced image with \(N = 4\) MPI processes, each using a single thread (-threads=1),
  • computes \(128 / N = 32\) samples per pixel (-spp=128) in each MPI process,
  • sets the height and width of the resulting image to \(800\) pixels, and finally
  • stores the generated image as snowman.png.

A ray tracer computes the interaction of straight “light rays” with objects placed in a 3D scene. Each object can have different material properties, resulting in different optical effects, such as matte or partially translucent surfaces. Light rays that reach the “camera” contribute to rendering the image by accumulating their effects across all pixels.

Computationally, ray tracing primarily consists of many independent geometric and vector operations, including matrix-vector and matrix-matrix calculations. These operations can therefore be evaluated in parallel across large numbers of rays and pixels. This enables different parallelization strategies. You could divide the pixels of the final image into regions, where each parallel process computes one region. Another strategy, which is applied here, is dividing the number of samples per pixel (spp) across all parallel processes. For each pixel, spp samples contribute to the final pixel. For example, with -spp=128 and \(4\) MPI processes, each MPI process computes \(\frac{128}{4}=32\) samples contributing to all pixels of the final image.

In addition to MPI-based parallelization, the application also supports shared-memory parallelism through \(T\) threads using the -threads=T parameter. Threads share the same address space and therefore may require less memory overhead than multiple processes.

CUDA Build

The example application can also utilize NVIDIA GPUs via CUDA. In this case, the ray-tracing computations are performed on GPUs, which provide an ideal environment for this type of workload, provided that the problem size and rendering complexity are sufficiently large to benefit from massive parallelism.

CUDA support requires a separate build, which we will execute in a separate Slurm job. In this case, it may be especially important to build on the target hardware, since your login nodes may not contain the accelerators intended for execution.

Let us prepare the build directory again on our login node:

BASH

# Assuming you are still in the SnowmanRaytracer source directory or inside the CPU build directory
cd ..
mkdir build_gpu && cd build_gpu

In addition to the dependencies listed above, this build relies on CUDA and you may have to load the corresponding modules for your HPC system. The application still uses MPI for process management and coordination, for example, to assign one MPI process per GPU when multiple GPUs are used.

Our build script (build_gpu_snowman_cuda.sbatch) may look like this:

BASH

#!/usr/bin/env bash
#SBATCH --job-name=build-gpu-and-test-Snowman
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --partition=gpu
#SBATCH --gpus=2

# Prepare your environment with the dependencies
# This will likely look different in your case!
module load 2025 GCC/13.2.0 OpenMPI/4.1.6 Boost/1.83.0 CMake/3.27.6 libpng/1.6.40 buildenv/default CUDA/12.6.0

# Assuming you are submitting from the "build_gpu" directory
cmake -DCMAKE_BUILD_TYPE=Release -DENABLE_CUDA=ON ../SnowmanRaytracer

# Building the software in parallel
cmake --build . --parallel

# First test run with 2 MPI processes
export CUDA_VISIBLE_DEVICES=0,1 # Some HPC systems configure GPU visibility automatically
mpirun -n 2 ./raytracer -width=800 -height=800 -spp=128 -threads=1 -png=snowman_gpu.png

We are all set to learn about job efficiency!


With the example application in place, we are now ready to explore the many factors that affect job performance. We will repeatedly use this application in different configurations throughout the course, so make sure to keep it in a central location that remains accessible during the entire course.

Acknowledgements


Course created in context of HPC.NRW.