Pinning
Last updated on 2025-10-31 | Edit this page
Overview
Questions
- What is “pinning” of job resources?
- How can pinning improve the performance?
- How can I see, if pinning resources would help?
- What requirement hints can I give to the scheduler?
Objectives
After completing this episode, participants should be able to …
- Define the concept of “pinning” and how it can affect job performance.
- Name Slurms options for memory- and cpu- binding.
- Use hints to tell Slurm how to optimize their job allocation.
Binding / pinning:
--mem-bind=[{quiet|verbose},]<type>-m, --distribution={*|block|cyclic|arbitrary|plane=<size>}[:{*|block|cyclic|fcyclic}[:{*|block|cyclic|fcyclic}]][,{Pack|NoPack}]-
--hint=: Hints for CPU- (compute_bound) and memory-bound (memory_bound), but alsomultithread,nomultithread -
--cpu-bind=[{quiet|verbose},]<type>(srun) - Mapping of application <-> job resources
Motivation
Exercise
Case 1: 1 thread per rank
mpirun -n 8 ./raytracer -width=512 -height=512 -spp=128 -threads=1 -alloc_mode=3 -png=snowman.png
Case 2: 2 thread per rank
mpirun -n 8 ./raytracer -width=512 -height=512 -spp=128 -threads=2 -alloc_mode=3 -png=snowman.png
Questions: - Do you notice any difference in runtime between the two cases? - Is the increase in threads providing a speedup as expected?
- Observation: The computation times are almost the same.
- Expected behavior: Increasing threads should ideally reduce runtime.
- Hypothesis: Additional threads do not contribute.
How to investigate?
You can verify the actual core usage in two ways: 1. Use
--report-bindings with mpirun 2. Use
htopcommand on the compute node
Exercise
Follow any one of the option above and run for 2 threads per rank
mpirun -n 8 ./raytracer -width=512 -height=512 -spp=128 -threads=2 -alloc_mode=3 -png=snowman.png
Questions: - Did you find any justification for the hypothesis we made?
Only 8 cores are active instead of 16
Explanation:
- Eventhough we requested 2 threads per MPI rank, both threads are pinned to the same core.
- The second thread waits for the first thread to finish, so no actual thread-level parallelization is achieved.
How to achieve?
Exercise: Understanding Process and Thread Binding
Pinning (or binding) means locking a process or thread to a specific hardware resource such as a CPU core, socket, or NUMA region. Without pinning, the operating system may move tasks between cores, which can reduce cache reuse and increase memory latency, directly diminishes performance.
In this exercise we will explore how MPI process and thread binding works. We will try binding to core, socket, and numa, and observe timings and bindings.
Exercise
Case 1: --bind-to numa
mpirun -n 8 --bind-to numa ./raytracer -width=512 -height=512 -spp=128 -threads=12 -alloc_mode=3 -png=snowman.png
Case 2: --bind-to socket
mpirun -n 4 --bind-to socket /raytracer -width=512 -height=512 -spp=128 -threads=48 -alloc_mode=3 -png=snowman.png
Questions: - What is difference between Case 1 and Case 2. Any difference in performance? How many workers? - How could you adjust process/thread counts to better utilize the hardware in Case 2?
- MPI and thread pinning is hardware-aware.
- If the number of processes matches the number of domains (socket or NUMA), then the number of threads should equal the cores per domain to fully utilize the node.
- No speedup in Case 2: Oversubscription occurs because we requested 4 processes on a system with only 2 sockets.
- Threads compete for the same cores → OpenMPI queues threads and waits until other processes finish.
Best Practices for MPI Process and Thread Pinning
Mapping vs. Binding Analogy
Think of running MPI processes and threads like booking seats for a group of friends:
-
Mapping is like planning where your group will sit
in the theatre or on a flight.
- Example: You decide some friends sit in Economy, some in Premium
Economy, and some in Business.
- Similarly,
--map-bydistributes MPI ranks across nodes, sockets, or NUMA regions.
- Example: You decide some friends sit in Economy, some in Premium
Economy, and some in Business.
-
Binding is like reserving the exact seats for each
friend in the planned area.
- Example: Once the seating area is chosen, you assign specific seat
numbers to each friend.
- Similarly,
--bind-topins each MPI process or thread to a specific core or hardware unit to avoid movement.
- Example: Once the seating area is chosen, you assign specific seat
numbers to each friend.
This analogy helps illustrate why mapping defines placement and binding enforces it.
We will use --bind-to core (the smallest hardware unit)
and --map-by to distribute MPI processes across sockets or
NUMA or node regions efficiently.
Choosing the Smallest Hardware Unit
Binding processes to the smallest unit (core) is recommended because:
Exclusive use of resources
Each process or thread is pinned to its own core, preventing multiple threads or processes from competing for the same CPU.Predictable performance
When processes share cores, execution times can fluctuate due to scheduling conflicts. Binding to cores ensures consistent timing across runs.
- Best practice: Always bind processes to the smallest unit (core) and
spread processes evenly across the available hardware using
--map-by. - Example options:
-
--bind-to core→ binds each process to a dedicated core (avoids oversubscription). -
--map-by socket:PE=<threads>→ spreads given number of threads as a processing element across the socket -
--map-by numa:PE=<threads>→ spreads processes across NUMA domains, assigning<threads>cores per process. - similarly
--map-by numa:PE=<threads> -
--cpus-per-rank <n>→ Assigns<n>cores (hardware threads) to each MPI rank - ensuring that all threads within a rank occupy separate cores.
-
Exercise
Use the given best practices above for case 1: -n 8,
-threads=1 and case 2: -n 8,
-threads=4 and answer following questions
Questions: - How many cores does the both jobs use? - Did you get more workers than you requested? - Did you see the scaling when running with 4 threads?
- 8 and 32
- No.
- Yes
Summary
Always check how pinning works
Use verbose reporting (e.g.,--report-bindings) to see how MPI processes and threads are mapped to cores and sockets.Documentation is your friend
For OpenMPI withmpirun, consult the manual: https://www.open-mpi.org/doc/v4.1/man1/mpirun.1.phpKnow your hardware
Understanding the number of sockets, cores per socket, and NUMA regions on your cluster helps you make effective binding decisions.Avoid oversubscription
Assigning more threads or processes than available cores hurts performance — it causes contention and idle waits.Recommended practice for OpenMPI
Use--bind-to corealong with--map-byto control placement and threads per process to maximize throughput.