Introduction


Resource Requirements


Figure 1


Figure 2


Scheduler Tools


Scaling Study


Figure 1

Speedup and efficiency of strong scaling example
Speedup and efficiency of strong scaling example

Figure 2

Three snowmen in 800x800 with 128 samples per pixel
Three snowmen in 800x800 with 128 samples per pixel

Figure 3

Three snowmen in 800x800 with 8192 samples per pixel
Three snowmen in 800x800 with 8192 samples per pixel

Figure 4

Speedup and efficiency of weak scaling example
Speedup and efficiency of weak scaling example

Performance Overview


Figure 1

Diagram to visualize the data hierarchy of CPU architectures. Network, local disks, memory, and CPU caches have decreasing amounts of storage capacity, but increasing bandwidths and shorter latencies. Calculations occur in CPUs, possibly in multiple CPU cores, which may have multiple threads each, and even apply vectorized instructions.
The underlying hardware frames any performance analysis. Calculations are performed in multiple cores, potentially multiple threads per core, and even in vectorized instructions where a single operation is applied to multiple sets of data in a single instruction. Data moves through the data hierarchy to CPU cores, where each level “closer” to the CPU has a smaller storage capacity, but larger bandwidth and smaller latencies, improving access performance.

Figure 2

"My Jobs" tab in the ClusterCockpit web UI
ClusterCockpits main menu. Select “My Jobs” to see a list of the jobs associated to your account.

Figure 3

ClusterCockpit Job Info panel

Figure 4

Cluster Cockpit Footprint panel summarizing central job characteristics

Figure 5

Cluster Cockpit Roofline plot of a job

Figure 6

Select Metrics button in the ClusterCockpit job view

Figure 7

Linaro perf-report overview 1

Figure 8

ClusterCockpit cpu load per core

Figure 9

ClusterCockpit flops_any metric

Figure 10


Figure 11


Figure 12


Figure 13


Figure 14


Figure 15


Figure 16


Figure 17


Figure 18


Figure 19


Figure 20

Performance Reports also summarizes the applications behavior in terms of MPI calls, e.g. time spent in collective calls involving all processors, or point-to-point communications.


Figure 21

The I/O block summarizes measurements of interactions with the local file systems. Here, no I/O operations are affecting the applications performance at all.


Pinning


How to identify a bottleneck?


Performance of Accelerators


Next Steps