Performance Basics
2024-08-29
The goal is right enough, fast enough — not flop/s.
Performance is not all that matters.
The road to good performance
starts with a single core.
Parallel efficiency is hard!
Top 500 benchmark reports:
Measure the first; how do we know the second?
Start with what is floating point:
Lots of interest in 16-bit formats for ML. Linpack results are double precision
Processor does more than one thing at a time. On one CPU core of Perlmutter (AMD EPYC 7763 (Milan)):
\[ 2 \frac{\mbox{flops}}{\mbox{FMA}} \times 4 \frac{\mbox{FMA}}{\mbox{vector FMA}} \times 2 \frac{\mbox{vector FMA}}{\mbox{cycle}} = 16 \frac{\mbox{flops}}{\mbox{cycle}} \]
At standard clock (2.45 GHz)
\[ 16 \frac{\mbox{flops}}{\mbox{cycle}} \times 2.4 \times 10^9 \frac{\mbox{cycle}}{\mbox{s}} = 39.2 \frac{\mbox{Gflop}}{\mbox{s}} \]
At max boost clock (3.5 GHz)
\[ 16 \frac{\mbox{flops}}{\mbox{cycle}} \times 3.5 \times 10^9 \frac{\mbox{cycle}}{\mbox{s}} = 56 \frac{\mbox{Gflop}}{\mbox{s}} \]
Each CPU has 64 cores, at standard clock
\[ 39.2 \frac{\mbox{Gflop}}{\mbox{s}} = 2508.8 \frac{\mbox{Gflop}}{\mbox{s}} \approx 2.5 \frac{\mbox{Tflop}}{\mbox{s}} \]
Peak CPU flop/s by partition:
Rpeak \(>\) Rmax \(>\) Gordon Bell \(>\) Typical
Consider HPCG - June 2024.
Problem: Data movement is expensive!
void square_dgemm(int n, double* C, double* A, double* B)
{
// Accumulate C += A*B for n-by-n matrices
for (i = 0; i < n; ++i)
for (j = 0; j < n; ++j)
for (k = 0; k < n; ++k)
C[i+j*n] += A[i+k*n] * B[k+j*n];
}
Two pieces to cost of fetching data
Time from operation start to first result (s)
Rate at which data arrives (bytes/s)
Start with a simple model
Picture gets even more complicated!
Too simple:
Why is parallel time not \(T(n)/p\)?
\[\begin{aligned} \mbox{Speedup} &= \frac{\mbox{Serial time}}{\mbox{Parallel time}} \\ \mbox{Efficiency} &= \frac{\mbox{Speedup}}{p} \end{aligned}\]
Perfect (linear) speedup is \(p\). Barriers:
If \(s\) is the fraction that is serial:
\[\mbox{Speedup} < \frac{1}{s}\]
Looks bad for strong scaling!
Fix problem size, vary \(p\)
Fix work per processor, vary \(p\)
Scaled speedup \[ S(p) = \frac{T_{\mbox{serial}}(n(p))}{T_{\mbox{parallel}}(n(p),p)} \] Gustafson: \[ S(p) \leq p - \alpha(p-1) \] where \(\alpha\) is fraction of serial work.
Problem is not just with purely serial work, but
Main pain point: dependency between computations
a = f(x)
b = g(x)
c = h(a,b)
Can compute \(a\) and \(b\) in parallel with each other.
But not with \(c\)!
True dependency (read-after-write). Can also have issues with false dependencies (write-after-read and write-after-write), deal with this later.
“Pleasingly parallel” (aka “embarrassingly parallel”) tasks require very little coordination, e.g.:
Result is “high-throughput” computing – easy to get impressive speedups!
Says nothing about hard-to-parallelize tasks.
If your task is not pleasingly parallel, you ask:
Matrix-matrix multiply:
Look at examples somewhat like yours – a parallel pattern – and maybe seek an informative benchmark. Better yet: reduce to a previously well-solved problem (build on tuned kernels).
NB: Uninformative benchmarks will lead you astray.
Speed-of-light “Rpeak” is hard to reach
Want