CS 5220

Performance Basics

David Bindel

2024-08-29

Soap Box

The Goal

The goal is right enough, fast enough — not flop/s.

More than Speed

Performance is not all that matters.

Portability, readability, ease of debugging, ...
Want to make intelligent tradeoffs

Start at the Beginning

The road to good performance
starts with a single core.

Even single-core performance is hard
Helps to build well-engineered libraries

Fair Comparisons

Parallel efficiency is hard!

\(p\) processors \(\neq\) speedup of \(p\)
Different algorithms parallelize differently
Speed vs untuned serial code is cheating!

Peak Performance

Whence Rmax?

Top 500 benchmark reports:

Rmax: Linpack flop/s
Rpeak: Theoretical peak flop/s

Measure the first; how do we know the second?

What is a float?

Start with what is floating point:

(Binary) scientific notation
Extras: inf, NaN, de-normalized numbers
IEEE 754 standard: encodings, arithmetic rules

Formats

64-bit double precision (DP)
32-bit single precision (SP)
Extended precisions (often 80 bits)
128-bit quad precision
16-bit half precision (multiple)
Decimal formats

Lots of interest in 16-bit formats for ML. Linpack results are double precision

What is a flop?

Basic floating point operations: \(+, -, \times, /, \sqrt{\cdot}\)
FMA (fused multiply-add): \(d = ab + c\)
Costs depend on precision and op
Often focus on add, multiply, FMA (“flams”)

Perlmutter specs

Consider Perlmutter

Flops / cycle / core

Processor does more than one thing at a time. On one CPU core of Perlmutter (AMD EPYC 7763 (Milan)):

\[ 2 \frac{\mbox{flops}}{\mbox{FMA}} \times 4 \frac{\mbox{FMA}}{\mbox{vector FMA}} \times 2 \frac{\mbox{vector FMA}}{\mbox{cycle}} = 16 \frac{\mbox{flops}}{\mbox{cycle}} \]

Flops / sec / core

At standard clock (2.45 GHz)

\[ 16 \frac{\mbox{flops}}{\mbox{cycle}} \times 2.4 \times 10^9 \frac{\mbox{cycle}}{\mbox{s}} = 39.2 \frac{\mbox{Gflop}}{\mbox{s}} \]

At max boost clock (3.5 GHz)

\[ 16 \frac{\mbox{flops}}{\mbox{cycle}} \times 3.5 \times 10^9 \frac{\mbox{cycle}}{\mbox{s}} = 56 \frac{\mbox{Gflop}}{\mbox{s}} \]

Flops / sec / CPU

Each CPU has 64 cores, at standard clock

\[ 39.2 \frac{\mbox{Gflop}}{\mbox{s}} = 2508.8 \frac{\mbox{Gflop}}{\mbox{s}} \approx 2.5 \frac{\mbox{Tflop}}{\mbox{s}} \]

Peak CPU flop/s by partition:

GPU: \(2.5808\) Tflop/s/CPU \(\times 1536\) CPU \(\approx\) 3.9 Pflop/s
CPU: \(2.5808\) Tflop/s/CPU \(\times 2\) CPU/node \(\times 3072\) nodes \(\approx 15.4\) Pflop/s
- NERSC docs inconsistent re 2 CPU/node?

Flops / sec / GPU

GPU partition nodes have 4 NVIDIA A100 each.
Different peak performance depending on FP type (9.7 Tflop/s FP64)

But…

Rpeak \(>\) Rmax \(>\) Gordon Bell \(>\) Typical

Performance is application dependent
Hard to get more than a few percent on most

Consider HPCG - June 2024.
Problem: Data movement is expensive!

Serial Costs

Naive Matmul

void square_dgemm(int n, double* C, double* A, double* B)
{
    // Accumulate C += A*B for n-by-n matrices
    for (i = 0; i < n; ++i)
      for (j = 0; j < n; ++j)
        for (k = 0; k < n; ++k)
          C[i+j*n] += A[i+k*n] * B[k+j*n];    
}

Inner product formulation of matrix multiply
Takes \(2n^3\) flops
Cost is much more than Rpeak suggests!
Problem is communication cost / memory traffic

Price to Fetch

Two pieces to cost of fetching data

Latency: Time from operation start to first result (s)
Bandwidth: Rate at which data arrives (bytes/s)

Price to Fetch

Usually latency \(\gg\) bandwidth\(^{-1} \gg\) time per flop
Latency to L3 cache is 10s of ns
DRAM is \(3-4 \times\) slower
Partial solution: caches (to discuss next time)

See: Latency numbers every programmer should know

Price to Fetch

Lose orders of magnitude if too many memory refs
And getting full vectorization is also not easy!
We’ll talk more about (single-core) arch next time

Takeaways

Start with a simple model

But flop counting is too simple
Counting every detail complicates life
Want enough detail to predict something

Watch for Hidden Costs

Flops are not the only cost!
Memory/communication costs are often killers
Integer computation may play a role, too

Parallelism?

Picture gets even more complicated!

Parallel Costs

Naive model

Too simple:

Serial task takes time \(T(n)\)
Deploy \(p\) processors
Parallel time is \(T(n)/p\)

What’s Wrong?

Why is parallel time not \(T(n)/p\)?

Overheads: Communication, synchronization, extra computation and memory overheads
Intrinsically serial work
Idle time due to synchronization
Contention for resources

Quantifying Performance

Start with good serial performance
(Strong) scaling study: compare parallel vs serial time as a function of \(p\) for a fixed problem

\[\begin{aligned} \mbox{Speedup} &= \frac{\mbox{Serial time}}{\mbox{Parallel time}} \\ \mbox{Efficiency} &= \frac{\mbox{Speedup}}{p} \end{aligned}\]

Quantifying Performance

Perfect (linear) speedup is \(p\). Barriers:

Serial work (Amdahl’s law)
Parallel overheads (communication, synchronization)

Amdahl

If \(s\) is the fraction that is serial:

\[\mbox{Speedup} < \frac{1}{s}\]

Looks bad for strong scaling!

Strong and Weak Scaling

Strong scaling: Fix problem size, vary \(p\)
Weak scaling: Fix work per processor, vary \(p\)

Scaled Speedup

Scaled speedup \[ S(p) = \frac{T_{\mbox{serial}}(n(p))}{T_{\mbox{parallel}}(n(p),p)} \] Gustafson: \[ S(p) \leq p - \alpha(p-1) \] where \(\alpha\) is fraction of serial work.

Imperfect Parallelism

Problem is not just with purely serial work, but

Work that offers limited parallelism
Coordination overheads.

Dependencies

Main pain point: dependency between computations

a = f(x)
b = g(x)
c = h(a,b)

Can compute \(a\) and \(b\) in parallel with each other.
But not with \(c\)!

True dependency (read-after-write). Can also have issues with false dependencies (write-after-read and write-after-write), deal with this later.

Granularity

Coordination is expensive
- including parallel start/stop!
Need to do enough work to amortize parallel costs
Not enough to have parallel work, need big chunks!
Chunk size depends on the machine.

Patterns and Benchmarks

Pleasing Parallelism

“Pleasingly parallel” (aka “embarrassingly parallel”) tasks require very little coordination, e.g.:

Monte Carlo computations with independent trials
Mapping many data items independently

Result is “high-throughput” computing – easy to get impressive speedups!

Says nothing about hard-to-parallelize tasks.

Displeasing Parallelism

If your task is not pleasingly parallel, you ask:

What is the best performance I reasonably expect?
How do I get that performance?

Partly Pleasing Parallelism?

Matrix-matrix multiply:

Is not pleasingly parallel.
Admits high-performance code.
Is a prototype for much dense linear algebra.
Is the key to the Linpack benchmark.

Patterns and Kernels

Look at examples somewhat like yours – a parallel pattern – and maybe seek an informative benchmark. Better yet: reduce to a previously well-solved problem (build on tuned kernels).

NB: Uninformative benchmarks will lead you astray.

Recap

Speed-of-light “Rpeak” is hard to reach

Communication (even on one core!)
Other overhead costs to parallelism
Dependencies limiting parallelism

Want

Models to understand real performance
Building blocks for getting high performance