CS 5220

Applications of Parallel Computers

Performance Basics

Prof David Bindel

Please click the play button below.

Starting on the Soap Box

Starting on the Soap Box

The goal is right enough, fast enough — not flop/s.

Starting on the Soap Box

Performance is not all that matters.

  • Portability, readability, debuggability matter too!
  • Want to make intelligent trade-offs.

Starting on the Soap Box

The road to good performance
starts with a single core.

  • Even single-core performance is hard.
  • Helps to build on well-engineered libraries.

Starting on the Soap Box

Parallel efficiency is hard!

  • \(p\) processors \(\neq\) speedup of \(p\)
  • Different algorithms parallelize differently.
  • Speed vs untuned serial code is cheating!

Peak performance

Whence Rmax?

Top 500 benchmark reports:

  • Rmax: Linpack flop/s
  • Rpeak: Theoretical peak flop/s

Measure the first; how do we know the second?

What is a float?

Start with what is floating point:

  • (Binary) scientific notation
  • Extras: inf, NaN, de-normalized numbers
  • IEEE 754 standard: encodings, arithmetic rules

What is a float?

Common floating point formats

  • 64-bit double precision (DP)
  • 32-bit single precision (SP)

Linpack results are double precision

What is a float?

Less common

  • Extended precisions (often 80 bits)

  • 128-bit quad precision

  • 16-bit half precision (multiple)

  • Decimal formats

Lots of interest in 16-bit formats for ML

What is a flop?

  • Basic floating point operations: \[ +, -, \times, /, \sqrt{\cdot} \]

  • FMA (fused multiply-add): \[ d = ab + c \]

  • Costs depend on precision and op

  • Often focus on add, multiply, FMA (flams)

Flops / cycle / core

Processor does more than one thing at a time. On my laptop (2018 MacBook Air):

  • Two vector FMAs can start in one cycle

  • Vector FMA does four DP FMAs at once

  • Often count an FMA as two flops

Flops / cycle (one core)

\[ 2 \frac{\mbox{flops}}{\mbox{FMA}} \times 4 \frac{\mbox{FMA}}{\mbox{vector FMA}} \times 2 \frac{\mbox{vector FMA}}{\mbox{cycle}} = 16 \frac{\mbox{flops}}{\mbox{cycle}}\]

Flops / sec (one core)

\[\begin{aligned} 16 \frac{\mbox{flops}}{\mbox{cycle}} \times (1.6 \times 10^9) \frac{\mbox{cycle}}{\mbox{sec}} &= 25.6 \times 10^9 \frac{\mbox{flops}}{\mbox{sec}} \\ &= 25.6~\mbox{Gflop/s} \end{aligned}\]

Flops / sec

\[ 25.6 \frac{\mbox{Gflop/s}}{\mbox{core}} \times 2~\mbox{cores} = 51.2~\mbox{Gflop/s} \]

Things get more complicated if there are different core types (e.g. CPU cores and GPU cores)

Some historical context

Note: CM-5/1024 peak was 131 Gflop/s.

This was the top machine on the first Linpack benchmark list (July 1993).

Sanity check

What is the peak flop (CPU) flop rate on your machine? This StackOverflow thread might help you figure out your flops/cycle.

The Cost of Computing

(Single core)

The Cost of Computing

Consider a simple serial code:


// Accumulate C += A*B for n-by-n matrices
for (i = 0; i < n; ++i)
  for (j = 0; j < n; ++j)
    for (k = 0; k < n; ++k)
      C[i+j*n] += A[i+k*n] * B[k+j*n];
    

The Cost of Computing

Simplest model:

  1. Dominant cost is \(2n^3\) flops (adds and multiplies)
  2. One flop per clock cycle
  3. Expected time is \[ \mbox{Time (s)} \approx \frac{2n^3 \mbox{ flops}} {25.6 \cdot 10^9 \mbox{ flop/s}} \]

Problem: Model assumptions are wrong!

dbindel@MacBook-Air-5 codes % gcc naive-matmul.c
dbindel@MacBook-Air-5 codes % time ./a.out
./a.out  112.04s user 0.43s system 99% cpu 1:53.25 total
    

The Cost of Computing

Dominant cost is \(2n^3\) flops (adds and multiplies)?

  • Dominant cost is often memory traffic!

  • Special case of a communication cost

The Cost of Computing

Two pieces to cost of fetching data

Latency

Time from operation start to first result (s)

Bandwidth

Rate at which data arrives (bytes/s)

The Cost of Computing

  • Usually latency \(\gg\) bandwidth\(^{-1}\) \(\gg\) time per flop

  • Latency to L3 cache is 10s of ns

  • DRAM is \(3\)\(4 \times\) slower

  • Partial solution: caches (to discuss next time)

See: Latency numbers every programmer should know

The Cost of Computing

Makes DRAM (\(\sim 100\) ns) look even worse: \[ 100 \mbox{ ns} \times 25.6 \mbox{ Gflop/s} = 2560 \mbox{ flops} \]

The Cost of Computing

  • Lose orders of magnitude if too many memory refs

  • And getting full vectorization is also not easy!

  • We’ll talk more about (single-core) arch next time

The Cost of Computing

What to take away from this example?

The Cost of Computing

Start with a simple model

  • Simplest: asymptotic complexity (e.g. \(O(n^3)\) flops)

  • Counting every detail just complicates life

  • But we want enough detail to predict something

The Cost of Computing

Watch out for hidden costs

  • Flops are not the only cost!

  • Memory/communication costs are often killers

  • Integer computation may play a role as well

The Cost of Computing

Haven’t even talked about > 1 core yet!

The Cost of Computing

(in parallel)

The Cost of (Parallel) Computing

Simple model:

  • Serial task takes time \(T\) (or \(T(n)\))

  • Deploy \(p\) processors

  • Parallel time is \(T(n)/p\)

... and you should be suspicious by now!

The Cost of (Parallel) Computing

Why is parallel time not \(T/p\)?

  • Overheads: Communication, synchronization, extra computation and memory overheads

  • Intrinsically serial work

  • Idle time due to synchronization

  • Contention for resources

Quantifying Parallel Performance

  • Starting point: good serial performance

  • Scaling study: compare parallel to serial time as a function of number of processors (\(p\)) \[\begin{aligned} \mbox{Speedup} &= \frac{\mbox{Serial time}}{\mbox{Parallel time}} \\[2mm] \mbox{Efficiency} &= \frac{\mbox{Speedup}}{p} \end{aligned}\]

  • Ideally, speedup = \(p\). Usually, speedup \(< p\).

  • Barriers to perfect speedup

    • Serial work (Amdahl’s law)

    • Parallel overheads (communication, synchronization)

Amdahl’s Law

Parallel scaling study where some serial code remains: \[\begin{aligned} p = & \mbox{ number of processors} \\ s = & \mbox{ fraction of work that is serial} \\ t_s = & \mbox{ serial time} \\ t_p = & \mbox{ parallel time} \geq s t_s + (1-s) t_s / p \end{aligned}\]

\[\mbox{Speedup} = \frac{t_s}{t_p} = \frac{1}{s + (1-s) / p} < \frac{1}{s}\]

Amdahl’s Law

\[\mbox{Speedup} < \frac{1}{s}\]

So \(1\%\) serial work \(\implies\) max speedup < \(100 \times\), regardless of \(p\).

Strong and weak scaling

Ahmdahl looks bad! But two types of scaling studies:

Strong scaling

Fix problem size, vary \(p\)

Weak scaling

Fix work per processor, vary \(p\)

Strong and weak scaling

For weak scaling, study scaled speedup \[ S(p) = \frac{T_{\mbox{serial}}(n(p))}{T_{\mbox{parallel}}(n(p), p)} \] Gustafson’s Law: \[ S(p) \leq p - \alpha(p-1) \] where \(\alpha\) is the fraction of work that is serial.

Imperfect parallelism

Problem is not just with purely serial work, but

  • Work that offers limited parallelism
  • Coordination overheads.

Dependencies

Main pain point: dependency between computations


        a = f(x)
        b = g(x)
        c = h(a,b)
    

Compute a and b in parallel, but finish both before c!
Limits amount of parallel work available.

This is a true dependency (read-after-write). Also have false dependencies (write-after-read and write-after-write) that can be dealt with more easily.

Granularity

  • Coordination is expensive
    - including parallel start/stop!
  • Need to do enough work to amortize parallel costs
  • Not enough to have parallel work, need big chunks!
  • Chunk size depends on the machine.

Patterns and Benchmarks

Pleasing Parallelism

“Pleasingly parallel” (aka “embarrassingly parallel”) tasks require very little coordination, e.g.:

  • Monte Carlo computations with independent trials
  • Mapping many data items independently

Result is “high-throughput” computing – easy to get impressive speedups!

Says nothing about hard-to-parallelize tasks.

Patterns and Benchmarks

If your task is not pleasingly parallel, you ask:

  • What is the best performance I reasonably expect?
  • How do I get that performance?

Patterns and Benchmarks

Matrix-matrix multiply:

  • Is not pleasingly parallel.
  • Admits high-performance code.
  • Is a prototype for much dense linear algebra.
  • Is the key to the Linpack benchmark.

Patterns and Benchmarks

Look at examples somewhat like yours – a parallel pattern – and maybe seek an informative benchmark. Better yet: reduce to a previously well-solved problem (build on tuned kernels).

NB: Uninformative benchmarks will lead you astray.

Onward!