CS 5220

Applications of Parallel Computers

Performance Basics

Please click the play button below.

Starting on the Soap Box

Let me start with a good, old-fashioned rant. Computers are a very powerful tool, and they are worthy of study in their own right. But too much of CS is wrapped up in beating the state-of-the-art on one benchmark or the other. In HPC, we have Linpack and the obsession with exaflops, but we are not alone -- look at ML! Richard Hamming said the purpose of computation is insight, not numbers. Well, sometimes the purpose is numbers. You need to know how much load your bridge can handle, how loud your speakers will be. That's well and good, and HPC can help. But sometimes you don't need HPC. A cheaper simulation or a more clever algorithm may let you get the number you need on your laptop or desktop, without programming heroics.

Starting on the Soap Box

The goal is right enough, fast enough — not flop/s.

Starting on the Soap Box

Performance is not all that matters.

Portability, readability, debuggability matter too!
Want to make intelligent trade-offs.

Starting on the Soap Box

The road to good performance
starts with a single core.

Even single-core performance is hard.
Helps to build on well-engineered libraries.

Starting on the Soap Box

Parallel efficiency is hard!

\(p\) processors \(\neq\) speedup of \(p\)
Different algorithms parallelize differently.
Speed vs untuned serial code is cheating!

Peak performance

Whence Rmax?

Top 500 benchmark reports:

Rmax: Linpack flop/s
Rpeak: Theoretical peak flop/s

Measure the first; how do we know the second?

What is a float?

Start with what is floating point:

(Binary) scientific notation
Extras: inf, NaN, de-normalized numbers
IEEE 754 standard: encodings, arithmetic rules

Before we talk about floating point operations per second, we have to talk about floating point operations. And before that, we have to talk about floating point, though briefly. Floating point arithmetic is how we approximate real arithmetic on computers. It's essentially scientific notation, but in base 2 instead of base 10. The IEEE 754 standard defines how the encodings work for floating point numbers, including special encodings for denormalized representations near zero, infinity, and not-a-number. It also defines the rules used for floating point operations. We'll talk about the details a bit more later in the class. David Goldberg wrote a great article about What Every Computer Scientist Should Know About Floating-Point Arithmetic. I highly recommend it if you are fuzzy about how floating point works.

What is a float?

Common floating point formats

64-bit double precision (DP)
32-bit single precision (SP)

Linpack results are double precision

What is a float?

Less common

Extended precisions (often 80 bits)
128-bit quad precision
16-bit half precision (multiple)
Decimal formats

Lots of interest in 16-bit formats for ML

Of course, there are other formats as well. Some have more than 64-bits, like the 128-bit quad format or the more-flexible extended precision specification (usually 80 bits). In addition, there is a lot of recent interest in 16-bit half precision formats. There are two of these -- Float16 and BFloat16 -- and there is a lot of interest in using them for machine learning tasks. But there are a lot of things you can get away with in double or single precision that become very dangerous in half precision, so these formats should be treated with some care. In addition, the 754 standard also specifies decimal floating point formats. This used to be a completely different standard (IEEE 854), but both formats appeared together in IEEE 754-2008. I was actually active on the 754-2008 committee for a time while I was a graduate student at Berkeley. It was a learning experience! The most recent version of the standard is IEEE 754-2019.

What is a flop?

Basic floating point operations: \[ +, -, \times, /, \sqrt{\cdot} \]
FMA (fused multiply-add): \[ d = ab + c \]
Costs depend on precision and op
Often focus on add, multiply, FMA (flams)

Flops / cycle / core

Processor does more than one thing at a time. On my laptop (2018 MacBook Air):

Two vector FMAs can start in one cycle
Vector FMA does four DP FMAs at once
Often count an FMA as two flops

Flops / cycle (one core)

\[ 2 \frac{\mbox{flops}}{\mbox{FMA}} \times 4 \frac{\mbox{FMA}}{\mbox{vector FMA}} \times 2 \frac{\mbox{vector FMA}}{\mbox{cycle}} = 16 \frac{\mbox{flops}}{\mbox{cycle}}\]

Flops / sec (one core)

\[\begin{aligned} 16 \frac{\mbox{flops}}{\mbox{cycle}} \times (1.6 \times 10^9) \frac{\mbox{cycle}}{\mbox{sec}} &= 25.6 \times 10^9 \frac{\mbox{flops}}{\mbox{sec}} \\ &= 25.6~\mbox{Gflop/s} \end{aligned}\]

Flops / sec

\[ 25.6 \frac{\mbox{Gflop/s}}{\mbox{core}} \times 2~\mbox{cores} = 51.2~\mbox{Gflop/s} \]

Things get more complicated if there are different core types (e.g. CPU cores and GPU cores)

Some historical context

Note: CM-5/1024 peak was 131 Gflop/s.

This was the top machine on the first Linpack benchmark list (July 1993).

Sanity check

What is the peak flop (CPU) flop rate on your machine? This StackOverflow thread might help you figure out your flops/cycle.

The Cost of Computing

(Single core)

The Cost of Computing

Consider a simple serial code:


// Accumulate C += A*B for n-by-n matrices
for (i = 0; i < n; ++i)
  for (j = 0; j < n; ++j)
    for (k = 0; k < n; ++k)
      C[i+j*n] += A[i+k*n] * B[k+j*n];

It helps to be concrete, so let's consider a numerical method that multiplies two square n-by-n matrices. We will use the classic three nested loops approach that you probably first learned when you learned how to multiply matrices. If you no longer remember how to multiply two matrices, now is a good time to remind yourself! It will come up again soon enough. The innermost loop computes a dot product between a row of matrix A and a column of matrix B; that takes n multiplies and n adds. The outer two loops iterate over n^2 such products. The total cost is therefore 2n^3 flops. The code assumes the matrices A, B, and C are laid out in column-major order in memory: that is, all the entries of the first column appear first, followed by all the entries of the second column, and so forth. This is the order used by Fortran, MATLAB, and Python. C uses row-major ordering, to the extent that it supports multi-dimensional arrays at all. So we have to manually compute the access function that maps from row and column indices into a one-dimensional representation. We'll have more to say about memory layouts for multi-dimensional arrays in future lectures.

The Cost of Computing

Simplest model:

Dominant cost is \(2n^3\) flops (adds and multiplies)
One flop per clock cycle
Expected time is \[ \mbox{Time (s)} \approx \frac{2n^3 \mbox{ flops}} {25.6 \cdot 10^9 \mbox{ flop/s}} \]

Problem: Model assumptions are wrong!

dbindel@MacBook-Air-5 codes % gcc naive-matmul.c
dbindel@MacBook-Air-5 codes % time ./a.out
./a.out  112.04s user 0.43s system 99% cpu 1:53.25 total

The Cost of Computing

Dominant cost is \(2n^3\) flops (adds and multiplies)?

Dominant cost is often memory traffic!
Special case of a communication cost

The main problem with our naive estimate is that we neglected the cost to fetch the data from memory. In fact, memory accesses cost a lot more than flops! And the way we wrote our code makes it so that we will have to do a slow memory access for almost every flop, so the computation is not the dominant cost. There are alternate ways of writing the code that make it so that we can re-use many of the slow memory accesses, and actually get close to the peak speed. We will talk about these alternate approaches in future lectures. More generally, communication -- whether with memory or with other processors -- has not improved at the same rate that peak flop rates have improved. So a lot of the games that we will play have to do with minimizing communication with slow memory, or communication between processors.

The Cost of Computing

Two pieces to cost of fetching data

Latency: Time from operation start to first result (s)
Bandwidth: Rate at which data arrives (bytes/s)

The Cost of Computing

Usually latency \(\gg\) bandwidth\(^{-1}\) \(\gg\) time per flop
Latency to L3 cache is 10s of ns
DRAM is \(3\)–\(4 \times\) slower
Partial solution: caches (to discuss next time)

See: Latency numbers every programmer should know

Your computer has several different types of memory. For the larger, slower memories, the latency can be really long. Once one has started getting data, the bandwidth seems OK -- though what I've written on the slide, that latency is much greater than inverse bandwidth, is not really sensible (because they have different units). But the inverse bandwidth is certainly much bigger than the flop rate. Fortunately, modern computers also come with small, fast memories with lower latencies and higher bandwidths than the main memory. These fast memories, called caches, are key to single-core performance of things like matrix-matrix products. It's worthwhile having some idea of how long it takes to fetch data from different types of memory. I recommend taking a minute or two to look over the numbers on the web page linked from this slide.

The Cost of Computing

Makes DRAM (\(\sim 100\) ns) look even worse: \[ 100 \mbox{ ns} \times 25.6 \mbox{ Gflop/s} = 2560 \mbox{ flops} \]

The Cost of Computing

Lose orders of magnitude if too many memory refs
And getting full vectorization is also not easy!
We’ll talk more about (single-core) arch next time

The Cost of Computing

What to take away from this example?

The Cost of Computing

Start with a simple model

Simplest: asymptotic complexity (e.g. \(O(n^3)\) flops)
Counting every detail just complicates life
But we want enough detail to predict something

The Cost of Computing

Watch out for hidden costs

Flops are not the only cost!
Memory/communication costs are often killers
Integer computation may play a role as well

The Cost of Computing

Haven’t even talked about > 1 core yet!

The Cost of Computing

(in parallel)

The Cost of (Parallel) Computing

Simple model:

Serial task takes time \(T\) (or \(T(n)\))
Deploy \(p\) processors
Parallel time is \(T(n)/p\)

... and you should be suspicious by now!

The Cost of (Parallel) Computing

Why is parallel time not \(T/p\)?

Overheads: Communication, synchronization, extra computation and memory overheads
Intrinsically serial work
Idle time due to synchronization
Contention for resources

Quantifying Parallel Performance

Starting point: good serial performance
Scaling study: compare parallel to serial time as a function of number of processors (\(p\)) \[\begin{aligned} \mbox{Speedup} &= \frac{\mbox{Serial time}}{\mbox{Parallel time}} \\[2mm] \mbox{Efficiency} &= \frac{\mbox{Speedup}}{p} \end{aligned}\]
Ideally, speedup = \(p\). Usually, speedup \(< p\).
Barriers to perfect speedup
- Serial work (Amdahl’s law)
- Parallel overheads (communication, synchronization)

A (strong) scaling study is an experiment in which we compare the performance of a parallel code to the performance of a well-tuned serial code. In strong scaling, we look at speedup, or the ratio of serial to parallel time, versus the number of processors. The efficiency is the speedup relative to the number of processors; 100 percent efficiency means that you get the p-fold speedup of your dreams. The tuning of the serial code matters! You can get beautiful-looking speedups if you compare to a bad serial code, but those attractive speedup curves don't actually tell you that you are going to get answers fast as you add processors to your parallel code. Rather, they tell you that you are going to get answers slower than you ought to, pretty much across the board.

Amdahl’s Law

Parallel scaling study where some serial code remains: \[\begin{aligned} p = & \mbox{ number of processors} \\ s = & \mbox{ fraction of work that is serial} \\ t_s = & \mbox{ serial time} \\ t_p = & \mbox{ parallel time} \geq s t_s + (1-s) t_s / p \end{aligned}\]

\[\mbox{Speedup} = \frac{t_s}{t_p} = \frac{1}{s + (1-s) / p} < \frac{1}{s}\]

Amdahl’s Law

\[\mbox{Speedup} < \frac{1}{s}\]

So \(1\%\) serial work \(\implies\) max speedup < \(100 \times\), regardless of \(p\).

Strong and weak scaling

Ahmdahl looks bad! But two types of scaling studies:

Strong scaling: Fix problem size, vary \(p\)
Weak scaling: Fix work per processor, vary \(p\)

Strong and weak scaling

For weak scaling, study scaled speedup \[ S(p) = \frac{T_{\mbox{serial}}(n(p))}{T_{\mbox{parallel}}(n(p), p)} \] Gustafson’s Law: \[ S(p) \leq p - \alpha(p-1) \] where \(\alpha\) is the fraction of work that is serial.

Imperfect parallelism

Problem is not just with purely serial work, but

Work that offers limited parallelism
Coordination overheads.

Dependencies

Main pain point: dependency between computations


        a = f(x)
        b = g(x)
        c = h(a,b)

Compute a and b in parallel, but finish both before c!
Limits amount of parallel work available.

This is a true dependency (read-after-write). Also have false dependencies (write-after-read and write-after-write) that can be dealt with more easily.

Granularity

Coordination is expensive
- including parallel start/stop!
Need to do enough work to amortize parallel costs
Not enough to have parallel work, need big chunks!
Chunk size depends on the machine.

Patterns and Benchmarks

Pleasing Parallelism

“Pleasingly parallel” (aka “embarrassingly parallel”) tasks require very little coordination, e.g.:

Monte Carlo computations with independent trials
Mapping many data items independently

Result is “high-throughput” computing – easy to get impressive speedups!

Says nothing about hard-to-parallelize tasks.

Patterns and Benchmarks

If your task is not pleasingly parallel, you ask:

What is the best performance I reasonably expect?
How do I get that performance?

Patterns and Benchmarks

Matrix-matrix multiply:

Is not pleasingly parallel.
Admits high-performance code.
Is a prototype for much dense linear algebra.
Is the key to the Linpack benchmark.

Patterns and Benchmarks

Look at examples somewhat like yours – a parallel pattern – and maybe seek an informative benchmark. Better yet: reduce to a previously well-solved problem (build on tuned kernels).

NB: Uninformative benchmarks will lead you astray.

CS 5220

Applications of Parallel Computers

Performance Basics

Starting on the Soap Box

Starting on the Soap Box

Starting on the Soap Box

Starting on the Soap Box

Starting on the Soap Box

Peak performance

Whence Rmax?

What is a float?

What is a float?

What is a float?

What is a flop?

Flops / cycle / core

Flops / cycle (one core)

Flops / sec (one core)

Flops / sec

Some historical context

Sanity check

The Cost of Computing

(Single core)

The Cost of Computing

The Cost of Computing

The Cost of Computing

The Cost of Computing

The Cost of Computing

The Cost of Computing

The Cost of Computing

The Cost of Computing

The Cost of Computing

The Cost of Computing

The Cost of Computing

The Cost of Computing

(in parallel)

The Cost of (Parallel) Computing

The Cost of (Parallel) Computing

Quantifying Parallel Performance

Amdahl’s Law

Amdahl’s Law

Strong and weak scaling

Strong and weak scaling

Imperfect parallelism

Dependencies

Granularity

Patterns and Benchmarks

Pleasing Parallelism

Patterns and Benchmarks

Patterns and Benchmarks

Patterns and Benchmarks

Onward!