Peak performance
In the last deck, we looked up the Linpack benchmark numbers for
a few top machines. It's useful to know the theoretical "speed
of light" for a given platform, and to understand how it is
computed, so let's dig into these numbers a bit now.
Whence Rmax?
Top 500 benchmark reports:
Rmax: Linpack flop/s
Rpeak: Theoretical peak flop/s
Measure the first; how do we know the second?
More specifically, the first two numbers reported for each
machine in the Top 500 list are the maximum flop rate on the
benchmark (Rmax) and the theoretical peak flop rate (Rpeak).
We measure the former experimentally. Where does the latter
come from?
What is a float?
Start with what is
floating point :
(Binary) scientific notation
Extras: inf, NaN, de-normalized numbers
IEEE 754 standard: encodings, arithmetic rules
Before we talk about floating point operations per second, we
have to talk about floating point operations. And before that,
we have to talk about floating point, though briefly. Floating
point arithmetic is how we approximate real arithmetic on
computers. It's essentially scientific notation, but in base 2
instead of base 10. The IEEE 754 standard defines how the
encodings work for floating point numbers, including special
encodings for denormalized representations near zero,
infinity, and not-a-number. It also defines the rules used for
floating point operations. We'll talk about the details a bit
more later in the class.
David Goldberg wrote a great article about What Every Computer
Scientist Should Know About Floating-Point Arithmetic. I highly
recommend it if you are fuzzy about how floating point works.
What is a float?
Common floating point formats
64-bit double precision (DP)
32-bit single precision (SP)
Linpack results are double precision
The two most common floating point formats in numerical codes
are the IEEE 754 double and single precision formats, which are
64 bits and 32 bits in memory, respectively. When we specify
flop rates, we need to specify a precision. The Linpack results
are all reported in double precision.
What is a float?
Less common
Lots of interest in 16-bit formats for ML
Of course, there are other formats as well. Some have more than
64-bits, like the 128-bit quad format or the more-flexible
extended precision specification (usually 80 bits). In
addition, there is a lot of recent interest in 16-bit half
precision formats. There are two of these -- Float16 and
BFloat16 -- and there is a lot of interest in using them for
machine learning tasks. But there are a lot of things you can
get away with in double or single precision that become very dangerous in
half precision, so these formats should be treated with some
care.
In addition, the 754 standard also specifies decimal floating
point formats. This used to be a completely different standard
(IEEE 854), but both formats appeared together in IEEE 754-2008. I was
actually active on the 754-2008 committee for a time while I was
a graduate student at Berkeley. It was a learning experience!
The most recent version of the standard is IEEE 754-2019.
What is a flop?
Basic floating point operations:
\[
+, -, \times, /, \sqrt{\cdot}
\]
FMA (fused multiply-add):
\[
d = ab + c
\]
Costs depend on precision and op
Often focus on add, multiply, FMA (flams )
Floating point numbers are pretty useless if you can't use
them for arithmetic! The standard describes a few basic
operations: addition, subtraction, negation, multiplication, division,
and square root. Many modern processors also support fused
multiply-add (FMA), which does an addition and a
multiplication with a single rounding error.
The rate at which we can do floating point operations depends
on the format and the type of operation. In much of linear
algebra, most of the work goes into additions and
multiplications. Pete Stewart calls these flams.
Flops / cycle / core
Processor does more than one thing at a time. On my laptop
(2018 MacBook Air):
Two vector FMAs can start in one cycle
Vector FMA does four DP FMAs at once
Often count an FMA as two flops
Even single-core systems have instruction level parallelism. My
laptop, a 2018 MacBook Air, has a chip with two vector units
capable of simultaneous fused multiply-add operations on 256
bits of data -- four double precision numbers -- per operand.
More than one instruction can be started in a single cycle, and
the units are pipelined. We'll talk more about what this means
in the next deck, but for now you should just know that, at
theoretical peak, we can actually start two vector FMA
per cycle. And we often count an FMA as two flops.
Flops / cycle (one core)
\[
2 \frac{\mbox{flops}}{\mbox{FMA}} \times
4 \frac{\mbox{FMA}}{\mbox{vector FMA}} \times
2 \frac{\mbox{vector FMA}}{\mbox{cycle}} =
16 \frac{\mbox{flops}}{\mbox{cycle}}\]
So, putting this together, we have two flops per FMA, times four
FMA operations per vector vector FMA instruction, times two
vector operations per cycle, or 16 flops per cycle.
Flops / sec (one core)
\[\begin{aligned}
16 \frac{\mbox{flops}}{\mbox{cycle}} \times
(1.6 \times 10^9) \frac{\mbox{cycle}}{\mbox{sec}}
&= 25.6 \times 10^9 \frac{\mbox{flops}}{\mbox{sec}} \\
&= 25.6~\mbox{Gflop/s}
\end{aligned}\]
Multiply 16 flops per cycle by a clock rate of 1.6 GHz or 1.6
billion cycles per second, and we have a single core flop rate
of 25.6 GFlop/s.
Flops / sec
\[
25.6 \frac{\mbox{Gflop/s}}{\mbox{core}} \times
2~\mbox{cores} =
51.2~\mbox{Gflop/s}
\]
Things get more complicated if there are different core types
(e.g. CPU cores and GPU cores)
My laptop has two cores, so the theoretical peak is 51.2
GFlop/s. Many high-performance systems have a mix of different
types of cores -- CPUs and GPUs -- and so the computation for
them is more complicated. But the basic picture remains the
same. To compute the peak flop rate, we want to figure out how
many flops per cycle we can manage, and then multiply by the
clock rate.
Some historical context
Note: CM-5/1024
peak was 131 Gflop/s.
This was the top machine on the first Linpack benchmark list
(July 1993).
It is sort of impressive that my dinky MacBook Air has a
theoretical peak of 51.2 GFlop/s. For historical context, the
first Linpack benchmark list came out in July 1993, and the top
machine at the time (the CM-5/1024 from Thinking Machines) came
in at a theoretical peak of 131 GFlop/s. Maybe that seems like
a long time ago to you, but I was in high school in 1993!
Sanity check
What is the peak flop (CPU) flop rate on your machine?
This
StackOverflow thread might help you figure out your flops/cycle.
All right. Assuming that you are not using the same model of
MacBook Air, take a moment: what is the theoretical peak (double
precision) flop rate on the CPUs on your machine, whatever it
may be? I've linked a StackOverflow thread that discusses some
of the relevant parameters for various families of processors.
I suggest going to look it up!
The Cost of Computing
(Single core)
All right. So I have cores capable of a theoretical peak of
25.6 GFlop/s. Let's talk about what fraction of that peak I
might reasonably expect to get.
The Cost of Computing
Consider a simple serial code:
// Accumulate C += A*B for n-by-n matrices
for (i = 0; i < n; ++i)
for (j = 0; j < n; ++j)
for (k = 0; k < n; ++k)
C[i+j*n] += A[i+k*n] * B[k+j*n];
It helps to be concrete, so let's consider a numerical method
that multiplies two square n-by-n matrices. We will use the
classic three nested loops approach that you probably first
learned when you learned how to multiply matrices. If you no
longer remember how to multiply two matrices, now is a good time
to remind yourself! It will come up again soon enough.
The innermost loop computes a dot product between a row of
matrix A and a column of matrix B; that takes n multiplies and n
adds. The outer two loops iterate over n^2 such products. The
total cost is therefore 2n^3 flops.
The code assumes the matrices A, B, and C are laid out in
column-major order in memory: that is, all the entries of the
first column appear first, followed by all the entries of the
second column, and so forth. This is the order used by Fortran,
MATLAB, and Python. C uses row-major ordering, to the extent
that it supports multi-dimensional arrays at all. So we have to
manually compute the access function that maps from row and
column indices into a one-dimensional representation.
We'll have more to say about memory layouts for
multi-dimensional arrays in future lectures.
The Cost of Computing
Simplest model:
Dominant cost is \(2n^3\)
flops (adds and multiplies)
One flop per clock cycle
Expected time is
\[
\mbox{Time (s)} \approx
\frac{2n^3 \mbox{ flops}}
{25.6 \cdot 10^9 \mbox{ flop/s}}
\]
Problem: Model assumptions are wrong!
So how long does it actually take to multiply two matrices? A
very simple estimate might go as follows: let's assume that we
are close to peak performance. Then the time (in seconds) is
the number of flops divided by the flop rate (in flops per
second). Certainly the units work out! For my laptop, with a
theoretical peak on one core of 25.6 Gflop/s, this model would
predict that I could multiply two 2400-by-2400 matrices in a bit
over a second (1.08 seconds). Of course, the assumption that we
are running at peak flop rate is probably wrong. So how long
does it actually take?
dbindel@MacBook-Air-5 codes % gcc naive-matmul.c
dbindel@MacBook-Air-5 codes % time ./a.out
./a.out 112.04s user 0.43s system 99% cpu 1:53.25 total
The answer: it takes almost two minutes on my laptop. Our naive
estimate was off by almost two orders of magnitude. Clearly,
something was missing from our performance model.
The Cost of Computing
Dominant cost is \(2n^3\) flops
(adds and multiplies)?
The main problem with our naive estimate is that we neglected
the cost to fetch the data from memory. In fact, memory
accesses cost a lot more than flops! And the way we wrote our
code makes it so that we will have to do a slow memory access
for almost every flop, so the computation is not the dominant
cost. There are alternate ways of writing the code that make it
so that we can re-use many of the slow memory accesses, and
actually get close to the peak speed. We will talk about these
alternate approaches in future lectures.
More generally, communication -- whether with memory or with
other processors -- has not improved at the same rate that peak
flop rates have improved. So a lot of the games that we will
play have to do with minimizing communication with slow memory,
or communication between processors.
The Cost of Computing
Two pieces to cost of fetching data
Latency
Time from operation start to first result (s)
Bandwidth
Rate at which data arrives (bytes/s)
When we think about the cost of memory accesses, or other
communications, we usually think about two distinct things. The
first is the latency, or the time between when an operation
starts and when we get the first result. The second is
bandwidth, or the rate at which data arrives (in steady state).
The Cost of Computing
Usually latency
\(\gg\)
bandwidth\(^{-1}\)
\(\gg\)
time per flop
Latency to L3 cache is 10s of ns
DRAM is \(3\) –\(4 \times\) slower
Partial solution: caches (to discuss next time)
See: Latency numbers every programmer should know
Your computer has several different types of memory. For the
larger, slower memories, the latency can be really long. Once
one has started getting data, the bandwidth seems OK -- though
what I've written on the slide, that latency is much greater
than inverse bandwidth, is not really sensible (because they have
different units). But the inverse bandwidth is certainly much bigger
than the flop rate.
Fortunately, modern computers also come with small, fast memories with
lower latencies and higher bandwidths than the main memory.
These fast memories, called caches, are key to single-core
performance of things like matrix-matrix products.
It's worthwhile having some idea of how long it takes to fetch
data from different types of memory. I recommend taking a minute or two
to look over the numbers on the web page linked from this slide.
The Cost of Computing
Makes DRAM (\(\sim 100\) ns)
look even worse:
\[
100 \mbox{ ns} \times
25.6 \mbox{ Gflop/s} = 2560 \mbox{ flops}
\]
The main memory typically uses synchronous dynamic random-access
memory or SDRAM. There are several different latency numbers
for SDRAM, but the right order of magnitude for a new, random
read to memory is on the order of a few tens of nanoseconds
(ns). A good ballpark estimate is 100 ns. This is pessimistic,
but it is certainly not off by an order of magnitude.
How long is 100 ns in flops? On my machine, it is 2560 flops,
which is a lot.
The Cost of Computing
Lose orders of magnitude if too many memory refs
And getting full vectorization is also not easy!
We’ll talk more about (single-core) arch next time
So: we lose orders of magnitude in performance if we have too
many random references to slow memory (DRAM). And even if we
manage to finesse this issue, getting enough vectorization to
approach 16 flops per cycle turns out to be nontrivial. We'll
talk about these things in more detail over the next few
lectures, starting with the upcoming lecture on single-core
architecture.
The Cost of Computing
What to take away from this example?
All right. Aside from the fact that compute is fast and memory
is slow, what should we take away from this example?
The Cost of Computing
Start with a simple model
Simplest: asymptotic complexity (e.g. \(O(n^3)\) flops)
Counting every detail just complicates life
But we want enough detail to predict something
Even more than the numbers or the conclusions about the relative
speed of compute and memory, I'd like you to remember how we
thought about modeling performance in this example. We started
off just keeping track of flops, and that gave us a lower bound
on the time it would take to do a computation. But because our
model was too simple, we missed something important; namely, the
memory. It's natural to skip over this detail when we think
about algorithm complexity; and, really, it only affects the
constants in a big-O view of the world. But constants are
important in HPC, and if we want a model that will be at all
predictive, we need some details about communication and memory
costs.
The Cost of Computing
Watch out for hidden costs
Flops are not the only cost!
Memory/communication costs are often killers
Integer computation may play a role as well
There are a lot of situations in which memory and communication
costs a lot more than computation! It's also worth noting that
floating point operations are not the only part of a
computation -- things like integer computation costs may also
play a role.
The Cost of Computing
Haven’t even talked about > 1 core yet!
And this depressing complexity all came about from discussing a
serial code written in four lines of C! Things get even more
complicated when we talk about parallel code, of course.
The Cost of Computing
(in parallel)
So, now that we've depressed ourselves with the single-core
case, let's talk about some basic concepts for reasoning about the
complexity and performance of parallel codes.
The Cost of (Parallel) Computing
Simple model:
... and you should be suspicious by now!
A naive view of the world, one that is trotted forward far too
often by people who ought to know better, is that p processors
will result in a p-fold improvement over the execution time of a
serial code for the same task. You should be suspicious of this
reasoning by now. In fact, it rarely holds.
The Cost of (Parallel) Computing
Why is parallel time not \(T/p\) ?
Overheads: Communication, synchronization, extra computation and memory overheads
Intrinsically serial work
Idle time due to synchronization
Contention for resources
One problem in the parallel case is communication overhead.
Just as it takes time to ask a memory for data, it takes time to
send messages from one processor to another. In addition, just
because we have p processors available, that does not mean that
we can use them all effectively at once! Sometimes we have
intrinsically serial tasks, or tasks that offer little
parallelism. We might have issues of load balance that leave
some processors idle much of the time. Or we might spend all our
time contending for access to shared resources (which is a type
of communication overhead, I suppose). In each case, we are
going to have factors that keep us from getting close to
perfect parallel efficiency.
Quantifying Parallel Performance
Starting point: good serial performance
Scaling study: compare parallel to serial time as a function of number of processors (\(p\) ) \[\begin{aligned}
\mbox{Speedup} &= \frac{\mbox{Serial time}}{\mbox{Parallel time}} \\[2mm]
\mbox{Efficiency} &= \frac{\mbox{Speedup}}{p}
\end{aligned}\]
Ideally, speedup = \(p\) . Usually, speedup \(< p\) .
Barriers to perfect speedup
A (strong) scaling study is an experiment in which we compare
the performance of a parallel code to the performance of a
well-tuned serial code. In strong scaling, we look at speedup,
or the ratio of serial to parallel time, versus the number of
processors. The efficiency is the speedup relative to the
number of processors; 100 percent efficiency means that you get
the p-fold speedup of your dreams.
The tuning of the serial code matters! You can get
beautiful-looking speedups if you compare to a bad serial code,
but those attractive speedup curves don't actually tell you that
you are going to get answers fast as you add processors to
your parallel code. Rather, they tell you that you are going to
get answers slower than you ought to, pretty much across the board.
Amdahl’s Law
Parallel scaling study where some serial code remains: \[\begin{aligned}
p = & \mbox{ number of processors} \\
s = & \mbox{ fraction of work that is serial} \\
t_s = & \mbox{ serial time} \\
t_p = & \mbox{ parallel time} \geq s t_s + (1-s) t_s / p
\end{aligned}\]
\[\mbox{Speedup} =
\frac{t_s}{t_p} = \frac{1}{s + (1-s) / p} <
\frac{1}{s}\]
Last time, we talked about one of the fundamental modeling
results for strong scaling: Amdahl's law. Amdahl says that if
some fraction s of the serial code cannot be parallelized, then
our speedup will be no more than 1/s, no matter how many
processors we use.
Amdahl’s Law
\[\mbox{Speedup} < \frac{1}{s}\]
So \(1\%\) serial work \(\implies\) max speedup < \(100 \times\) , regardless of \(p\) .
So, for example, if our serial code spends one percent of our
time loading up a problem and the other 99 percent computing,
then parallelizing only the computational section will limit our
overall speedup to at most 100. That's bad news if we're trying
to impress a sponsor with parallel efficiency of our code on
their 10000-core supercomputer!
Strong and weak scaling
Ahmdahl looks bad! But two types of scaling studies:
Strong scaling
Fix problem size, vary \(p\)
Weak scaling
Fix work per processor, vary \(p\)
When we talk about Amdahl's law, we're talking about performance
for a fixed problem, or strong scaling. But often, we want a
bigger computer so that we can solve bigger problems, not just
to solve the same problem repeatedly. In a weak scaling study,
we fix the amount of work per processor rather than fixing the
overall problem size. This leads to very different reasoning.
Strong and weak scaling
For weak scaling, study scaled speedup
\[
S(p) =
\frac{T_{\mbox{serial}}(n(p))}{T_{\mbox{parallel}}(n(p), p)}
\]
Gustafson’s Law:
\[
S(p) \leq p - \alpha(p-1)
\]
where \(\alpha\) is the
fraction of work that is serial.
The analogue of Amdahl's law for weak scaling is Gustafson's
law. Gutsafson says that when we scale the problem size with
the number of processors, the scaled speedup grows linearly with
the processor count.
Imperfect parallelism
Problem is not just with purely serial work, but
Work that offers limited parallelism
Coordination overheads.
Amdahl's law and Gustafson's law are phrased in terms of serial
work. But many algorithms have sections that, while not serial,
have limited parallelism. For example, consider processing a
tree from the leaves to the root: lots of parallelism at the
leaves, little close to the root. There are also cases where
we have to worry about coordination overheads that might
actually grow with the number of processors! Amdahl and
Gustafson capture the spirit of all these problems, if not all
the details. The point is that we have to keep the
less-parallel overheads small relative to the work being done
per processor if we want to see reasonable parallel efficiency.
Dependencies
Main pain point: dependency between computations
a = f(x)
b = g(x)
c = h(a,b)
Compute a and b in parallel, but finish both before c!
Limits amount of parallel work available.
This is a true dependency (read-after-write). Also have false
dependencies (write-after-read and write-after-write) that can be
dealt with more easily.
What keeps us from achieving high parallelism? In a word:
dependencies. If the output of one computation is an input to
another computation, those computations cannot run in parallel.
This is a true dependency, but there are also false dependencies
-- things that shouldn't really matter, but cause problems with
so-called data races in practice. Copied data structures or
synchronization strategies can address false dependencies, but
the only way to deal with a true dependency is to rethink the
algorithm.
Granularity
Coordination is expensive
- including parallel start/stop!
Need to do enough work to amortize parallel costs
Not enough to have parallel work, need big chunks!
Chunk size depends on the machine.
As it turns out, having lots of latent parallelism isn't
enough. We need long, expensive chunks of work that can be done
completely independently of other long, expensive chunks of
work. Otherwise, we spend more time coordinating than we do
getting work done. Of course, as we will see, there can also be
downsides to big independent chunks of work. This will come up
in particularly when we talk about issues surrounding problem
partitioning and load balancing.