Applications of Parallel Computers
Intro to Shared Memory
Prof David Bindel
Please click the play button below.
Welcome to another exciting week of 5220! The topic this week
is high-performance shared memory programming.
Message passing pain
Common message passing pattern
Logical global structure
Local representation per processor
Local data may have redundancy
Example: Data in ghost cells
Example: Replicated book-keeping data
Last week, we discussed message passing programming. This is a
“shared nothing” model where all the data is stored locally,
processor-by-processor. There is a global structure, but it is
purely notional; there is no abstraction available for accessing a
shared array, for example. And the per-processor local
representations may have some redundancy, like extra “ghost cells”
in our hyperbolic PDE example.
Message passing pain
Big pain point:
Many partly-overlapping representations
Need consistent picture across processes
Wouldn’t it be nice to have just one representation?
For most people, the painful part of message-passing programming
is keeping the many overlapping local representations of the program
data consistent with a notional global representation. OK, MPI
syntax is painful as well, but you can write wrapper functions to
help you with that. So if you get started with message passing, it’s
tempting to think that life would be so much better with a single
representation of the global state, stored in shared memory.
Shared memory vs message passing
Implicit comm via memory vs explicit messages
Still need separate global vs local picture?
In a shared memory programming system, communication happens
implicitly through reading and writing to shared global data
structures, rather than explicitly through message passing. So
this means that we don’t need to distinguish between the global
and local pictures, right?
You might be tempted by this point of view if you hadn’t already
done the matrix multiply exercise and found yourself at some point
thinking “wouldn’t it be nice if I could have explicit control over
the cache”? As it is, I expect you know already that the question of
whether we need to worry about local vs global data is a trick. If
we care about performance, we need to care about this
distinction.
Shared memory vs message passing
Still need separate global vs local picture?
No: One thread-safe data structure may be easier
Yes: More sharing can hurt performance
Synchronization costs even with no contention
Contention for locks reduces parallelism
Cache coherency slows even non-contending access
It is indeed possible to write correct shared memory code in a
“shared everything” style, where all the major data structures
exist in shared memory. We have to be careful about
synchronization so that we don’t get that data structure into an
inconsistent state, but we can do it.
However, too much careless sharing can hurt performance, at least
when we are writing to our data structure (it’s less of an issue for
the read-only case). The problem is that, just as message latency
costs can eat all our performance in message passing codes,
synchronization costs can eat all our performance with shared
memory. Actually, matters are even worse than that: even if we are
doing work that requires no synchronization, we might be slowed down
by cache coherency effects if we’re not careful about data
layout.
Shared memory vs message passing
Still need separate global vs local picture?
“Easy” approach: add multi-threading to serial code
Better performance: design like message-passing
Let’s dig a little deeper on the HW side
The takeaway for all this is that high-performance shared-memory
code often looks an awful lot like high-performance message
passing code. The easy approach of starting with a serial code
and adding OpenMP directives to parallelize a few expensive
loops often just slows everything down. It’s best to think about
natural chunks of work and compact local data structures early
on.
I don’t mean to trash-talk shared memory here. I’d rather write
OpenMP code than MPI code on a single processor. I actually also
like programming in parallel global address space (PGAS) languages
more than I like MPI. There is often less code to write. But
there’s less code to write because some of the things that matter
to performance are implicit, and we need to have the mental model
of those implicit operations in order to reason about how to
organize our code for performance.
So let’s talk for a moment now about the hardware, and what makes
high-performance shared memory difficult.
Memory model
Single processor: return last write
What about DMA and memory-mapped I/O?
Simplest generalization: sequential consistency
The programmer interacts with the hardware through an
abstraction. In the case of memory, that abstraction is the memory
model. On a single processor, we usually think of memory as having
the property that a read from a memory location will return the
last thing that we wrote to that location. There are some
exceptions to this even in the case of a single core, like direct
memory access (DMA) by the network card, but apart from those
special cases, the single-processor memory interface is pretty
simple.
The natural generalization of the single-processor memory
abstraction to the multi-processor case is called sequential
consistency.
Sequential consistency
A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.
– Lamport, 1979
The idea of sequential consistency is almost as old as I am. It
goes back to the work of Leslie Lamport in the late 70s. Lamport
said that sequential consistency means that memory looks like
multiple processors took turns reading and writing. That is, we
could explain the outcome of a multiprocessor program by writing an
interleaving of the program order executions on each processor, and
executing that interleaved code in order.
Sequential consistency
Program behaves as if :
Each process runs in program order
Instructions from different processes are interleaved
Interleaved instructions ran on one processor
All right. So the rules of the game are that the instructions
for different programs run in order, but interleaved. Note that
this says nothing about speed - this is all about reasoning
about code correctness.
We also are assuming that reads and writes are atomic - that is,
we can’t get halfway through a write and have another processor
start reading while half the bits are in the old state and half
the bits in the new state.
Example: Spin lock
Initially, flag = 0
and sum = 0
Processor 1:
sum += p1;
flag = 1;
Processor 2:
while (!flag);
sum += p2;
Let’s give an example of reasoning about multiprocessor code
using sequential consistency. Suppose we have two processors
trying to update a shared variable (sum). We start with a shared
flag variable and a shared sum variable, both set to zero. After
each processor makes its contribution, we want to have them both
update the sum. In order to avoid the problem of both of them
reading the initial state of the sum variable, we force processor
1 to go before processor 2, by saying that processor 2 will wait
until the flag changes state before applying its update. This
waiting loop is sometimes called a spin loop.
So processor 1 applies an update and raises a flag, and then
processor 2 is cleared to get past the spin loop to apply its
update. And this is a very reasonable thing to do if we have
sequential consistency.
Example: Spin lock
Without sequential consistency support, what if
Processor 2 caches flag?
Compiler optimizes away loop?
Compiler reorders assignments on P1?
Starts to look restrictive!
But let’s think for a moment about the implications of
sequential consistency for this code. Somehow, processor 2 has
to know that it can’t just read a cached version of the flag
variable; otherwise, it might get stuck forever. And the
compiler has to know that another processor might be doing
something behind the scenes, otherwise it could optimize away
the loop, or potentially reorder things as part of the
instruction scheduler. After all, if we look at the code running
on either processor 1 or processor 2 independently, the two
statements that each runs look like they could be safely
interchanged.
When we start to think about all the hardware and compiler
optimizations we have to give up for sequential consistency,
suddenly it starts to look restrictive!
Sequential consistency
Program behavior is “intuitive”:
Nobody sees garbage values
Time always moves forward
One issue is cache coherence :
Coherence: different copies, same value
Requires (nontrivial) hardware support
Also an issue for optimizing compiler!
There are cheaper relaxed consistency models.
So, the good part of sequential consistency is that there’s an
intuitive mental model. Time moves forward, programs take turns,
nobody sees garbage values that are inconsistent with a serial
timeline. But there are performance issues. At the hardware level,
we have to worry about cache coherence: when one processor writes to
a shared variable, the other processor has to see the write, not
just keep reading from a locally cached copy. And at the software
level, the compiler is restricted in the types of reorderings it can
do.
Fortunately, there are alternate memory models that are easier
for performance, though they make reasoning about correctness
harder. These are usually called relaxed consistency models.
Before we talk about relaxed consistency, though, let’s talk for
a moment about how the hardware can support sequential
consistency.
Snoopy bus protocol
Basic idea:
Broadcast operations on memory bus
Cache controllers “snoop” on all bus transactions
Memory writes induce serial order
Act to enforce coherence (invalidate, update, etc)
Maybe the simplest hardware protocols for cache coherence are
the “snoopy bus” protocols. The idea is that all the processors
and the memory are attached to a shared bus, and every memory
request (and every reply from the memory) are sent across that
bus. All the processors can see the bus, and so the cache
controllers on each processor “snoop” on every bus transaction,
even the ones that they didn’t initiate. If one processor sends
a memory write across the bus, any other processor that has the
associated line in cache will either invalidate or update that
cache line. The fact that all the processors have to share the
same bus, and only one at a time can be actively transmitting on
the bus, means that the bus serves as a synchronization
point. So it is relatively straightforward to maintain cache
coherence, even if it means that we have to have some extra
hardware around for a cache controller to monitor the bus.
Snoopy bus protocol
Problems:
Bus bandwidth limits scaling
Contending writes are slow
Have other protocol options (e.g. directory-based).
But usually give up on full sequential consistency.
As we discussed a few lectures ago, though, we can only really
use buses with relatively small multiprocessors. Contention for the
bus slows everything down, and the bus also limits the aggregate
memory bandwidth available to the processors. There are more
sophisticated things that hardware architects can do, such as
directory-based protocols that involve distributing memory among the
processors and passing around messages for cache coherence; but
those often involve giving up full sequential consistency. So even
at the hardware level, without taking into account compiler
optimizations, full sequential consistency may be too much to ask
for.
Weakening sequential consistency
Try to reduce to the true cost of sharing
volatile
tells compiler: worry about sharing
Atomic operations do reads/writes as a single op
Memory fences tell when to force consistency
Synchronization ops (lock/unlock) include fences
Unprotected data races give undefined behavior.
The idea with weakened consistency models is to reduce the true
cost of sharing by only doing things that enforce consistency
when they are needed. For example, the volatile
type qualifier in C tells the compiler that a variable is shared
(with both reads and writes involved), and so statements
involving that variable cannot be reordered by the compiler,
even if it looks like they are independent of neighboring
statements. We also might use hardware-supported atomic
operations like compare and swap or atomic increments that are
guaranteed to read and write to memory as part of a single
operation. Then there are memory fence operations, which are
like little barriers: the program can execute up to a fence in
whatever order, but it is guaranteed that all writes to memory
from before the fence are committed before any instructions
after the fence. And synchronization primitives like locks
generally include memory fence operations, or maybe they rely on
hardware-supported atomic operations.
These types of operations provide sequential “checkpoints” in the
program execution, but the weaker consistency models don’t make any
guarantees about the apparent interleaving of instructions between
those checkpoints.
Sharing
True sharing:
Frequent writes cause a bottleneck.
Idea: make independent copies (if possible).
Example problem: malloc
/free
data structure.
So we weaken the memory model to make it easier to get good
compiler and hardware performance while still giving ourselves a
way to reason about the correctness of (correctly synchronized)
shared access. What’s the remaining cost associated with true
sharing? Shared reads are essentially free; the problem is
sharing data structures where we have frequent writes. This
suggests that when it’s possible, we might want to make
independent copies of some of these shared data structures, or
explicitly partition the data structures the same way that we
would in a message-passing code.
There are sometimes cases of shared data structures that might
not be entirely obvious. One of these is the data structures that
keep track of free space on the heap - the ones that are updated
by malloc and free in C, or by new and delete in C++, and are
almost completely invisible in some garbage-collected
languages. So having lots of processors dynamically allocating
memory for heap data structures - even if those data structures
are shared - can sometimes cause surprising performance
bottlenecks, depending on how the standard library implements
these tracking data structures. Some allocators are indeed pretty
clever about avoiding these bottlenecks, but I still tend to code
in a way that treats malloc and free as expensive.
Sharing
False sharing:
Distinct variables on same cache block
So make processor memory contiguous (if possible)
Example problem: array of ints, one per processor
Even with weak consistency models, there can be hidden costs
when there isn’t really any sharing. The most common example is
the cost of cache coherence overhead in the case of “false
sharing.” This happens when there are distinct variables on the
same cache line, so that multiple processors read and write to
the same cache line even if they are reading and writing to
different variables on that line. The memory controller
organizes everything by cache lines, so the hardware can’t
really distinguish this type of access pattern from shared reads
and writes. For example, we might see this if several processors
were all trying to write to a shared array of ints, one per
processor; even if each processor reads and writes a subset of
the ints that are disjoint from those accessed by any other
processor, false sharing might cause the memory controller to
start generating lots of cache coherence traffic, thus slowing
everything down.
We can usually avoid false sharing effects by having each
processor keep its local data in a contiguous chunk of memory,
rather than sharing an array in a fine-grained way.
Take-home message
Sequentially consistent shared memory useful
“Natural” analogue to serial case
Architects work hard to support it
... but implementation is costly!
Makes life hard for optimizing compilers
Coherence traffic slows things down
Helps to limit sharing
Think about these things to get good performance.
What’s the take-home message from all this? First, if you want
to write correct code, you need to have a model for how shared
memory behaves. Sequential consistency is maybe the most natural
and intuitive model, and computer architects work hard to
support it as well as they can. But the cost of maintaining such
a strong consistency model is high, both in terms of hardware
and in terms of lost opportunities for optimizing compilers. So
we usually go with weaker models of consistency where the
behavior under data races is undefined. Even in these weaker
memory models, if we want performance, it helps to limit sharing
- or at least to limit sharing that involves both reads and
writes.
Just like in the single-core case, there’s a lot about the
caches and the hardware that you don’t have to think about if
you only want correctness. But if you want high performance, you
have to have at least a rough notion of what is going on behind
the scenes.
With that out of the way, we’ll turn in the next deck to the
nuts and bolts of shared memory programming with OpenMP. Until
then!