CS 5220

Applications of Parallel Computers

Intro to Shared Memory

Prof David Bindel

Please click the play button below.

Message passing pain

Common message passing pattern

Logical global structure
Local representation per processor
Local data may have redundancy
- Example: Data in ghost cells
- Example: Replicated book-keeping data

Message passing pain

Big pain point:

Many partly-overlapping representations
Need consistent picture across processes

Wouldn’t it be nice to have just one representation?

Shared memory vs message passing

Implicit comm via memory vs explicit messages
Still need separate global vs local picture?

Shared memory vs message passing

Still need separate global vs local picture?

No: One thread-safe data structure may be easier
Yes: More sharing can hurt performance
- Synchronization costs even with no contention
- Contention for locks reduces parallelism
- Cache coherency slows even non-contending access

It is indeed possible to write correct shared memory code in a “shared everything” style, where all the major data structures exist in shared memory. We have to be careful about synchronization so that we don’t get that data structure into an inconsistent state, but we can do it.

However, too much careless sharing can hurt performance, at least when we are writing to our data structure (it’s less of an issue for the read-only case). The problem is that, just as message latency costs can eat all our performance in message passing codes, synchronization costs can eat all our performance with shared memory. Actually, matters are even worse than that: even if we are doing work that requires no synchronization, we might be slowed down by cache coherency effects if we’re not careful about data layout.

Shared memory vs message passing

Still need separate global vs local picture?

“Easy” approach: add multi-threading to serial code
Better performance: design like message-passing

Let’s dig a little deeper on the HW side

The takeaway for all this is that high-performance shared-memory code often looks an awful lot like high-performance message passing code. The easy approach of starting with a serial code and adding OpenMP directives to parallelize a few expensive loops often just slows everything down. It’s best to think about natural chunks of work and compact local data structures early on.

I don’t mean to trash-talk shared memory here. I’d rather write OpenMP code than MPI code on a single processor. I actually also like programming in parallel global address space (PGAS) languages more than I like MPI. There is often less code to write. But there’s less code to write because some of the things that matter to performance are implicit, and we need to have the mental model of those implicit operations in order to reason about how to organize our code for performance.

So let’s talk for a moment now about the hardware, and what makes high-performance shared memory difficult.

Memory model

Single processor: return last write
- What about DMA and memory-mapped I/O?
Simplest generalization: sequential consistency

Sequential consistency

A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.
– Lamport, 1979

Sequential consistency

Program behaves as if:

Each process runs in program order
Instructions from different processes are interleaved
Interleaved instructions ran on one processor

Example: Spin lock

Initially, flag = 0 and sum = 0

Processor 1:

sum += p1;
flag = 1;

Processor 2:

while (!flag);
sum += p2;

Let’s give an example of reasoning about multiprocessor code using sequential consistency. Suppose we have two processors trying to update a shared variable (sum). We start with a shared flag variable and a shared sum variable, both set to zero. After each processor makes its contribution, we want to have them both update the sum. In order to avoid the problem of both of them reading the initial state of the sum variable, we force processor 1 to go before processor 2, by saying that processor 2 will wait until the flag changes state before applying its update. This waiting loop is sometimes called a spin loop.

So processor 1 applies an update and raises a flag, and then processor 2 is cleared to get past the spin loop to apply its update. And this is a very reasonable thing to do if we have sequential consistency.

Example: Spin lock

Without sequential consistency support, what if

Processor 2 caches flag?
Compiler optimizes away loop?
Compiler reorders assignments on P1?

Starts to look restrictive!

But let’s think for a moment about the implications of sequential consistency for this code. Somehow, processor 2 has to know that it can’t just read a cached version of the flag variable; otherwise, it might get stuck forever. And the compiler has to know that another processor might be doing something behind the scenes, otherwise it could optimize away the loop, or potentially reorder things as part of the instruction scheduler. After all, if we look at the code running on either processor 1 or processor 2 independently, the two statements that each runs look like they could be safely interchanged.

When we start to think about all the hardware and compiler optimizations we have to give up for sequential consistency, suddenly it starts to look restrictive!

Sequential consistency

Program behavior is “intuitive”:

Nobody sees garbage values
Time always moves forward

One issue is cache coherence:

Coherence: different copies, same value
Requires (nontrivial) hardware support

Also an issue for optimizing compiler!

There are cheaper relaxed consistency models.

So, the good part of sequential consistency is that there’s an intuitive mental model. Time moves forward, programs take turns, nobody sees garbage values that are inconsistent with a serial timeline. But there are performance issues. At the hardware level, we have to worry about cache coherence: when one processor writes to a shared variable, the other processor has to see the write, not just keep reading from a locally cached copy. And at the software level, the compiler is restricted in the types of reorderings it can do.

Fortunately, there are alternate memory models that are easier for performance, though they make reasoning about correctness harder. These are usually called relaxed consistency models.

Before we talk about relaxed consistency, though, let’s talk for a moment about how the hardware can support sequential consistency.

Snoopy bus protocol

Basic idea:

Broadcast operations on memory bus
Cache controllers “snoop” on all bus transactions
- Memory writes induce serial order
- Act to enforce coherence (invalidate, update, etc)

Maybe the simplest hardware protocols for cache coherence are the “snoopy bus” protocols. The idea is that all the processors and the memory are attached to a shared bus, and every memory request (and every reply from the memory) are sent across that bus. All the processors can see the bus, and so the cache controllers on each processor “snoop” on every bus transaction, even the ones that they didn’t initiate. If one processor sends a memory write across the bus, any other processor that has the associated line in cache will either invalidate or update that cache line. The fact that all the processors have to share the same bus, and only one at a time can be actively transmitting on the bus, means that the bus serves as a synchronization point. So it is relatively straightforward to maintain cache coherence, even if it means that we have to have some extra hardware around for a cache controller to monitor the bus.

Snoopy bus protocol

Problems:

Bus bandwidth limits scaling
Contending writes are slow

Have other protocol options (e.g. directory-based).
But usually give up on full sequential consistency.

Weakening sequential consistency

Try to reduce to the true cost of sharing

volatile tells compiler: worry about sharing
Atomic operations do reads/writes as a single op
Memory fences tell when to force consistency
Synchronization ops (lock/unlock) include fences

Unprotected data races give undefined behavior.

The idea with weakened consistency models is to reduce the true cost of sharing by only doing things that enforce consistency when they are needed. For example, the volatile type qualifier in C tells the compiler that a variable is shared (with both reads and writes involved), and so statements involving that variable cannot be reordered by the compiler, even if it looks like they are independent of neighboring statements. We also might use hardware-supported atomic operations like compare and swap or atomic increments that are guaranteed to read and write to memory as part of a single operation. Then there are memory fence operations, which are like little barriers: the program can execute up to a fence in whatever order, but it is guaranteed that all writes to memory from before the fence are committed before any instructions after the fence. And synchronization primitives like locks generally include memory fence operations, or maybe they rely on hardware-supported atomic operations.

These types of operations provide sequential “checkpoints” in the program execution, but the weaker consistency models don’t make any guarantees about the apparent interleaving of instructions between those checkpoints.

Sharing

True sharing:

Frequent writes cause a bottleneck.
Idea: make independent copies (if possible).
Example problem: malloc/free data structure.

So we weaken the memory model to make it easier to get good compiler and hardware performance while still giving ourselves a way to reason about the correctness of (correctly synchronized) shared access. What’s the remaining cost associated with true sharing? Shared reads are essentially free; the problem is sharing data structures where we have frequent writes. This suggests that when it’s possible, we might want to make independent copies of some of these shared data structures, or explicitly partition the data structures the same way that we would in a message-passing code.

There are sometimes cases of shared data structures that might not be entirely obvious. One of these is the data structures that keep track of free space on the heap - the ones that are updated by malloc and free in C, or by new and delete in C++, and are almost completely invisible in some garbage-collected languages. So having lots of processors dynamically allocating memory for heap data structures - even if those data structures are shared - can sometimes cause surprising performance bottlenecks, depending on how the standard library implements these tracking data structures. Some allocators are indeed pretty clever about avoiding these bottlenecks, but I still tend to code in a way that treats malloc and free as expensive.

Sharing

False sharing:

Distinct variables on same cache block
So make processor memory contiguous (if possible)
Example problem: array of ints, one per processor

Even with weak consistency models, there can be hidden costs when there isn’t really any sharing. The most common example is the cost of cache coherence overhead in the case of “false sharing.” This happens when there are distinct variables on the same cache line, so that multiple processors read and write to the same cache line even if they are reading and writing to different variables on that line. The memory controller organizes everything by cache lines, so the hardware can’t really distinguish this type of access pattern from shared reads and writes. For example, we might see this if several processors were all trying to write to a shared array of ints, one per processor; even if each processor reads and writes a subset of the ints that are disjoint from those accessed by any other processor, false sharing might cause the memory controller to start generating lots of cache coherence traffic, thus slowing everything down.

We can usually avoid false sharing effects by having each processor keep its local data in a contiguous chunk of memory, rather than sharing an array in a fine-grained way.

Take-home message

Sequentially consistent shared memory useful
- “Natural” analogue to serial case
- Architects work hard to support it
... but implementation is costly!
- Makes life hard for optimizing compilers
- Coherence traffic slows things down
- Helps to limit sharing

Think about these things to get good performance.

What’s the take-home message from all this? First, if you want to write correct code, you need to have a model for how shared memory behaves. Sequential consistency is maybe the most natural and intuitive model, and computer architects work hard to support it as well as they can. But the cost of maintaining such a strong consistency model is high, both in terms of hardware and in terms of lost opportunities for optimizing compilers. So we usually go with weaker models of consistency where the behavior under data races is undefined. Even in these weaker memory models, if we want performance, it helps to limit sharing - or at least to limit sharing that involves both reads and writes.

Just like in the single-core case, there’s a lot about the caches and the hardware that you don’t have to think about if you only want correctness. But if you want high performance, you have to have at least a rough notion of what is going on behind the scenes.

With that out of the way, we’ll turn in the next deck to the nuts and bolts of shared memory programming with OpenMP. Until then!