Thinking about many partly-overlapping representations
Maintaining consistent picture across processes
Wouldn’t it be nice to have just one representation?
Shared memory vs message passing
Implicit communication via memory vs explicit messages
Still need separate global vs local picture?
Shared memory vs message passing
Still need separate global vs local picture?
No: One thread-safe data structure may be easier
Yes: More sharing can hurt performance
Synchronization costs cycles even with no contention
Contention for locks reduces parallelism
Cache coherency can slow even non-contending access
Shared memory vs message passing
Still need separate global vs local picture?
“Easy” approach: add multi-threading to serial code
Better performance: design like a message-passing code
Let’s dig a little deeper on the HW side
Memory model
Single processor: return last write
What about DMA and memory-mapped I/O?
Simplest generalization: sequential consistency
Sequential consistency
A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.
– Lamport, 1979
Sequential consistency
Program behaves as if:
Each process runs in program order
Instructions from different processes are interleaved
Interleaved instructions ran on one processor
Example: Spin lock
Initially, flag = 0 and sum = 0
Processor 1:
sum += p1;flag =1;
Processor 2:
while(!flag);sum += p2;
Example: Spin lock
Without sequential consistency support, what if
Processor 2 caches flag?
Compiler optimizes away loop?
Compiler reorders assignments on P1?
Starts to look restrictive!
Sequential consistency
Program behavior is “intuitive”:
Nobody sees garbage values
Time always moves forward
One issue is cache coherence:
Coherence: different copies, same value
Requires (nontrivial) hardware support
Also an issue for optimizing compiler!
There are cheaper relaxed consistency models.
Snoopy bus protocol
Basic idea:
Broadcast operations on memory bus
Cache controllers “snoop” on all bus transactions
Memory writes induce serial order
Act to enforce coherence (invalidate, update, etc)
Snoopy bus protocol
Problems:
Bus bandwidth limits scaling
Contending writes are slow
There are other protocol options (e.g. directory-based).
But usually give up on full sequential consistency.
Weakening sequential consistency
Try to reduce to the true cost of sharing
volatile tells compiler when to worry about sharing
Atomic operations do reads/writes as a single op
Memory fences tell when to force consistency
Synchronization primitives (lock/unlock) include fences
Unprotected data races give undefined behavior.
Sharing
True sharing:
Frequent writes cause a bottleneck.
Idea: make independent copies (if possible).
Example problem: malloc/free data structure.
Sharing
False sharing:
Distinct variables on same cache block
Idea: make processor memory contiguous (if possible)
Example problem: array of ints, one per processor
Take-home message
Sequentially consistent shared memory is a useful idea...
“Natural” analogue to serial case
Architects work hard to support it
... but implementation is costly!
Makes life hard for optimizing compilers
Coherence traffic slows things down
Helps to limit sharing
Have to think about these things to get good performance.
#pragma omp parallelfor(i =0; i < nsteps;++i){ do_stuff();#pragma omp barrier}
Work sharing
Work sharing constructs split work across a team
Parallel for: split by loop iterations
sections: non-iterative tasks
single: only one thread executes (synchronized)
master: master executes, others skip (no sync)
Parallel iteration
Idea: Map independent iterations onto different threads
#pragma omp parallel forfor(int i =0; i < N;++i) a[i]+= b[i];#pragma omp parallel{// Stuff can go here...#pragma omp forfor(int i =0; i < N;++i) a[i]+= b[i];}
Implicit barrier at end of loop (unless nowait clause)
Parallel iteration
The iteration can also go across a higher-dim index set
#pragma omp parallel for collapse(2)for(int i =0; i < N;++i)for(int j =0; j < M;++j) a[i*M+j]= foo(i,j);
Restrictions
for loop must be in “canonical form”
Loop var is an integer, pointer, random access iterator (C++)
Test compares loop var to loop-invariant expression
Increment or decrement by a loop-invariant expression
No code between loop starts in collapse set
Needed to split iteration space (maybe in advance)
Restrictions
Iterations should be independent
Compiler may not stop you if you screw this up!
Iterations may be assigned out-of-order on one thread!
Unless the loop is declared monotonic
Reduction loops
How might we parallelize something like this?
double sum =0;for(int i =0; i < N;++i) sum += big_hairy_computation(i);
Reduction loops
How might we parallelize something like this?
double sum =0;#pragma omp parallel for reduction(+:sum)for(int i =0; i < N;++i) sum = big_hairy_computation(i);
Ordered
OK, what about something like this?
for(int i =0; i < N;++i){int result = big_hairy_computation(i); add_to_queue(q, result);}
Work is mostly independent, but not wholly.
Ordered
Solution: ordered directive in loop with ordered clause
#pragma omp parallel for orderedfor(int i =0; i < N;++i){int result = big_hairy_computation(i);#pragma ordered add_to_queue(q, result);}
Ensures the ordered code executes in loop order.
Parallel loop scheduling
Partition index space different ways:
static[(chunk)]: decide at start; default chunk is n/nthreads. Lowest overhead, most potential imbalance.
dynamic[(chunk)]: each takes chunk (default 1) iterations when it has time. Higher overhead, auto balances.
guided: take chunks of size unassigned iterations/threads; get smaller toward end of loop. Between static and dynamic.
auto: up to the system!
Default behavior is implementation-dependent.
SIMD loops
As of OpenMP 4.0:
#pragma omp simd reduction(+:sum) aligned(a:64)for(int i =0; i < N;++i){ a[i]= b[i]* c[i]; sum = sum + a[i];}