CS 5220

Shared memory

David Bindel

2024-10-01

Logistics

HW1 and P1 due tonight
- Spend time on your writeup!
P2 posted today
- Lesson from P1: start early!

Shared memory

Message passing pain

Common message passing pattern

Logical global structure
Local representation per processor
Local data may have redundancy
- Example: Data in ghost cells
- Example: Replicated book-keeping data

Message passing pain

Big pain point:

Thinking about many partly-overlapping representations
Maintaining consistent picture across processes

Wouldn’t it be nice to have just one representation?

Shared memory vs message passing

Implicit communication via memory vs explicit messages
Still need separate global vs local picture?

Shared memory vs message passing

Still need separate global vs local picture?

No: One thread-safe data structure may be easier
Yes: More sharing can hurt performance
- Synchronization costs cycles even with no contention
- Contention for locks reduces parallelism
- Cache coherency can slow even non-contending access

Shared memory vs message passing

Still need separate global vs local picture?

“Easy” approach: add multi-threading to serial code
Better performance: design like a message-passing code

Let’s dig a little deeper on the HW side

Memory model

Single processor: return last write
- What about DMA and memory-mapped I/O?
Simplest generalization: sequential consistency

Sequential consistency

A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.
– Lamport, 1979

Sequential consistency

Program behaves as if:

Each process runs in program order
Instructions from different processes are interleaved
Interleaved instructions ran on one processor

Example: Spin lock

Initially, flag = 0 and sum = 0

Processor 1:

sum += p1;
flag = 1;

Processor 2:

while (!flag);
sum += p2;

Example: Spin lock

Without sequential consistency support, what if

Processor 2 caches flag?
Compiler optimizes away loop?
Compiler reorders assignments on P1?

Starts to look restrictive!

Sequential consistency

Program behavior is “intuitive”:

Nobody sees garbage values
Time always moves forward

One issue is cache coherence:

Coherence: different copies, same value
Requires (nontrivial) hardware support

Also an issue for optimizing compiler!

There are cheaper relaxed consistency models.

Snoopy bus protocol

Basic idea:

Broadcast operations on memory bus
Cache controllers “snoop” on all bus transactions
- Memory writes induce serial order
- Act to enforce coherence (invalidate, update, etc)

Snoopy bus protocol

Problems:

Bus bandwidth limits scaling
Contending writes are slow

There are other protocol options (e.g. directory-based).
But usually give up on full sequential consistency.

Weakening sequential consistency

Try to reduce to the true cost of sharing

volatile tells compiler when to worry about sharing
Atomic operations do reads/writes as a single op
Memory fences tell when to force consistency
Synchronization primitives (lock/unlock) include fences

Unprotected data races give undefined behavior.

Take-home message

Sequentially consistent shared memory is a useful idea...
- “Natural” analogue to serial case
- Architects work hard to support it
... but implementation is costly!
- Makes life hard for optimizing compilers
- Coherence traffic slows things down
- Helps to limit sharing

Have to think about these things to get good performance.

Programming model

Shared memory programming model

Program consists of threads of control.

Can be created dynamically
Each has private variables (e.g. local)
Each has shared variables (e.g. heap)
Communication through shared variables
Coordinate by synchronizing on variables
Examples: pthreads, C11 threads, OpenMP, Cilk, Java threads

Wait, what’s a thread?

Processes have separate state. Threads share some:

Instruction pointer (per thread)
Register file (per thread)
Call stack (per thread)
Heap memory (shared)

Wait, what’s a thread?

Threads for parallelism
Threads for concurrency

Mechanisms for thread birth/death

Statically allocate threads
Fork/join
Fork detached threads
Cobegin/coend (OpenMP?)
- Like fork/join, but lexically scoped
Futures
- v = future(somefun(x))
- Attempts to use v wait on evaluation

Mechanisms for synchronization

Atomic operations
Locks/mutexes (enforce mutual exclusion)
Condition variables (notification)
Monitors (like locks with lexical scoping)
Barriers

OpenMP: Open spec for MultiProcessing

Standard API for multi-threaded code
- Only a spec — multiple implementations
- Lightweight syntax
- C or Fortran (with appropriate compiler support)
High level:
- Preprocessor/compiler directives (80%)
- Library calls (19%)
- Environment variables (1%)
Basic syntax: #omp construct [ clause … ]
- Usually affects structured block (one way in/out)
- OK to have exit() in such a block

A logistical note

# Intel compiler
icc -c -qopenmp foo.c
icc -o -qopenmp mycode.x foo.o

# GCC
gcc -c -fopenmp foo.c
gcc -o -fopenmp mycode.x foo.o

# LLVM / CLang (Linux)
clang -c -fopenmp foo.c
clang -o -fopenmp mycode.x foo.o

# Apple LLVM / CLang (with libomp via Homebrew)
clang -c -Xpreprocessor -fopenmp foo.c
clang -o -Xpreprocessor -fopenmp foo.o -lomp

Parallel “hello world”

#include <stdio.h>
#include <omp.h>

int main()
{
    #pragma omp parallel
    printf("Hello world from %d\n", 
           omp_get_thread_num());

    return 0;
}

Data in parallel regions

Basic model: fork-join / cobegin-coend
Each thread runs same code block
Annotations distinguish shared/private data
Relaxed consistency for shared data

Shared and private

Annotations distinguish between different types of sharing:

shared(x) (default): One x shared everywhere
private(x): Thread gets own x (indep. of master)
firstprivate(x): Each thread gets its own x, initialized by x from before parallel region
lastprivate(x): After the parallel region, private x set to the value last left by one of the threads (used in loops and parallel sections)
reduction(op:x): Does reduction on all thread x on exit of parallel region

Parallel regions

double s[MAX_THREADS];
int i;
#pragma omp parallel shared(s) private(i)
{
  i = omp_get_thread_num();
  s[i] = i;
}
// Implicit barrier here

Parallel regions

double s[MAX_THREADS];  // default shared
#pragma omp parallel
{
  int i = omp_get_thread_num();  // local, so private
  s[i] = i;
}
// Implicit barrier here

Parallel regions

Several ways to control num threads

Default: System chooses (= number cores?)
Environment: export OMP_NUM_THREADS=4
Function call: omp_set_num_threads(4)
Clause: #pragma omp parallel num_threads(4)

Can also nest parallel regions.

Parallel regions

What to do with parallel regions alone? Maybe Monte Carlo:

double result = 0;
#pragma omp parallel reduction(+:result)
  result = run_mc(trials) / omp_get_num_threads();
printf("Final result: %f\n", result);

Anything more interesting needs synchronization.

OpenMP synchronization

High-level synchronization:

critical: Critical sections
atomic: Atomic update
barrier: Barrier
ordered: Ordered access (later)

OpenMP synchronization

Low-level synchronization:

flush
Locks (simple and nested)

We will stay high-level.

Critical sections

Automatically lock/unlock at ends of critical section
Automatically memory flushes for consistency
Locks are still there if you really need them...

Critical sections

#pragma omp parallel
{
    //...
    #pragma omp critical(my_data_cs)
    {
        // ... modify data structure here ...
    }
}

Critical sections

void list_push(link_t** l, int data)
{
    link_t* link = (link_t*) malloc(sizeof(link_t));
    link->data = data;
    #pragma omp critical(list_cs)
    {
        link->next = *l;
        *l = link;
    }
}

Atomic updates

#pragma omp parallel
{
    // ...
    double my_piece = foo();
    #pragma omp atomic
    x += my_piece;
}

Only simple ops: increment/decrement or x += expr and co

Atomic captures

void list_push2(link_t** l, int data)
{
    link_t* link = (link_t*) malloc(sizeof(link_t));
    link->data = data;
    #pragma omp atomic capture
    {
        link->next = *l;
        *l = link;
    }
}

Barriers

#pragma omp parallel
for (i = 0; i < nsteps; ++i) {
    do_stuff();
    #pragma omp barrier
}

Parallel iteration

Idea: Map independent iterations onto different threads

#pragma omp parallel for
for (int i = 0; i < N; ++i)
    a[i] += b[i];

#pragma omp parallel
{
    // Stuff can go here...
    #pragma omp for
    for (int i = 0; i < N; ++i)
        a[i] += b[i];
}

Implicit barrier at end of loop (unless nowait clause)

Parallel iteration

The iteration can also go across a higher-dim index set

#pragma omp parallel for collapse(2)
for (int i = 0; i < N; ++i)
    for (int j = 0; j < M; ++j)
        a[i*M+j] = foo(i,j);

Restrictions

for loop must be in “canonical form”

Loop var is an integer, pointer, random access iterator (C++)
Test compares loop var to loop-invariant expression
Increment or decrement by a loop-invariant expression
No code between loop starts in collapse set
Needed to split iteration space (maybe in advance)

Restrictions

Iterations should be independent
- Compiler may not stop you if you screw this up!
Iterations may be assigned out-of-order on one thread!
- Unless the loop is declared monotonic

Reduction loops

How might we parallelize something like this?

double sum = 0;
for (int i = 0; i < N; ++i)
    sum += big_hairy_computation(i);

Reduction loops

How might we parallelize something like this?

double sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; ++i)
    sum = big_hairy_computation(i);

Ordered

OK, what about something like this?

for (int i = 0; i < N; ++i) {
    int result = big_hairy_computation(i);
    add_to_queue(q, result);
}

Work is mostly independent, but not wholly.

Ordered

Solution: ordered directive in loop with ordered clause

#pragma omp parallel for ordered
for (int i = 0; i < N; ++i) {
    int result = big_hairy_computation(i);
    #pragma ordered
    add_to_queue(q, result);
}

Ensures the ordered code executes in loop order.

Parallel loop scheduling

Partition index space different ways:

static[(chunk)]: decide at start; default chunk is n/nthreads. Lowest overhead, most potential imbalance.
dynamic[(chunk)]: each takes chunk (default 1) iterations when it has time. Higher overhead, auto balances.
guided: take chunks of size unassigned iterations/threads; get smaller toward end of loop. Between static and dynamic.
auto: up to the system!

Default behavior is implementation-dependent.

SIMD loops

As of OpenMP 4.0:

#pragma omp simd reduction(+:sum) aligned(a:64)
for (int i = 0; i < N; ++i) {
    a[i] = b[i] * c[i];
    sum = sum + a[i];
}

SIMD loops

Can also declare vectorized functions:

#pragma omp declare simd
float myfunc(float a, float b, float c)
{
    return a*b + c;
}

Other parallel work divisions

sections: like cobegin/coend
single: do only in one thread (e.g. I/O)
master: do only in master thread; others skip

Sections

#pragma omp parallel
{
    #pragma omp sections nowait
    {
        #pragma omp section
        do_something();

        #pragma omp section
        and_something_else();

        #pragma omp section
        and_this_too();
        // No implicit barrier here
    }
    // Implicit barrier here
}

sections nowait to kill barrier.

Task-based parallelism

Work-sharing so far is rather limited
- Work cannot be produced/consumed dynamically
- Fine for data parallel array processing...
- ... but what about tree walks and such?
Alternate approach (OpenMP 3.0+): Tasks

Tasks

Task involves:

Task construct: task directive plus structured block
Task: Task construct + data

Tasks are handled by run time, complete at barriers or taskwait.

Example: List traversal

#pragma omp parallel
{
    #pragma omp single nowait
    {
        for (link_t* link = head; link; link = link->next)
            #pragma omp task firstprivate(link)
            process(link);
    }
    // Implicit barrier
}

One thread generates tasks, others execute them.

Example: Tree traversal

int tree_max(node_t* n)
{
    int lmax, rmax;
    if (n->is_leaf)
        return n->value;
    
    #pragma omp task shared(lmax)
        lmax = tree_max(n->l);
    #pragma omp task shared(rmax)
        rmax = tree_max(n->l);
    #pragma omp taskwait

    return max(lmax, rmax);
}

The taskwait waits for all child tasks.

Task dependencies

What happens if one task produces what another needs?

#pragma omp task depend(out:x)
x = foo();
#pragma omp task depend(in:x)
y = bar(x);

Topics not addressed

Low-level synchronization (locks, flush)
OpenMP 4.x constructs for accelerator interaction
A variety of more specialized clauses

See http://www.openmp.org/

A final reminder

Parallelism is not performance!

CS 5220

Logistics

Shared memory

Message passing pain

Message passing pain

Shared memory vs message passing

Shared memory vs message passing

Shared memory vs message passing

Memory model

Sequential consistency

Sequential consistency

Example: Spin lock

Example: Spin lock

Sequential consistency

Snoopy bus protocol

Snoopy bus protocol

Weakening sequential consistency

Sharing

Sharing

Take-home message

Programming model

Shared memory programming model

Wait, what’s a thread?

Wait, what’s a thread?

Mechanisms for thread birth/death

Mechanisms for synchronization

OpenMP: Open spec for MultiProcessing

A logistical note

Parallel “hello world”

Data in parallel regions

Shared and private

Parallel regions

Parallel regions

Parallel regions

Parallel regions

OpenMP synchronization

OpenMP synchronization

Critical sections

Critical sections

Critical sections

Atomic updates

Atomic captures

Barriers

Work sharing

Parallel iteration

Parallel iteration

Restrictions

Restrictions

Reduction loops

Reduction loops

Ordered

Ordered

Parallel loop scheduling

SIMD loops

SIMD loops

Other parallel work divisions

Sections

Task-based parallelism

Tasks

Example: List traversal

Example: Tree traversal

Task dependencies

Topics not addressed

A final reminder