CS 5220

Applications of Parallel Computers

OpenMP programming

Prof David Bindel

Please click the play button below.

Shared memory programming model

Program consists of threads of control.

Can be created dynamically
Each has private variables (e.g. local)
Each has shared variables (e.g. heap)
Communication through shared variables
Coordinate by synchronizing on variables
Examples: pthreads, C11 threads, OpenMP, Cilk, Java threads

Wait, what’s a thread?

Processes have separate state. Threads share some:

Instruction pointer (per thread)
Register file (per thread)
Call stack (per thread)
Heap memory (shared)

I keep using the word “thread” as though it’s self-evident what that means. But it shouldn’t be obvious at all if you haven’t seen it enough before!

Modern operating systems run many processes simultaneously. A process is a complete program execution context: stack, heap, open files, program counter, etc. The operating system is responsible for sharing the hardware across processors (and other resources).

Processes are completely independent of each other. In contrast, threads have some state that is independent, and some state that is shared. In particular, each thread has its own program counter, register file, and call stack. But all the threads in a program share the same heap and the same global segment (or the part of the code where static variables live).

Wait, what’s a thread?

Threads for parallelism
Threads for concurrency

There are typically two reasons why people program with threads. The reason that we are interested in threads is the opportunity for parallelism. But even on a single core, we might organize a program around multiple threads for concurrency. This is particularly useful when there are many logically independent tasks that may spend a significant amount of time waiting (for I/O, for example). A typical example might be a server for a web application, with each thread representing a different session.

Why do I highlight the difference? In part, because there are some cases where threads are used for concurrency and are useless for parallelism. In particular, in the built-in Python threads, only one thread ever runs at a time; this is great for concurrency, less great for parallelism. Python does have a multiprocessing library that actually supports parallelism, but it’s a different beast. Nor is Python the only language or library that makes this distinction.

All right, now that you’re warned that not all threads are useful for getting parallelism, let’s move on assuming that we’re in the case we care about.

Mechanisms for thread birth/death

Statically allocate threads
Fork/join
Fork detached threads
Cobegin/coend (OpenMP?)
- Like fork/join, but lexically scoped
Futures
- v = future(somefun(x))
- Attempts to use v wait on evaluation

If we are going to use threads, the first thing we have to figure out is how to start and stop them. There are some languages that start off with a fixed-size team of threads, just as MPI starts with a fixed-size number of processors in MPI_COMM_WORLD. But often there is some mechanism to start and stop threads more dynamically.

One option is to make a call to “fork” a new thread, or start it executing some given routine. Usually, a call to fork a thread returns some type of handle that we can use later to refer to that thread. The main use for this handle is to “join” on the forked thread; that is, to wait for the forked thread to complete its execution. By default, we usually assume that we need to join a forked thread at some point in the future, but in some systems we also can fork “detached” threads that do not need to be joined later.

In OpenMP, we do something like forking and joining, but we don’t do it with a function call. Instead, everything is lexically scoped: you fork a team of threads at the start of an OpenMP parallel section, and join all those threads at the end of the section. This type of construction is sometimes called a co-begin and co-end.

Another useful mechanism to know about is the so-called “future.” A future looks a lot like an ordinary function call, but instead of executing in the same thread and then returning an output (and control) to the caller, it executes in a new thread and returns a “promise” object to the caller. When the promise object is evaluated in some expression or as the argument to some function, the system joins on the thread that was running the function call, and then returns the function call output.

Mechanisms for synchronization

Atomic operations
Locks/mutexes (enforce mutual exclusion)
Condition variables (notification)
Monitors (like locks with lexical scoping)
Barriers

Apart from starting the thread and reading or writing through memory - which usually doesn’t involve any special syntax or function calls - the other thing we need for shared memory programming is synchronization. There are several different ways to handle this.

At the most fine-grained level, we can use atomic operations. These are operations that combine a read and a write; examples include compare-and-swap operations or atomic increment and decrement operations. Usually, there are a few atomic operations that get direct hardware support. These are useful for things like updating a shared counter, and can also be used as a building block for higher-level synchronization constructs.

At the next level up is synchronization with locks and condition variables. If you’re a CS major and have taken an operating systems class, you’ve probably seen this type of synchronization. The idea is that only one thread can “hold” a lock at a time; when the thread releases the lock, another thread picks it up. Condition variables are used for threads to signal each other about things like the availability of a lock. But this is still pretty low-level stuff, and I don’t intend to do anything with explicit locking protocols in this class.

To reason about the correctness of threaded codes, we usually think about invariants that all the threads need for correctness. For example, for a shared doubly-linked list, we might have the invariant that the forward pointers and the backward pointers are supposed to be consistent with each other. Code that adds or removes items will briefly break the invariant, and during those periods it is important that nobody be trying to access the data structure. Those pieces of code are called critical sections, and a natural way to protect them is with a construction called a monitor.

Monitors are to locks as cobegin/coend constructs are to forks. A monitor is a synchronization construct that surrounds a critical section of code, or a piece of code that only one thread ought to enter at a time. Internally, monitors are implemented with locks and condition variables.

Finally, we have barrier-based synchronization. We’ve discussed barriers before, but just to remind you, the idea is that a barrier represents a synchronization time in your code: everyone agrees what happened before the barrier and what happens after. So, for example, everyone might write into their local part of a shared data structure before a barrier, and after the barrier they can read from both their part of the data structure and the parts belonging to other threads. Barriers are the central synchronization construct in BSP, or Bulk Synchronous Programming.

OpenMP: Open spec for MultiProcessing

Standard API for multi-threaded code
- Only a spec — multiple implementations
- Lightweight syntax
- C or Fortran (with appropriate compiler support)
High level:
- Preprocessor/compiler directives (80%)
- Library calls (19%)
- Environment variables (1%)
Basic syntax: #omp *construct* [*clause* ...]
- Usually affects structured block (one way in/out)
- OK to have exit() in such a block

So far, we’ve been pretty abstract. Let’s get down to brass tacks now, and talk about programming in OpenMP.

OpenMP is a standard programming interface for multi-threaded code. As with MPI, OpenMP defines an interface, and not an implementation. Also like MPI, there are multiple versions of OpenMP for C, C++, and Fortran. Unlike MPI, OpenMP isn’t just a library: we need explicit compiler support for OpenMP programming.

Most of OpenMP goes into compiler directives. In C, we tell the compiler about these directives with the #pragma construct, which is just a way of telling the compiler about all sorts of things that lie outside the scope of the usual language standard. In addition to the compiler directives, there are also some support library routines. Some of these library routines do things directly relevant to parallel programming, like managing explicit locks; others are useful extras, like the wall-clock timing routines that we’ve already seen in our demos.

Finally, there are a handful of environment variables that control things about OpenMP, such as the default number of threads.

A logistical note

# Intel compiler
icc -c -qopenmp foo.c
icc -o -qopenmp mycode.x foo.o

# GCC
gcc -c -fopenmp foo.c
gcc -o -fopenmp mycode.x foo.o

# LLVM / CLang (Linux)
clang -c -fopenmp foo.c
clang -o -fopenmp mycode.x foo.o

# Apple LLVM / CLang (with libomp via Homebrew)
clang -c -Xpreprocessor -fopenmp foo.c
clang -o -Xpreprocessor -fopenmp foo.o -lomp

Ideally, you’ll start jumping back and forth between these slides and some examples in the demo directory. So now seems like a good time to tell you how to actually compile and link an OpenMP program. If you are using the Intel compiler, GCC, or CLang, you just need to include a command line flag when you compile and link. This is -fopenmp in GCC and CLang, -qopenmp for the Intel compiler. But if you are on Mac OS, you are probably by default using Apple’s own version of CLang and LLVM, and for a long time, that version did not support OpenMP. Now it does, at least sort of. But you have to include an extra flag (-Xpreprocessor) before the -fopenmp flag; and you have to explicitly install the OpenMP support library and link it in by adding -lomp to the end of your link line. The easiest way I know to install the OpenMP support library is via Homebrew.

Parallel “hello world”

#include <stdio.h>
#include <omp.h>

int main()
{
    #pragma omp parallel
    printf("Hello world from %d\n",
           omp_get_thread_num());

    return 0;
}

Let’s make sure that you are able to actually follow along. The demos/openmp subdirectory includes this “Hello world” program (called hello.c). Using the Makefile in that directory, or by invoking the compiler directly with OpenMP enabled, build the hello executable and run it.

The pragma line in the C code starts an OpenMP parallel region. At the start of the region, we create a team of threads, and each thread runs the code inside the region in parallel. Unless we say otherwise, there is an implicit barrier at the end of the parallel region. We can call omp_get_thread_num from within a thread to get the index, between zero and omp_get_num_threads(). So you should get out something like “Hello from 0”, “Hello from 1,” etc. You can change the number of threads used by default by setting the OMP_NUM_THREADS environment variable; there are a couple examples in the Makefile.

Data in parallel regions

Basic model: fork-join / cobegin-coend
Each thread runs same code block
Annotations distinguish shared/private data
Relaxed consistency for shared data

Shared and private

shared(x) (default): One x shared everywhere
private(x): Thread gets own x (indep. of master)
firstprivate(x): Each thread gets its own x, initialized by x from before parallel region
lastprivate(x): After the parallel region, private x set to the value last left by one of the threads (used in loops and parallel sections)
reduction(op:x): Does reduction on all thread x on exit of parallel region

Parallel regions

double s[MAX_THREADS];
int i;
#pragma omp parallel shared(s) private(i)
{
  i = omp_get_thread_num();
  s[i] = i;
}
// Implicit barrier here

Parallel regions

double s[MAX_THREADS];  // default shared
#pragma omp parallel
{
  int i = omp_get_thread_num();  // local, so private
  s[i] = i;
}
// Implicit barrier here

Parallel regions

Several ways to control num threads

Default: System chooses (= number cores?)
Environment: export OMP_NUM_THREADS=4
Function call: omp_set_num_threads(4)
Clause: #pragma omp parallel num_threads(4)

Can also nest parallel regions.

Parallel regions

Parallel regions alone? Maybe Monte Carlo:

double result = 0;
#pragma omp parallel reduction(+:result)
  result = run_mc(trials) / omp_get_num_threads();
printf("Final result: %f\n", result);

Anything more interesting needs synchronization.

OpenMP synchronization

High-level synchronization:

critical: Critical sections
atomic: Atomic update
barrier: Barrier
ordered: Ordered access (later)

OpenMP synchronization

Low-level synchronization:

flush
Locks (simple and nested)

We will stay high-level.

Critical sections

Automatically lock/unlock at ends of critical section
Automatically memory flushes for consistency
Locks are still there if you really need them...

Critical sections

#pragma omp parallel
{
  // ...
  #pragma omp critical(my_data_cs)
  {
    // ... modify data structure here ...
  }
}

Critical sections

void list_push(link_t** l, int data)
{
    link_t* link = (link_t*) malloc(sizeof(link_t));
    link->data = data;
    #pragma omp critical(list_cs)
    {
        link->next = *l;
        *l = link;
    }
}

Suppose that we have several threads pushing data onto a linked list. We don’t really care about the order that the data goes onto the list, but we do care that no data is lost. In this case, a critical section is needed to avoid two threads trying to simultaneously add data to the front of the list. Even under sequential consistency, without the critical section we could have both P0 and P1 set their new link next pointers to the same old head pointer, and then each try to update the head pointer. In this case, either one or the other of the new entries will get lost from the list.

As an aside: see how the critical section name does not depend on what list we are working with? This solution arguably does more synchronization than needed, as we cannot update even completely distinct linked lists concurrently with this approach - all the updates are serialized through the same critical section. We’ll see an alternate approach momentarily.

Atomic updates

#pragma omp parallel
{
  ...
  double my_piece = foo();
  #pragma omp atomic
  x += my_piece;
}

Only simple ops: increment/decrement or x += expr and co

Atomic captures

void list_push2(link_t** l, int data)
{
    link_t* link = (link_t*) malloc(sizeof(link_t));
    link->data = data;
    #pragma omp atomic capture
    {
        link->next = *l;
        *l = link;
    }
}

Barriers

#pragma omp parallel
for (i = 0; i < nsteps; ++i) {
  do_stuff
  #pragma omp barrier
}

Work sharing

Work sharing constructs split work across a team

Parallel for: split by loop iterations
sections: non-iterative tasks
single: only one thread executes (synchronized)
master: master executes, others skip (no sync)

Parallel iteration

Map independent iterations across threads

#pragma omp parallel for
for (int i = 0; i < N; ++i)
    a[i] += b[i];

#pragma omp parallel
{
    // Stuff can go here...
    #pragma omp for
    for (int i = 0; i < N; ++i)
        a[i] += b[i];
}

Implicit barrier at end of loop (unless nowait clause)

Parallel iteration

The iteration can also go across a higher-dim index set

#pragma omp parallel for collapse(2)
for (int i = 0; i < N; ++i)
    for (int j = 0; j < M; ++j)
        a[i*M+j] = foo(i,j);

Restrictions

for loop must be in “canonical form”

Loop var is an integer, pointer, random access iterator (C++)
Test compares loop var to loop-invariant expression
Increment/decrement by loop-invariant expression
No code between loop starts in collapse set
Needed to split iteration space (maybe in advance)

Restrictions

Iterations should be independent
- Compiler may not stop you if you screw this up!
Iterations may be assigned out-of-order
- Unless the loop is declared monotonic

Reduction loops

How might we parallelize something like this?

double sum = 0;
for (int i = 0; i < N; ++i)
    sum += big_hairy_computation(i);

Reduction loops

How might we parallelize something like this?

double sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; ++i)
    sum = big_hairy_computation(i);

Ordered

OK, what about something like this?

for (int i = 0; i < N; ++i) {
    int result = big_hairy_computation(i);
    add_to_queue(q, result);
}

Work is mostly independent, but not wholly.

Ordered

Solution: ordered directive in loop with ordered clause

#pragma omp parallel for ordered
for (int i = 0; i < N; ++i) {
    int result = big_hairy_computation(i);
    #pragma ordered
    add_to_queue(q, result);
}

Ensures the ordered code executes in loop order.

Parallel loop scheduling

static[(chunk)]: decide at start of loop; default chunk is n/nthreads. Lowest overhead, most potential load imbalance.
dynamic[(chunk)]: each thread takes chunk iterations when it has time; default chunk is 1. Higher overhead, but automatically balances load.
guided: take chunks of size unassigned iterations/threads; chunks get smaller toward end of loop. Somewhere between static and dynamic.
auto: up to the system!

In addition to everything else, the parallel for loop has clauses to specify how the iteration space should be partitioned. One option is static partitioning of the iteration space. This is great if the work for every iteration is about the same, but if different iterations can cost different amounts, static partitioning can cause load imbalances where one thread ends up taking on much more work than the others. Dynamic partitioning, in contrast, means that work is assigned dynamically: each thread grabs a few more iterations when it is ready for more. Dynamic partitioning requires that the compiler maintain some synchronized data structure for sharing out the work, and that leads to some overhead. But this strategy is great for load balance.

Guided (or tapered) strategies sit somewhere between static and dynamic strategies. The system starts by assigning big blocks of work, but the pieces of the iteration space that we assign get smaller and smaller as the loop gets ever closer to completion. This is often a good strategy to get both reasonable load balance and not spend too much time scheduling iterations.

The default strategy, sometimes called auto, is entirely implementation-dependent.

SIMD loops

As of OpenMP 4.0:

#pragma omp simd reduction(+:sum) aligned(a:64)
for (int i = 0; i < N; ++i) {
    a[i] = b[i] * c[i];
    sum = sum + a[i];
}

SIMD loops

Can also declare vectorized functions:

#pragma omp declare simd
float myfunc(float a, float b, float c)
{
    return a*b + c;
}

Other parallel work divisions

sections: like cobegin/coend
single: do only in one thread (e.g. I/O)
master: do only in master thread; others skip

Sections

#pragma omp parallel
{
    #pragma omp sections nowait
    {
        #pragma omp section
        do_something();

        #pragma omp section
        and_something_else();

        #pragma omp section
        and_this_too();
        // No implicit barrier here
    }
    // Implicit barrier here
}

sections nowait to kill barrier.

Task-based parallelism

Work-sharing so far is rather limited
- Work cannot be produced/consumed dynamically
- Fine for data parallel array processing...
- ... but what about tree walks and such?
Alternate approach (OpenMP 3.0+): Tasks

Tasks

Task involves:

Task construct: task directive plus structured block
Task: Task construct + data

Tasks are handled by run time, complete at barriers or taskwait.

Example: List traversal

#pragma omp parallel
{
    #pragma omp single nowait
    {
        for (link_t* link = head; link; link = link->next)
            #pragma omp task firstprivate(link)
            process(link);
    }
    // Implicit barrier
}

One thread generates tasks, others execute them.

Consider a linked list, where we want to do some processing on each item in the list. We can’t traverse the list with a parallel for loop; it doesn’t fit into the restrictions. Instead, we’ll start with a single thread that goes through and creates a task for processing each link. This omp single section is effectively a producer in a producer-consumer system with many possible consumers. Depending on how quickly the tasks are handled, the thread responsible for producing all the tasks may then turn to doing some of the processing afterward, along with the processing being done by all the other threads. There is a nowait clause on the single section that produces the tasks, but the tasks must all complete in order to pass the implicit barrier at the end of the parallel region.

Example: Tree traversal

int tree_max(node_t* n)
{
    int lmax, rmax;
    if (n->is_leaf)
        return n->value;

    #pragma omp task shared(lmax)
        lmax = tree_max(n->l);
    #pragma omp task shared(rmax)
        rmax = tree_max(n->l);
    #pragma omp taskwait

    return max(lmax, rmax);
}

The taskwait waits for all child tasks.

Task dependencies

What if one task produces what another needs?

#pragma omp task depend(out:x)
x = foo();
#pragma omp task depend(in:x)
y = bar(x);

Topics not addressed

Low-level synchronization (locks, flush)
OpenMP 4.x constructs for accelerator interaction
A variety of more specialized clauses

See http://www.openmp.org/

A final reminder

Parallelism is not performance!

Finally, I remind you again: OpenMP is not magic pixie dust! If you sprinkle parallel for directives atop all the loops in your serial code, you are likely to get broken code in the worst case (because of dependencies you didn’t consider) and slow code in the not-as-bad case. You have to amortize your synchronization costs if you want any reasonable sort of speedup! The easiest way to achieve this is often to program as if you were writing in a message-passing environment, doing big chunks of work between accesses to global memory and synchronization events.

Alas, previous experience teaching this class suggests that many of you won’t really get what I’m saying about synchronization overheads until you are faced with an example or two. Don’t worry, we’ll get those examples. It’ll be fun.

Until next time!