CS 5220

Applications of Parallel Computers

Beyond MPI and OpenMP

Please click the play button below.

A programmatic note

Lots of programming systems!
Playing with one can be a good project
Key: model, measure, tune performance!

This slide deck is not exactly when and where I planned things at the start of the semester, in two ways. First, I originally planned to talk about transform methods in this slot; I only thought better of it as I started thinking about what to say about the final project. And second, I originally planned to finish recording and posting by Thursday, not by the end of the following weekend! What can I say, it has been an eventful week.

I’m going to spend the bulk of this lecture talking about parallel global address space (PGAS) languages generally, and UPC and UPC++ in particular. But before that, I want to take a step back and talk about the broader parallel programming picture. Indeed, there are a lot of parallel programming languages and frameworks out there! And I want to point to a few of them, because playing with one of these can make the basis for a fine class project. The key, though, should you choose to follow this path, is that you should do your playing with an eye to performance. Model, tune, and measure!

Parallel programming

Where we stand

MPI (shared-nothing, message passing)
OpenMP (shared memory, thread based)

Beyond MPI and OpenMP

Heterogeneity and accelerators?
Data analytics systems (Spark, Hadoop, …)?
ML systems (TensorFlow, PyTorch, Keras, …)?
Other parallel languages for sci comp?
Mixing and matching?

There are a couple things that you might think are obvious holes. First, we haven’t said anything about accelerator programming, even though that’s a pretty giant part of the HPC landscape right now; the dominant programming systems there are CUDA and OpenCL, though there are lots of competitors (including OpenMP, which I think we mentioned in passing).

Second, I’ve said nothing about any of the big parallel systems enjoying wide use for data analytics and machine learning. No Spark, no Hadoop, no TensorFlow or PyTorch or Keras or anything like that. This is mostly a matter of scope: if I was doing a course on systems for ML, I might talk about all these things (and maybe Chris De Sa does talk about these things in his ML systems course!). But this course is deliberately oriented toward the HPC side of the world.

Third, parallel scientific computing has been a concern for as long as I’ve been alive – literally – and yet I’ve said very little about the many other parallel systems that have been explored. That’s partly because of something you’ve doubtless already noticed: high-performance programming is an enormous amount of work, and getting the best performance puts a lot of pressure on compiler writers. There are only so many cycles the world is willing to spend on producing (or learning) production-grade HPC tools, and lots of ideas never really make it to launch, or never really catch on once they are out in the world. Ideally, the ideas get taken up by existing systems, though. And there are some good ideas that are worth talking about that aren’t yet all that well handled by the MPI + OpenMP combination, and we’ll talk about a few of those shortly.

Fourth, I haven’t said that much about how to mix-and-match different types of programming models, and that definitely is worth a few words.

Plan for today

Accelerator programming
MPI + X
Libraries and frameworks
Special-use languages
PGAS languages, UPC and UPC++

Accelerator programming

GPUs and manycore devices
- Early: shaders for general LA
- NVidia: CUDA for beyond graphics (esp ML)
- Other attempts: Xeon Phi, FPGA, etc
No Magic Bullets
- Be thoughtful when comparing!
- Use libraries when possible!

OK. Let’s start with accelerators, by which I typically mean GPUs and related many-core devices.

The predecessor to modern accelerator programming was known as “general-purpose graphics processing unit” (GPGPU) programming. Graphics cards do lots of linear-algebra-flavored operations, and so people started with using shaders – usually used for producing pixels – to instead encode general linear algebra problems. The general style is very reminiscent of the old vector machines – single instruction on multiple threads (SIMT).

NVidia introduced CUDA in 2007 as a way to make their graphics cards “easy” for general-purpose programming. Needless to say, it caught on, and I think it’s fair to say that the introduction of GPU and other accelerators has been one of the big enablers of a lot of the big ML efforts, particularly those involving deep learning. GPUs are not the only type of accelerator out there, either. Intel introduced the Xeon Phi accelerators as a competitor toward the GPUs coming out of NVidia and AMD, for example, though the path for these chips has been rocky. And there are Google’s Tensor Processing Units (or TPUs) and various FPGA-based accelerators out there in the wild now, too.

Accelerators are a huge part of the current landscape. But it’s important to think about them carefully. GPUs aren’t magic. They are highly parallel, but there’s also a lot of parallelism available in a high-end CPU with a bunch of cores. Naive code on a GPU is going to be slow, just as naive code on a CPU is going to be slow; and there have been a lot of bad takes over time from people who carefully tuned a GPU implementation of a code that was originally somewhat naively written for a CPU, and hence got an exaggerated notion of the raw performance difference.

Anyhow, if you want to do standard types of things, the advice is the same for CPU and for GPU systems: use someone else’s library! Fortunately, a lot of linear algebra libraries (for example) now include accelerator support, so you can benefit from having a GPU or other accelerator without writing any new GPU code.

CUDA, OpenCL, OpenACC, OpenMP

CUDA: the grandfather of them all (2007)
OpenCL: open standard, multiple devices (2009)
OpenACC: OpenMP-style directives (2012)
OpenMP: added accel directives in 4.0 (2013)

NVidia introduced CUDA in 2007. CUDA started life as a C/C++-like language for GPU “kernels” that would be compiled to a low-level VM code called PTX (Parallel Thread Execution). Then the graphics driver converts the PTX code into something the GPU can run. I always think of “CUDA” as referring to the first part of this pipeline – converting from high-level code in C/C++ (or Fortran, now) down to PTX code. But there are now a lot of other systems with PTX backends, including Python’s Numba and the GPU support system in Julia.

CUDA is an NVidia-specific thing, but NVidia was never the sole manufacturer of graphics cards, and there were a number of competitors to the CUDA programming system. OpenCL came out of AMD at first, but it’s an open standard run by the Khronos Group. OpenCL shares some ideas in common with CUDA.

OpenACC was designed to be an OpenMP-like system of compiler pragmas for compiling with accelerators. But it had hardly gotten off the ground before OpenMP also added accelerator support with version 4.0. There are enough differences between OpenMP and OpenACC that both will doubtless continue for a while, though Cray (one of the original developers of the OpenACC standards) has apparently announced that they aren’t going to support it in their compilers going forward. My guess is that OpenMP is where we’re going to go for directive-based accelerator programming in the next few years.

Kokkos, SYCL, DPC++, oneAPI

Kokkos: Portable (?) many-core performance
SYCL: Higher-level programming on top of OpenCL
DPC++ and oneAPI: Intel’s SYCL versions

CUDA and OpenCL are pretty low-level. Kokkos and SYCL are attempts to gain higher-level, more portable performance in C++.

Unlike CUDA and OpenCL, Kokkos is fundamentally based on standard C++, and offers benefits even if you aren’t doing GPU programming. Kokkos has the notion of separation of parallel patterns (loop, scan, reduction, etc), memory spaces where the data resides, execution spaces where functions execute, and – of course – the actual functions that are involved. Kokkos came out of Sandia’s Trilinos project, but now has a life of its own, and continues to be developed and supported.

SYCL is the new thing out of the Khronos group, the same group that manages OpenCL, though it can now run on top of substrates other than OpenCL. It’s also standard C++, with simple syntax for some common parallel patterns. Like Kokkos, SYCL provides some advantages even for running on CPU architectures. Intel has thrown their considerable weight behind SYCL with their data-parallel C++ (DPC++) and oneAPI initiatives, and it will be interesting to see how this evolves.

Unlike CUDA and OpenCL, which have bindings into other languages, the Kokkos and SYCL systems are very closely tied to C++, and it’s hard to see how we would have PyKokkos or SYCL Fortran any time soon.

DO CONCURRENT? pSTL?

Fortran 2008 DO CONCURRENT: Loop iterations with no interdependencies
Parallel STL: Parallel algorithms in C++

Over time, some of the standard parallel constructs are getting support in core language standards. We’ll talk about one such extension (co-arrays in Fortran) in a few slides, but for now I want to mention language features relevant to multi-core and potentially to heterogeneous computing. Specifically: Fortran 2008 added a “do concurrent” loop, basically a version of the do loop (a for loop in C) in which the programmer promises there are no dependencies between loop iterations. And in the C++ 2017 standard, there are a bunch of parallel algorithms that were added to the standard template library. At the moment, I don’t know of any compilers that attempt to target accelerators with these constructs, but at least we’re starting to acknowledge in our language designs that we’re living in a parallel world!

Simit, Halide, and company

Idea: mini-language for LA and tensor ops

Standalone mini-languages (Simit)
Embedded in another language (Halide, Numba, Jax)

Of course, there are other languages beyond Fortran and C++, and it’s interesting to see what role they increasingly play in accelerator programming. There are now a number of systems that effectively introduce small, special-purpose languages for certain classes of accelerator programs. Simit is one that has its own syntax and compiler, and produces remarkably fast simulation code (either for multicore or GPUs). Other systems effectively embed little expression languages inside of another language, like C++ or Python. Halide does this for C++, while Numba and Jax do this in Python. None of these is a complete general-purpose programming language, but that’s part of the appeal; by restricting the scope, it’s easier to figure out how to compile to something fast without worrying about all the things that a C++ or Fortran compiler has to worry about (e.g. aliasing).

MPI+X?

MPI for inter-node plus

MPI-3 RMA locally?
OpenMP locally?
- Have to coordinate with MPI
Some other thread system locally?
Accelerator integration?

When people talk about heterogeneous programming languages and frameworks in 2020, they still mostly seem to mean languages and frameworks for codes that run within the confines of a single box, though maybe one with a bunch of chips and boards in it. For cross-node communication, the standard answer remains “use MPI.”

Of course, you can program a single box using MPI! At least, you can program everything but accelerator boards that way. If you use MPI-3, you can even get shared-memory style programming while on-node, though maybe this isn’t the most common approach. More common is something like MPI + OpenMP, though it’s not so common that building and running the code is completely painless. One also has to think a bit about how the two parallel systems interact. One possibility is to only call MPI from outside an OpenMP region, but there is also special support for multi-threaded MPI processes (you need to declare whether you’re going to only let the master thread call MPI, or only let one thread at a time call, or let all the threads call).

And… there is no pure MPI way of doing most accelerator programming. Moreover, a pure MPI code that involves one MPI rank per core is going to require some coordination in the common case that all those cores are in a box with a couple GPUs. So despite the pain points, there are some advantages to the MPI-plus-OpenMP combination, maybe together with something else for accelerators or maybe with the OpenMP 4 offload directives.

Trilinos and PETSc

Trilinos: C++ framework for parallel scientific codes (Sandia)
PETSc: Portable Extensible Toolkit for Scientific Computing (Argonne)

FENICs and domain-specific

FEniCS: High-level finite element codes in Python

There are a number of libraries that sit atop libraries like PETSc and Trilinos, like the FEniCS finite element system. The idea in FEniCS is that you declaratively specify your equations and all the things that go into a finite element code (the element types, the geometry, etc), and the system generates a high-performance code for you. And part of the bargain is that you get interfaces to the solver packages in PETSc! At least, you do if you’ve built the code with PETSc support and figured out how to run everything.

The point of all this is that sometimes writing a high-performance numerical code doesn’t involve diving down to the lowest levels of the stack and writing a bunch of things with MPI and OpenMP and CUDA. You can do good work living off higher-level library interfaces, too, though understanding performance (and understanding how to run the code) usually involves understanding what’s going on at the lower layers of the stack even in this case.

PGAS languages

Partitioned Global Address Space

Local addresses on local proc
Remote addresses on other procs
May also have private space
Programmer controls placement

PGAS languages

Fortran 2008
UPC
Chapel
UPC++
Many library-based systems

There are a number of partitioned global address space systems out there.

The most established is probably the co-array feature in Fortran, which began as part of a dedicated language (co-array Fortran) and then entered the Fortran standard in 2008. A close runner-up, though, may be Unified Parallel C, which grew out of a number of earlier parallel C variants (e.g. Active C, Split-C, PCP). The Chapel programming language was introduced by Cray during a DARPA project on “High Productivity Computing Systems”; unlike other languages from that effort, such as X10 and Fortress, Chapel continues to have a life well after that project. Of more recent vintage is UPC++, a PGAS system out of Berkeley based on modern standard C++ (and not an extended language as in UPC). And there are a number of library-based systems as well.

I have never actually written anything in UPC++, and we’ve mostly stuck to C for this course anyhow. So for today I will stick with telling you about UPC.

UPC

Explicit parallel extension to ANSI C
A PGAS language
C philosophy: concise, low level, …
- And “enough rope to hang yourself!”

Execution model

THREADS parallel threads, MYTHREAD local index
Thread count specified at compile or run time
Synchronization primitives (barriers, locks)
Parallel iteration (forall)
Parallel memory access/mgmt
Parallel library routines

Hello world

#include <upc.h>
#include <stdio.h>

int main()
{
    printf("Hello from %d of %d\n",
           MYTHREAD, THREADS);
}

Shared variables

shared int ours;
int mine;

Shared variables

shared int ours;

Allocated once on thread 0
Cannot have dynamic lifetime
More expensive to access

Shared arrays

shared int x[THREADS];      /* 1 per thread */
shared double y[3*THREADS]; /* 3 per thread */
shared int z[10];           /* Varies */

Elements have affinity
Default cyclic layout

Piece of pi

Write \[ \pi = 4 (\mbox{area of unit circle quadrant}) \] If \((X,Y)\) are uniform on \([0,1]^2\) then \[ \pi = 4 P\{X^2 + Y^2 < 1\}. \] MC: Sample square, compute fraction inside circle.

pi in C

int main()
{
    int i, hits = 0, trials = 1000000;
    srand(17);
    for (i = 0; i < trials; ++i)
        hits += trials_in_disk();
    printf("Pi approx %y\n", 4.0*hits/trials);
    return 0;
}

pi in UPC, v1

shared int all_hits[THREADS];
int main() {
    int i, hits = 0, trials = 1000000;
    srand(1+MYTHREAD*17);
    for (i = 0; i < trials; ++i)
        hits += trials_in_disk();
    all_hits[MYTHREAD] = hits;
    upc_barrier;
    if (MYTHREAD == 0) {
        hits = 0;
        for (i = 0; i < THREADS; ++i)
            hits += all+hits[i];
        printf("Pi approx %y\n", 4.0*hits/trials/THREADS);
    }
    return 0;
}

Synchronization

Barriers
- Normal: upc_barrier
- Split-phase: upc_notify and upc_wait
Locks (protect critical sections)

Locks

upc_lock_t* lock = upc_all_lock_alloc();
upc_lock(lock);      /* Get lock */
upc_unlock(lock);    /* Release lock */
upc_lock_free(lock); /* Free */

pi in UPC, take 2

shared int tot = 0;
int main() {
    int i, hits = 0, trials = 1000000;
    upc_lock_t* tot_lock = upc_all_lock_alloc();
    srand(1+MYTHREAD*17);
    for (i = 0; i < trials; ++i)
        hits += trials_in_disk();
    upc_lock(tot_lock)
    tot += hits;
    upc_unlock(tot_lock);
    upc_barrier;
    if (MYTHREAD == 0) {
        upc_lock_free(tot_lock);
        printf("Pi approx %g\n", 4.0*tot/trials/THREADS);
    }
    return 0;
}

Collectives

#include <upc.h>
#include <bupc_collective.h>
int main() {
    int i, hits = 0, trials = 1000000;
    srand(1 + MYHREAD*17);
    for (i = 0; i < trials; ++i)
        hits += trial_in_disk();
    hits = bupc_allv_reduce(int, hits, 0, UPC_ADD);
    if (MYTHREAD == 0)
        printf("Pi approx %g\n", 4.0*tot/trials/THREADS);
    return 0;
}

Forall loops

upc_forall (init; test; update; affinity)
    statement;

Assume independent iterations
Just run iters that match affinity expr
- Integer: affinity % THREADS == MYTHREAD
- Pointer: upc_threadof(affinity) == MYTHREAD
Syntactic sugar (could use for)

Example

shared double x[N], y[N], z[N];
int main() {
    int i;
    upc_forall (i = 0; i < N; ++i; i)
        z[i] = x[i] + y[i];
}

Array layouts

Layout specifiers allow block cyclic
Block exprs must be compile constant
Element i affinity with (i / blocksize) % THREADS
Higher-dim: affinity by linearized index

Array layouts

shared double a[N];                 /* Block cyclic */
shared[*] double a[N];              /* Blocks of N/THREADS */
shared[] double a[N];               /* All elements on thread 0 */
shared[M] double a[N];              /* Block cyclic, block size M */
shared[M1][M2] double a[N][M1][M2]; /* Blocks of M1*M2 */

1D Jacobi Poisson example

shared[*] double u_old[N], u[N], f[N];
void jacobi_sweeps(int nsweeps) {
    upc_barrier;
    for (int it = 0; it < nsweeps; ++it) {
        upc_forall(int i = 1; i < N-1; ++i; &(u[i]))
            u[i] = (u_old[i-1] + u_old[+1] - h*h*f[i])/2;
        upc_barrier;
        upc_forall(int i = 1; i < N-1; ++i; &(u[i]))
            u_old[i] = u[i];
        upc_barrier;
}

Jacobi take 2

shared double ubound[2][THREADS]; /* Ghost cells */
double uold[N_PER+2], uloc[N_PER+2], floc[N_PER+2];
void jacobi_sweep(double h2) {
    if (MYTHREAD>0)       ubound[1][MYTHREAD-1]=uold[1];
    if (MYTHREAD<THREADS) ubound[0][MYTHREAD+1]=uold[N_PER];
    upc_barrier;
    uold[0]       = ubound[0][MYTHREAD];
    uold[N_PER+1] = ubound[1][MYTHREAD];
    for (int i = 1; i < N_PER+1; ++i)
        uloc[i] = (uold[i-1]+uold[i+1] - h2*floc[i])/2;
    for (int i = 1; i < N_PER; ++i)
        uold[i] = uloc[i];
}

Jacobi take 3

void jacobi_sweep(double h2) {
    if (MYTHREAD>0)       ubound[1][MYTHREAD-1]=uold[1];
    if (MYTHREAD<THREADS) ubound[0][MYTHREAD+1]=uold[N_PER];
    upc_notify;
    for (int i = 2; i < N_PER; ++i)
        uloc[i] = (uold[i-1]+uold[i+1] - h2*floc[i])/2;
    upc_wait;
    uold[0]       = ubound[0][MYTHREAD];
    uold[N_PER+1] = ubound[1][MYTHREAD];
    for (int i = 1; i < N_PER+1; i += N_PER)
        uloc[i] = (uold[i-1]+uold[i+1] - h2*floc[i])/2;
    for (int i = 1; i < N_PER; ++i)
        uold[i] = uloc[i];
}

Sharing pointers

int* p;                /* Ordinary pointer */
shared int* p;         /* Local pointer to shared data */
shared int* shared p;  /* Shared pointer to shared data */
int* shared p;         /* Legal, but bad idea */

Pointers to shared are larger, slower than standard.

UPC pointers

Three fields in shared pointer

Thread number
Local address of block
Phase (position in block)

Access with upc_threadof and upc_phaseof; go to start with upc_resetphase

Global allocation

shared void*
upc_global_alloc(size_t nblocks, size_t nbytes);

Non-collective
Layout is shared [nybtes] char[nblocks * nbytes]

Global allocation

shared void*
upc_all_alloc(size_t nblocks, size_t nbytes);

Collective - all call, all get same pointer
Layout is shared [nbytes] char[nblocks * nbytes]

Free

void upc_free(shared void* p);

Example: Shared integer stack

typedef struct list_t {
    int x;
    shared struct list_t* next;
} list_t;

shared struct list_t* shared head;
upc_lock_t* list_lock;

Example: Shared integer stack

void push(int x) {
    shared list_t* item =
        upc_global_alloc(1, sizeof(list_t));
    upc_lock(list_lock);
    item->x = x;
    item->next = head;
    head = item;
    upc_unlock(list_lock);
}

Example: Shared integer stack

int pop(int* x) {
    shared list_t* item;
    upc_lock(list_lock);
    if (head == NULL) {
        upc_unlock(list_lock);
        return -1;
    }
    item = head
    head = head->next;
    *x = item->x;
    upc_free(item);
    upc_unlock(list_lock);
    return 0;
}

Memory consistency

UPC access:

Strict (sequential consistency)
Relaxed (may appear out of order to other threads)

Via <upc_relaxed.h>, strict or relaxed type qualifiers, or pragmas.

The upc_fence is a strict null reference.

Performance

Won’t be used if it’s too slow! So:

Maximize single-node perf
Fast comm (GASNet for Berkeley UPC)
Manage details intelligently

Performance tuning is nontrivial – no magic bullets!

All right. So far, we’ve talked about how to express things, but what about performance? The good news is that it’s basically C, and so we know the things that we need to do in order to maximize single-node performance: use the right compiler flags, call tuned libraries, etc. For communication, there is a layer called GASNet that is pretty fast and is used for several PGAS language implementations – or we could do MPI RMA operations from MPI 3. And, of course, we have to be intelligent about how we set up the shared data and communication; the language gives us ways to express low-level details like memory layout, but doesn’t force us to do the “right” thing in terms of performance.

There are case studies as part of UPC tutorial slides. With care, we can sometimes get better performance than MPI! But performance tuning in any setting is nontrivial. There are no silver bullets.