CS 5220

Applications of Parallel Computers

Beyond MPI and OpenMP

Prof David Bindel

Please click the play button below.

A programmatic note

  • Lots of programming systems!
  • Playing with one can be a good project
  • Key: model, measure, tune performance!

Parallel programming

Where we stand

  • MPI (shared-nothing, message passing)
  • OpenMP (shared memory, thread based)

Beyond MPI and OpenMP

  • Heterogeneity and accelerators?
  • Data analytics systems (Spark, Hadoop, …)?
  • ML systems (TensorFlow, PyTorch, Keras, …)?
  • Other parallel languages for sci comp?
  • Mixing and matching?

Plan for today

  • Accelerator programming
  • MPI + X
  • Libraries and frameworks
  • Special-use languages
  • PGAS languages, UPC and UPC++

Accelerator programming

  • GPUs and manycore devices
    • Early: shaders for general LA
    • NVidia: CUDA for beyond graphics (esp ML)
    • Other attempts: Xeon Phi, FPGA, etc
  • No Magic Bullets
    • Be thoughtful when comparing!
    • Use libraries when possible!

CUDA, OpenCL, OpenACC, OpenMP

  • CUDA: the grandfather of them all (2007)
  • OpenCL: open standard, multiple devices (2009)
  • OpenACC: OpenMP-style directives (2012)
  • OpenMP: added accel directives in 4.0 (2013)

Kokkos, SYCL, DPC++, oneAPI

  • Kokkos: Portable (?) many-core performance
  • SYCL: Higher-level programming on top of OpenCL
  • DPC++ and oneAPI: Intel’s SYCL versions

DO CONCURRENT? pSTL?

  • Fortran 2008 DO CONCURRENT: Loop iterations with no interdependencies
  • Parallel STL: Parallel algorithms in C++

Simit, Halide, and company

Idea: mini-language for LA and tensor ops

  • Standalone mini-languages (Simit)
  • Embedded in another language (Halide, Numba, Jax)

MPI+X?

MPI for inter-node plus

  • MPI-3 RMA locally?
  • OpenMP locally?
    • Have to coordinate with MPI
  • Some other thread system locally?
  • Accelerator integration?

Trilinos and PETSc

  • Trilinos: C++ framework for parallel scientific codes (Sandia)
  • PETSc: Portable Extensible Toolkit for Scientific Computing (Argonne)

FENICs and domain-specific

FEniCS: High-level finite element codes in Python

PGAS languages

Partitioned Global Address Space

  • Local addresses on local proc
  • Remote addresses on other procs
  • May also have private space
  • Programmer controls placement

PGAS languages

  • Fortran 2008
  • UPC
  • Chapel
  • UPC++
  • Many library-based systems

UPC

  • Explicit parallel extension to ANSI C
  • A PGAS language
  • C philosophy: concise, low level, …
    • And “enough rope to hang yourself!”

Execution model

  • THREADS parallel threads, MYTHREAD local index
  • Thread count specified at compile or run time
  • Synchronization primitives (barriers, locks)
  • Parallel iteration (forall)
  • Parallel memory access/mgmt
  • Parallel library routines

Hello world

#include <upc.h>
#include <stdio.h>

int main()
{
    printf("Hello from %d of %d\n",
           MYTHREAD, THREADS);
}

Shared variables

shared int ours;
int mine;

Shared variables

shared int ours;
  • Allocated once on thread 0
  • Cannot have dynamic lifetime
  • More expensive to access

Shared arrays

shared int x[THREADS];      /* 1 per thread */
shared double y[3*THREADS]; /* 3 per thread */
shared int z[10];           /* Varies */
  • Elements have affinity
  • Default cyclic layout

Piece of pi

Write \[ \pi = 4 (\mbox{area of unit circle quadrant}) \] If \((X,Y)\) are uniform on \([0,1]^2\) then \[ \pi = 4 P\{X^2 + Y^2 < 1\}. \] MC: Sample square, compute fraction inside circle.

pi in C

int main()
{
    int i, hits = 0, trials = 1000000;
    srand(17);
    for (i = 0; i < trials; ++i)
        hits += trials_in_disk();
    printf("Pi approx %y\n", 4.0*hits/trials);
    return 0;
}

pi in UPC, v1

shared int all_hits[THREADS];
int main() {
    int i, hits = 0, trials = 1000000;
    srand(1+MYTHREAD*17);
    for (i = 0; i < trials; ++i)
        hits += trials_in_disk();
    all_hits[MYTHREAD] = hits;
    upc_barrier;
    if (MYTHREAD == 0) {
        hits = 0;
        for (i = 0; i < THREADS; ++i)
            hits += all+hits[i];
        printf("Pi approx %y\n", 4.0*hits/trials/THREADS);
    }
    return 0;
}

Synchronization

  • Barriers
    • Normal: upc_barrier
    • Split-phase: upc_notify and upc_wait
  • Locks (protect critical sections)

Locks

upc_lock_t* lock = upc_all_lock_alloc();
upc_lock(lock);      /* Get lock */
upc_unlock(lock);    /* Release lock */
upc_lock_free(lock); /* Free */

pi in UPC, take 2

shared int tot = 0;
int main() {
    int i, hits = 0, trials = 1000000;
    upc_lock_t* tot_lock = upc_all_lock_alloc();
    srand(1+MYTHREAD*17);
    for (i = 0; i < trials; ++i)
        hits += trials_in_disk();
    upc_lock(tot_lock)
    tot += hits;
    upc_unlock(tot_lock);
    upc_barrier;
    if (MYTHREAD == 0) {
        upc_lock_free(tot_lock);
        printf("Pi approx %g\n", 4.0*tot/trials/THREADS);
    }
    return 0;
}

Collectives

#include <upc.h>
#include <bupc_collective.h>
int main() {
    int i, hits = 0, trials = 1000000;
    srand(1 + MYHREAD*17);
    for (i = 0; i < trials; ++i)
        hits += trial_in_disk();
    hits = bupc_allv_reduce(int, hits, 0, UPC_ADD);
    if (MYTHREAD == 0)
        printf("Pi approx %g\n", 4.0*tot/trials/THREADS);
    return 0;
}

Forall loops

upc_forall (init; test; update; affinity)
    statement;
  • Assume independent iterations
  • Just run iters that match affinity expr
    • Integer: affinity % THREADS == MYTHREAD
    • Pointer: upc_threadof(affinity) == MYTHREAD
  • Syntactic sugar (could use for)

Example

shared double x[N], y[N], z[N];
int main() {
    int i;
    upc_forall (i = 0; i < N; ++i; i)
        z[i] = x[i] + y[i];
}

Array layouts

  • Layout specifiers allow block cyclic
  • Block exprs must be compile constant
  • Element i affinity with (i / blocksize) % THREADS
  • Higher-dim: affinity by linearized index

Array layouts

shared double a[N];                 /* Block cyclic */
shared[*] double a[N];              /* Blocks of N/THREADS */
shared[] double a[N];               /* All elements on thread 0 */
shared[M] double a[N];              /* Block cyclic, block size M */
shared[M1][M2] double a[N][M1][M2]; /* Blocks of M1*M2 */

1D Jacobi Poisson example

shared[*] double u_old[N], u[N], f[N];
void jacobi_sweeps(int nsweeps) {
    upc_barrier;
    for (int it = 0; it < nsweeps; ++it) {
        upc_forall(int i = 1; i < N-1; ++i; &(u[i]))
            u[i] = (u_old[i-1] + u_old[+1] - h*h*f[i])/2;
        upc_barrier;
        upc_forall(int i = 1; i < N-1; ++i; &(u[i]))
            u_old[i] = u[i];
        upc_barrier;
}

Jacobi take 2

shared double ubound[2][THREADS]; /* Ghost cells */
double uold[N_PER+2], uloc[N_PER+2], floc[N_PER+2];
void jacobi_sweep(double h2) {
    if (MYTHREAD>0)       ubound[1][MYTHREAD-1]=uold[1];
    if (MYTHREAD<THREADS) ubound[0][MYTHREAD+1]=uold[N_PER];
    upc_barrier;
    uold[0]       = ubound[0][MYTHREAD];
    uold[N_PER+1] = ubound[1][MYTHREAD];
    for (int i = 1; i < N_PER+1; ++i)
        uloc[i] = (uold[i-1]+uold[i+1] - h2*floc[i])/2;
    for (int i = 1; i < N_PER; ++i)
        uold[i] = uloc[i];
}

Jacobi take 3

void jacobi_sweep(double h2) {
    if (MYTHREAD>0)       ubound[1][MYTHREAD-1]=uold[1];
    if (MYTHREAD<THREADS) ubound[0][MYTHREAD+1]=uold[N_PER];
    upc_notify;
    for (int i = 2; i < N_PER; ++i)
        uloc[i] = (uold[i-1]+uold[i+1] - h2*floc[i])/2;
    upc_wait;
    uold[0]       = ubound[0][MYTHREAD];
    uold[N_PER+1] = ubound[1][MYTHREAD];
    for (int i = 1; i < N_PER+1; i += N_PER)
        uloc[i] = (uold[i-1]+uold[i+1] - h2*floc[i])/2;
    for (int i = 1; i < N_PER; ++i)
        uold[i] = uloc[i];
}

Sharing pointers

int* p;                /* Ordinary pointer */
shared int* p;         /* Local pointer to shared data */
shared int* shared p;  /* Shared pointer to shared data */
int* shared p;         /* Legal, but bad idea */

Pointers to shared are larger, slower than standard.

UPC pointers

Three fields in shared pointer

  • Thread number
  • Local address of block
  • Phase (position in block)

Access with upc_threadof and upc_phaseof; go to start with upc_resetphase

Global allocation

shared void*
upc_global_alloc(size_t nblocks, size_t nbytes);
  • Non-collective
  • Layout is shared [nybtes] char[nblocks * nbytes]

Global allocation

shared void*
upc_all_alloc(size_t nblocks, size_t nbytes);
  • Collective - all call, all get same pointer
  • Layout is shared [nbytes] char[nblocks * nbytes]

Free

void upc_free(shared void* p);

Example: Shared integer stack

typedef struct list_t {
    int x;
    shared struct list_t* next;
} list_t;

shared struct list_t* shared head;
upc_lock_t* list_lock;

Example: Shared integer stack

void push(int x) {
    shared list_t* item =
        upc_global_alloc(1, sizeof(list_t));
    upc_lock(list_lock);
    item->x = x;
    item->next = head;
    head = item;
    upc_unlock(list_lock);
}

Example: Shared integer stack

int pop(int* x) {
    shared list_t* item;
    upc_lock(list_lock);
    if (head == NULL) {
        upc_unlock(list_lock);
        return -1;
    }
    item = head
    head = head->next;
    *x = item->x;
    upc_free(item);
    upc_unlock(list_lock);
    return 0;
}

Memory consistency

UPC access:

  • Strict (sequential consistency)
  • Relaxed (may appear out of order to other threads)

Via <upc_relaxed.h>, strict or relaxed type qualifiers, or pragmas.

The upc_fence is a strict null reference.

Performance

Won’t be used if it’s too slow! So:

  • Maximize single-node perf
  • Fast comm (GASNet for Berkeley UPC)
  • Manage details intelligently

Performance tuning is nontrivial – no magic bullets!