CS 5220

Shared memory

David Bindel

2024-10-03

OpenMP

Standard API for multi-threaded code
- Only a spec — multiple implementations
- Lightweight syntax
- C or Fortran (with appropriate compiler support)
High level:
- Preprocessor/compiler directives (80%)
- Library calls (19%)
- Environment variables (1%)

OpenMP

Basic syntax: #omp construct [ clause … ]

Usually affects structured block (one way in/out)
OK to have exit() in such a block

Last time

Creating parallel regions
Sharing annotations
Synchronization
Parallel for

Parallel for

void accumulate_a2a(std::vector<double>& f, const std::vector<double>& x)
{
    int n = f.size();

    #pragma omp parallel for
    for (int i = 0; i < n; ++i) {
        double fi = 0.0;
        double xi = x[i];
        for (int j = 0; j < n; ++j) {
            double dij = xi-x[j];
            if (dij != 0.0)
                fi += 1.0/(dij*dij);
        }
        f[i] = fi;
    }
}

Aside: tiling transformation on this code?

What about symmetry?

void accumulate_a2a(std::vector<double>& f, const std::vector<double>& x)
{
    int n = f.size();

    for (int i = 0; i < n; ++i) {
        double xi = x[i];
        for (int j = i+1; j < n; ++j) {
            double dij = xi-x[j];
            f[i] += 1.0/(dij*dij);
            f[j] += 1.0/(dij*dij);
        }
    }
}

Why would omp parallel for fail here?

Other parallel work divisions

sections: like cobegin/coend
single: do only in one thread (e.g. I/O)
master: do only in master thread; others skip

Sections

#pragma omp parallel
{
    #pragma omp sections nowait
    {
        #pragma omp section
        do_something();

        #pragma omp section
        and_something_else();

        #pragma omp section
        and_this_too();
        // No implicit barrier here
    }
    // Implicit barrier here
}

sections nowait to kill barrier.

Task-based parallelism

Work-sharing so far is rather limited
- Work cannot be produced/consumed dynamically
- Fine for data parallel array processing...
- ... but what about tree walks and such?
Alternate approach (OpenMP 3.0+): Tasks

Tasks

Task involves:

Task construct: task directive plus structured block
Task: Task construct + data

Tasks are handled by run time, complete at barriers or taskwait.

Example: List traversal

#pragma omp parallel
{
    #pragma omp single nowait
    {
        for (link_t* link = head; link; link = link->next)
            #pragma omp task firstprivate(link)
            process(link);
    }
    // Implicit barrier
}

One thread generates tasks, others execute them.

Example: Tree traversal

int tree_max(node_t* n)
{
    int lmax, rmax;
    if (n->is_leaf)
        return n->value;
    
    #pragma omp task shared(lmax)
        lmax = tree_max(n->l);
    #pragma omp task shared(rmax)
        rmax = tree_max(n->l);
    #pragma omp taskwait

    return max(lmax, rmax);
}

The taskwait waits for all child tasks.

Example: Quicksort

void omp_qsort(int* a, int lo, int hi)
{
    if (lo >= hi) return;
    int p = partition(a, lo, hi);
    #pragma omp task shared(a)
    omp_qsort(a, lo, p);
    #pragma omp task shared(a)    
    omp_qsort(a, p, hi);
}

void call_qsort(int* a, int lo, int hi)
{
    #pragma omp parallel
    {
        #pragma omp single
        omp_qsort(a, lo, hi);
        #pragma omp taskwait
    }
}

Task dependencies

What happens if one task produces what another needs?

#pragma omp task depend(out:x)
x = foo();
#pragma omp task depend(in:x)
y = bar(x);

Topics not addressed

Low-level synchronization (locks, flush)
OpenMP 4.x constructs for accelerator interaction
A variety of more specialized clauses

See http://www.openmp.org/

SPH

Your next project!

Smoothed Particle Hydrodynamics

Particle-based method for fluid simulation
- Representative of other particle-based methods
- More visually interesting than MD with Lennard-Jones?
Particle \(i\) (a fluid blob) evolves as \[ m \mathbf{a}_i = \sum_{|\mathbf{x}_j-\mathbf{x}_i| \leq h} \mathbf{f}_{ij} \] where force law satisfies \(\mathbf{f}_{ij} = -\mathbf{f}_{ji}\).
Chief performance challenge: fast force evaluation!

Task 1: Binning / spatial hashing

Partition space into bins of size \(\geq h\) (interaction radius)
Only check interactions in nearby bins
Tradeoff between bin size, number of interaction checks

Task 1: Binning / spatial hashing

Keep particles in an array as usual
Also keep array of head pointers for bins
Thread linked list structures for bin contents

Task 1: Binning / spatial hashing

struct particle_t {
    float rho;        // Density
    float x[3];       // Position
    float v[3];       // Velocity (full step)
    float vh[3];      // Velocity (half step)
    float a[3];       // Accelerations
    particle_t* next; // Link for hashing
};

struct sim_state_t {
    float mass;                     // Particle mass
    std::vector<particle_t> part;   // Particles
    std::vector<particle_t*> hash;  // Hash table
};

Task 1: Binning / spatial hashing

Typical situation: lots of empty bins
Empty boxes take space!
Alternative: spatial hashing
- Map multiple bins to one storage location
- Avoid collisions (several bins map to same place)
Idea: figure out potential neighbors good for a few steps?

Ordering

Bins naturally identified with two or three indices
Want to map to a single index
- To serve as a hash key
- For ordering computations

Ordering

Row/column major: mediocre locality
Better: Z-Morton ordering
- Interleave bits of \((x,y,z)\) indices
- Efficient construction a little fidly
- Basic picture is simple!

Z Morton ordering

Equivalent to height-balanced quadtree/octtree

Task 2: Profiling

Computation involves several different steps:

Finding nearest neighbors
Computing local densities
Computing local pressure / viscous forces
Time stepping
I/O

How much does each cost? Scaling?

Profiling

Want to act based on timing data
- Manual instrumentation?
- Profilers?
Fix what takes most time first
Consider both algorithm improvements and tuning

Task 3: Parallelization

How do we decompose the problem?

Processors own regions of space?
Processors own fixed set of particles?
Processors own sets of force interactions?
Hybrid recommended!

Task 3: Parallelization

How do we synchronize force computation?

Compute \(\mathbf{f}_{ij}\) and \(\mathbf{f}_{ji}\) simultaneously in reference code
Could keep multiple updates per processor and reduce?
Use multi-color techniques?
Interacts with problem decomposition!

Example: Multi-color ordering

No blue cell neighbors another blue.

Things to think about

How do we make sure we don’t break the code?
How fast is the parallel code for \(p=1\)?
Are there load balance issues?
Do we synchronize often?
Do we get good strong scaling? weak scaling?

Task 4: Play

How fast can we make the serial code?
How should we improve initialization?
Could we time step smarter?
What about surface tension?
What about better BCs?
Swishing, pouring, etc?
Improvements for incompressible flow?