CS 5220

Parallelism and Locality in Simulations

David Bindel

2024-09-17

Intro

Parallelism and Locality

The world exhibits parallelism and locality

Particles, people, etc function independently
Near-field interactions stronger than far-field
Can often simplify dependence on distant things

Parallelism and Locality

Get more parallelism / locality through model

Limited dependency between adjacent time steps
Can neglet or approximate far-field effects

Parallelism and Locality

Often get parallelism at multiple levels

Hierarchical circuit simulation
Interacting models for climate
Parallelizing individual experiments in MC or optimization

Styles of Simulation

Discrete event systems (continuous or discrete time)
Particle systems
Lumped parameter models (ODEs)
Distributed parameter models (PDEs / IEs)

Often more than one type of simulation is approprate.
(Sometimes more than one at a time!)

Discrete Event Systems

May be discrete or continuous time.

Game of life
Logic-level circuit simulation
Network simulation

Discrete Events

Finite set of variables, transition function updates
Synchronous case: finite state machine
Asynchronous case: event-driven simulation
Synchronous (?) example: Game of Life
Nice starting point – no discretization concerns!

Game of Life

Game of life (John Conway):

Game of Life

Game of life (John Conway):

Live cell dies with < 2 live neighbors
Live cell dies with > 3 live neighbors
Live cell lives with 2-3 live neighbors
Dead cell becomes live with exactly 3 live neighbors

Game of Life

What to do if I really cared?

Tile the problem for memory
Try for high operational intensity
Use instruction-level parallelism
Don’t output board too often!

Before doing anything with OpenMP/MPI!

Game of Life

East to parallelize by domain decomposition

Update work involves volume of subdomains
Communication per step on surface (cyan)

Also works with tiling.

Game of Life

Sketch of a kernel for tiled implementation:

Bitwise representation of cells (careful with endian-ness)
A “tile” is a 64-by-64 piece (64 uint64_t)
- Keep two tiles (ref and tmp)
Think of inner 48-by-48 as “live”
Buffer of size 8 on all sides
Compute saturating 3-bit neighbor counters
Batches of eight steps (four ref to tmp, four back)

Game of Life

Some areas are more eventful than others!

Game of Life

What if pattern is dilute?

Few or no live cells at surface at each step
Think of live cell at a surface as an “event”
Only communicate events!
- This is asynchronous
- Harder with message passing – when to receive?

Asynchronous Life

How do we manage events?

Speculative – assume no communication across boundary for many steps, back up if needed
Conservative – wait when communication possible
- Possible \(\neq\) guaranteed!
- Deadlock: everyone waits for a send
- Can get around this with NULL messages

Asynchronous Life

How do we manage load balance?

No need to simulate quiescent parts of the game!
Maybe dynamically assign smaller blocks to processors?

HashLife

There are also other algorithms!

Beyond Life

Particle Systems

Billiards, electrons, galaxies, …
Ants, cars, agents, …?

Particle Simulation

Particles move via Newton (\(F = ma\)) with

External forces: ambient gravity, currents, etc
Local forces: collisions, Van der Waals (\(r^{-6}\)), etc
Far-field forces: gravity and electrostatics (\(r^{-2}\)), etc
- Simple approximations often apply (Saint-Venant)

Forced Example

\[\begin{aligned} f_i &= \sum_j G m_i m_j \frac{(x_j-x_i)}{r_{ij}^3} \left( 1-\left( \frac{a}{r_{ij}} \right)^4 \right), \\ r_{ij} &= \|x_i-x_j\| \end{aligned}\]

Long-range attractive force (\(r^{-2}\))
Short-range repulsive force (\(r^{-6}\))
Go from attraction to repulsion at radius \(a\)

Simple Serial Simulation

Using Boost.Numeric.Odeint, we can write

integrate(particle_system, x0, tinit, tfinal, h0,
          [](const auto& x, double t) {
              std::cout << "t=" << t << ": x=" << x << std::endl;
          });

where

particle_system defines the ODE system
x0 is the initial condition
tinit and tfinal are start and end times
h0 is the initial step size

and the final lambda is an observer function.

Beyond Serial Simulation

Can parallelize in

Time (tricky): Parareal methods, asynchronous methods
Space: Our focus!

Plotting Particles

Smooth Particle Hydrodynamics (SPH) – Project 2

Pondering Particles

Where do particles “live” (distributed mem)?
- Decompose in space? By particle number?
- What about clumping?
How are long-range force computations organized?
How are short-range force computations organized?
How is force computation load balanced?
What are the boundary conditions?
How are potential singularities handled?
Choice of integrator? Step control?

External Forces

Simplest case: no particle interactions.

Pleasingly parallel (like Monte Carlo!)
Could just split particles evenly across processors
Is it that easy?
- Maybe some trajectories need short time steps?
- Even with MC, load balance may not be trivial!

Local Forces

Simplest all-pairs check is \(O(n^2)\) (expensive)
Or only check close pairs (via binning, quadtrees?)
Communication required for pairs checked
Usual model: domain decomposition

Local Forces: Communication

Minimize communication:

Send particles that might affect a neighbor “soon”
Trade extra computation against communication
Want low surface area-to-volume ratios on domains

Local Forces: Load Balance

Are particles evenly distributed?
Do particles remain evenly distributed?
Can divide space unevenly (e.g. quadtree/octtree)

Far-Field Forces

Every particle affects every other particle
All-to-all communication required
- Overlap communication with computation
- Poor memory scaling if everyone keeps everything!
Idea: pass particles in a round-robin manner

Passing Particles (Far-Field Forces)

copy particles to current buf
for phase = 1 to p
  send current buf to rank+1 (mod p)
  recv next buf from rank-1 (mod p)
  interact local particles with current buf
  swap current buf with next buf
end

Passing Particles (Far-Field Forces)

Suppose \(n = N/p\) particles in buffer. At each phase \[\begin{aligned} t_{\mathrm{comm}} & \approx \alpha + \beta n \\ t_{\mathrm{comp}} & \approx \gamma n^2 \end{aligned}\]

So mask communication with computation if \[ n \geq \frac{1}{2\gamma} \left( \beta + \sqrt{\beta^2 + 4 \alpha \gamma} \right). \]

Passing Particles (Far-Field Forces)

More efficient serial code
\(\implies\) larger \(n\) needed to mask commujnication!
\(\implies\) worse speed-up as \(p\) gets larger (fixed \(N\))
but scaled speed-up (\(n\) fixed) remains unchanged.

Far-Field Forces: Particle-Mesh

Consider \(r^{-2}\) electrostatic potential interaction

Enough charges look like a continuum!
Poisson maps charge distribution to potential
Fast Poisson for regular grids (FFT, multigrid)
Approx depends on mesh and particle density
Can clean up leading part of approximation error

Far-Field Forces: Particle-Mesh

Map particles to mesh points (multiple strategies)
Solve potential PDE on mesh
Interpolate potential to particles
Add correction term – acts like local force

Far-Field Forces: Tree Methods

Distance simplifies things
- Andromeda looks like a point mass from here?
Build tree, approx descendants at each node
Variants: Barnes-Hut, FMM, Anderson’s method
More on this later in the semester

Summary of Particle Example

Model: Continuous motion of particles
- Could be electrons, cars, whatever
Step through discretized time

Summary of Particle Example

Local interactions
- Relatively cheap
- Load balance a pain
All-pairs interactions
- Obvious algorithm is expensive (\(O(n^2)\))
- Particle-mesh and tree-based algorithms help

An important special case of lumped/ODE models.