Applications of Parallel Computers
Particle Systems
Prof David Bindel
Please click the play button below.
Welcome to our discussion of a second class of physical
simulations in our lightning introduction to locality and
parallelism in simulations. In this deck, we are talking about
particle systems.
Particle systems
Billiards, electrons, galaxies, ...
Ants, cars, agents, ...?
When I say particles, I mean all types of discrete interacting
entities. That might include simple physical things, from billiard balls
to electrons to galaxies; or it might include ants in a colony,
cars on a highway, or agents in the Matrix (or in an agent-based
simulation, if you're not a Keanu Reeves fan).
Particle simulation
Particles move via Newton (\(F = ma\) ), with
External forces: ambient gravity, currents, etc.
Local forces: collisions, Van der Waals (\(1/r^6\) ), etc.
Far-field forces: gravity and electrostatics (\(1/r^2\) ), etc.
Simple approximations often apply (Saint-Venant)
When we are dealing with physical particles, we generally have
them moving around according to Newton's force balance law. The
interesting thing is where the forces come from! Sometimes they
are external, things like gravity or currents that come from
outside the scope of the things we are trying to simulate.
Sometimes, they involve interactions between particles that are
very close to each other. And sometimes they involve far-field
forces like gravitational attraction or electrostatic attraction
or repulsion - though, as we have mentioned before, these types of
far-field forces sometimes admit convenient approximations.
By the way: the "far field vs near field" distinction is
ubiquitous in physics. In some areas, like mechanics, it has
special names - St. Venant's principle, for example, in solid
mechanics. Mathematicians refer to the property as elliptic
regularity, usually in the context of talking about PDEs. But
even if you only care about discrete systems like social
networks, the same effect can occur - just with less
directly-physical notions of what "near" and "far" mean.
A forced example
Example force: \[f_i = \sum_j Gm_i m_j
\frac{(x_j-x_i)}{r_{ij}^3}
\left(1 - \left(\frac{a}{r_{ij}}\right)^{4} \right), \qquad
r_{ij} = \|x_i-x_j\|\]
Long-range attractive force (\(r^{-2}\) )
Short-range repulsive force (\(r^{-6}\) )
Go from attraction to repulsion at radius \(a\)
So here's the type of thing we might consider. Suppose that our
particles have a short-range repulsive force and a long-range
attractive force. They can't get away from each other, and can't
get too close. What happens? Interesting things! Depends a lot
on how much energy there is in the initial state of the system.
A simple serial simulation
In Matlab , we can write
npts = 100;
t = linspace(0, tfinal, npts);
[tout, xyv] = ode113(@fnbody, t, [x; v], [], m, g);
xout = xyv(:,1:length(x))';
... but I can’t call ode113 in C in parallel (or can I?)
At the end of the day, the particles are governed by a collection
of ordinary differential equations. So maybe one could consider
using a standard ODE solver, like one of the ones in MATLAB, to
solve the system. This is not such a stupid idea, really. But it
turns out that we might want to think again, both for parallelism
and to get the properties that we want out of our ODE solver.
A simple serial simulation
Maybe a fixed step leapfrog will do?
npts = 100;
steps_per_pt = 10;
dt = tfinal/(steps_per_pt*(npts-1));
xout = zeros(2*n, npts);
xout(:,1) = x;
for i = 1:npts-1
for ii = 1:steps_per_pt
x = x + v*dt;
a = fnbody(x, m, g);
v = v + a*dt;
end
xout(:,i+1) = x;
end
The standard ODE solvers are pretty accurate for medium lengths of
time, but they do have some error due to approximating an ODE by
something that involves discrete time steps. Unfortunately,
varying the step size to maintain exact accuracy often doesn't get
us quite what we want. It is OK if we mess up some details, as
long as others are well preserved. In physics, for example, we
usually want conservation of energy and momentum, and some simple
integrators with fixed step sizes - like the leap frog scheme
shown above - may do that even better than the MATLAB ODE solver.
Fixed step leapfrog is easier to code, too!
Plotting particles
Smooth Particle Hydrodynamics (SPH) simulation
You'll get to play with this integrator soon enough for yourself.
The plan is that the next project will involve a smoothed particle
hydrodynamics or SPH simulation, and we'll use leapfrog for the
time integration.
Pondering particles
Where do particles live (distributed mem)?
Decompose in space? By particle number?
What about clumping?
How are long-range force computations organized?
How are short-range force computations organized?
How is force computation load balanced?
What are the boundary conditions?
How are potential singularities handled?
What integrator is used? What step control?
We'll say more about time steppers on Thursday. But the more
interesting thing to talk about at the moment is how we manage the
data for the particles and the computation of the forces between
them at any given step. We also have to worry about things like
what happens at boundaries - do we bounce off walls, or does
something else happen? And sometimes we have to worry about the
potential that our model has singular behaviors, like infinite
attraction between particles that are zero distance apart. Of
course, that type of interaction usually means that we have
neglected some important part of the physics, but that's cold
comfort when we're trying to sort out why there are NaNs
suddenly appearing in our simulations!
External forces
Simplest case: no particle interactions.
Embarrassingly parallel (like Monte Carlo)!
Could just split particles evenly across processors
Is it that easy?
Maybe some trajectories need short time steps?
Even with MC, load balance may not be trivial!
All right, let's turn now to the topic of dealing with forces.
By far the easiest class of forces to deal with are external
forces. Think here the Earth's gravitational field in a
physical context, or the lure of students (or faculty!) to free food
if we are modeling crowd dynamics. If external forces are the
only thing we have to worry about, we have pleasing (or
embarrassing) parallelism: each
trajectory can be computed independently of the others. Of
course, it can't be that easy, because there is usually some
inhomogeneity in how much each simulation costs, and that
affects things like load balance. But still, this is much
easier to deal with than cases where the particles actually interact.
Local forces
Simplest all-pairs check is \(O(n^2)\) (expensive)
Or only check close pairs (via binning, quadtrees?)
Communication required for pairs checked
Usual model: domain decomposition
Things get more interesting when particles interact locally.
The naive thing to do here would be to check every pair of
particles for interaction, which is O(n^2) where n is the number
of particles. But we can go much faster if we instead partition
space into bins, and only have to check for interactions between
particles in neighboring bins.
Often, we would assign big regions of space (maybe broken into
smaller bins internally) to each processor; this is a spatial
decomposition of the problem. We could also choose to partition
the problem by assigning a fixed subset of the particles to each
processor, or by assigning a subset of the particle-particle
interactions to each processor. The fastest particle
simulations out there often end up using a combination of these
approaches.
Local forces: Communication
Minimize communication:
Send particles that might affect a neighbor “soon”
Trade extra computation against communication
Want low surface area-to-volume ratios on domains
If we use a spatial decomposition, we have to communicate the
position of any particles that are near the boundaries of the
domains owned by each processor. As with the Game of Life
example, we like decomposing into big blocks because they have
a low surface-area-to-volume ratio, with communication
proportional to surface area and computation proportional to
volume.
We could communicate just
those particles that are within an interaction radius of the
boundary, but we'd have to send new updates at every time step.
often we might communicate particles that are
within a larger radius of the boundary. That way, we only have to
communicate once every few steps. Each message might
involve more particles, but often that's a small price to pay
for reducing the number of messages (and the associated
latencies).
Local forces: Load balance
Are particles evenly distributed?
Do particles remain evenly distributed?
Can divide space unevenly (e.g. quadtree/octtree)
Decomposing space into big blocks seems great when we think that
particles might be spread out evenly. But what if they are not
evenly distributed? And what if the distribution changes over
time? In our hypothetical simulation of students and faculty
swarming to the refreshments table before a colloquium, we are
likely to see a concentration around the refreshments table
after a short time, regardless of the initial distribution.
One way to deal with uneven distributions of particles is to
consider uneven partitions of space. There are classic spatial
data structures that do this, such as quadtrees (in 2D) and
octtrees (in 3D). We'll talk more about these later in the course.
Far-field forces
Every particle affects every other particle
All-to-all communication required
Overlap communication with computation
Poor memory scaling if everyone keeps everything!
Idea: pass particles in a round-robin manner
Far-field forces are more complicated, because everything
affects everything else. If we don't know about simplifying
approximations for the far-field, this looks like an all-to-all
communication pattern, and that's expensive. Of course, we
could have every processor keep track of every particle, and
split up the interactions processor-by-processor; but it's not
good for memory scalability if everyone has to store
everything.
So here's an idea: what if we run our computation like we would
run a potluck lunch, and pass the dishes (or particles)?
Passing particles (far-field forces)
copy local particles to current buf
for phase = 1:p
send current buf to rank+1 (mod p)
recv next buf from rank-1 (mod p)
interact local particles with current buf
swap current buf with next buf
end
Here's that idea in pictures and pseudocode. We start off with
each processor holding a set of particles that it owns. The
processor copies those particles into a buffer that holds the
particles with which we want to compute interactions. When we
are done with one batch of particles, we send the current set
that we're interacting with on to the next processor, and
receive a new set of particles from the previous processor (in a
ring ordering). If we have some memory to spare, we can do
these data transfers at the same time that we are computing,
hiding the communication costs (at least partly). This strategy
of overlapping communication and computation is a good approach
across a lot of different types of parallel patterns.
Passing particles (far-field forces)
Suppose \(n = N/p\) particles in buffer. At each phase \[\begin{aligned}
t_{\mathrm{comm}} & \approx \alpha + \beta n \\
t_{\mathrm{comp}} & \approx \gamma n^2
\end{aligned}\] So we can mask communication with computation if \[n \geq
\frac{1}{2\gamma} \left( \beta + \sqrt{\beta^2 + 4 \alpha \gamma} \right)
> \frac{\beta}{\gamma}\]
Let's do a little back-of-the-envelope modeling here. Suppose we
are doing our round-robin protocol. We have n rounds in total,
each with a message cost involving a latency alpha and inverse
bandwidth beta; and a local computation cost of time gamma for
each of the n^2 interactions we need to accumulate. To find where
the computation costs exceed communication costs, we can solve a
little quadratic equation. An unsurprising lower bound comes from
dropping the latency term to find that we need at least beta over
gamma particles-per-processor in order to avoid being
communication bound. But, of course, things get worse when there
is a nontrivial amount of latency, as there often is.
Passing particles (far-field forces)
More efficient serial code
\(\implies\) larger \(n\) needed to mask communication!
\(\implies\) worse speed-up as \(p\) gets larger (fixed \(N\) )
but scaled speed-up (\(n\) fixed) remains unchanged.
One of the takeaways from our back-of-the-envelope analysis is
that the better we are at optimizing the local computation
(decreasing gamma), the more particles per processor we need to
mask communication costs! And things look even worse for strong
scaling experiments, since the number of particles per processor
decreases as we increase the number of processors. Depending how
we set it up, things might look better for a weak scaling
experiment - this depends whether we try to keep the same total
work per processor (where work scales like N^2) or the same total
particles per processor.
Far-field forces: particle-mesh methods
Consider \(r^{-2}\) electrostatic potential interaction
Enough charges looks like a continuum!
Poisson maps charge distribution to potential
Fast Poisson for regular grids (FFT, multigrid)
Approx depends on mesh and particle density
Can clean up leading part of approximation error
OK, now let's talk about what happens when we have long-range
forces where we can get some type of simplification in the far
field. This happens, for example, for electrostatic and
gravitational forces. When we have enough charges in a region,
it starts to look less like a discrete set of charges and more
like a continuous distribution. This is great, because the
electrostatic potential from a continuous distribution of
charges can be computed by solving a Poisson equation, one of
the standard classes of partial differential equations that we
know how to solve fairly quickly, using methods like fast
Fourier transforms or multigrid solvers that we will talk about
later in the semester.
Long-range forces: particle-mesh
Map particles to mesh points (multiple strategies)
Solve potential PDE on mesh
Interpolate potential to particles
Add correction term – acts like local force
The basic strategy of particle-mesh methods, then, is to
approximate the distribution of particles by "virtual" particles
sitting on a mesh, and then to compute interactions for that
distribution of virtual particles via a fast Poisson solve.
Of course, this is an approximation, and how good it is depends
some on the mesh size and the particle density. Usually, we
would improve the approximation with a local correction for
interactions between particles that are close to each other.
We know how to handle that local correction efficiently using
the types of techniques we talked about earlier in the deck
for dealing with short-range forces.
Far-field forces: tree methods
Distance simplifies things
Andromeda looks like a point mass from here?
Build a tree, approx descendants at each node
Variants: Barnes-Hut, FMM, Anderson’s method
More on this later in the semester
Particle-mesh methods work pretty well when there are a lot of
particles and they are fairly evenly distributed in space. But
those aren't always valid assumptions. An alternative that
works even with less even particle distributions is the idea of
a so-called tree code. In tree codes, we organize our
approximation in boxes of increasing size, and keep a "summary"
approximation of the potential from things inside the box on
things that are sufficiently far away from the box (relative to
the box size). For example, in the solar system, we might need
to keep track of the sun and all the different planets and their
interactions; but if we care about the pull of the Andromeda
galaxy at all, we will probably approximate it as a single giant
mass at a point very far away.
There are a variety of these tree-code approximation schemes out
there, and we'll try to talk about some of them in more detail
later in the semester.
Summary of particle example
Model: Continuous motion of particles
Could be electrons, cars, whatever...
Step through discretized time
OK, what have we talked about in this deck? We've been talking
about continuous motions of "particles" for a very loose meaning
of particles - they could be electrons, cars, or people, and the
pattern still holds. We have a set of differential equations
that govern the motion of these particles, and use some
time-stepping method to approximate the solutions of these
differential equations (where we'll talk more about
time-stepping in the Thursday deck).
Summary of particle example
Local interactions
Relatively cheap
Load balance a pain
All-pairs interactions
Obvious algorithm is expensive (\(O(n^2)\) )
Particle-mesh and tree-based algorithms help
An important special case of lumped/ODE models.
The key to making particle simulations go fast is the
organization of data and computations in order to quickly
compute the forces acting on the particles. External forces are
the easy part; the harder part is computing particle-particle
interactions. We can be pretty efficient about short-range
interactions by using appropriate spatial data structures in
order to only compute interactions that matter, but the
all-to-all computation involved in long-range forces is more
expensive. The obvious algorithms are O(n^2), but there are
alternative methods that give a very good approximation that
cost less; these include particle-mesh and tree-based
algorithms.
Particle systems are a special case of an ODE-based model. The
structure of the force interactions (which often varies over
time) is interesting enough for us to treat these as a thing on
their own. But there are some concerns in common with other
ODE-based simulations, as we will see when we pick up with our
discussion in the slides for Thursday.