Applications of Parallel Computers
Intro to Message Passing
Prof David Bindel
Please click the play button below.
Plan for this week
This week: distributed memory
HW issues (topologies, cost models)
Message-passing concepts and MPI
Some simple examples
Next week: shared memory (and PGAS?)
The plan this week is to talk about message passing generally in
this deck, and how it is supported in hardware. Then we'll talk
a little bit about MPI programming in the next deck. Then we'll
talk next week about shared memory programming in OpenMP.
I know I said this last week, but I think there's some
pedagogical value to thinking about the message passing style of
programming before thinking about shared memory. That's because
the types of things you have to think about to get
high-performance shared memory codes are often the same types of
things you have to think through to get high-performance message
passing codes. It's just that with MPI, you're forced into it
in a way that you aren't always with OpenMP.
There are other types of programming models out there, too,
things that sit somewhere between traditional message-passing
and traditional shared memory. For that matter, MPI sometimes
sits somewhere between those two traditions, thanks to the
features in MPI-3! Anyhow, I hope to do something with one of
the Parallel Global Address Space (PGAS) languages or libraries,
too.
But for today: message passing!
Basic questions
How much does a message cost?
Latency : time to get between processors
Bandwidth : data transferred per unit time
How does contention affect communication?
This is a combined hardware-software question!
We want to understand just enough for reasonable modeling.
One of the basic questions in designing message-passing code for
performance is how much a message will cost. Remember, the time
for a message in isolation is a latency cost (time to get out of
the caller, through the processors, and to the other system)
together with a bandwidth cost (time per byte during the
transfer). So we need to understand enough about the network to
get a sense for how to think about these two terms.
Of course, life is more complicated than that! Often, there are
many messages traversing the network simultaneously, and so we
have to worry about congestion effects - we can't always assume
the effective bandwidth that one processor gets is always equal to the
max bandwidth the network provides. Also, there are several
pieces to latency: the time to make it across the network wires
is one part, but there's also the time to get a message through
the operating system and into the network in the first place.
And we can try to use the processor for other work when there
just happens to be data going across the wires, but maybe not
when the CPU is busy with the overhead to initiate the message.
So maybe this is something that we should consider, too.
The point here, though, is not to turn you into networking
experts. It is to give you a rough sense of what's possible so
that you can understand how to think about modeling the
performance of your code.
Thinking about interconnects
Several features characterize an interconnect:
Topology : who do the wires connect?
Routing : how do we get from A to B?
Switching : circuits, store-and-forward?
Flow control : how do we manage limited resources?
Let's start by thinking about the network hardware: we call this
the network, or the network fabric, or the interconnect,
depending on who is doing the talking. Several features
characterize the interconnect, down at the bottom level. First,
there's the topology: how are the wires, nodes, and switches
connected to each other? Then, there are the rules for routing
the messages: do we use the same path for every packet from a
given source to a given destination, or can it change? And is
it a shortest path, or are there other policies that come into
play? We also have to decide whether we are going to set up a
circuit and communicate on it for a while - the way that the old
phone connections used to work - or if we are going to break our
messages into packets and send them in a store-and-forward
fashion, which is what happens in lots of modern networks. And
then, if there may be more traffic sometimes than the network
can gracefully handle, we have to have flow control mechanisms
that let us share the bandwidth resources of the network.
Thinking about interconnects
Links are like streets
Switches are like intersections
Hops are like blocks traveled
Routing algorithm is like a travel plan
Stop lights are like flow control
Short packets are like cars, long ones like buses?
At some point the analogy breaks down...
Let's try an analogy to the physical world, one familiar to
anyone who has seen busy traffic. Network nodes that produce
and receive traffic are like buildings. Vehicles / messages
move from one building to another via the streets / links, but
there usually isn't a direct connection from every source to
every destination. So we have intersections where streets some
together; the network analogy here is switches. Routing
algorithms are like travel plans for individual vehicles. And
flow control algorithms are like the algorithms that govern stop
lights.
Of course, it isn't a perfect analogy. But it's a pretty
reasonable one, as these things go.
Bus topology
One set of wires (the bus)
Only one processor allowed at any given time
Contention for the bus is an issue
Example: basic Ethernet, some SMPs
Let's start with the simplest possible case: a bus topology.
A bus is one set of wires that everyone connects to. In
general, only one processor is allowed to send on the bus at a
time, though everyone connected to the bus can see the message.
So figuring out how to deal with contention, or the case of
multiple processors that want to send simultaneously, is a big
issue with buses. In fact, it's such a big issue that we rarely
find buses with lots of nodes. The original Ethernet systems
used buses, but mostly we use switched Ethernet these days.
There are, however, buses that connect processors to memory on
most machines. There are also buses to connect processors to
peripherals: after all, USB is short for Universal Serial Bus.
Crossbar
Dedicated path from every input to every output
Takes O ( p 2 ) switches and wires!
Example: recent AMD/Intel multicore chips
(older: front-side bus)
Buses represent one extreme: potentially many nodes, and one
shared link. This is pretty good for minimizing the number of
wires we use, but not so good for sharing. The opposite extreme
is the crossbar: a dedicated channel for every
source-destination pair in the network. This completely gets
rid of the problem of contention, but the hardware gets
expensive quickly as we add processors. But a crossbar is not a
bad way to go at all when the number of processors is modest,
and this type of interconnect is pretty common inside modern
multicore processors with not too many cores.
Bus vs. crossbar
Crossbar: more hardware
Bus: more contention (less capacity?)
Generally seek happy medium
Less contention than bus
Less hardware than crossbar
May give up one-hop routing
Neither buses nor crossbars are that attractive when we have a
lot of processors to connect. Crossbars require too much
hardware; buses suffer too much from contention. Other
topologies balance the contention and hardware costs in various
ways, though usually at the cost of having more complicated
routing. Or any routing, really. There's not much to say about
routing on a bus or a crossbar.
Network properties
Think about latency and bandwidth via two quantities:
Diameter : max distance between nodes
Latency depends on distance (weakly?)
Bisection bandwidth : smallest BW cut to bisect
Particularly key for all-to-all comm
We can get a long way in performance modeling by characterizing
networks by two numbers: the network diameter, and the bisection
bandwidth.
The network diameter is the maximum number of hops
between nodes, and the latency is generally maximal along these
maximal-length paths. So we can build a pessimistic model of
latency with just this one number, though it's sometimes useful
to have more details about the distances between pairs of
nodes. For some networks, the latency is dominated by the
software stacks at the endpoints anyhow; in that case, we hardly
care about these distinctions. But it depends.
The bisection bandwidth is the amount of bandwidth that we'd need to
cut in order to partition into two equal-size subsets of nodes.
Bisection bandwidth tells us something about the limiting
bandwidth we'd see in all-to-all communication patterns, which
is pretty much the worst-case scenario for contention.
MANY networks
In a typical cluster
Ethernet, Infiniband, Myrinet
Buses within boxes?
Something between cores?
All with different behaviors.
But maybe the key thing to realize about modern parallel
machines is that there is not just one network. There
are many networks! In a typical cluster, we have a
"conventional" network like Ethernet (or Infiniband or Myrinet,
if we can afford a bit more) between boxes, but then we have a
different network - maybe a bus - connecting the different chips
on the motherboard, and an on-chip network that connects the
several cores that fit onto a single chip. And these networks
have dramatically different characteristics.
Modeling picture
DO distinguish different networks
Otherwise, want simple perf models
So: we do want to distinguish between different networks. But
otherwise, we want as simplified a model of the network as we
can get away with, something that captures the most salient
features with only a few parameters. We'll talk briefly about
two such models: the Hockney model, and the LogP model. There
are a bunch of others out there, and I've linked to a recent
survey that comments on a good 25 communication models; if you
want to read it, knock yourself out, but I won't blame you if
you skip!
All models are wrong, but some are useful. -- George E. P. Box
Of course, our models are not going to be perfect! But we
really want something that is going to help us reason about
possible parallel code organizations, not something that is
going to give exact predictions. A simple approximation is
sometimes more useful for this purpose than a more complicated model.
α -β model (Hockney 94)
Crudest model: t c o m m = α + β M
t c o m m = communication time
α = latency
β = inverse bandwidth
M = message size
Works pretty well for basic guidance!
Typically α ≫ β ≫ t f l o p . More money on
network, lower α .
My favorite model is one you've already seen before. Formally,
this is sometimes attributed to Hockney, but it's a pretty old
idea; namely, the time to send a message is the latency plus the
message size divided by bandwidth. So it's a simple affine
function of message size; and this is often a pretty
good model! It doesn't give us any guidance about contention
effects or about the potential for overlapping communication and
computation, but it often does a reasonable job at capturing the
cost of an end-to-end message in isolation - for useful values
of "often" and "reasonable."
LogP model
Like α -β , but includes CPU time on send/recv:
Latency: the usual
Overhead: CPU time to send/recv
Gap: min time between send/recv
P: number of processors
Assumes small messages (gap ∼ β for fixed message size).
A lot of communication models build off the so-called LogP
model (also from a mid-90s paper). LogP is an acronym standing
for latency, overhead, gap, and number of processors. The gap
is something like a bandwidth cost for smallish fixed-size
messages; the real difference between LogP and the Hockney model
is that LogP distinguishes between the time that messages take
"in the system" (without intervention from the sender or
receiver processors) and the overhead time that the processors
at the send or receive ends have to spend on dealing with the
messages - making system calls, copying data between buffers,
and so forth. So this tells us a little more about the
potential for overlapping communication and computation, though
we still have nothing in the model to describe bandwidth
effects.
And many others
Recent survey lists
25 models!
More complexity, more parameters
Most miss some things (see Box quote)
Still useful for guidance!
Needs to go with experiments
You can draw a line through two points, and you probably have a
sense of the direction that this line is going. The past 25
years have seen a small but steady line of work on communication
performance models that capture more and more about the network
behavior.
My own preference is to look for the simplest model that helps
with making design decisions, and then to back that up with
experiments to reveal when we are starting to run into effects
like contention that might not be well described by our models.
Ping-pong
Process 0:
for i = 1:ntrials
send b bytes to 1
recv b bytes from 1
end
Process 1:
for i = 1:ntrials
recv b bytes from 0
send b bytes to 0
end
One of the simplext experiments out there for understanding
network performance is a so-called ping-pong test. The idea is
that we just send a message back and forth between a pair of
processors, and then repeat for different message sizes. If we
want to get, for example, a linear model fit for the time to
send a message of a given size, we can just run ordinary least
squares on the data.
Laptop experiment
Open MPI on dual-core MacBook Air
This all gets much more exciting when we do it with machines
that you actually use, so let's start with a ping-pong
experiment on my laptop. I have a MacBook Air with two cores.
The code for this measurement (and the Python script to produce
the plot) are in the demos/ping subdirectory; I recommend trying
this out on your own machine!
Laptop ping-pong times
α = 0.485 microseconds; β = 0.215 s/GB
A linear model actually does a pretty good job of describing the
ping-pong timings on my laptop! Note that even here, I pay
almost half a microsecond for message latency (where, of course,
I am lumping overhead into the latency). We computed that my
laptop has a peak of about 25.6 GFlop/s, so this we can get
about 12.4 KFlops of work done in one message latency if we're
running close to peak. This is more than an order of magnitude
worse than the worst memory access cost I saw on this machine in
the strided access memory benchmark we tried out a little while
ago. The punchline to all of this: latencies are expensive!
Graphite experiment
Open MPI on old totient (now Graphite)
Two six-core chips per node, eight nodes
Heterogeneous network
Crossbar between cores (?)
Bus between chips
Gig-E between nodes
Now let's try the same experiment on something a bit more
complicated. One of the computing resources we will have for
this class is the Graphite cluster, and particularly the nodes
that used to make up the Totient cluster that Intel gave us for
teaching 5220. There are eight nodes, each with two six-core
chips. So we really have three different networks in this one
system: the on-core network between cores, which I believe is a
crossbar; a bus that connects the chips and main memory; and
Ethernet between the nodes. We want to run the experiment on
each of these networks.
Layout (P=24)
With --map-by core and --bind-to core
P 0-5: First chip, first node
P 6-11: Second chip, first node
P 12-17: First chip, second node
P 18-23: Second chip, second node
We're going to want to be a little careful here, because an MPI
process is not identical to a CPU core. We have to tell
the system that we want the MPI processes to be bound to specific
cores, and that we want to map nearby cores to adjacent MPI ranks
(i.e. process numbers) as much as possible. If we do that, we get
that 24 cores will span two different boxes, and we can probe all
three networks by looking at messages from rank 0 to rank 1, 11,
and 23, respectively.
On chip (0-1)
α = 0.849 microseconds; β = 0.299 s/GB
The on-chip network is a little slower than the on-chip network
on my laptop - worse latency and worse bandwidth, both. The
latency is the better part of a microsecond in this case.
Cross chip (0-11)
α = 2.29 microseconds; β = 1.15 s/GB
When we go off-chip - but stay on the same box - the latency
is more than double the on-chip case, and the bandwidth is less
than half. This isn't an order-of-magnitude difference, but
it's big enough to be worth noticing.
Cross node (0-23)
α = 63.1 microseconds; β = 1.99 s/GB
What about when we go across nodes? Well - Ethernet is more
complicated, and a simple linear model doesn't fit anywhere near as
well. We can certainly see, though, that while the bandwidth
isn't that terrible, the latency really is. We're now up to
many tens of microseconds.
OK, the latency for nerve signals in the human body is something like
tens of milliseconds in the best case - top speed is around 200
miles per hour in certain types of nerve fibers, and I'm about
six feet, so that gives me 20 milliseconds. From that
perspective, tens of microseconds doesn't seem so bad. But,
again: you can do a lot of flops in that amount of time!
Takeaways
Prefer larger to smaller messages (amortize latency,
overhead)
More care with slower networks
Avoid communication when possible
Great speedup for Monte Carlo and other embarrassingly parallel codes!
Overlap communication with computation
Models tell roughly computation to mask comm
Really want to measure, though
All right, what should you take away from this?
First, message latencies are big, particularly for networks like
Ethernet. if we don't want all our time to go to latency, we
had better not send too many messages. As we discussed when
talking about PDE solvers, a few bigger messages might be
preferable to a larger number of smaller messages, even when we
might need some redundant work in the former case.
We can make pleasingly parallel codes fast because we don't need
much communication, but we aren't always that lucky.
Second, we'd really like to overlap communication with
computation when we can. A fairly rough model can tell us about
how much work we have to have between messages in order to
mask as much of the message cost as we possibly can - remember,
there may be overhead costs that we just can't overlap. Even if
we believe our models - and it isn't always wise to take them
too seriously - an experiment or two is often worthwhile to try
to sort out these types of effects.
CS 5220 Applications of Parallel Computers Intro to Message Passing Prof David Bindel Please click the play button below. Welcome to another exciting week of 5220!