Applications of Parallel Computers
Intro to Message Passing
Prof David Bindel
Please click the play button below.
Plan for this week
- This week: distributed memory
- HW issues (topologies, cost models)
- Message-passing concepts and MPI
- Some simple examples
- Next week: shared memory (and PGAS?)
Basic questions
How much does a message cost?
- Latency: time to get between processors
- Bandwidth: data transferred per unit time
- How does contention affect communication?
This is a combined hardware-software question!
We want to understand just enough for reasonable modeling.
Thinking about interconnects
Several features characterize an interconnect:
- Topology: who do the wires connect?
- Routing: how do we get from A to B?
- Switching: circuits, store-and-forward?
- Flow control: how do we manage limited resources?
Thinking about interconnects
- Links are like streets
- Switches are like intersections
- Hops are like blocks traveled
- Routing algorithm is like a travel plan
- Stop lights are like flow control
- Short packets are like cars, long ones like buses?
At some point the analogy breaks down...
Bus topology
- One set of wires (the bus)
- Only one processor allowed at any given time
- Contention for the bus is an issue
- Example: basic Ethernet, some SMPs
Crossbar
- Dedicated path from every input to every output
- Takes $O(p^2)$ switches and wires!
- Example: recent AMD/Intel multicore chips
(older: front-side bus)
Bus vs. crossbar
- Crossbar: more hardware
- Bus: more contention (less capacity?)
- Generally seek happy medium
- Less contention than bus
- Less hardware than crossbar
- May give up one-hop routing
Network properties
Think about latency and bandwidth via two quantities:
- Diameter: max distance between nodes
- Latency depends on distance (weakly?)
- Bisection bandwidth: smallest BW cut to bisect
- Particularly key for all-to-all comm
MANY networks
In a typical cluster
- Ethernet, Infiniband, Myrinet
- Buses within boxes?
- Something between cores?
All with different behaviors.
Modeling picture
- DO distinguish different networks
- Otherwise, want simple perf models
All models are wrong, but some are useful.
-- George E. P. Box
$\alpha$-$\beta$ model (Hockney 94)
Crudest model: $t_{\mathrm{comm}} = \alpha + \beta M$
- $t_{\mathrm{comm}} =$ communication time
- $\alpha =$ latency
- $\beta =$ inverse bandwidth
- $M =$ message size
Works pretty well for basic guidance!
Typically $\alpha \gg \beta \gg t_{\mathrm{flop}}$. More money on
network, lower $\alpha$.
LogP model
Like $\alpha$-$\beta$, but includes CPU time on send/recv:
- Latency: the usual
- Overhead: CPU time to send/recv
- Gap: min time between send/recv
- P: number of processors
Assumes small messages (gap $\sim \beta$ for fixed message size).
And many others
Recent survey lists
25 models!
- More complexity, more parameters
- Most miss some things (see Box quote)
- Still useful for guidance!
- Needs to go with experiments
Ping-pong
Process 0:
for i = 1:ntrials
send b bytes to 1
recv b bytes from 1
end
Process 1:
for i = 1:ntrials
recv b bytes from 0
send b bytes to 0
end
Laptop experiment
Open MPI on dual-core MacBook Air
Laptop ping-pong times
$\alpha = 0.485$ microseconds; $\beta = 0.215$ s/GB
Graphite experiment
Open MPI on old totient (now Graphite)
- Two six-core chips per node, eight nodes
- Heterogeneous network
- Crossbar between cores (?)
- Bus between chips
- Gig-E between nodes
Layout (P=24)
With --map-by core and --bind-to core
- P 0-5: First chip, first node
- P 6-11: Second chip, first node
- P 12-17: First chip, second node
- P 18-23: Second chip, second node
On chip (0-1)
$\alpha = 0.849$ microseconds; $\beta = 0.299$ s/GB
Cross chip (0-11)
$\alpha = 2.29$ microseconds; $\beta = 1.15$ s/GB
Cross node (0-23)
$\alpha = 63.1$ microseconds; $\beta = 1.99$ s/GB
Takeaways
- Prefer larger to smaller messages (amortize latency,
overhead)
- More care with slower networks
- Avoid communication when possible
- Great speedup for Monte Carlo and other embarrassingly parallel codes!
- Overlap communication with computation
- Models tell roughly computation to mask comm
- Really want to measure, though