CS 5220

Applications of Parallel Computers

Intro to Message Passing

Prof David Bindel

Please click the play button below.

Plan for this week

  • This week: distributed memory
    • HW issues (topologies, cost models)
    • Message-passing concepts and MPI
    • Some simple examples
  • Next week: shared memory (and PGAS?)

Basic questions

How much does a message cost?

  • Latency: time to get between processors
  • Bandwidth: data transferred per unit time
  • How does contention affect communication?

This is a combined hardware-software question!

We want to understand just enough for reasonable modeling.

Thinking about interconnects

Several features characterize an interconnect:

  • Topology: who do the wires connect?
  • Routing: how do we get from A to B?
  • Switching: circuits, store-and-forward?
  • Flow control: how do we manage limited resources?

Thinking about interconnects

  • Links are like streets
  • Switches are like intersections
  • Hops are like blocks traveled
  • Routing algorithm is like a travel plan
  • Stop lights are like flow control
  • Short packets are like cars, long ones like buses?

At some point the analogy breaks down...

Bus topology

Diagram of bus topology
  • One set of wires (the bus)
  • Only one processor allowed at any given time
    • Contention for the bus is an issue
  • Example: basic Ethernet, some SMPs

Crossbar

Diagram of crossbar topology
  • Dedicated path from every input to every output
    • Takes $O(p^2)$ switches and wires!
  • Example: recent AMD/Intel multicore chips
    (older: front-side bus)

Bus vs. crossbar

  • Crossbar: more hardware
  • Bus: more contention (less capacity?)
  • Generally seek happy medium
    • Less contention than bus
    • Less hardware than crossbar
    • May give up one-hop routing

Other topologies

Diagram of linear topology Diagram of ring topology Diagram of mesh topology Diagram of torus topology Diagram of hypercube topology Diagram of fat tree topology

Network properties

Think about latency and bandwidth via two quantities:

  • Diameter: max distance between nodes
    • Latency depends on distance (weakly?)
  • Bisection bandwidth: smallest BW cut to bisect
    • Particularly key for all-to-all comm

MANY networks

In a typical cluster

  • Ethernet, Infiniband, Myrinet
  • Buses within boxes?
  • Something between cores?

All with different behaviors.

Modeling picture

All models are wrong, but some are useful.
-- George E. P. Box

$\alpha$-$\beta$ model (Hockney 94)

Crudest model: $t_{\mathrm{comm}} = \alpha + \beta M$

  • $t_{\mathrm{comm}} =$ communication time
  • $\alpha =$ latency
  • $\beta =$ inverse bandwidth
  • $M =$ message size

Works pretty well for basic guidance!

Typically $\alpha \gg \beta \gg t_{\mathrm{flop}}$. More money on network, lower $\alpha$.

LogP model

Like $\alpha$-$\beta$, but includes CPU time on send/recv:

  • Latency: the usual
  • Overhead: CPU time to send/recv
  • Gap: min time between send/recv
  • P: number of processors

Assumes small messages (gap $\sim \beta$ for fixed message size).

And many others

Recent survey lists 25 models!

  • More complexity, more parameters
  • Most miss some things (see Box quote)
  • Still useful for guidance!
  • Needs to go with experiments

Ping-pong

Process 0:

for i = 1:ntrials
  send b bytes to 1
  recv b bytes from 1
end

Process 1:

for i = 1:ntrials
  recv b bytes from 0
  send b bytes to 0
end

Laptop experiment

Open MPI on dual-core MacBook Air

Laptop ping-pong times

$\alpha = 0.485$ microseconds; $\beta = 0.215$ s/GB

Timing of 0-1 pings

Graphite experiment

Open MPI on old totient (now Graphite)

  • Two six-core chips per node, eight nodes
  • Heterogeneous network
    • Crossbar between cores (?)
    • Bus between chips
    • Gig-E between nodes

Layout (P=24)

With --map-by core and --bind-to core

  • P 0-5: First chip, first node
  • P 6-11: Second chip, first node
  • P 12-17: First chip, second node
  • P 18-23: Second chip, second node

On chip (0-1)

$\alpha = 0.849$ microseconds; $\beta = 0.299$ s/GB

Timing of 0-1 pings

Cross chip (0-11)

$\alpha = 2.29$ microseconds; $\beta = 1.15$ s/GB

Timing of 0-11 pings

Cross node (0-23)

$\alpha = 63.1$ microseconds; $\beta = 1.99$ s/GB

Timing of 0-23 pings

Takeaways

  • Prefer larger to smaller messages (amortize latency, overhead)
  • More care with slower networks
  • Avoid communication when possible
    • Great speedup for Monte Carlo and other embarrassingly parallel codes!
  • Overlap communication with computation
    • Models tell roughly computation to mask comm
    • Really want to measure, though