CS 5220

Distributed memory

David Bindel

2024-09-24

Logistics

Matrix multiply

  • Deadline extended to Oct 1
  • Please use c4-standard-2 instances
  • I also recommend the Intel compilers
  • MKL gets 80-100 GFlop/s

Scoring

  • Performance: Linearly interpolate
    • 10 GFlop/s = 0 points (baseline)
    • 60 GFlop/s = 5 points (max)
  • Writeup (5 points)
    • Describe strategy
    • What you tried
    • What worked or didn’t work
    • Analysis of improvements
    • Performance plots

Advice

  • Intel compiler vectorization guidelines
  • ICX align: __declspec(align(16, 8) static double Abuf[BUF_SIZE];
  • Report with -qopt-report
  • OK with -fp-model fast=2
  • Recommend -march=emeraldrapids

More advice

  • Start with a small kernel working on aligned memory
  • Get advice from the compiler (-qopt-report)
  • Build bigger matmul by copying a tile in, doing kernel matmuls, and accumulating result out
  • Build timing harnesses for things
  • Use Intel Advisor and any other tools you can find!

HW1

  • Also due Oct 1
  • Main point: get to know Perlmutter!
  • Also write just a little MPI (telephone)
  • This is an individual assignment

Distributed memory

Plan for this week

  • This week: distributed memory
    • HW issues (topologies, cost models)
    • Message-passing concepts and MPI
    • Some simple examples
  • Next week: shared memory

Basic questions

How much does a message cost?

  • Latency: time to get between processors
  • Bandwidth: data transferred per unit time
  • How does contention affect communication?

This is a combined hardware-software question!

We want to understand just enough for reasonable modeling.

Thinking about interconnects

Several features characterize an interconnect:

  • Topology: who do the wires connect?
  • Routing: how do we get from A to B?
  • Switching: circuits, store-and-forward?
  • Flow control: how do we manage limited resources?

Thinking about interconnects

  • Links are like streets
  • Switches are like intersections
  • Hops are like blocks traveled
  • Routing algorithm is like a travel plan
  • Stop lights are like flow control
  • Short packets are like cars, long ones like buses?

At some point the analogy breaks down…

Bus topology

  • One set of wires (the bus)
  • Only one processor allowed at any given time
    • Contention for the bus is an issue
  • Example: basic Ethernet, some SMPs

Crossbar

  • Dedicated path from every input to every output
    • Takes \(O(p^2)\) switches and wires!
  • Example: recent AMD/Intel multicore chips
    (older: front-side bus)

Bus vs. crossbar

  • Crossbar: more hardware
  • Bus: more contention (less capacity?)
  • Generally seek happy medium
    • Less contention than bus
    • Less hardware than crossbar
    • May give up one-hop routing

Other topologies

Network properties

Think about latency and bandwidth via two quantities:

  • Diameter: max distance between nodes
    • Latency depends on distance (weakly?)
  • Bisection bandwidth: smallest BW cut to bisect
    • Particularly key for all-to-all comm

MANY networks

In a typical cluster

  • Ethernet, Infiniband, Myrinet
  • Buses within boxes?
  • Something between cores?

All with different behaviors.

Modeling picture

\(\alpha\)-\(\beta\) model (Hockney 94)

Crudest model: \(t_{\mathrm{comm}} = \alpha + \beta M\)

  • Communication time \(t_{\mathrm{comm}}\)
  • Latency \(\alpha\)
  • Inverse bandwidth \(\beta\)
  • Message size \(M\)

Works pretty well for basic guidance!

Typically \(\alpha \gg \beta \gg t_{\mathrm{flop}}\). More money on network, lower \(\alpha\).

LogP model

Like \(\alpha\)-\(\beta\), but includes CPU time on send/recv:

  • Latency: the usual
  • Overhead: CPU time to send/recv
  • Gap: min time between send/recv
  • P: number of processors

Assumes small messages (gap \(\sim \beta\) for fixed message size).

And many others

Recent survey lists 25 models!

  • More complexity, more parameters
  • Most miss some things (see Box quote)
  • Still useful for guidance!
  • Needs to go with experiments

Ping-pong

Process 0:

for i = 1:ntrials
  send b bytes to 1
  recv b bytes from 1
end

Process 1:

for i = 1:ntrials
  recv b bytes from 0
  send b bytes to 0
end

Laptop ping-pong times

\(\alpha = 0.240\) microseconds; \(\beta = 0.141\) s/GB

Perlmutter ping-pong times

\(\alpha = 0.365\) microseconds; \(\beta = 0.0189\) s/GB

This is on one node.

Takeaways

  • Prefer larger to smaller messages (amortize latency, overhead)
  • More care with slower networks
  • Avoid communication when possible
    • Great speedup for Monte Carlo and other embarrassingly parallel codes!
  • Overlap communication with computation
    • Models tell roughly computation to mask comm
    • Really want to measure, though

MPI programming

Message passing programming

Basic operations:

  • Pairwise messaging: send/receive
  • Collective messaging: broadcast, scatter/gather
  • Collective computation: parallel prefix (sum, max)
  • Barriers (no need for locks!)
  • Environmental inquiries (who am I? do I have mail?)

(Much of what follows is adapted from Bill Gropp.)

MPI

  • Message Passing Interface
  • An interface spec — many implementations (OpenMPI, MPICH, MVAPICH, Intel, …)
  • Single Program Multiple Data (SPMD) style
  • Bindings to C, C++, Fortran

MPI

  • Versions
    • 1.0 in 1994
    • 2.2 in 2009
    • 3.0 in 2012
    • 4.1 in 2023
    • 4.2, 5.0 are pending
  • Will mostly stick to MPI-2 today

Hello world

#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
    int rank, size;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    printf("Hello from %d of %d\n", rank, size);
    MPI_Finalize();
    return 0;
}

Building, queueing, running

Several steps to actually run

cc -o foo.x foo.c      # Compile the program
sbatch foo.sub         # Submit to queue (SLURM)
# srun -n 2 ./foo.x    # (in foo.sub) Run on 2 procs

Sending and receiving

Need to specify:

  • What’s the data?
    • Different machines use different encodings (e.g. endian-ness)
    • \(\implies\) “bag o’ bytes” model is inadequate
  • How do we identify processes?
  • How does receiver identify messages?
  • What does it mean to “complete” a send/recv?

MPI datatypes

Message is (address, count, datatype). Allow:

  • Basic types (MPI_INT, MPI_DOUBLE)
  • Contiguous arrays
  • Strided arrays
  • Indexed arrays
  • Arbitrary structures

Complex data types may hurt performance?

MPI tags

Use an integer tag to label messages

  • Help distinguish different message types
  • Can screen messages with wrong tag
  • MPI_ANY_TAG is a wildcard

MPI Send/Recv

Basic blocking point-to-point communication:

int
MPI_Send(void *buf, int count,
         MPI_Datatype datatype,
         int dest, int tag, MPI_Comm comm);

int
MPI_Recv(void *buf, int count,
         MPI_Datatype datatype,
         int source, int tag, MPI_Comm comm,
         MPI_Status *status);

MPI send/recv semantics

  • Send returns when data gets to system
    • … might not yet arrive at destination!
  • Recv ignores messages that mismatch source/tag
    • MPI_ANY_SOURCE and MPI_ANY_TAG wildcards
  • Recv status contains more info (tag, source, size)

Ping-pong pseudocode

Process 0:

for i = 1:ntrials
  send b bytes to 1
  recv b bytes from 1
end

Process 1:

for i = 1:ntrials
  recv b bytes from 0
  send b bytes to 0
end

Ping-pong MPI

void ping(char* buf, int n, int ntrials, int p)
{
    for (int i = 0; i < ntrials; ++i) {
        MPI_Send(buf, n, MPI_CHAR, p, 0,
                 MPI_COMM_WORLD);
        MPI_Recv(buf, n, MPI_CHAR, p, 0,
                 MPI_COMM_WORLD, NULL);
    }
}

(Pong is similar)

Ping-pong MPI

for (int sz = 1; sz <= MAX_SZ; sz += 1000) {
    if (rank == 0) {
        double t1 = MPI_Wtime();
        ping(buf, sz, NTRIALS, 1);
        double t2 = MPI_Wtime();
        printf("%d %g\n", sz, t2-t1);
    } else if (rank == 1) {
        pong(buf, sz, NTRIALS, 0);
    }
}

Blocking and buffering

Block until data “in system” — maybe in a buffer?

Blocking and buffering

Alternative: don’t copy, block until done.

Problem 1: Potential deadlock

Both processors wait to send before they receive!
May not happen if lots of buffering on both sides.

Solution 1: Alternating order

Could alternate who sends and who receives.

Solution 2: Combined send/recv

Common operations deserve explicit support!

Combined sendrecv

MPI_Sendrecv(sendbuf, sendcount, sendtype,
             dest, sendtag,
             recvbuf, recvcount, recvtype,
             source, recvtag,
             comm, status);

Blocking operation, combines send and recv to avoid deadlock.

Problem 2: Communication overhead

Partial solution: nonblocking communication

Blocking vs non-blocking

  • MPI_Send and MPI_Recv are blocking
    • Send does not return until data is in system
    • Recv does not return until data is ready
    • Cons: possible deadlock, time wasted waiting
  • Why blocking?
    • Overwrite buffer during send \(\implies\) evil!
    • Read buffer before data ready \(\implies\) evil!

Blocking vs non-blocking

Alternative: nonblocking communication

  • Split into distinct initiation/completion phases
  • Initiate send/recv and promise not to touch buffer
  • Check later for operation completion

Overlap communication and computation

Nonblocking operations

Initiate message:

MPI_Isend(start, count, datatype, dest
          tag, comm, request);
MPI_Irecv(start, count, datatype, dest
          tag, comm, request);

Wait for message completion:

MPI_Wait(request, status);

Test for message completion:

MPI_Test(request, status);

Multiple outstanding requests

Sometimes useful to have multiple outstanding messages:

MPI_Waitall(count, requests, statuses);
MPI_Waitany(count, requests, index, status);
MPI_Waitsome(count, requests, indices, statuses);

Multiple versions of test as well.

Other send/recv variants

Other variants of MPI_Send

  • MPI_Ssend (synchronous) – do not complete until receive has begun
  • MPI_Bsend (buffered) – user provides buffer (via MPI_Buffer_attach)
  • MPI_Rsend (ready) – user guarantees receive has already been posted
  • Can combine modes (e.g. MPI_Issend)

MPI_Recv receives anything.

Another approach

  • Send/recv is one-to-one communication
  • An alternative is one-to-many (and vice-versa):
    • Broadcast to distribute data from one process
    • Reduce to combine data from all processors
    • Operations are called by all processes in communicator

Broadcast and reduce

MPI_Bcast(buffer, count, datatype,
          root, comm);
MPI_Reduce(sendbuf, recvbuf, count, datatype,
           op, root, comm);
  • buffer is copied from root to others
  • recvbuf receives result only at root
  • op \(\in \{\) MPI_MAX, MPI_SUM, …\(\}\)

Example: basic Monte Carlo

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char** argv) {
    int nproc, myid, ntrials = atoi(argv[1]);
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &nproc);
    MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
    MPI_Bcast(&ntrials, 1, MPI_INT,
              0, MPI_COMM_WORLD);
    run_mc(myid, nproc, ntrials);
    MPI_Finalize();
    return 0;
}

Example: basic Monte Carlo

Let sum[0] = \(\sum_i X_i\) and sum[1] = \(\sum_i X_i^2\).

void run_mc(int myid, int nproc, int ntrials) {
    double sums[2] = {0,0};
    double my_sums[2] = {0,0};
    /* ... run ntrials local experiments ... */
    MPI_Reduce(my_sums, sums, 2, MPI_DOUBLE,
               MPI_SUM, 0, MPI_COMM_WORLD);
    if (myid == 0) {
        int N = nproc*ntrials;
        double EX = sums[0]/N;
        double EX2 = sums[1]/N;
        printf("Mean: %g; err: %g\n",
               EX, sqrt((EX*EX-EX2)/N));
    }
}

Collective operations

  • Involve all processes in communicator
  • Basic classes:
    • Synchronization (e.g. barrier)
    • Data movement (e.g. broadcast)
    • Computation (e.g. reduce)

Barrier

MPI_Barrier(comm);

Not much more to say. Not needed that often.

Broadcast

Scatter/gather

Allgather

Alltoall

Reduce

Scan

The kitchen sink

  • In addition to above, have vector variants (v suffix), more All variants (Allreduce), Reduce_scatter, …
  • MPI3 adds one-sided communication (put/get)
  • MPI is not a small library!
  • But a small number of calls goes a long way
    • Init/Finalize
    • Get_comm_rank, Get_comm_size
    • Send/Recv variants and Wait
    • Allreduce, Allgather, Bcast