CS 5220

Applications of Parallel Computers

MPI Programming

Prof David Bindel

Please click the play button below.

Message passing programming

Basic operations:

  • Pairwise messaging: send/receive
  • Collective messaging: broadcast, scatter/gather
  • Collective computation: parallel prefix (sum, max)
  • Barriers (no need for locks!)
  • Environmental inquiries (who am I? do I have mail?)

(Much of what follows is adapted from Bill Gropp.)

MPI

  • Message Passing Interface
  • An interface spec — many implementations
    (OpenMPI, MPICH, MVAPICH, Intel, ...)
  • Single Program Multiple Data (SPMD) style
  • Bindings to C, C++, Fortran

MPI

  • Version 1.0 in 1994, 2.2 in 2009, 3.0 in 2012
  • MPI 3 goodies:
    • Nonblocking collectives
    • Neighborhood collectives
    • RMA and one-sided comm
  • Will stick to MPI-2 today

Hello world

#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
    int rank, size;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    printf("Hello from %d of %d\n", rank, size);
    MPI_Finalize();
    return 0;
}

Building, queueing, running

Several steps to actually run

mpicc -o foo.x foo.c   # Compile the program
sbatch foo.sub         # Submit to queue (SLURM)
# mpirun -n 2 ./foo.x  # (in foo.sub) Run on 2 procs
  

This is all platform-specific.

Communicators

  • Processes form groups
  • Messages sent in contexts
    • Separate communication for libraries
  • Group + context = communicator
  • Identify process by rank in group
  • Default is MPI_COMM_WORLD

Sending and receiving

Need to specify:

  • What’s the data?
    • Different machines use different encodings (e.g. endian-ness)
    • $\implies$ “bag o’ bytes” model is inadequate
  • How do we identify processes?
  • How does receiver identify messages?
  • What does it mean to “complete” a send/recv?

MPI datatypes

Message is (address, count, datatype). Allow:

  • Basic types (MPI_INT, MPI_DOUBLE)
  • Contiguous arrays
  • Strided arrays
  • Indexed arrays
  • Arbitrary structures

Complex data types may hurt performance?

MPI tags

Use an integer tag to label messages

  • Help distinguish different message types
  • Can screen messages with wrong tag
  • MPI_ANY_TAG is a wildcard

MPI Send/Recv

Basic blocking point-to-point communication:

int
MPI_Send(void *buf, int count,
         MPI_Datatype datatype,
         int dest, int tag, MPI_Comm comm);

int
MPI_Recv(void *buf, int count,
         MPI_Datatype datatype,
         int source, int tag, MPI_Comm comm,
         MPI_Status *status);

MPI send/recv semantics

  • Send returns when data gets to system
    • ... might not yet arrive at destination!
  • Recv ignores messages that mismatch source/tag
    • MPI_ANY_SOURCE and MPI_ANY_TAG wildcards
  • Recv status contains more info (tag, source, size)

Ping-pong pseudocode

Process 0:

for i = 1:ntrials
  send b bytes to 1
  recv b bytes from 1
end

Process 1:

for i = 1:ntrials
  recv b bytes from 0
  send b bytes to 0
end

Ping-pong MPI

void ping(char* buf, int n, int ntrials, int p)
{
    for (int i = 0; i < ntrials; ++i) {
        MPI_Send(buf, n, MPI_CHAR, p, 0,
                 MPI_COMM_WORLD);
        MPI_Recv(buf, n, MPI_CHAR, p, 0,
                 MPI_COMM_WORLD, NULL);
    }
}

(Pong is similar)

Ping-pong MPI

for (int sz = 1; sz <= MAX_SZ; sz += 1000) {
    if (rank == 0) {
        double t1 = MPI_Wtime();
        ping(buf, sz, NTRIALS, 1);
        double t2 = MPI_Wtime();
        printf("%d %g\n", sz, t2-t1);
    } else if (rank == 1) {
        pong(buf, sz, NTRIALS, 0);
    }
}

Blocking and buffering

Diagram of buffered message send

Block until data “in system” — maybe in a buffer?

Blocking and buffering

Diagram of unbuffered message send

Alternative: don’t copy, block until done.

Problem 1: Potential deadlock

Diagram of deadlock scenario

Both processors wait to send before they receive!
May not happen if lots of buffering on both sides.

Solution 1: Alternating order

Diagram of breaking deadlock scenario by alternate send/recv

Could alternate who sends and who receives.

Solution 2: Combined send/recv

Diagram of send-receive primitive

Common operations deserve explicit support!

Combined sendrecv

MPI_Sendrecv(sendbuf, sendcount, sendtype,
             dest, sendtag,
             recvbuf, recvcount, recvtype,
             source, recvtag,
             comm, status);

Blocking operation, combines send and recv to avoid deadlock.

Problem 2: Communication overhead

Diagram of nonblocking send primitive

Partial solution: nonblocking communication

Blocking vs non-blocking

  • MPI_Send and MPI_Recv are blocking
    • Send does not return until data is in system
    • Recv does not return until data is ready
    • Cons: possible deadlock, time wasted waiting
  • Why blocking?
    • Overwrite buffer during send $\implies$ evil!
    • Read buffer before data ready $\implies$ evil!

Blocking vs non-blocking

Alternative: nonblocking communication

  • Split into distinct initiation/completion phases
  • Initiate send/recv and promise not to touch buffer
  • Check later for operation completion

Overlap communication and computation

Diagram of nonblocking send/receive primitive

Nonblocking operations

Initiate message:

MPI_Isend(start, count, datatype, dest
          tag, comm, request);
MPI_Irecv(start, count, datatype, dest
          tag, comm, request);

Wait for message completion:

MPI_Wait(request, status);

Test for message completion:

MPI_Test(request, status);

Multiple outstanding requests

Sometimes useful to have multiple outstanding messages:

MPI_Waitall(count, requests, statuses);
MPI_Waitany(count, requests, index, status);
MPI_Waitsome(count, requests, indices, statuses);

Multiple versions of test as well.

Other send/recv variants

Other variants of MPI_Send

  • MPI_Ssend (synchronous) – do not complete until receive has begun
  • MPI_Bsend (buffered) – user provides buffer (via MPI_Buffer_attach)
  • MPI_Rsend (ready) – user guarantees receive has already been posted
  • Can combine modes (e.g. MPI_Issend)

MPI_Recv receives anything.

Another approach

  • Send/recv is one-to-one communication
  • An alternative is one-to-many (and vice-versa):
    • Broadcast to distribute data from one process
    • Reduce to combine data from all processors
    • Operations are called by all processes in communicator

Broadcast and reduce

MPI_Bcast(buffer, count, datatype,
          root, comm);
MPI_Reduce(sendbuf, recvbuf, count, datatype,
           op, root, comm);
  • buffer is copied from root to others
  • recvbuf receives result only at root
  • op $\in \{$ MPI_MAX, MPI_SUM, …$\}$

Example: basic Monte Carlo

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char** argv) {
    int nproc, myid, ntrials = atoi(argv[1]);
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &nproc);
    MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
    MPI_Bcast(&ntrials, 1, MPI_INT,
              0, MPI_COMM_WORLD);
    run_mc(myid, nproc, ntrials);
    MPI_Finalize();
    return 0;
}

Example: basic Monte Carlo

Let sum[0] = $\sum_i X_i$ and sum[1] = $\sum_i X_i^2$.

void run_mc(int myid, int nproc, int ntrials) {
    double sums[2] = {0,0};
    double my_sums[2] = {0,0};
    /* ... run ntrials local experiments ... */
    MPI_Reduce(my_sums, sums, 2, MPI_DOUBLE,
               MPI_SUM, 0, MPI_COMM_WORLD);
    if (myid == 0) {
        int N = nproc*ntrials;
        double EX = sums[0]/N;
        double EX2 = sums[1]/N;
        printf("Mean: %g; err: %g\n",
               EX, sqrt((EX*EX-EX2)/N));
    }
}

Collective operations

  • Involve all processes in communicator
  • Basic classes:
    • Synchronization (e.g. barrier)
    • Data movement (e.g. broadcast)
    • Computation (e.g. reduce)

Barrier

MPI_Barrier(comm);

Not much more to say. Not needed that often.

Broadcast

Diagram of MPI broadcast

Scatter/gather

Diagram of MPI gather

Allgather

Diagram of MPI all-gather

Alltoall

Diagram of MPI all-to-all

Reduce

Diagram of MPI reduce

Scan

Diagram of MPI scan

The kitchen sink

  • In addition to above, have vector variants (v suffix), more All variants (Allreduce), Reduce_scatter, ...
  • MPI3 adds one-sided communication (put/get)
  • MPI is not a small library!
  • But a small number of calls goes a long way
    • Init/Finalize
    • Get_comm_rank, Get_comm_size
    • Send/Recv variants and Wait
    • Allreduce, Allgather, Bcast