Due: Monday, Nov 14 by 11:59 pm.
The purpose of this assignment is to get some more experience with MPI programming, and to do some simple comparisons between an OpenMP code and an MPI code. Your goal is to profile a reference implementation of an all-pairs shortest path algorithm using OpenMP, and to convert that algorithm to use MPI. The reference implementation is documented here. For this assignment, you should do the following
Do a scaling study to characterize the performance of the OpenMP code as a function of the graph size and the number of nodes. Can you explain the performance you see? Note that different graphs will require different numbers of outer iterations; you should probably take this into consideration in your analysis.
Write a naive MPI implementation in which each processor keeps a
complete copy of the data and communicates it with neighboring
processors using MPI_Allgatherv
.
Write a more sophisticated MPI implementation in which you improve the memory scalability by not keeping a full copy of the distance matrix at every processor. Instead, you should rotate the matrices around from one processor to another in some systematic way. I suggest adapting the simple 1D ring matrix multiply from the dense linear algebra lectures, but you are allowed to do something more sophisticated if you wish.
Repeat your scaling study for your two MPI implementations. Again, try to explain the performance in terms of a simplified model. Make sure you do at least a few runs that involve more than one node!
The current OpenMP version is fairly naive (I didn't attempt any blocking of the matrix-multiply-like kernel), and I'm sure there is room for significant improvement if you feel so inclined. However, I'm really more interested in what you get working in MPI; if you find you have time left over, it is better spent on the final project.
You may start with the serial code, which you should retrieve on the cluster with the command
wget http://www.cs.cornell.edu/~bindel/class/cs5220-f11/code/path.tgz
You can get started with the serial code by reading the annotated source code, which I generated from the file sources using dsbweb, my home-grown literate programming tool.
You may work in groups of 2 or 3. You do not have to have the same groups as last time.
You will submit your code, but the main deliverable for this project is a report. Here is a list of items you might show in your report:
A plot in log-log scale that shows the asymptotics for your codes.
Speedup plots that show how closely your parallel codes approach the idealized p-times speedup and a discussion on whether it is possible to do better. You should consider both weak scaling (where the work per processor is fixed) and strong scaling (where the problem size is fixed).
Where does the time go? Consider breaking down the runtime into computation time, synchronization time and/or communication time, and describing each with a simple model. How do they scale with the number of processors? The graph size?