Due: Monday, Oct 31 by 11:59 pm.
The purpose of this assignment is to get a somewhat more serious parallel programming experience. Your goal is to parallelize a toy smoothed particle hydrodynamics simulation. A reference serial implementation is documented here. The current version uses a naive algorithm that does not use spatial locality. Your mission, should you choose to accept it, is to fix that. You should report on your progress in two weeks, including three things:
Improve the complexity by changing from the naive O(n2) algorithm to an O(n) (roughly) force evaluation algorithm based on spatial partitioning. You will not necessarily need to repartition at every step!
Parallelize your code using OpenMP, and spend some time tuning the
parallel implementation. You will need to go beyond just adding
pragmas before your for
loops in order to get reasonable
speed!
Do a scaling study to characterize the performance of both your serial and parallel codes as a function of the number of particles in the system and the number of processors. Try to explain what you see. Where are the bottlenecks? What are the parallel overheads? Do you have good load balance? Are you making good use of the memory system? For several of these questions, you may find it useful to try timing your code with HPCToolkit and TAU (both of which are now installed on the cluster).
If you have extra time, play a little! Improve or extend the code in some way that appeals to you, either by doing something clever with the time integrator (e.g. automatic adaptation of the time step), or adding a feature (I deliberately left out the surface tension calculations), or by doing something else. If this project appeals to you, extensions to it could easily be the basis of a class project.
You may start with the serial code, which you should retrieve on the cluster with the command
wget http://www.cs.cornell.edu/~bindel/class/cs5220-f11/code/sph.tgz
To visualize the output files generated by the code, use the Java viewer Bouncy.jar. If you feel like hacking on it, feel free: Bouncy.java.
You can get started with the serial code by reading the annotated source code, which I generated from the file sources using dsbweb, my home-grown literate programming tool.
You may work in groups of 2 or 3. One person in your group should be a non-CS student (if possible), but otherwise you're responsible for finding a group. You do not have to have the same groups as last time.
You will submit your code, but the main deliverable for this project is a report. Here is a list of items you might show in your report:
A plot in log-log scale that shows that your serial and parallel codes run in O(n) time, and a description of the data structures that you used to achieve it.
A description of the synchronization you used in the OpenMP code.
A description of the design choices that you tried and how did they affect the performance.
Speedup plots that show how closely your parallel codes approach the idealized p-times speedup and a discussion on whether it is possible to do better.
Where does the time go? Consider breaking down the runtime into computation time, synchronization time and/or communication time. How do they scale with the number of processors? The number of particles?