Applications of Parallel Computers : HW 1

Due: Wednesday, February 17 by 5 pm.

Problem

For this assignment, you will optimize a routine to multiply two double-precision square matrices. As discussed in class, the naive implementation is short, sweet, and horrifyingly slow. A naive blocked code is only marginally better. You will need to use what you have learned about tuning to get your code to run as fast as possible on a single core on one node of the crocus cluster (Intel Xeon E5504).

We provide:

A trivial unoptimized implementation and simple blocked implementation
A timing harness and tester
A version of the interface that calls the ATLAS BLAS

Implementation

Your function must have the following signature:

  void square_dgemm(unsigned M, const double* A, const double* B,
                    double* C);

The three arrays should be interpreted as matrices in column-major order with leading dimension M. The operation implemented will actually be a multiply-add:

  C := C + A*B

Look at the code in basic_dgemm.c if you find this confusing.

The necessary files are in matmul.tar.gz. Included are:

Makefile: a sample Makefile, with some basic rules,
matmul.c: the driver program,
basic_dgemm.c: a very simple square_dgemm implementation,
blocked_dgemm.c: a slightly more complex square_dgemm implementation
blas_dgemm.c: another wrapper that lets the C driver program call the dgemm routine in BLAS implementations,
timing.gnuplot: a sample gnuplot script to display the results,

We will be testing on the 2.0 GHz Xeon machines on the crocus cluster. Each node has two quad-core chips, but you will only be using a single core for this assignent. See the wiki for more information on the cluster.

Submission

Your group should submit your dgemm.c, your Makefile (so we can see compiler optimizations) and a write-up. Your write-up should contain:

the names of the group members
a description of optimizations used or attempted
the results of those optimizations
your explanations for any odd behavior (e.g. performance dips)

To document the effect of your optimizations, include a graph comparing your code with basic_dgemm.c. Your explanations should rely heavily on your knowledge of the memory hierarchy (benchmark graphs help).

Resources

I have mentioned the wiki, right?
Much of this assignment was shamelessly borrowed from CS 267, Fall 2006, which in turn borrowed from previous years into the mists of antiquity. Many of the resources listed there will be useful.
You can find out more about the processor by running Todd Allen's cpuid utility and by running cat /proc/cpuinfo.
The optimizations used in PHiPAC and ATLAS may be interesting. Note: You cannot use PHiPAC or ATLAS to generate your matrix multiplication kernels. You can write your own code generator, however. You might want to skim the tech report (.pdf) on PHiPAC. The section on C coding guidelines will be particuarly relevant.
There is a classic paper (PostScript) on the idea of cache blocking in the context of matrix multiply by Monica Lam, et al.
Several folks have tried to automatically get cache locality by basing their matrix storage organization around space-filling curves. See the paper by Chatterjee and the classic paper by Gustavson, Recursion leads to automatic variable blocking for dense linear algebra.

David Bindel

Last modified 2010-02-03