Due: Wednesday, February 17 by 5
pm.
Problem
For this assignment, you will optimize a routine to multiply two
double-precision square matrices. As discussed in class, the naive
implementation is short, sweet, and horrifyingly slow. A naive blocked
code is only marginally better. You will need to use what you have
learned about tuning to get your code to run as fast as possible on a
single core on one node of the crocus cluster (Intel Xeon E5504).
We provide:
- A trivial unoptimized implementation and simple blocked
implementation
- A timing harness and tester
- A version of the interface that calls the ATLAS BLAS
Implementation
Your function must have the following signature:
void square_dgemm(unsigned M, const double* A, const double* B,
double* C);
The three arrays should be interpreted as matrices in column-major
order with leading dimension M. The operation implemented will actually
be a multiply-add:
C := C + A*B
Look at the code in basic_dgemm.c if you find this
confusing.
The necessary files are in matmul.tar.gz.
Included are:
- Makefile
- a sample Makefile, with some basic rules,
- matmul.c
- the driver program,
- basic_dgemm.c
- a very simple square_dgemm implementation,
- blocked_dgemm.c
- a slightly more complex square_dgemm implementation
- blas_dgemm.c
- another wrapper that lets the C driver program call the
dgemm routine in BLAS implementations,
- timing.gnuplot
- a sample gnuplot script to display the results,
We will be testing on the 2.0 GHz Xeon machines on the crocus cluster.
Each node has two quad-core chips, but you will only be using a single
core for this assignent. See the wiki for more
information on the cluster.
Submission
Your group should submit your dgemm.c, your Makefile
(so we can see compiler optimizations) and a write-up. Your write-up
should contain:
- the names of the group members
- a description of optimizations used or attempted
- the results of those optimizations
- your explanations for any odd behavior (e.g. performance dips)
To document the effect of your optimizations, include a graph
comparing your code with basic_dgemm.c. Your explanations should
rely heavily on your knowledge of the memory hierarchy (benchmark graphs
help).
Resources
- I have mentioned the wiki,
right?
- Much of this assignment was shamelessly borrowed from CS 267, Fall
2006, which in turn borrowed from previous years into the mists of
antiquity. Many of the resources listed there will be useful.
- You can find out more about the processor by running Todd Allen's cpuid
utility and by running cat /proc/cpuinfo.
- The optimizations used in PHiPAC
and ATLAS may be
interesting. Note: You cannot use PHiPAC or ATLAS to generate your
matrix multiplication kernels. You can write your own code
generator, however. You might want to skim the tech report
(.pdf) on PHiPAC. The section on C coding guidelines will be
particuarly relevant.
- There is a classic paper (PostScript) on the idea of
cache blocking in the context of matrix multiply by Monica Lam, et
al.
- Several folks have tried to automatically get cache locality by
basing their matrix storage organization around space-filling curves.
See the paper
by Chatterjee and the classic paper by Gustavson,
Recursion
leads to automatic variable blocking for dense linear
algebra.