Due: Monday, Sep 26 by 11:59 pm.
For this assignment, you will optimize a routine to multiply two double-precision square matrices. As discussed in class, the naive implementation is short, sweet, and horrifyingly slow. A naive blocked code is only marginally better. You will need to use what you have learned about tuning to get your code to run as fast as possible on a single core on one node of the crocus cluster (Intel Xeon E5504).
We provide:
Your function must have the following signature:
void square_dgemm(unsigned M, const double* A, const double* B,
double* C);
The three arrays should be interpreted as matrices in column-major order with leading dimension M. The operation implemented will actually be a multiply-add:
C := C + A*B
Look at the code in basic_dgemm.c
if you find this
confusing.
The necessary files are in matmul.tgz, which you should retrieve on the cluster with the command
wget http://www.cs.cornell.edu/~bindel/class/cs5220-f11/code/matmul.tgz
Included are:
Makefile
: a sample Makefile
with some
basic rulesMakefile.in
: default settings for optimization flags
and the likematmul.c
: the driver programbasic_dgemm.c
: a very simple square_dgemm
implementationblocked_dgemm.c
: a slightly more complex
square_dgemm
implementationblas_dgemm.c
: a wrapper that lets the C driver program
call the dgemm
routine in a tuned BLAS implementationf2c_dgemm.c
: a wrapper that illustrates how to call
the reference FORTRAN dgemm
routine from C.fdgemm.f
: the reference FORTRAN dgemm
called by f2c_dgemm.c
tplot.sh
: a sample script that uses
gnuplot
to plot timing results. If the raw output is in
(for example) timing-basic.out
, run ./tplot.sh
basic
to generate timing-basic.pdf
.make_sge.sh
: a helper shell script that generates SGE
batch system submission scripts for matmul
timing
runs.We will be testing on the 2.0 GHz Xeon machines on the crocus cluster. Each node has two quad-core chips, but you will only be using a single core for this assignent. See here for more information on the cluster.
Your group should submit your dgemm.c
, your
Makefile
(which should include any overrides to the default
Makefile.in
so that we can see compiler optimizations) and a
write-up. Your write-up should contain:
To document the effect of your optimizations, include a graph
comparing your code with basic_dgemm.c
. Your explanations
should rely heavily on your knowledge of the memory hierarchy (benchmark
graphs help).
Roughly speaking, a fast matrix multiplication routine will likely have two ingredients: a fast "kernel" capable of multiplying small matrices quickly, and then routines that use the kernel in a cache-friendly way to multiply larger matrices. One way to approach the assignment is top-down: sketch out the kernel first, and make sure you understand the global structure of the calls you want to set up, then start tuning parameters. Another way is bottom-up: build a fast kernel, play with parameters until you're happy with it; then build a matrix multiply for small-ish matrices (things that fit in L1 cache) out of the kernel, and play with parameters until happy; and then repeat for the L2 (and maybe L3) cache. I strongly recommend the second path, if only for the sake of sanity while debugging! It's easier to focus on getting a small thing right first than to try to get it all right on the first try.
There are a few lower-level logistical issues that you should probably
be aware of as you proceed. You may want to read my notes on serial tuning before going through
these points, since some of them address things you have to know in order
to implement my suggestions there using gcc
on the
crocus
cluster.
The restrict
keyword used in the code segment below is a
C99 feature, and does not occur in previous versions of the language
standard. GCC supports the C99 standard, but only if you ask it to do so.
Unfortunately, if you ask it to use the C99 standard strictly, the
compiler becomes unhappy with the clock_gettime
calls we use
for timing. Fortunately, there is nothing to prevent you from compiling
some files as C99 and others as GNU C / C89. Alternately, you can use the
GNU99 mode, which combines C99 with some POSIX features supported by GCC.
To compile with thestrict C99 standard, add the flag
-std=c99
to the GCC command line. To compile in GNU99 mode,
use the flag -std=gnu99
.
GCC is still not all that smart about how it makes use of the SSE units. It's possible that some of you have coaxed GCC into actually getting automatic vectorization of your code, but I have not (this is, by the way, one of the things that's nice about the Intel compilers). However, if you tell GCC where it makes sense to use SSE instructions, the compiler can then help you do the heavy lifting of efficiently scheduling those instructions. For example, the very inner loop of my current matrix multiply kernel looks
/* Include SSE intrinsics */
#include <xmmintrin.h>
/* Compute x += A*b
*
* The "restrict" keyword in C99 tells the compiler
* there is no aliasing.
*
* Note that A and x must be aligned on 16-byte boundaries. This can
* be done where they are declared with something like:
*
* double A[4] __attribute__((aligned(16)));
* double x[2] __attribute__((aligned(16)));
*/
void matvec2by2(const double* restrict A,
const double* restrict b,
double* restrict x)
{
/* Load each component of b into an SSE register.
* The registers hold two doubles each; each double gets
* the same value. The __m128d type is the type used by
* the compiler to mean "short vector with which we can
* do SSE stuff."
*/
__m128d b0 = _mm_load1_pd(b+0);
__m128d b1 = _mm_load1_pd(b+1);
/* Load x into registers */
__m128d xr = _mm_load_pd(x);
/* Now, update x:
* Load columns of A into a0 and a1
* multiply by b1 (a vectorized op) to get sum0 and sum1
* add into x
*/
__m128d a0 = _mm_load_pd(A+0);
__m128d a1 = _mm_load_pd(A+2);
__m128d sum0 = _mm_mul_pd(a0,b0);
__m128d sum1 = _mm_mul_pd(a1,b1);
xr = _mm_add_pd(xr,sum0);
xr = _mm_add_pd(xr,sum1);
/* It's good to store the result! */
_mm_store_pd(x,xr);
}
My code looks a little different (for one thing, it is sitting inside a larger loop nest), but this is the basic idea.
Feel free to use this code directly! More generally, you should feel free to use other codes that you find online, though you should provide an appropriate citation (e.g. via a comment). The details of how to use the SSE units aren't really the thing I hoped you'd get out of this -- and in any case, this code still needs some fiddling in order to get great performance.
There are different types of places where variables can live in memory:
In a lot of day-to-day stuff, we use the stack for almost everything. But the stack is a finite resource! If you start allocating large arrays on the stack, you will very quickly have a stack overflow (and a program crash). This is probably not what you want, so I suggest using the following strategy if you want a little side buffer for copy optimization:
void my_function(...)
{
/* static means that this goes in the global segment
* (only one copy -- not thread safe!), and the alignment
* attribute is helpful if you're going to use my_space
* with SSE stuff.
*/
static double my_space[BLOCK_SIZE*BLOCK_SIZE]
__attribute__((aligned(16)));
...
}
You can also dynamically allocate space on the heap -- but you won't want to do that anywhere near an inner loop!
Timing on SMP Linux systems is a pain for two reasons:
The code I have given you should usually work, but a more
reliable way may be to define CLOCK
as
CLOCK_PROCESS_CPUTIME_ID
in timing.c
and to use
taskset
to ensure that your process is pinned to a single
physical processor: i.e. use taskset -c 0 ./matmul
(or
change $1 >> timing.raw
to taskset -c 0 $1
>> timing.raw
in make_sge.sh
.
cpuid
utility and by running cat /proc/cpuinfo
. Note that
you should do this using the batch queueing system, since the front-end
node has a different processor than the workers.