CS 612 -- Homework 1
Due: February 22, 2005
Introduction
In this assignment, you will be asked to discover the cache parameters
of your machine, analyze the performance effects of various types of
codes on your machine, obtain an optimized MMM code for your machine
and finally use this experience to create your own optimized MMM
kernel. Note that this assignment is easiest to complete if it is all
done on a P6-class machine (i.e. Pentium 2 or Pentium 3). The files
for completing this assignment are linked to within the assignment.
Problem 1 -- Cache Parameters
In this portion, you will use a tool developed by Kamen Yotov, called
X-Ray, to determine the cache parameters of the machine you are
using. First, read this paper to learn
how to download and use X-Ray. Then run X-Ray to determine the cache
parameters of your system. Find all parameters you think relevant to
optimization for cache performance. If possible, verify the results
returned by X-Ray by comparing to hardware manuals, or information on
the web. Turn this information in as part 1. You will also need it to
complete later parts of the assignment.
For further discussion of X-Ray (the papers referenced in the above
document, etc), check here.
Problem 2 -- Miss Ratio as a Measure of Cache Effectiveness
The miss ratio of a program provides a measure of how
effectively the program utilizes your cache, and, in some sense, how
efficient the program is. Intuitively, the fewer times a memory access
has to go all the way to main memory, faster the program will run. The
miss ratio measures exactly this. In this problem, we will examine the
miss ratio for a common linear algebra problem, Matrix Matrix Multiply
(MMM), and see if the results give us some ideas as to how to best
exploit knowledge of cache parameters.
In order to compute the miss ratio, we will use performance
counters. Most architectures provide access to counters which measure
such things as memory accesses and cache misses. We will use this
information to determine the miss ratio for various versions of MMM.
If the machine you are testing on is in the P6 family, you
can use this library to access the
hardware counters. If you are using a Pentium 4 class machine, you
will need to determine how to access the performance counters
yourself. If you would like to use a PowerPC based machine, email
Milind for information regarding the performance counters.
Once you are familiar with the access and use of the performance
counters for your machine, write a simple, triply-nested loop MMM
program, and run all 6 loop-orders for varying array sizes to produce
a graph that looks like:
Explain the changes in the miss ratio for the graph. Discuss why some
loop orders perform better than others, and explain any sudden changes
in miss ratio. Finally, comment on how this information might provide
optimization opportunities.
Problem 3 -- Empirically Optimized Cache-efficient Libraries
Rather than hand tuning linear algebra libraries to each architecture
individually, a common approach is to use empirical optimization to
automatically generate a high performance implementation of the
library. The standard linear algebra library is BLAS (Basic Linear Algebra
Subprograms), and the most common empirical optimization tool for BLAS
is ATLAS (Automatically
Tuned Linear Algebra Software). In this problem, you will use ATLAS to
generate the BLAS libraries optimized for your system.
First, download and install ATLAS, following the documentation on
their website. Note
that if you are using a Windows machine, you will need to also
download and install Cygwin. After installing (and running) ATLAS, you
will have generated a static library which contains the BLAS
subroutines.
Now write a simple wrapper class to perform MMM
using square matrices, and calling the BLAS routine for matrix matrix
multiply. Reference this
document to figure out how to call the generated routines (note:
you are looking for the BLAS operation that allows you to do C =
alpha*A*B + alpha*C without any constraints on A, B or C). For further
help, here
is a description of what the various arguments mean, and here
is a sample program that shows how to call a similar BLAS function.
Once you get your code working, run it for various matrix
sizes, and plot your results.
Problem 4 -- Write Your Own
Now you will try your hand at writing an efficient MMM kernel. Read
this paper
by Yotov et al. Focus on the technique of writing parameterized
MMM. Once you are comfortable with tiling and unrolling to efficiently
use registers and cache, write your own implementation of MMM. Find
the parameters in any way you would like - e.g. using a model or
experimentation - that give you the best results. Run your kernel for
various matrix sizes, and plot your results. Compare them to the
results of Problem 3.
If you were able to complete Problem 2
on the same machine that you completed the rest of the assignment, for
extra credit, comment on how the parameters you found relate to the
cache performance you noted in Problem 2.
Submission
For Problem 1, submit the cache information you discovered. For
Problems 2 and 3, submit your discussions along with the plots. For
Problem 4, submit your discussion, plots of data and your source code.