Group Project 5 - Cache Design

CS 3410 Fall 2018


Project Due: November 19, 2018 at 11:59pm

Please answer the questions and submit all code via CMS.

This is a partner project. You can ONLY work with your assigned partner. Failure to adhere to this rule will result in an Academic Integrity Violation.


Overview

In this assignment you will explore the effectiveness of caches and the impact of different cache configurations. To do this, you will write a C program that simulates the behavior of a cache. The simulator will be configurable, meaning that it can model the behavior of a variety of cache configurations. When exploring cache designs, a performance simulator like this is faster to work with than modeling the caches at the circuit level (for example, in Logisim). Your cache simulator will take as input a memory trace: a list of load and store instructions that were executed during a single run of a single program. You will use this single memory trace to evaluate the impact of cache size, associativity, block size, etc. on cache performance.

Deliverables

In addition to turning in your code on CMS, you will be asked to run 5 Experiments with your cache simulator. You will be asked to answer questions about these experiments on CMS. For tasks 5-8, you will be told:

  • The experiment(s) you need to run for that task.
  • The questions for each experiment (to be submitted via CMS quiz).

The Trace

The memory trace you will be using was generated by running a program called art and recording all of the loads and stores issued by the processor during that run. It is one of many benchmarks used by processor designers to measure the quality of their designs. (Normally you would not just look at a single benchmark, but for the purposes of this assignment, one is sufficient.) The trace file looks like this:

0 3019b6c8
0 300e20d8
0 3026e398
0 30192830
1 30192830
0 300049d0
...
A leading 0 indicates a load and a 1 indicates a store. The second number is the address of the memory operation (in hex). Taking a look at is_load and is_store variable in simulator.c will help you further understand the format of the trace file.

Assume that addresses are 32 bits for this entire project.

For our purposes a kilobyte (KB) is equivalent to a KiB, which is equal to 1024 (210) bytes. A megabyte (MB) is equivalent to an MiB, which is 1,048,576 (220) bytes. For example, 32KB is equal to 32*210B = 25*210B = 215B and 64MB is equal to 64*220B = 26*220B = 226B.

Tasks

We have broken up the assignment into the following tasks:

Task 1: Cache Tag/Index/Offset Calculations

A cache has three primary configuration parameters:

  • A: Associativity (number of ways per set)
  • B: Block size (size of a single block, a block is also referred to as a cache line)
  • C: Capacity or cache size (total amount of data in the cache)

These three parameters determine the number of sets and ways in the cache. These parameters also specify how a memory reference address is split into its three components: tag, index, and offset.

Paper Exercise: calculate the following values for a two-way set associative 32KB cache with 64B blocks (that is: A=2, B=64B, C=32KB):

  • Number of offset bits in the address
  • Number of sets in the cache
  • Total number of cache lines in the cache
  • Number of index bits in the address
  • Number of tag bits in the address

You do not need to turn this in, but for the sake of your understanding this assignment and your performance on Prelim 2, please take it seriously!

Task 2: Generalizing the Calculations

Write formulas in terms of A, B, and C for calculating the following quantities. You may find using log(A), log(B), and log(C) in the formulas easier than A, B, and C directly. For this assignment, you may assume A, B, and C are all powers of 2. Write the formulas for:

  • Number of offset bits in the address
  • Number of sets in the cache
  • Total number of cache lines in the cache
  • Number of index bits in the address
  • Number of tag bits in the address

Hint: you may find it helpful to first work out the numbers for a direct-mapped cache (that is, consider just B and C), and then extend it to set-associative caches (in which multiple cache lines reside in a single set).

Task 3: Initializing the Cache

Now use your formulas from Task 2 to complete 5 lines of the make_cache function in cache.c. You need to set the variables for n_offset_bit, n_set, n_total_cache_line, n_index_bit, and n_tag_bit.

Notice there is a log2 function in the included <math.h> library.

Because we're not actually manipulating the actual data values (as a real processor would), but just calculating hit/miss for each memory reference, there is no need to store data values in the cache (in fact, the trace doesn't even give you the data values).

The cache can be represented by a 2-dimensional array of tags. (The second dimension only becomes relevant when the associativity of the cache is higher than 1.) Finish the make_cache function by creating the array of tags that will be used to determine whether a cache hit or miss has occurred. You must complete and use the helper function make_2d_matrix. It will malloc an array, with each element holding a pointer to another malloced array to create an n_row by n_col matrix whose elements are of size size.

To avoid using a valid bit, we initialize all the tag values to zero; we assume that all blocks are valid and remain valid throughout the simulation. (We can make this simplification because we happen to know that the trace we're using doesn't have any addresses in the address range that would create an all-zero tag for the caches we simulate.)

Task 4: Extracting Bits from Addresses

Continue coding up your cache.c by completing the three methods used to extract the tag bits, index bits, and block address from any given address. The method names are as follows:

  • get_cache_tag()
  • get_cache_index()
  • get_cache_block_addr(): the original address with the offset bits zeroed out

For example, if your address were 0b111101010001 and there were 4 bits each in the tag, index, and offset, get_cache_tag(0b111101010001) returns 0b1111, get_cache_index(0b111101010001) returns 0b0101, and get_cache_block_addr(0b111101010001) returns 0b111101010000.

In decimal notation, the above example can be understood as: if your address were 3921 and there were 4 bits each in the tag, index, and offset,

  • get_cache_tag(3921) returns 15
  • get_cache_index(3921) returns 5
  • get_cache_block_addr(3921) returns 3920

IMPORTANT: Do not convert the integer values to character strings (which is slow and unnecessarily verbose). Integer bit manipulation instructions (shifts, bitwise and, or, not, etc.) can be used to easily and efficiently accomplish such bit extraction operations. Hint #1: to generate an integer value with all bits set to one, you can use ~0 (bit-wise not of zero). Hint #2: shifting your address in both directions may prove useful.

The autograder will call these three methods twice, beginning with the address 0b111101010001 and print out the values for these variables. Note, they will not match the above example because the tag, offset, and index bits will not each be 4 bits long (the addresses in the simulator are all 32 bits).

Completing the Access Method

The most challenging part of the assignment will be to write the access method access_cache, which performs the cache lookup for each load or store; it returns true if the access is a hit, false if the access results in a miss. Read the rest of the assignment and the code you have been given to correctly implement the access method. Do not try to implement the entire method at once. Instead, build on the method as each new concept is introduced below.

Task 5: Implementing a Direct-Mapped Cache to explore Hit Rate vs. Capacity

Complete the access method for a direct-mapped cache by following the steps described in the function's comment (you can ignore lru_way and dirty_bits for now). Your simulator can now simulate a direct-mapped cache parameterized by B and C. Let's test it out!

The first experiment is to gauge the impact of cache size on cache miss rate.

For now, handle loads and stores in the same fashion. In the next task you will set dirty_bits depending on whether or not it is a store. Additionally, if the tag is in cache_tags then it is a hit, regardless of if it was a load or a store

Use your cache simulator to produce cache miss rates for varying cache sizes. Generate the data for caches capacity from 256 bytes (28) to 4MB (222). Configure the block size to 64 bytes. Below are the first two commands you should run. The first sets the cache capacity to 28 = 256 bytes and the block size to 26 = 64 bytes. The second increases the cache capacity to 29 = 512 bytes.

./p5 -t art-full.trace -cache 8 6 1
./p5 -t art-full.trace -cache 9 6 1
Generate a line plot of this data. On the y-axis, plot the "cache miss rate (percent of all memory references)". The smaller the miss rate, the better. Use log of the cache size for x-axis..

Optional: Running this command 15 times can get tedious (especially if you don't use the up arrow). You can automate this process using python or bash scripting. Feel free to try it out. Here is an example of a bash for-loop that you can modify and then directly copy into the terminal, or save in some file myscript.sh and then run any time by running bash myscript.sh.

for i in `seq 1 10`;
do
  echo count: $i
done


Experiment 1

  • Associativity = 1
  • Block size = 64B
  • Capacity = 256B, 512B, 1KB, 2KB, ..., 4MB

Questions

  1. What’s the smallest capacity that makes the miss rate of direct-mapped cache less than 15%?
  2. What’s the smallest capacity that makes the miss rate of direct-mapped cache less than 5%?
  3. Ratio of Hit Rate: 32KB vs. 64KB

Task 6: Implementing Set Associativity to explore Hit Rate vs. Associativity

Before moving on: create a copy of your working cache up to this point. Run the following command cp cache.c cache_direct_mapped.c to copy your cache.c into a new file cache_direct_mapped.c. This will be to ensure that you can receive partial credit if your new changes break your direct-mapped-cache implementation. You will have the opportunity to submit both versions, but because your final cache.c will be a super-set of cache_direct_mapped.c, you are allowed to submit your final cache.c code for both.

Now modify your simulator to support n-way set associative caches. Each set in the cache will now need an array of "tag" entries (one for each way), and a single "LRU" (least recently used) bit used for cache replacement. An LRU update method has been provided for you. See the code to understand how it works and how you should call it. When replacing a block, replace the LRU entry (and don't forget to update the LRU bit to point to the other entry).

Generate miss rate data for the same block size and caches sizes as in the previous question, but simulate two-way set-associative caches. Plot this result with the original data collected for a direct-mapped cache (two lines on the graph: one for direct-mapped caches and one for the 2-way set-associative cache).


Experiment 2

  • Associativity = 2
  • Block size = 64B
  • Capacity = 256B, 512B, ..., 4MB

Experiment 3 (Optional)

  • Associativity = 4
  • Block size = 64B
  • Capacity = 256B, 512B, ..., 4MB

Questions

  1. What’s the smallest capacity that makes the miss rate of 2-way cache less than 15%?
  2. What’s the smallest capacity that makes the miss rate of 2-way cache less than 5%?
  3. How large must the direct-mapped cache before it equals or exceeds the performance of the 1 KB 2-way assoc?
  4. If two address reside in the same cache set, what do they have in common?
    1. Set index bits
    2. Tag bits
    3. Block index bits
    4. Valid bit
  5. Memory address m is 4-bit binary. Suppose we have a cache of capacity 8 bytes, blocksize = 2 bytes. For the following memory access pattern, will you prefer a direct-mapped cache or a 2-way cache design if you want a higher hit rate?
    • 0100
    • 1100
    • 0100
    • 1100
    1. Direct-mapped
    2. 2-way

Because the set-associative cache generally performs better than the direct-mapped cache, all subsequent experiments use the two-way set associative cache.

Task 7: Tracking bytes transferred to explore Write-Back vs. Write-Through Caches

Traffic into the cache is easy to calculate: it is just the number of misses multiplied by the block size because for each miss you must bring into the cache that entire block of memory. Traffic out of the cache is write traffic, and it depends on whether the cache is write-back or write-through. Your simulator can calculate the traffic for both write-back and write-through caches in a simulation as follows:

  • Write-through: For the write-through cache, the traffic is the average number of bytes written by each store multiplied by the number of stores. The trace doesn't specify how many bytes each store writes, but for your calculations assume that each store writes 4 bytes.
  • Write-back: For the write-back cache system, the traffic is the number of dirty evictions multiplied by the block size. Because at most one eviction happens per miss, this write-back traffic can at most double the overall traffic (in the case in which all evicted blocks were dirty). To calculate the number of dirty evictions, modify your simulator by adding a "dirty" bit for each block in the cache. Set the dirty bit whenever a store operation writes to the block. This dirty bit is not per word or per byte; it is one dirty bit for the entire cache line. (You'll need a dirty bit for each of the ways in the set; be sure to set the correct dirty bit.)

Compute and store the total traffic for both cache configurations in the cache_stats.c variables total_bytes_transferred_wb and total_bytes_transferred_wt.

For both cases, use a write-allocate policy. This implies that the number of misses for the write-through and write-back caches are the same, only the traffic is different.

For a two-way set-associative cache with 64-byte blocks, generate a graph plotting the same cache sizes as the previous question (on the x-axis) versus total data traffic per memory operation (total number of bytes transferred divided by the total number of memory references).

Plot two lines: (1) the write-through cache (miss fill traffic + write-through traffic) and (2) the write-back cache (miss fill traffic + write-back of dirty blocks).


Experiment 4

  • Associativity = 2
  • Block size = 64B
  • Capacity = 256B, 512B, ..., 4MB

Questions

  1. Which of the following plot of cache traffic (traffic out + traffic in) is the closest of the one you produced (both for write-back and write-through)?
    1. Graph A
    2. Graph B
    3. Graph C
    4. Graph D
  2. If we only plot the traffic out of the write-through cache, as capacity of the cache increases, how does the traffic change?
    1. Increase
    2. Decrease
    3. Stay the same
  3. If we only plot the traffic out of the write-back cache, as capacity of the cache increases, how does the traffic change?
    1. Increase
    2. Decrease
    3. Stay the same

Task 8: Hit Rate vs. Block size

Now let's explore the impact of different cache block sizes. Use the simulator to calculate the miss rate for a 32KB, two-way set associative, write-back cache with several different block sizes: 8B, 16B, 32B, 64B, 128B, 256B, and 512B blocks. Create two graphs from this data:

  • Miss Rate Graph. Plot the data on a graph with miss rate on the y-axis, log of the cache block size on the x-axis, and the miss rate plotted as a line.
  • Traffic Graph. Plot the data on a graph the total write-back traffic on the y-axis, log of the cache block size on the x-axis, and the total write-back traffic plotted as a line.

Experiment 5

  • Associativity = 2
  • Block size = 8B, 16B, 32B, ..., 512B
  • Capacity = 32KB

Questions

  1. What is the block size with the highest hit rate?
  2. What is the block size with the lowest total write-back traffic (transferred in + write-back transferred out)
  3. Based on the calculation, if a cache design uses a relatively large block size (say 2KB), which metric is the design trying to minimize?
    1. Miss rate
    2. Traffic
  4. Based on the calculation, if a cache design uses a relatively small block size (say 8B), which metric is the design trying to minimize?
    1. Miss rate
    2. Traffic

What to Turn In

Source Code. Submit your completed code via CMS under the P5 assignment.

  • cache.c
  • cache_direct_mapped.c
  • cache_stats.c

Answers. You will submit your answers to all of the above questions by completing the P5 Quiz on CMS.

Comments and Hints

Test Cases / Debugging. All of the data you plot and discuss should be generated from the art-full.trace input trace.

Makefile. A Makefile can be your good friend to help you compile your source code more conveniently. We provided the Makefile for you in this project. If you are interested reading more about Makefiles, Section F.6 in this chapter is a good read.

Sanity Test To help you debug your cache simulator, we've provided a sanity test. The way to use it is to run the sample input command and compare your result with the sample output command. Comparing your results to these sample outputs should help ensure your code is compiling properly and satisfies basic functionality. These debug runs are a completely optional portion of the assignment, but it is one way to make sure your are off to a good start.

Here are the three sample input command:

  • make
  • ./p5 -t art-full.trace -cache 16 4 2
  • ./p5 -t art-full.trace -cache 8 4 4
  • ./p5 -t art-full.trace -cache 24 2 8

The three sample outputs are as follows:

Expected output for command 1:
-----------------------------------------------------
Trace name                  art-full.trace
Cache size  = 65536B. Each block = 16B.
2-way set associative cache.
Tag = 17 bits, Index = 11 bits, Offset = 4 bits
There are 2048 sets total in the cache.
At this associativity, that means a total of 4096 cache lines.
Instruction Limit           none
art-full.trace
Processed 1957764 trace lines.
Num Instructions:           1957764
Cache Hit Rate = 72.69% (Miss Rate: 27.31%)
Total number of bytes transferred write-back: 9516240
Total number of bytes transferred write-thru: 9459396
Expected output for command 2:
-----------------------------------------------------
Trace name                  art-full.trace
Cache size  = 256B. Each block = 16B.
4-way set associative cache.
Tag = 26 bits, Index = 2 bits, Offset = 4 bits
There are 4 sets total in the cache.
At this associativity, that means a total of 16 cache lines.
Instruction Limit           none
art-full.trace
Processed 1957764 trace lines.
Num Instructions:           1957764
Cache Hit Rate = 69.18% (Miss Rate: 30.82%)
Total number of bytes transferred write-back: 10615264
Total number of bytes transferred write-thru: 10558244
Expected output for command 3:
-----------------------------------------------------
Trace name                  art-full.trace
Cache size  = 16777216B. Each block = 4B.
8-way set associative cache.
Tag = 11 bits, Index = 19 bits, Offset = 2 bits
There are 524288 sets total in the cache.
At this associativity, that means a total of 4194304 cache lines.
Instruction Limit           none
art-full.trace
Processed 1957764 trace lines.
Num Instructions:           1957764
Cache Hit Rate = 88.76% (Miss Rate: 11.24%)
Total number of bytes transferred write-back: 880208
Total number of bytes transferred write-thru: 1783732

You can check your output against the sample output for sanity purpose. To compare your output with ours systematically, we recommend using the command diff in the terminal. diff file1 file2 will output contents that are different between file1 and file2.

Acknowledgements. This assignment was originally written by Milo Martin at University of Pennsylvania and subsequently modified to include the coding infrastructure and an autograder by Anne Bracy at Washington University in St. Louis. The assignment code base was converted from Java to C by CS 3410 course staff at Cornell University to meet the needs of the class. The Makefile reading material used in this project is from Operating Systems: Three Easy Pieces authored by Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau at University of Wisconsin-Madison.