Caches

The Memory Bottleneck

Remember our overview of computer architecture styles, where we assumed that each step in an instruction execution could happen in about one clock cycle? The assumption then was that it took about the same length of time to: fetch an instruction; decode it into control signals and access the register file; actually perform an arithmetic/logic operation like adding or multiplying two numbers; load or store to memory, if necessary; and write results back to the registers.

We can now tell you that this was a convenient fiction. While many of these stages do take about a cycle, there are important exceptions. For example, while it is easy to implement an integer addition circuit within one clock period (even at today’s multi-gigahertz clock frequencies), multiplication and division can often take several cycles. Think something like 3 to 15 cycles, depending on the complexity of the operation and the clock frequency.

But most importantly, accessing a computer’s memory is way slower than everything else. Loading or storing a single value to/from main memory takes hundreds of cycles on a modern computer. Because practical programs access memory every few instructions, this means that the performance of the memory system is an enormous factor in the performance of a computer system.

There are two big reasons why main memory is so slow: it is far away from the processor (both physically and metaphorically), and it uses a different physical technology. The result is that on-chip memory is fast, small, and expensive; off-chip (main) memory is slow, large, and cheap. For more on this fundamental trade-off, see our previous notes on the memory hierarchy.

SRAM vs. DRAM

One of the features of the memory hierarchy’s trade-off is a difference in manufacturing technology. Data storage on the CPU uses a technology called static RAM (SRAM), which is just built out of transistors—the same stuff that we make logic gates and registers out of. The ubiquitous technology for off-chip memory is dynamic RAM (DRAM). DRAM is a completely different technology that works by manufacturing arrays of tiny capacitors and periodically filling them with charge.

We already mentioned that SRAM is small, fast, and expensive while DRAM is large, slow, and cheap. But it’s worth dwelling for a moment on the sheer magnitude of the differences between the two.

Speed: Accessing a value in SRAM typically takes roughly on the order of 0.5 nanoseconds. And in general, accessing any element in an SRAM is equally fast. In DRAM, accessing the first value in a DRAM array can take tens of nanoseconds. Subsequently accessing nearby values can be faster.
Size: A typical size for an on-chip SRAM is roughly on the order of 1 MB. Even an entry-level laptop in 24 comes with 16 GB of DRAM.
Cost: A rough estimate for the cost of DRAM storage is $3 per GB. It’s hard to pin down a good estimate for the cost of SRAM alone, because it usually comes with logic, but a good ballpark estimate is in the order of thousands of dollars per GB.

Because the trade-off is so extreme, it makes sense that computers would want to have some of each. An all-DRAM computer would be way too slow, and an all-SRAM computer would be way too expensive. Carefully combining memories of different speeds can have a huge impact on the cost/performance trade-off of a system.

Locality

This lecture is about caching, a technique that adds an intermediate-sized memory between registers and main memory. The idea is to build, out of SRAM, a place to put data that we access frequently. Then we’ll automatically transfer data from main memory (DRAM) to the cache (SRAM) so that most accesses, on average, can find their data in the cache.

To make this work, we will need a policy for automatically predicting which data is likely to be accessed frequently in the future. The key principle that caches will exploit is locality. Locality is a common pattern in real software that says that similar data is likely to be accessed close together in time.

Computer architects distinguish between two different forms of locality. Both of them are assumptions about how “normal” programs are likely to behave:

Temporal locality: If a program accesses a given value, it is likely to need to access the same value again sometime soon.
Spatial locality: If a program accesses a given value, it is likely to access nearby values in memory (i.e., addresses that are numerically close to the original address) sometime soon.

To illustrate the difference, consider this program:

int total = 0;
for (int i = 1; i < n; i++) {
  total += a[i];
}
return total;

Let’s think about the accesses to total and a[i]. Do these accesses exhibit spatial or temporal locality?

The accesses to total has high temporal locality because we access the same variable (the same address in memory) on every iteration of the loop—i.e., separated by only a few instructions.
The a[i] accesses have high spatial locality because we are repeatedly, and close by in time, accessing nearby addresses in memory. When the program loads a[i], it will very soon ask load a[i+1], whose address is only 4 bytes away.

Locality is an extremely general principle. Maybe you can think a little bit about other situations in your life that seem to exhibit temporal or spatial locality. Common examples of mechanisms for exploiting locality in everyday life include refrigerators, backpacks, and laundry hampers.

Hits & Misses

The idea with a cache is to try to “intercept” most of a program’s memory accesses. A cache wants to fulfill as many loads and stores as it can directly, using its limited pool of fast SRAM. In rare conditions where it does not have the data already, it reluctantly forwards the request on to the larger, slower main memory.

In the presence of a cache, every memory access that a program executes is either a cache hit or a cache miss:

A hit happens when the data already exists in the cache, so we can fulfill the request quickly.
A miss is the other case: the data is not already in the cache, so we have to send the request on to DRAM.

A cache’s purpose in life is to maximize the hit rate (or, equivalently, minimize the miss rate).

A Hierarchy of Caches

A single cache is good, so multiple caches must be better! Remember, there is a fundamental trade-off between memory size and speed. So modern computers don’t just have one cache at a single point in this trade-off space; they use several different caches of different sizes (and therefore different speeds). These are layered into a hierarchy.

It is common for modern machines to have three levels of caching, called the L1, L2, and L3 caches. The L1 cache is closest to the processor, smallest, and fastest. It is not unheard of to tack on an L4 cache. There are diminishing returns eventually, so this doesn’t go on forever.

In the L1 cache, it is also common for computers to separate the data and the instructions into separate caches. The data and instructions coexist in main memory, so it is totally reasonable to have a single L1 cache for both. But it turns out that the locality patterns for accessing instructions and data are so different that, to maximize performance, computer architects have found it helpful to keep them separate. You will sometimes see these separate caches abbreviated as the L1I and L1D cache.

Direct-Mapped Cache

We have talked a lot about the goals of a cache; let’s finally talk about how caches work. We’ll start with a simple style of cache called a direct-mapped cache. In this kind of cache, every address in main memory is mapped to exactly one location in the cache.

Let’s say we have 64-bit memory addresses, and we have a cache that can store $2^n \ll 2^{64}$ values. To state the obvious, it is impossible for every memory address to get its own entry in the cache! So we need some policy to map memory locations onto cache locations. In a direct mapped cache, this is a many-to-one mapping.

Here’s the policy: we will split up the memory address, and we will use the least significant $n$ bits of the address to determine the cache index, i.e., the location within the cache where this data will go. We have $2^n$ cache locations, and there are $2^n$ possible values of these $n$ bits, so each value gets its own entry in the cache. We will then call the other $64-n$ bits the tag; we will need these to disambiguate which address a given cache entry is currently holding.

We’ll implement the hardware for our cache so that each of the entries has 3 values: the tag, a valid bit, and the actual data. Let’s visualize a tiny 4-entry ($n=2$) cache like this:

index	valid?	tag	data
00
01
10
11

Here’s what these columns mean:

The index is literally just the index of the cache entry. (This never changes.)
The valid bit indicates whether that cache entry currently holds meaningful data at all. 0 means invalid (“don’t pay attention to this at all; nothing to see here”) and 1 means valid (“I am currently holding some cached data”). The invalid state is useful at program startup, when the cache doesn’t hold anything at all (all entries are invalid).
The tag bit is those other $64-n$ bits of the current value in the cache entry. That is, every cache entry could contain one one of $2^{64-n}$ different memory addresses; the tag tells us which one it currently is.
The data is the current value at that memory address. (This is the raison d’être of the cache!)

Now, to access a memory address $a$, we’ll execute this algorithm:

Split the address $a$ into an index $i$ ($n$ bits) and a tag $t$ (the other $64-n$ bits).
Look in entry $i$ of the cache.
Is the entry valid (is the valid bit 1)? If not, stop and go to main memory (this is a miss).
Does the entry’s tag equal $t$? If not, stop and go to main memory (this is also a miss).
The line is valid and the tag matches, so this is a hit. We can use the data from this cache entry and avoid going to main memory.

Filling the Cache

On a cache miss, we need to fetch the value from main memory. (Let’s only consider loads for now; we’ll handle stores later.) Because this is slow, we want to avoid doing this again in the future. So, we want to do something called filling the cache entry. After fetching the data from main memory, do these things:

Look in entry $i$ of the cache (again).
Is the entry valid? If so, there is already some data here, and we will take its place. This is called an eviction. (We will discuss more about what to do about evictions in the next section.)
Set the valid bit to 1 (regardless of what it was before), to indicate that it contains real data now.
Set the tag to $t$, to disambiguate which data it holds.
Set the data to the value we got from main memory.

This way, subsequent accesses to the same address will hit. This is the way that caches exploit temporal locality, i.e., nearby-in-time accesses to the same address.

Example

To keep this example tractable, let’s pretend we only have 4-bit addresses (not 64). We’ll stick with a 4-entry cache, so the least-significant 2 bits are the index.

What happens when you execute this sequence of loads? Assume you start with en empty cache, where every entry is invalid. Label each access as a hit or a miss. Also, note each time an eviction occurs.

load 1100
load 1101
load 0100
load 1100

It can be helpful to draw out the four-column table above and update it after every access.

Larger Blocks

Our little cache is already pretty good at exploiting temporal locality, but we haven’t yet done anything about spatial locality. In our example above, when we access address 1100 and then immediately access 1101, both are misses even though the memory locations are “neighbors.” Under the hypothesis that many accesses in real applications will have spatial locality, we can extend the cache design to hit more often.

Here’s the idea. So far, every entry in our cache has only held a single memory address (and therefore only a single byte of data). Let’s generalize it to hold an entire block (a.k.a. line) of data, i.e., $2^b$ bytes.

Before, we split the address into two pieces: the tag and the $n$-bit index. We will now split it into three. Listing from most-significant position to least-significant: the tag, the $n$-bit index, and the $b$-bit offset within the block.

You can visualize all of memory being broken up into $2^b$-byte blocks. The block is the unit of data that we will transfer to and from the cache. For example, when we fill data from main memory into the cache, we will fetch the entire $2^b$-byte block that contains $a$ and put it into the cache. Now, loading a single byte brings in a bunch of neighbors—on the assumption that it’s likely that the program will soon need to access those neighbors.

The algorithm for accessing the cache remains the same; we just have to change the way we chunk up the address. And when we return data from the cache, we will use the least-significant $b$ bytes as an offset to decide which byte from the block to return.

Example

Let’s return to our 4-byte cache from above. Let’s keep the design using 4 entries, but let’s make every entry store a 2-byte block instead of a single byte. That means our little 4-bit addresses now consist of 1 tag bit, 2 index bits, and 1 offset bit.

If you visualize this cache as a table, it looks exactly the same:

index	valid?	tag	data
00
01
10
11

The big difference now is that the “data” column stores 2-byte blocks. (The tag column now only stores 1 bit.)

Try simulating the same sequences of accesses again. Label the hits and misses:

load 1100
load 1101
load 0100
load 1100

Keeping Comparisons Fair

In this example, we cheated a bit: by doubling the size of the blocks, we double the total size of the cache. This means the cache is twice as big and twice as expensive. To make a fair comparison, between two cache designs, you’ll want to keep the total number of bytes the same. So if you double the block size, you should halve the number of entries.

Handling Stores

So far, we have only talked about loads (reads from memory). What about stores?

Writing to a cache works mostly the same as reading, except that we have a few choices to make.

When we store to a block that is not already in the cache (a store miss), should we fill it (bring the block into the cache)? Or just send the write to memory? If so, this is called a write-allocate policy. Write-allocate caches make the (very reasonable) hypothesis that programs that write a given memory location are likely to read it again in the near future.
When we store, should we just update the data in the cache, or should we also immediately send it to memory? The “immediately send all stores to main memory” policy is called write-through and it’s pretty simple. The other policy, where we just update the cache, is called write-back and it’s slightly more complicated.

The rest of this section will be about write-back caches. The write-back policy is a good idea in general because it means that you can avoid a lot of costly stores to main memory. It’s extremely popular for this reason. But it requires extra bookkeeping to deal with the fact that main memory and the cache can get “out of sync.”

Here’s the idea for keeping the cache and main memory in sync. We will add yet another value to our cache entries (another column in our table): the dirty bit. A cache entry is clean when it is in sync with main memory and dirty when it might disagree with main memory. Here’s how you can visualize the write-back cache:

index	valid?	dirty?	tag	data
00
01
10
11

We will need to add these details to our algorithm for accessing the cache:

When you fill a cache entry, initially set its dirty bit to 0. (The entry currently agrees with main memory.)
Whenever you store to an entry in the cache, set its dirty bit to 1. (We are avoiding writing to main memory, so now a disagreement is possible.)
Whenever you evict an entry from the cache, check its dirty bit. If the entry is clean, do nothing. If it’s dirty, write the data back to main memory then.

Example

Let’s try out a write-back policy with this sequence of accesses. Use our cache setup with 2-byte blocks as above.

load 1100
store 1101
load 0100
load 1100

Fully Associative Cache

All the caches we’ve seen so far have been direct-mapped: every block in main memory has exactly one cache entry where it might live. You may have noticed that these caches have a lot of evictions. Even when there is theoretically plenty of space in the cache, the fact that every block has only one option for where to live means that conflicts on these entries seem to happen all the time.

The opposite style of cache is a fully associative cache, where any memory address could use any entry in the cache. The index is no longer relevant at all; every cache entry could hold any address. When we divide up the address, you no longer take $n$ bits for the index; the entire $64-b$ bits are one gigantic tag.

We will also change the cache-access algorithm. Where the direct-mapped algorithm says “look at entry $i$,” the fully associative version must look at every single entry in the cache, because the block we’re interested in might be anywhere.

Example

Let’s return to our 4-entry cache (with 2-byte blocks). In a fully associative version, because the indices are irrelevant, we can visualize it this way:

valid?	tag	data

There are 4 entries, all created equal, and they all might hold any address in all of memory. Let’s try the same sequence of loads again. Labels the hits and misses:

load 1100
load 1101
load 0100
load 1100

Replacement Policies

When you fill a block in a direct-mapped cache, there is only one choice of which existing block you should evict: the one that is in the (unique) entry where the block must live. In a fully associative cache, when the cache is full, you are now faced with a choice: which of the entries in the entire cache should we evict? An engineer designing a cache must decide on a replacement policy to answer this question.

There is an entire world of science dedicated to inventing cool eviction policies. The goal is to guess which block is least likely to be used again in the near future. And critically, it must make this decision efficiently—you can’t spend a lot of time thinking about which block to evict.

Some popular options include:

Least-recently used (LRU): Keep track of the last time to access every block, and evict the one that was last used longest ago. The hypothesis is that, the longer a program goes not accessing a given block, the less likely it is to access it again soon. Unfortunately, LRU has a lot of overhead because you have to keep track of some kind of timestamp on every single block.
Not most-recently used (NMRU): Like LRU, but only keep track of the most recently accessed block. When it comes time to evict, randomly pick some block that is not the most recent one you accessed. This makes somewhat worse decisions than LRU, but it’s a lot cheaper to implement and is popular for this reason.

The Costs of Associativity

Associativity is great! It leads to far fewer evictions. The problem is that it’s costly to implement in hardware. Because any block could go in any entry, we have to check all entries on every access to the cache. The hardware structure for implementing this “search all entries” operation is called a content-addressable memory (CAM). Because of the “search everywhere” nature of this operation, CAMs are expensive: large, hot, and slow. The cost scales with the number of entries, so it is only really practical to build fully associative caches when they are very small.

Set-Associative Cache

The final cache design we’ll consider strikes a balance between the direct-mapped and fully-associative extremes. A given address may live in exactly one entry in a direct-mapped cache; it may go in any entry in a fully associative cache; in a set-associative cache, it may live in one of a small number of entries grouped together into a set.

Let the number of entries in a set be $2^k$. In caching terminology, our cache has $2^k$ ways. If there are $2^n$ total entries in our cache, then there are $\frac{2^n}{2^k} = 2^{n - k}$ sets. You can think of direct-mapped caches and fully associative caches as special cases:

Direct-mapped: $k = 0$, so there is only 1 way. There are $2^n$ sets with a single block each.
Fully associative: $k = n$, the it’s a $2^n$-way cache with only 1 (giant) set.

The usual way to visualize a set-associative cache is with a 2D grid of entries: one row per set, one column per way. Returning to our 4-entry cache with 2-byte blocks, we can make a visualization by copying and pasting two two-entry tables side by side:

	way 0			way 1
index	valid?	tag	data	valid?	tag	data
0
1

There are still 4 entries in this cache; they are now just grouped into sets of 2. This also means that the number of index bits goes from $n$ to $n-k$ (in this case, from 2 to 1) and the tags get correspondingly larger.

Let’s again update the algorithm for accessing the cache. After calculating the index, we now have to look at the entire set at that index. That means searching through all the ways (columns in our grid) associated with the index. And when we fill the cache after a miss, we need to choose which way within the set to evict using a replacement policy, just like in a fully associative cache.

Example

Once again, let’s simulate the same series of accesses on our machine with 4-bit addresses. This time, we will use a 4-entry, 2-way set associative cache, with a block size of 2. Use an LRU replacement policy. Here’s the sequence of loads again:

load 1100
load 1101
load 0100
load 1100

Understanding Cache Performance

With so many choices about how to design a cache, it can be useful to understand how well your cache is performing on average. You can characterize the overall performance by computing the average memory access time for the entire memory system. The average access time is:

\[ t_{\text{avg}} = t_{\text{hit}} + r_{\text{miss}} \times t_{\text{miss}} \]

Where:

$t_{\text{hit}}$ is the time it takes to access the cache. Cache hits take exactly this amount of time; cache misses take this time to check the cache and then more time to go to main memory.
$t_{\text{miss}}$ is the time it takes to access main memory.
$r_{\text{miss}}$ is the miss rate: the fraction of accesses that are misses.

For example, if it take 1 ns to access the cache and 50 ns to access main memory, and 95% of accesses hit, then the average access time is $1 + 0.05 \times 50 = 3.5$ ns.

You can also extend this reasoning to multi-level cache hierarchies. Say you have an L1 cache and an L2 cache. From the perspective of the L1 cache, $t_{\text{miss}}$ is the time it takes to access the rest of the cache hierarchy, i.e., to try accessing at L2. So you can calculate the average access time at the L2 cache and then use this average time as $t_{\text{miss}}$ in the L1 access time calculation.

Three Categories of Misses

To understand the performance of some code (or of a cache design), you often want to pay attention to the cache misses. They can often be the slowest part of the program. It can also be useful to break down the misses by why they missed.

The 3 classic categories conveniently all start with the letter C:

Cold or compulsory misses happen because this is the first access to the given cache line.
Conflict misses happen because the associativity is too low, and too many lines competed for the same set and evicted a line that the program needed later on.
Capacity misses happen because the entire cache is too small for the working set, and no amount of associativity could have helped.

Here’s an algorithm you can use to decide which category a miss belongs to:

Was this cache line ever loaded before?
- If no: it’s a cold miss.
- If yes: Would this access have missed in a fully associative cache?
  - If no: it’s a conflict miss.
  - If yes: it’s a capacity miss.

CS 3410