Load & Store Instructions

The Memory Hierarchy

So far, we have seen a bunch of RISC-V instructions that access the 32 registers, but we haven’t accessed memory yet. Registers are fine as long as your data fits in 31 64-bit values, but real software needs “bulk” storage, and that’s what memory is for.

In general, computer architects think of these different ways of storing data as tiers in an organization called the memory hierarchy. You can imagine an entire spectrum of different ways of storing data, all of which trade off between different goals:

Smaller memories that are closer to the processor and faster to access.
Larger memories that are farther from the processor and slower to access.

Registers are toward the first extreme: in 64-bit RISC-V, there is only a total of \(31 \times 8 = 248\) bytes of mutable storage, and it usually takes around 1 cycle (less than a nanosecond) to access a register.

Modern main memory is at the opposite extreme: even cheap phones have several gigabytes of main memory, and it typically takes hundreds of of cycles to access it.

You might reasonably ask: why not make the whole plane out of registers? There are two big answers to this question.

In real computers, these different memories are made out of different memory technologies. The physical details of how to construct memories are out of scope for CS 3410, but registers are universally made from transistors (like the flip-flops we built in class) and integrated with the processor, main memory is made of DRAM, a memory-specific technology that uses tiny little capacitors to store bits. DRAM requires different manufacturing processes than logic, is much cheaper per bit than integrated-with-logic storage, but it is also much slower.
There is a fundamental trade-off between capacity and latency. In any memory technology you can think of, building a larger memory makes it take longer to access.

Registers and main memory are two points in the memory-hierarchy spectrum. There are other points too: later in the semester, we will learn much more about caches, which fill in the space in between registers and main memory. You can also think of persistent storage (magnetic hard drives or flash memory SSDs) or even the Internet as further tiers beyond main memory.

Extension and Truncation

When we access memory, we will often need to change the size (the number of bits) of various values. For example, we’ll need to take an 8-bit value and treat it as a 64-bit value, and we’ll need to take a 64-bit value and treat it as a 32-bit value. When you increase the number of bits, that’s called extension, and when you decrease the size, that’s called truncation. The goal in both situations is to avoid losing information whenever possible: that is, to keep the same represented integer value when converting between sizes.

Truncation

Truncation from \(m\) bits to \(n\) bits works by extracting the lowest (least significant) \(n\) bits from the value. There is, sadly, no way to avoid losing information in some cases. Here are some examples:

Let’s truncate the 64-bit value 0x00000000000000ab to 32 bits. In decimal, this number has the value 171. Truncating to 32 bits yields 0x000000ab. That’s also 171. Awesome!
Let’s truncate 0xffffffffffffffab to 32 bits. That’s the value -85 in two’s complement. Truncating yields 0xffffffab. That’s still -85. Excellent!
Now let’s truncate the bits 0x80000000000000ab (note the 8 in the most-significant hex digit). That’s a really big negative value, because the leading bit is 1. Truncating yields 0x000000ab, which represents 171. That’s bad—we now have a different value. But losing some information is inevitable when you lose some bits.

Extension

There are two modes for extending from \(m\) bits to \(n\) bits. Both work by putting the value in the \(m\) least-significant bits of the \(n\)-bit output. The difference is in what we do with the extra \(n-m\) bits, which are the most-significant (upper) bits in the output.

Zero extension fills the upper bits with zeroes.
Sign extension fills them with copies of the most-significant bit in the input. (That is, the sign bit.)

Let’s see some examples.

Let’s zero-extend 0xffffffab (remember, that’s -85) to 64 bits. The result is 0x00000000ffffffab a pretty big positive number (4294967211 in decimal). So we didn’t preserve the value.
Now let’s sign-extend the same value. Because the most significant bit in the 32-bit input is 1, we fill in the upper 32 bits with 1s. The output is 0xffffffffffffffab in hex, or -85 in decimal. So we preserved the value!

The moral of the story is: when extending unsigned numbers, use zero extension; when extending signed numbers, use sign extension.

Load and Store Instructions

The 64-bit RISC-V instruction set gives you several instructions for loading from and storing to memory. They are very similar; the only difference is the size of the load or store: the number of bits we’re reading or writing.

Let’s start with ld and sd. The mnemonics use l and s for load and store, and the d means double word, which means they load/store 64 bits at a time.

The format looks like this:

ld rd, offset(rs1)
sd rs2, offset(rs1)

In both cases, the second operand is the address. This operand uses the funky-looking offset(rs1) syntax. This means “get the value from register rs1, and add the constant value offset to it; treat the result as the address.” The reason these instructions have a built-in constant offset is because it is so incredibly common for code to need to add a small constant value to an address before doing the access. If you don’t need this offset, you can always use 0 for the offset.

The ld instruction puts the value into rd. The sd instruction takes the value from rs2 and stores it to memory at the computed address.

Accessing Different Widths

The instruction set gives you several other load and store operations for different widths. Here is a non-exhaustive list:

ld and sd: Load or store a double word (64 bits).
lw, lwu, and sw: Load or store a word (32 bits).
lb, lbu, and sw: Load or store a byte (8 bits).

Recall that our registers are all 64 bits. So what happens when you use a smaller-width load or store?

When storing, you truncate (take the lowest \(n\) bits from the register).
When loading, you extend. The instruction tells you whether you zero-extend or sign-extend:
- The instructions with the u suffix are for unsigned numbers, and they zero-extend.
- The instructions without this suffix are for signed numbers, and they sign-extend.

So, for example, lb loads a single byte and sign-extends it to 64 bits to put it in a register. lbu does the same thing, but it zero-extends instead.

Example: Store Word, Load Byte

Consider this short program:

addi x11, x0, 0x49C
sw x11, 0(x5)
lb x12, 0(x5)

What is the value of x12 at the end?

As always, it helps to translate the assembly to pseudocode to understand it. Here’s one attempt:

x11 = 0x49c;
store_word(x11, x5);
x12 = load_byte(x5);

So we don’t know what address x5 holds, but that’s the memory address. We’re storing the value 0x49c as a word (32 bits) to that address, and then loading the byte at that address. Let’s look at the two steps:

First, we store the 64-bit value 0x49c. Since we use little endian, least-significant byte goes at the smallest address. Let’s say x5 holds the address \(a\). Then address \(a\) will hold the byte 0x9c, \(a+1\) holds the byte 0x04, and addresses \(a+2\) and \(a+3\) both hold zero.
Next, we load the byte at the same address. The load instruction gets the byte 0x9c, and it sign-extends it to 64 bits, so the final value is 0xffffffffffffff9c, or -100 in decimal if we interpret it as a signed number.

Example: Translating from C

How would you translate this C program to assembly?

void mystery(int* x, int* y) {
    *x = *y;
}

Assume (as is the case on our RISC-V target) that int is a 32-bit type. Assume also that the pointers x and y are stored in registers x3 and x5, respectively.

Here’s a reasonable translation:

lw x8, 0(x5)
sw x8, 0(x3)

Here are some salient observations about this code:

It makes sense that this is a load instruction followed by a store instruction, because we need to read the value at y and write it back to address x.
It also makes sense that we are using word-sized accesses (lw and sw) because that’s how you access 32 bits.
We use the signed version of the load (lw instead of lwu) to get sign-extension, not zero-extension. (If we used unsigned int instead, you would want lwu.)
The offset is zero in both instructions, because we want to use the addresses in x5 and x3 unmodified.

CS 3410