RISC-V: Data Memory & Control Flow

The Memory Hierarchy

So far, we have seen a bunch of RISC-V instructions that access the 32 registers, but we haven’t accessed memory yet. Registers are fine as long as your data fits in 31 64-bit values, but real software needs “bulk” storage, and that’s what memory is for.

In general, computer architects think of these different ways of storing data as tiers in an organization called the memory hierarchy. You can imagine an entire spectrum of different ways of storing data, all of which trade off between different goals:

Smaller memories that are closer to the processor and faster to access.
Larger memories that are farther from the processor and slower to access.

Registers are toward the first extreme: in 64-bit RISC-V, there is only a total of \(31 \times 8 = 248\) bytes of mutable storage, and it usually takes around 1 cycle (less than a nanosecond) to access a register.

Modern main memory is at the opposite extreme: even cheap phones have several gigabytes of main memory, and it typically takes hundreds of cycles (hundreds of nanoseconds) to access it.

You might reasonably ask: why not make the whole plane out of registers? There are two big answers to this question.

In real computers, these different memories are made out of different memory technologies. The physical details of how to construct memories are out of scope for CS 3410, but registers are universally made from transistors (like the flip-flops we built in class) and integrated with the processor, main memory is made of DRAM, a memory-specific technology that uses tiny little capacitors to store bits. DRAM requires different manufacturing processes than logic, is much cheaper per bit than integrated-with-logic storage, but it is also much slower.
There is a fundamental trade-off between capacity and latency. In any memory technology you can think of, building a larger memory makes it take longer to access.

Registers and main memory are two points in the memory-hierarchy spectrum. There are other points too: later in the semester, we will learn much more about caches, which fill in the space in between registers and main memory. You can also think of persistent storage (magnetic hard drives or flash memory SSDs) or even the Internet as further tiers beyond main memory.

Extension and Truncation

When we access memory, we will often need to change the size (the number of bits) of various values. For example, we’ll need to take an 8-bit value and treat it as a 64-bit value, and we’ll need to take a 64-bit value and treat it as a 32-bit value. When you increase the number of bits, that’s called extension, and when you decrease the size, that’s called truncation. The goal in both situations is to avoid losing information whenever possible: that is, to keep the same represented integer value when converting between sizes.

Truncation

Truncation from \(m\) bits to \(n\) bits works by extracting the lowest (least significant) \(n\) bits from the value. There is, sadly, no way to avoid losing information in some cases. Here are some examples:

Let’s truncate the 64-bit value 0x00000000000000ab to 32 bits. In decimal, this number has the value 171. Truncating to 32 bits yields 0x000000ab. That’s also 171. Awesome!
Let’s truncate 0xffffffffffffffab to 32 bits. That’s the value -85 in two’s complement. Truncating yields 0xffffffab. That’s still -85. Excellent!
Now let’s truncate the bits 0x80000000000000ab (note the 8 in the most-significant hex digit). That’s a really big negative value, because the leading bit is 1. Truncating yields 0x000000ab, which represents 171. That’s bad—we now have a different value. But losing some information is inevitable when you lose some bits.

Extension

There are two modes for extending from \(m\) bits to \(n\) bits. Both work by putting the value in the \(m\) least-significant bits of the \(n\)-bit output. The difference is in what we do with the extra \(n-m\) bits, which are the most-significant (upper) bits in the output.

Zero extension fills the upper bits with zeroes.
Sign extension fills them with copies of the most-significant bit in the input. (That is, the sign bit.)

Let’s see some examples.

Let’s zero-extend 0xffffffab (remember, that’s -85) to 64 bits. The result is 0x00000000ffffffab a pretty big positive number (4294967211 in decimal). So we didn’t preserve the value.
Now let’s sign-extend the same value. Because the most significant bit in the 32-bit input is 1, we fill in the upper 32 bits with 1s. The output is 0xffffffffffffffab in hex, or -85 in decimal. So we preserved the value!

The moral of the story is: when extending unsigned numbers, use zero extension; when extending signed numbers, use sign extension.

Load and Store Instructions

The 64-bit RISC-V instruction set gives you several instructions for loading from and storing to memory. They are very similar; the only difference is the size of the load or store: the number of bits we’re reading or writing.

Let’s start with ld and sd. The mnemonics use l and s for load and store, and the d means double word, which means they load/store 64 bits at a time.

The format looks like this:

ld rd, offset(rs1)
sd rs2, offset(rs1)

In both cases, the second operand is the address. This operand uses the funky-looking offset(rs1) syntax. This means “get the value from register rs1, and add the constant value offset to it; treat the result as the address.” The reason these instructions have a built-in constant offset is because it is so incredibly common for code to need to add a small constant value to an address before doing the access. If you don’t need this offset, you can always use 0 for the offset.

The ld instruction puts the value into rd. The sd instruction takes the value from rs2 and stores it to memory at the computed address.

Accessing Different Widths

The instruction set gives you several other load and store operations for different widths. Here is a non-exhaustive list:

ld and sd: Load or store a double word (64 bits).
lw, lwu, and sw: Load or store a word (32 bits).
lb, lbu, and sw: Load or store a byte (8 bits).

Recall that our registers are all 64 bits. So what happens when you use a smaller-width load or store?

When storing, you truncate (take the lowest \(n\) bits from the register).
When loading, you extend. The instruction tells you whether you zero-extend or sign-extend:
- The instructions with the u suffix are for unsigned numbers, and they zero-extend.
- The instructions without this suffix are for signed numbers, and they sign-extend.

So, for example, lb loads a single byte and sign-extends it to 64 bits to put it in a register. lbu does the same thing, but it zero-extends instead.

Example: Store Word, Load Byte

Consider this short program:

addi x11, x0, 0x49C
sw x11, 0(x5)
lb x12, 0(x5)

What is the value of x12 at the end?

As always, it helps to translate the assembly to pseudocode to understand it. Here’s one attempt:

x11 = 0x49c;
store_word(x11, x5);
x12 = load_byte(x5);

So we don’t know what address x5 holds, but that’s the memory address. We’re storing the value 0x49c as a word (32 bits) to that address, and then loading the byte at that address. Let’s look at the two steps:

First, we store the 64-bit value 0x49c. Since we use little endian, least-significant byte goes at the smallest address. Let’s say x5 holds the address \(a\). Then address \(a\) will hold the byte 0x9c, \(a+1\) holds the byte 0x04, and addresses \(a+2\) and \(a+3\) both hold zero.
Next, we load the byte at the same address. The load instruction gets the byte 0x9c, and it sign-extends it to 64 bits, so the final value is 0xffffffffffffff9c, or -100 in decimal if we interpret it as a signed number.

Example: Translating from C

How would you translate this C program to assembly?

void mystery(int* x, int* y) {
    *x = *y;
}

Assume (as is the case on our RISC-V target) that int is a 32-bit type. Assume also that the pointers x and y are stored in registers x3 and x5, respectively.

Here’s a reasonable translation:

lw x8, 0(x5)
sw x8, 0(x3)

Here are some salient observations about this code:

It makes sense that this is a load instruction followed by a store instruction, because we need to read the value at y and write it back to address x.
It also makes sense that we are using word-sized accesses (lw and sw) because that’s how you access 32 bits.
We use the signed version of the load (lw instead of lwu) to get sign-extension, not zero-extension. (If we used unsigned int instead, you would want lwu.)
The offset is zero in both instructions, because we want to use the addresses in x5 and x3 unmodified.

Control Flow in Assembly

So far, all the assembly programs we’ve written have been straight-line code, in the sense that they always run one instruction after the other. That’s like writing C without any control flow: no if, for, while, etc. The remainder of this lecture is about the instructions that exist in RISC-V to implement control-flow constructs.

Branch If Equal

For most instructions, when the processor is done running that instruction, it proceeds onto the next instruction (incrementing the program counter by 4 on RISC-V, because every instruction is 4 bytes). A branch instruction is one that can choose whether to do that or to execute some other instruction of your choosing instead. One example is the beq instruction, which means branch if equal:

beq rs1, rs2, label

The first two operands are registers, and beq checks whether the values are equal. The third operand is a label, which we’ll look closer at in a moment, but it refers to some other instruction. Then:

If the two registers hold equal values, then go to the instruction at label.
If they’re not equal, then just go to the next instruction (add 4 to the PC) as usual.

Labels appear in your assembly code like this:

my_great_label:

That is, just pick a name and put a : after it. This labels a specific instruction so that a branch can refer to it.

Here’s an example:

  beq x1, x2, some_label
  addi x3, x3, 42
some_label:
  addi x3, x3, 27

This program checks whether x1 == x2. If so, then it immediately executes the last instruction, skipping the second instruction. Otherwise, it runs all 3 instructions in this listing in order (it adds 42 and then adds 27 to x3).

In other words, you can imagine this assembly code implementing an if statement in C:

if (x1 != x2) {
  x3 += 42;
}
x3 += 27;

Labels in Machine Code

As shown above, in assembly code we can define labels like

my_great_label:

by simply picking a name and putting a : after it. However, these labels are symbolic and only appear in assembly code, not machine code.

When assembling the machine code, the assembler converts each label into signed offset. This offset is then added to the program counter (PC) to point to the next instruction if the branch is taken.

For example, consider the assembly program from the previous section annotated with the memory address (in instruction memory) of each instruction:

0:  beq x1, x2, some_label
4:  addi x3, x3, 42
some_label:
8:  addi x3, x3, 27

The assembler would remove the label some_label: and replace each occurrence with the appropriate offset:

0:  beq x1, x2, 8
4:  addi x3, x3, 42
8:  addi x3, x3, 27

Use Labels!

When writing assembly code by hand, use labels! Labels exist largely to make it easier (or possible) for programmers to read and write assembly code by hand. Replacing labels with offsets is a job better left to the assembler.

Other Branches and Jumps

You should read the RISC-V spec to see an exhaustive list of branch instructions it supports. Here are a few, beyond beq:

bne rs1, rs2, label: Branch if the registers are not equal.
blt rs1, rs2, label: Branch if rs1 is less than rs2, treated as signed (two’s complement) integers.
bge rs1, rs2, label: Like that, but with “greater than.”
bltu and bgtu are similar but do unsigned integer comparisons.

You will also encounter unconditional jumps, written j label. Unlike branches, j doesn’t check a condition; it always immediately transfers control to the label.

Implementing Loops

We have already seen how branches in assembly can implement the if control-flow construct. There are also all you need to implement loops, like the for and while constructs in C. We’ll see a worked example in this section.

Consider this loop that sums the values in an array:

int sum = 0;
for (int i = 0; i < 20; i++) {
  sum += A[i];
}

And imagine that A is declared as an array of ints:

int A[20];

Imagine that the A base pointer is in x8. Here’s a complete implementation of this loop in RISC-V assembly:

  add x9, x8, x0         # x9 = &A[0]
  add x10, x0, x0        # sum = 0
  add x11, x0, x0        # i = 0
  addi x13, x0, 20       # x13 = 20
Loop:
  bge x11, x13, Done
  lw x12, 0(x9)          # x12 = A[i]
  add x10, x10, x12      # sum += x12
  addi x9, x9, 4         # &A[i+1]
  addi x11, x11, 1       # i++
  j Loop
Done:

The important instructions for implementing the loop are the bge (branch if greater than or equal to) and j (unconditional jump) instructions. The former checks the loop condition i < 20, and the latter starts the next execution of the loop.

We have included comments to indicate how we implemented the various changes to variables. Here are some observations about this implementation:

We have chosen to put sum in register x10 and i in x11.
The x13 register just holds the number 20. We need it in a register so we can compare i < 20 with the bge instruction.
The x9 register is a little funky. It starts out storing the A base address, but then the pointer moves by 4 bytes on every loop iteration (with addi). The idea is that it always stores the address &A[i], i.e., a pointer to the \(i\)th element of the A array on the \(i\)th iteration. So to load the value A[i], we just need to load this address with lw.

CS 3410