RISC-V: Data Memory & Control Flow
The Memory Hierarchy
So far, we have seen a bunch of RISC-V instructions that access the 32 registers, but we haven’t accessed memory yet. Registers are fine as long as your data fits in 31 64-bit values, but real software needs “bulk” storage, and that’s what memory is for.
In general, computer architects think of these different ways of storing data as tiers in an organization called the memory hierarchy. You can imagine an entire spectrum of different ways of storing data, all of which trade off between different goals:
- Smaller memories that are closer to the processor and faster to access.
- Larger memories that are farther from the processor and slower to access.
Registers are toward the first extreme: in 64-bit RISC-V, there is only a total of 31×8=248 bytes of mutable storage, and it usually takes around 1 cycle (less than a nanosecond) to access a register.
Modern main memory is at the opposite extreme: even cheap phones have several gigabytes of main memory, and it typically takes hundreds of cycles (hundreds of nanoseconds) to access it.
You might reasonably ask: why not make the whole plane out of registers? There are two big answers to this question.
- In real computers, these different memories are made out of different memory technologies. The physical details of how to construct memories are out of scope for CS 3410, but registers are universally made from transistors (like the flip-flops we built in class) and integrated with the processor, main memory is made of DRAM, a memory-specific technology that uses tiny little capacitors to store bits. DRAM requires different manufacturing processes than logic, is much cheaper per bit than integrated-with-logic storage, but it is also much slower.
- There is a fundamental trade-off between capacity and latency. In any memory technology you can think of, building a larger memory makes it take longer to access.
Registers and main memory are two points in the memory-hierarchy spectrum. There are other points too: later in the semester, we will learn much more about caches, which fill in the space in between registers and main memory. You can also think of persistent storage (magnetic hard drives or flash memory SSDs) or even the Internet as further tiers beyond main memory.
Extension and Truncation
When we access memory, we will often need to change the size (the number of bits) of various values. For example, we’ll need to take an 8-bit value and treat it as a 64-bit value, and we’ll need to take a 64-bit value and treat it as a 32-bit value. When you increase the number of bits, that’s called extension, and when you decrease the size, that’s called truncation. The goal in both situations is to avoid losing information whenever possible: that is, to keep the same represented integer value when converting between sizes.
Truncation
Truncation from m bits to n bits works by extracting the lowest (least significant) n bits from the value. There is, sadly, no way to avoid losing information in some cases. Here are some examples:
- Let’s truncate the 64-bit value
0x00000000000000ab
to 32 bits. In decimal, this number has the value 171. Truncating to 32 bits yields0x000000ab
. That’s also 171. Awesome! - Let’s truncate
0xffffffffffffffab
to 32 bits. That’s the value -85 in two’s complement. Truncating yields0xffffffab
. That’s still -85. Excellent! - Now let’s truncate the bits
0x80000000000000ab
(note the 8 in the most-significant hex digit). That’s a really big negative value, because the leading bit is 1. Truncating yields0x000000ab
, which represents 171. That’s bad—we now have a different value. But losing some information is inevitable when you lose some bits.
Extension
There are two modes for extending from m bits to n bits. Both work by putting the value in the m least-significant bits of the n-bit output. The difference is in what we do with the extra n−m bits, which are the most-significant (upper) bits in the output.
- Zero extension fills the upper bits with zeroes.
- Sign extension fills them with copies of the most-significant bit in the input. (That is, the sign bit.)
Let’s see some examples.
- Let’s zero-extend
0xffffffab
(remember, that’s -85) to 64 bits. The result is0x00000000ffffffab
a pretty big positive number (4294967211 in decimal). So we didn’t preserve the value. - Now let’s sign-extend the same value.
Because the most significant bit in the 32-bit input is 1, we fill in the upper 32 bits with 1s.
The output is
0xffffffffffffffab
in hex, or -85 in decimal. So we preserved the value!
The moral of the story is: when extending unsigned numbers, use zero extension; when extending signed numbers, use sign extension.
Load and Store Instructions
The 64-bit RISC-V instruction set gives you several instructions for loading from and storing to memory. They are very similar; the only difference is the size of the load or store: the number of bits we’re reading or writing.
Let’s start with ld
and sd
.
The mnemonics use l
and s
for load and store, and the d
means double word, which means they load/store 64 bits at a time.
The format looks like this:
ld rd, offset(rs1)
sd rs2, offset(rs1)
In both cases, the second operand is the address.
This operand uses the funky-looking offset(rs1)
syntax.
This means “get the value from register rs1
, and add the constant value offset
to it; treat the result as the address.”
The reason these instructions have a built-in constant offset is because it is so incredibly common for code to need to add a small constant value to an address before doing the access.
If you don’t need this offset, you can always use 0 for the offset.
The ld
instruction puts the value into rd
.
The sd
instruction takes the value from rs2
and stores it to memory at the computed address.
Accessing Different Widths
The instruction set gives you several other load and store operations for different widths. Here is a non-exhaustive list:
ld
andsd
: Load or store a double word (64 bits).lw
,lwu
, andsw
: Load or store a word (32 bits).lb
,lbu
, andsw
: Load or store a byte (8 bits).
Recall that our registers are all 64 bits. So what happens when you use a smaller-width load or store?
- When storing, you truncate (take the lowest n bits from the register).
- When loading, you extend. The instruction tells you whether you zero-extend or sign-extend:
- The instructions with the
u
suffix are for unsigned numbers, and they zero-extend. - The instructions without this suffix are for signed numbers, and they sign-extend.
- The instructions with the
So, for example, lb
loads a single byte and sign-extends it to 64 bits to put it in a register.
lbu
does the same thing, but it zero-extends instead.
Example: Store Word, Load Byte
Consider this short program:
addi x11, x0, 0x49C
sw x11, 0(x5)
lb x12, 0(x5)
What is the value of x12
at the end?
As always, it helps to translate the assembly to pseudocode to understand it. Here’s one attempt:
x11 = 0x49c;
store_word(x11, x5);
x12 = load_byte(x5);
So we don’t know what address x5
holds, but that’s the memory address.
We’re storing the value 0x49c
as a word (32 bits) to that address,
and then loading the byte at that address.
Let’s look at the two steps:
- First, we store the 64-bit value
0x49c
. Since we use little endian, least-significant byte goes at the smallest address. Let’s sayx5
holds the address a. Then address a will hold the byte0x9c
, a+1 holds the byte0x04
, and addresses a+2 and a+3 both hold zero. - Next, we load the byte at the same address. The load instruction gets the byte
0x9c
, and it sign-extends it to 64 bits, so the final value is0xffffffffffffff9c
, or -100 in decimal if we interpret it as a signed number.
Example: Translating from C
How would you translate this C program to assembly?
void mystery(int* x, int* y) {
*x = *y;
}
Assume (as is the case on our RISC-V target) that int
is a 32-bit type.
Assume also that the pointers x
and y
are stored in registers x3
and x5
, respectively.
Here’s a reasonable translation:
lw x8, 0(x5)
sw x8, 0(x3)
Here are some salient observations about this code:
- It makes sense that this is a load instruction followed by a store instruction, because we need to read the value at
y
and write it back to addressx
. - It also makes sense that we are using word-sized accesses (
lw
andsw
) because that’s how you access 32 bits. - We use the signed version of the load (
lw
instead oflwu
) to get sign-extension, not zero-extension. (If we usedunsigned int
instead, you would wantlwu
.) - The offset is zero in both instructions, because we want to use the addresses in
x5
andx3
unmodified.
Control Flow in Assembly
So far, all the assembly programs we’ve written have been straight-line code, in the sense that they always run one instruction after the other.
That’s like writing C without any control flow: no if
, for
, while
, etc.
The remainder of this lecture is about the instructions that exist in RISC-V to
implement control-flow constructs.
Branch If Equal
For most instructions, when the processor is done running that instruction, it proceeds onto the next instruction (incrementing the program counter by 4 on RISC-V, because every instruction is 4 bytes).
A branch instruction is one that can choose whether to do that or to execute some other instruction of your choosing instead.
One example is the beq
instruction, which means branch if equal:
beq rs1, rs2, label
The first two operands are registers, and beq
checks whether the values are equal.
The third operand is a label, which we’ll look closer at in a moment, but it refers to some other instruction.
Then:
- If the two registers hold equal values, then go to the instruction at
label
. - If they’re not equal, then just go to the next instruction (add 4 to the PC) as usual.
Labels appear in your assembly code like this:
my_great_label:
That is, just pick a name and put a :
after it.
This labels a specific instruction so that a branch can refer to it.
Here’s an example:
beq x1, x2, some_label
addi x3, x3, 42
some_label:
addi x3, x3, 27
This program checks whether x1 == x2
.
If so, then it immediately executes the last instruction, skipping the second instruction.
Otherwise, it runs all 3 instructions in this listing in order (it adds 42 and then adds 27 to x3
).
In other words, you can imagine this assembly code implementing an if
statement in C:
if (x1 != x2) {
x3 += 42;
}
x3 += 27;
Labels in Machine Code
As shown above, in assembly code we can define labels like
my_great_label:
by simply picking a name and putting a :
after it.
However, these labels are symbolic and only appear in assembly code, not machine
code.
When assembling the machine code, the assembler converts each label into signed offset. This offset is then added to the program counter (PC) to point to the next instruction if the branch is taken.
For example, consider the assembly program from the previous section annotated with the memory address (in instruction memory) of each instruction:
0: beq x1, x2, some_label
4: addi x3, x3, 42
some_label:
8: addi x3, x3, 27
The assembler would remove the label some_label:
and replace each occurrence
with the appropriate offset:
0: beq x1, x2, 8
4: addi x3, x3, 42
8: addi x3, x3, 27
When writing assembly code by hand, use labels! Labels exist largely to make it easier (or possible) for programmers to read and write assembly code by hand. Replacing labels with offsets is a job better left to the assembler.
Other Branches and Jumps
You should read the RISC-V spec to see an exhaustive list of branch instructions it supports.
Here are a few, beyond beq
:
bne rs1, rs2, label
: Branch if the registers are not equal.blt rs1, rs2, label
: Branch ifrs1
is less thanrs2
, treated as signed (two’s complement) integers.bge rs1, rs2, label
: Like that, but with “greater than.”bltu
andbgtu
are similar but do unsigned integer comparisons.
You will also encounter unconditional jumps, written j label
.
Unlike branches, j
doesn’t check a condition; it always immediately transfers control to the label.
Implementing Loops
We have already seen how branches in assembly can implement the if
control-flow construct.
There are also all you need to implement loops, like the for
and while
constructs in C.
We’ll see a worked example in this section.
Consider this loop that sums the values in an array:
int sum = 0;
for (int i = 0; i < 20; i++) {
sum += A[i];
}
And imagine that A
is declared as an array of int
s:
int A[20];
Imagine that the A
base pointer is in x8
.
Here’s a complete implementation of this loop in RISC-V assembly:
add x9, x8, x0 # x9 = &A[0]
add x10, x0, x0 # sum = 0
add x11, x0, x0 # i = 0
addi x13, x0, 20 # x13 = 20
Loop:
bge x11, x13, Done
lw x12, 0(x9) # x12 = A[i]
add x10, x10, x12 # sum += x12
addi x9, x9, 4 # &A[i+1]
addi x11, x11, 1 # i++
j Loop
Done:
The important instructions for implementing the loop are the bge
(branch if greater than or equal to) and j
(unconditional jump) instructions.
The former checks the loop condition i < 20
, and the latter starts the next execution of the loop.
We have included comments to indicate how we implemented the various changes to variables. Here are some observations about this implementation:
- We have chosen to put
sum
in registerx10
andi
inx11
. - The
x13
register just holds the number 20. We need it in a register so we can comparei < 20
with thebge
instruction. - The
x9
register is a little funky. It starts out storing theA
base address, but then the pointer moves by 4 bytes on every loop iteration (withaddi
). The idea is that it always stores the address&A[i]
, i.e., a pointer to the ith element of theA
array on the ith iteration. So to load the valueA[i]
, we just need to load this address withlw
.