Calling Functions in Assembly

Pseudo-Instructions

While assembly languages mostly have a 1-1 correspondence to some processor’s machine code, sometimes it’s helpful for the assembly language to have a few convenient features that just make it easier for humans to read and write. The primary such feature in RISC-V assembly is its pseudo-instructions. A pseudo-instruction is an assembly-language instruction that does not actually correspond to any distinct machine-code instruction (with its own opcode and such).

Here are some common pseudo-instructions:

mv rd, rs1: Copy the value of register rs1 into register rd.
li rd, imm: Put the immediate value imm into register rd.
nop: A no-op: do nothing at all.

All three of these pseudo-instructions are equivalent to special cases of the addi instructions:

mv rd, rs1 does the same thing as addi rd, rs1, 0
li rd, imm is addi rd, x0, imm
nop is addi x0, x0, 0

Try to convince yourself that these addi instructions do in fact work to implement these pseudo-instructions’ semantics.

The RISC-V assembler translates pseudo-instructions into their equivalent real instructions for you. So you can write li x11, 42 and that will translate to exactly the same machine-code bits as addi x11, x0, 42.

Why doesn’t RISC-V implement these pseudo-instructions as real, distinct instructions? By keeping the number of instructions small, it simplifies the hardware—especially the decode stage—making it smaller, faster, and more efficient.

Functions in Assembly

With branching control flow, we can accomplish a lot in RISC-V assembly. We can “fake” if statements, for loops, and so on. But one thing we can’t do yet is call functions. That’s what this lecture is about.

Here’s an example C program we can work with:


int addfn(int a, int b) {
    return a + b;
}

int main() {
    int sum1, sum2;
    sum1 = addfn(1, 2);
    sum2 = addfn(3, 4);
    printf("sum1=%d and sum2=%d\n", sum1, sum2);
}

You already know how to implement the body of the addfn function in RISC-V. But nothing we’ve done so far will let us call that code multiple times with different arguments, as main does in this example.

Calling a function is a multi-step process, and it requires collaboration between both the caller code and the callee code (the function being called). At a high level, every function call needs to follow these steps:

The caller puts arguments in a place where the callee function can access them.
The caller transfers control to the callee (i.e., it jumps to the first instruction in the function).
The function creates a stack frame to hold its own local variables.
The function actually does stuff: i.e., the function body.
The function puts the return value in a place where caller can access it. It also restores any registers it used to the state the caller expects. And finally, it releases the stack frame that holds its local variables.
The callee returns control to the caller (i.e., jumps to the next instruction in the caller right after the function call).

The caller and callee need to agree on all the details for how this multi-step process works. For example, they must agree on which registers hold the arguments and which registers hold the return value. A standardized protocol for how to implement all these details is called a calling convention. The RISC-V ISA itself defines a particular calling convention, which we will learn about in this lecture. C compilers that generate RISC-V code also use the same calling convention to implement function definitions and function calls—and because it’s standardized, even functions compiled by different C compilers can call each other.

The RISC-V Calling Convention

We’ll break down the components next, but here are the most important parts of the RISC-V calling convention:

Arguments go in registers a0 through a7 (a.k.a. x10 through x17). (In fact, that is why these registers have an alternative name starting with an “a”! It’s for argument.)
Return values also go in registers a0 and a1. (Yes, this means that functions overwrite their arguments with their return values before they return.)
Register ra (a.k.a. x1) holds the return address: the address of the next instruction to run after the function call finishes.
Registers s1 through s11 (a.k.a. x9, and x18 through x27) are callee-saved registers. This means that callers can safely expect that, after they make a call and the call returns, the registers will be carefully restored to the value they had before the call.
Registers t0 through t6 (a.k.a. x5 to x7, and x28 through x31) are temporary registers. This means that callee functions can use these registers without saving them. If the caller needs the contents of these temporary registers after the callee returns, then the caller has to save them before making a function call to the callee. As a result, these temporary registers are called caller-saved registers.

Control Flow for Call and Return

Let’s start with the basic mechanism for transferring control: jumping from the caller to the callee and then back. The interesting thing is that the branch instructions we’ve seen so far, such as beq, won’t suffice. The problem is that functions, by their very nature, can be called from multiple locations. Like in our example above:


sum1 = addfn(1, 2);
sum2 = addfn(3, 4);

Imagine that we implemented both of these calls with a plain unconditional jump, j, like this. Then the calls might look like this:


li a0, 1;
li a1, 2;
j addfn;
mv <register containing sum1>, a0;

mv a0, 3;
mv a1, 4;
j addfn;
mv <register containing sum2>, a0;

All those li instructions would take care of setting up the argument registers and mv consuming the return-value register. We imagine here that addfn is an assembly-language label that points to the start of the addfn function’s instructions.

There’s a problem. In the implementation of the addfn function, how do we know where to jump back to? After each call is done, we need to transfer control to the next instruction after the jump. Even if we inserted labels on those instructions, if there is only a single block of instructions to implement addfn, those instructions would need to contain j <label> to return. But somehow it would need to pick a different label for each call, which is impossible!

The solution is to designate a register to hold the return address for the call. Instead of just using j to call a function, we’ll do two things:

Record the next instruction’s address as the return address, in register ra.
Jump to the first instruction of the called function.

Then, to return, the function just needs to jump to the instruction address in register ra. Regardless of who called the function, doing this will suffice to transfer control to the point right after the call.

RISC-V has instructions to support these strategies: both the call and the return. For the call, you use the jal instruction (the mnemonic stands for jump and link):


jal rd, label

The jal instruction does the two things we need for a call:

Put the address of the next instruction after the jal into register rd.
Unconditionally jump to label.

So our function calls will generally look like jal ra, <function label>. Then, to return from a function, we’ll use the jr instruction (the mnemonic means jump register):


jr rs1

The jr unconditionally jumps to the address stored in the register rs1. So function returns generally look like jr ra.

In fact, this pattern is so common that RISC-V has pseudo-instructions for function calls and returns:

jal label: short for jal ra, label
call label: like the above, but with an extra auipc instruction so it supports larger PC offsets
ret: short for jr ra

(Going one level deeper, it turns out that jr rs1 is itself a pseudo-instruction that is short for jalr x0, 0(rs1). But that’s not really important for learning about function calls.)

Managing the Stack

Beyond just jumping around, functions also have another important responsibility: they need to keep track of the their local variables. As you already know, local variables go in stack frames on the call stack. You also know that the stack is a region in memory grows downward (from higher memory addresses to lower ones) when we call functions, and it shrinks when function calls return. This section is about the bookkeeping that functions must to do create and use their stack frames.

The central idea is that we must use a register to keep track of the address of our current stack frame. According to the RISC-V calling convention, register sp (a.k.a. x2) contains the address of the top (the smallest address since the stack grows down) of the current stack frame. Further, the RISC-V calling convention has a frame pointer register, fp, that contains the address of the bottom of the stack frame (the fp has a higher address than the fp since the stack grows down). Code interacts with sp and fp in three main ways:

At the beginning of the function, it will “push a stack frame onto the call stack” by moving sp downward to make space for its own stack frame. Remember, this stack frame will contain the function’s local variables.
During the execution of the function, it will use (positive) offsets on sp to locate each of its local variables. So you’ll see stuff like ld a7, 16(sp) and sd a9, 40(sp) to load and store local variables using offsets from sp. Equivalently, negative offsets can be used with the fp to access any local variable within a stack frame. The advantage of using the fp versus the sp is that the offsets to values on the stack are constant relative to the fp, where as the offsets may change relative to the sp. Note that according the RISC-V calling convention, fp is optional, but in the cs3410 2025sp it is required.
At the end of the function, before it returns, it will “pop the stack frame off the call stack” by moving sp back up to wherever it used to be, “destroying” its stack frame. No memory literally gets destroyed, of course, but adjusting sp back to its pre-call value indicates that we’re done using all our local variables, and it lets the caller locate its own stack frame.

This means that functions usually look like this:


func_label:
  addi sp, sp, -16
  sd ra, 8(sp)
  sd fp, 0(sp)
  addi fp, sp, 8
  
  ...

  ld fp, 0(sp)
  ld ra, 8(sp)
  addi sp, sp, 16
  ret

or, equivalently:


func_label:
  addi sp, sp, -16
  addi fp, sp, 8
  sd ra, 0(fp)
  sd fp, -8(fp)
  
  ...

  ld fp, -8(fp)
  ld ra, 0(fp)
  addi sp, sp, 16
  ret

The addi at the top and bottom of the function “creates” and “destroys” (a.k.a. “push” and “pop”) the stack frame. The function’s code must know how big its stack frame needs to be: in this case, it’s 16 bytes, so we move the stack pointer down by 16 bytes at the beginning and back up by the same 16 bytes at the end. The stack frame size needs to be big enough to contain the function’s local variables, for instance, space the return address and frame pointer, ra, fp; C compilers compute this stack-frame size for you by adding up the size of all the local variables you declare.

Further, when the stack frame is “created” (“pushed”), the return address, ra, and frame pointer, fp, are stored on the stack, then the ra and fp are restored before the stack frame is “destroyed” (“popped”).

Why is ra stored on the stack? Storing ra on the stack allows functions to be called recursively. For instance, assume we did not store ra on the stack and main calls addfn and addfn calls printf, what would happen to ra? When main calls jal addfn (or call addfn), ra will contain the return address in main. Then, when addfn calls printf, jal printf (or call printf) will overwrite ra. Next, when printf returns to addfn and addfn wants to return to main the contents of ra will have been “clobbered” and there will be no way for addfn to return to main. Fortunately, however, by storing ra on the stack, addfn will restore ra from the stack, which will contain the address back to main.

Passing Arguments

RISC-V provides a consistent way of passing arguments and receiving the result of a subroutine invocation.

In particular, args a0 to a7 are used for arguments and a0 and a1 are used for return values. Note that a0 and a1 are both argument and value-return registers; as a result, the contents of argument registers in general are “clobbered” and not preserved.

If a function has more than eight arguments, then the arguments are “spilled” to the stack. The calling convention allocates space for all arguments on the child stack frame, placing the first eight args in registers a0 to a7 and “spills” any remaining args to the child stack frame. This means that space is allocated on the stack for the first eight args, even though that space is not initially used since the arg registers are used instead. Allocating space on the stack for all args is particular useful for functions with variable length inputs such as printf(“Scores: %d %d %d\n”, 1, 2, 3); and to treat the arguments as an array in memory.

Let’s see an example for passing ten arguments:


int addfn(int a, int b, int c, int d, int e, int f, int g, int h, int i, int j) {
    return a + b + c + d + e + f + g + h + i + j;
}

int main(){
    sum = addfn(0, 1, 2, 3, 4, 5, 6, 7, 8 9);
    printf("%d\n", sum);
}

assembly for main calling addfn:


main:
  li a0, 0
  li a1, 1
  ...
  li a7, 7
  li t0, 8
  sd t0, -16(sp)
  li t0, 9
  sd t0, -8(sp)
  jal addfn

The stack with respect to the caller will look like:


-8(sp):  9
-16(sp): 8
-24(sp): space for a7
-32(sp): space for a6
-40(sp): space for a5
-48(sp): space for a4
-56(sp): space for a3
-64(sp): space for a2
-72(sp): space for a1
-80(sp): space for a0

In particular, the caller passes the first eight args in registers a0-a7 and “spills” the ninth and tenth args to the stack and makes room for all ten args on the stack. Further, note that args are passed on the callee (child) stack frame.

Leaf Functions

Note that if a function does not call another function, then it is a leaf function. addfn functions above are all leaf functions. It is possible for leaf functions not to push or pop a stack frame. That is, not to adjust the sp, or save the ra, fp, any args on the stack. A leaf function can use temporary caller-save (t) registers since they do not need to be saved before using them. But, a leaf function that does not have a stack frame cannot use callee-save (s) registers since callee-save registers require saving them on the stack before using them.

Calling Convention Example

Let’s go through a couple calling convention examples. First, assume that we have the code below:


int test(int a, int b) {
    int tmp = (a&b)+(a|b);
    int s = sum(tmp,1,2,3,4,5,6,7,8);
    int u = sum(s,tmp,b,a,b,a);
    return u + a + b;
}

Next, let’s pretend that we are the RISC-V C compiler and write the assembly for the above test function:

To proceed, we will complete the following steps:

write the assembly for the Body of the function
Determine stack frame size
Complete Prologue/Epilogue that performs the stack frame push/pop

Calling Convention Body Example

In this first step, we will write the Body for test


# Prologue:
#     stack frame size = sizeof(registers) bytes x (2x args + 2x (ra/fp) + 0x #callee-save registers [+ 1x of temporary caller-save regsters stored on the stack])
#                      = 8 bytes x 5 = 40 bytes
#
#     stack frame layout
#        32(sp): a1 (b)
#        24(sp): a0 (a)
#        16(sp): ra
#         8(sp): fp
#         0(sp): t0

# Body

  # store args a and b
  SD a0, 24(sp) # a
  SD a1, 32(sp) # b
  
  # int tmp = (a&b)+(a|b);
  AND t0, a0, a1
  OR  t1, a0, a1
  ADD t0, t0, t1

  # store tmp
  SD t0, 0(sp)
  
  # int s = sum(tmp,1,2,3,4,5,6,7,8);
  MV a0, t0
  LI a1, 1
  LI a2, 2
  ...
  LI a7, 7
  LI t1, 8
  SD t1, -8(sp) # spill ninth arg to the child stack frame
  JAL sum

  # restore tmp, a, b
  LD t0, 0(sp)  # tmp
  LD t1, 24(sp) # a
  LD t2, 32(sp) # b

  # int u = sum(s,tmp,b,a,b,a);
  MV a0, a0 # s
  MV a1, t0 # tmp
  MV a2, t2 # b
  MV a3, t1 # a
  MV a4, t2 # b
  MV a5, t1 # a
  JAL sum

  # restore a and b
  LD t1, 24(sp) # a
  LD t2, 32(sp) # b
  
  # add u (a0), a (t1), b (t2)
  ADD a0, a0, t1 # u + a
  ADD a0, a0, t2 # u + a + b
  # a0 = u + a + b

# Epilogue

Several notes for the above assembly of test.

a and b were stored in the space allocated for them on the stack.
a and b had to be restored several times because a0 and a1 are temporary caller-save. I.e. after the call to sum1 and sum2, a and b had to be restored.
tmp, stored in t0, needed to be saved in the test stack frame since t0 is a temporary caller-save register and t0 (tmp) is needed after the first call to sum returns.
The ninth argument (value 8) had to be spilled to the child stack frame. Instructions LI t1, 8 and SD t1, -8(sp) store the value 8 on the child stack frame.

Calling Convention Prologue/Epilogue Example

Next, let’s take a look how to create and destory (push and pop) the stack frame for test in the prologue and epilogue, respectively.


#     stack frame layout
#        32(sp): b (a1)
#        24(sp): a (a0)
#        16(sp): ra
#         8(sp): fp
#         0(sp): t0

test: 	
    # Prologue
    ADDI sp, sp, -40 # allocate stack frame
    SD ra, 16(sp)    # save ra
    SD fp,  8(sp)    # save old fp
    ADDI fp, sp, 32  # set new frame pointer

    # Body
    ...

    #Epilogue
    LD fp,  8(sp)   # restore fp
    LD ra, 16(sp)   # restore ra
    ADDI sp, sp, 40 # dealloc frame
    ret		# JR ra

The test stack frame size is 40 bytes, which is space to store the two args, a and b, ra/fp, and tmp variable. Further, in the prologue and epilogue, only ra and fp are stored. The arguments for test, a and b, and tmp (t0) are stored on the stack in the # Body.

Another consideration is the total number of stores and loads for this implementation of test. Specifically, there are two stores and two loads in the prologue/epilogue and three stores and five loads in the body for a total of five stores (SD) and seven loads (LD).

Calling Convention Example 2

Now let’s look at a different implementation for test. It is the same C code for test, but a different assembly implementation. In this assembly, we will use callee-save registers (s) to save on access to memory, and, hopefully, reduce the number of stores/loads (SD/LD). The stack size may increase because we need to save the callee-save registers before we use them, but there may be less overall stores/loads.


# Prologue
#     stack frame size = sizeof(registers) x (2x args + 2x (ra/fp) + 3x callee-save registers [+ 0x temporary caller-save regsters stored on the stack])
#                      = 8 bytes x 7 = 56 bytes
#
#     stack frame layout
#        48(sp): b
#        40(sp): a
#        32(sp): ra
#        24(sp): fp
#        16(sp): s3
#         8(sp): s2
#         0(sp): s1


# Body

  # store args in callee-save registers s1 and s2
  MV s1, a0 # a
  MV s2, a1 # b
  
  # int tmp = (a&b)+(a|b);
  AND s3, a0, a1
  OR  t1, a0, a1
  ADD s3, s3, t1 # store tmp in a callee-save register s3
 
  # int s = sum(tmp,1,2,3,4,5,6,7,8);
  MV a0, s3
  LI a1, 1
  LI a2, 2
  ...
  LI a7, 7
  LI t1, 8
  SD t1, -8(sp) # spill ninth arg to the child stack frame
  JAL sum

  # int u = sum(s,tmp,b,a,b,a);
  MV a0, a0 # s
  MV a1, s3 # tmp
  MV a2, s2 # b
  MV a3, s1 # a
  MV a4, s2 # b
  MV a5, s1 # a
  JAL sum

  # add u (a0), a (s1), b (s2)
  ADD a0, a0, s1 # u + a
  ADD a0, a0, s2 # u + a + b
  # a0 = u + a + b

# Epilogue

In this assembly, there is space allocated for args a and b; however, we use callee-save registers s1 and s2 for a and b instead. As a result, the body of test has one store (SD) and zero loads (LD) in the body. Note that test still needs to spill the ninth argument on the stack before calling sum.

Calling Convention Prologue/Epilogue Example 2

Now, let’s take a look at the prologue and epilogue to push and pop the test stack frame for this second implementation.


#     stack frame layout
#        48(sp): b
#        40(sp): a
#        32(sp): ra
#        24(sp): fp
#        16(sp): s3
#         8(sp): s2
#         0(sp): s1

test: 	
    # Prologue
    ADDI sp, sp, -56 # allocate stack frame
    SD ra, 32(sp)    # save ra
    SD fp, 24(sp)    # save old fp
    SD s3, 16(sp)    # store callee-save reg s1
    SD s2, 8(sp)     # store callee-save reg s2
    SD s1, 0(sp)     # store callee-save reg s3
    ADDI fp, sp, 48  # set new frame pointer

    # Body
    ...

    #Epilogue
    LD s1, 0(sp)    # restore s1
    LD s2, 8(sp)    # restore s2
    LD s3, 16(sp)   # restore s3
    LD fp, 24(sp)   # restore fp
    LD ra, 32(sp)   # restore ra
    ADDI sp, sp, 56 # dealloc frame
    ret		# JR ra

In this assembly, the test stack frame size is 56 bytes, which is space to store the two args, a and b, ra/fp, and space for three callee-save (s) registers. We store s1-s3 so that we can use them a, b, and tmp. variable.

In terms of the total number of stores and loads, there are five stores and five loads in the prologue/epilogue and one store and zero loads in the body for a total of six stores (SD) and five loads (LD), reducing the total number of loads by two compared to the prior assembly.

Summary and Cheat Sheet for the RISC-V Calling Convention

first eight args passed in registers a0, a1, … , a7
Space for args passed in childs’s stack frame
return value (if any) in a0, a1
stack frame at sp
- contains ra (clobbered on JAL to sub-functions)
- contains fp
- contains local vars (possibly clobbered by sub-functions)
- contains space for incoming args
Saved registers (callee save regs) are preserved
Temporary registers (caller save) regs are not
Global data accessed via gp

Diagram of stack frame

RISC-V Registers

Return address: x1 (ra)
Stack pointer: x2 (sp)
Frame pointer: x8 (fp/s0)
First eight arguments: x10-x17 (a0-a7)
Return result: x10-x11 (a0-a1)
Callee-save free regs: x18-x27 (s2-s11)
Caller-save free regs: x5-x7,x28-x31 (t0-t6)
Global pointer: x3 (gp)
Thread pointer: x4 (tp)

CS 3410