A7: Functions in Assembly

Instructions: Remember, all assignments in CS 3410 are individual. You must submit work that is 100% your own. Remember to ask for help from the CS 3410 staff in office hours or on Ed! If you discuss the assignment with anyone else, be careful not to share your actual work, and include an acknowledgment of the discussion in a collaboration.txt file along with your submission.

The assignment is due via Gradescope at 11:59pm on the due date indicated on the schedule.

Submission Requirements
Provided Files
Getting Started
Overview
Part 0: Lab Section
Part 1: Optimizing Fibonacci
Submission
Rubric

Submission Requirements

Please submit the following files.

From the in-lab work:

addone.s: your first assembly function, which increments the integer it’s given
recsum.s: a recursive summation function

From Part 1:

recursive.s: implementation of a recursive Fibonacci function
memoization.s: implementation of a memoized Fibonacci function
tail_recursive.s: implementation of a tail recursive Fibonacci function
opt_tail.s: implementation of a tail-call optimized Fibonacci function

Provided Files

We have provided you the following:

recursive.c, memoization.c, tail_recursive.c, and opt_tail.c, as the starter code for Part 1
compare.c, a program that compares the performance of the different versions of the Fibonacci function
Makefile, which you can use to build executables for the above programs

Getting Started

To get started, obtain the release code by cloning your assignment repository from GitHub:


$ git clone git@github.coecis.cornell.edu:cs3410-2025sp-student/<NETID>_asmfunc.git

Replace <NETID> with your NetID. All the letters in your NetID should be in lowercase.

Overview

This assignment will expand your understanding of RISC-V assembly programming with a focus on managing the call stack. You will get direct experience with defining and calling functions and adhering to RISC-V calling conventions. You will learn about optimizing recursive functions for performance.

Part 0: Lab Section

View the lab slides here.

During this lab section, you’ll get some initial experience with writing functions in assembly. The key challenge is in following the RISC-V standard calling conventions. The calling conventions are a set of rules that both the caller code and the callee code (the function being called) must follow.

All RISC-V functions can be broken up into three parts (in-order):

Prologue. Located at the beginning of the function, the prologue constructs the stack frame.
Body. Located after the prologue, the body contains the instructions of what the function actually does.
Epilogue. Located at the end of the function, the epilogue releases the stack frame before returning control to the caller.

We recommend starting writing the body of the function, noting which callee- and caller-saved registers you used. Afterwards, you can write the prologue and epilogue to properly save and restore the registers that you used.

Warm Up: `add_one`

Let’s start simple by implementing a function that adds 1 to its argument. You can imagine this C function:


int add_one(int i) {
  return i + 1;
}

Let’s start by compiling the body of the function. If you refer to the RISC-V calling conventions, you’ll notice that the first argument and the return value go in register a0. This makes the body pretty simple — we just have to add 1 to a0!


addi a0, a0, 1

Next, the prologue. First, we need to determine how large the stack frame must be; let’s call this number SIZE. SIZE must be big enough to hold the return address, any callee-saved registers, and any local variables that don’t fit in registers. Here’s a compact “to-do” list for what the prologue must do:

Move the stack pointer down by the size of the stack frame.
Push the return address onto the stack.
Push any callee-saved registers that our body uses onto the stack.
If needed, push any local variables that don’t fit into registers onto the stack.

The epilogue does the opposite; it must release the stack frame (i.e., clean up!):

Restore any callee-saved registers by popping them from the stack.
Pop the return address from the stack.
Move the stack pointer back to its original position.
Use ret (a.k.a., jr ra) to return to jump to the next instruction after the function call.

As we’ve written it, our function body doesn’t use any callee-saved registers (s0–s11), nor does it require any stack space for local variables. That means we just have to store the return address on the stack. In RISC-V 64-bit (what we’re using), memory addresses are 8 bytes (64 bits).

Putting it all together, here’s an implementation of add_one:


add_one:
  # Prologue.
  addi sp, sp, -8  # Push the stack frame.
  sd   ra, 0(sp)   # Save return address.

  # Body.
  addi a0, a0, 1

  # Epilogue.
  ld   ra, 0(sp)   # Restore return address.
  addi sp, sp, 8   # Pop the stack frame.
  ret

The key difficulty of writing the prologue and epilogue is deciding where in the stack frame to store what. In other words, what is stored at which offset from sp? add_one only needs to store one value, the return address, so that just goes at 0(sp). However, in general you must determine the layout of the stack frame.

Copy the RISC-V assembly of the add_one function above into a file called addone.s.

Trying It Out: Calling Your Function From C

We can’t run this assembly program yet as it lacks a main function. It also doesn’t print anything out, which makes it hard to tell what it’s doing (if anything). One way to test your assembly functions is to write a C program that calls your assembly function.

Make sure that your addone.s implementation has an add_one: label. Now, at the top of the file add the following line:


.global add_one

This directive tells the assembler that the add_one label is a global symbol, meaning it’s accessible to other code.

Then, in a separate file (e.g., addone_test.c) copy the C program below:


#include <stdio.h>

int add_one(int i);

int main() {
  int res = add_one(42);
  printf("%d\n", res);
}

That add_one declaration is called a prototype, which means it doesn’t have a function body. It just tells the C compiler that the function is implemented elsewhere—in your case, in an assembly file.

Now, let’s compile and link these two files together.


$ rv gcc addone.s addone_test.c -o addone_test

Then use rv qemu addone_test, to run the program.

This works thanks to the magic of calling conventions! You and GCC are both “assembly programmers”, and because you agree on the standard way to invoke functions, the assembly code you both write can interoperate.

Recursive Sum

Next, we’ll write a recursive function that sums the integers from $1$ through $n$ . The function we want to implement would look something like this in C:


int sum(int n) {
  if (n == 0)
    return n;
  return n + sum(n - 1);
}

In assembly, recursive function calls work exactly the same way as any other function call—the caller and callee just happen to be the same function. We’ll follow the RISC-V calling conventions in both roles.

Start by writing the function body. The interesting part is implementing the function call. Take note of which caller-saved registers you need to save before the jal instruction and restore after the function returns.

Next, write the prologue and epilogue. You’ll want to start by making a list of all the values this function will ever need to store in its stack frame. Determine the stack frame layout, or the offsets you’ll store each value at. Lastly, follow the recipe from the add_one step above to write the prologue and epilogue.

Once you’re done, you can test your sum function by writing a main wrapper in C, as we did for add_one. You’ll want to try calling sum on several different inputs.

To finish, put your assembly implementation of the sum in a file called recsum.s.

Part 1: Optimizing Fibonacci

In this assignment, you will implement several different versions of a function the $n$ th number of the Fibonacci sequence. We’ll start with a straightforward recursive implementation and then explore various performance optimizations.

Version A: Recursive Fibonacci

Here’s a straightforward recursive implementation of a Fibonacci function in C:


unsigned long r_fibonacci(int n) {
  if (n == 0)
    return 0;
  else if (n == 1)
    return 1;
  else
    return r_fibonacci(n - 2) + r_fibonacci(n - 1);
}

Your task is to translate this code into RISC-V assembly.

Put your implementation in a file called recursive.s. We have provided a main function you can use to test your code in recursive.c. To test your code, type:


$ rv make recursive     # Build the `recursive` executable.
$ rv qemu recursive 10  # Run it.

The recursive executable takes a command-line argument: the index of the Fibonacci to calculate. So qemu recursive 10 should print the 10th Fibonacci number, which is 55.

Version B: Memoized Fibonacci

The recursive implementation works, but it is very slow. Try timing the execution of a few Fibonacci calculations:


$ time rv qemu recursive 35
$ time rv qemu recursive 40
$ time rv qemu recursive 42

On my machine, calculating the 40th Fibonacci number took 4 seconds, and calculating the 42nd took 11 seconds. That suggests that the asymptotic complexity is pretty bad.

Part of the problem is that the recursive version recomputes the same answer many times. For example, if you call r_fibonacci(4), it will eventually call r_fibonacci(2) twice: once directly, and once indirectly via the recursive call to r_fibonacci(3). This redundancy can waste a lot of work.

A popular way to avoid wasteful recomputation is memoization. The idea is to maintain a memo table of previously-computed answers and to reuse them whenever possible. For our function, the memo table can just be an array, where the $i$ th index holds the $i$ th Fibonacci number. Here’s some Python code that illustrates the idea:


def m_fibonacci(n, memo_table, size):
    # Check the memo table. A nonzero value means we've already computed this.
    if n < size and memo_table[n] != 0:
        return memo_table[n]

    # We haven't computed this, so do the actual recursive computation.
    if n == 0:
        return 0
    elif n == 1:
        return 1
    answer = (m_fibonacci(n - 2, memo_table, size) + 
        m_fibonacci(n - 1, memo_table, size))

    # Save the answer in the memo table before returning.
    if n < size:
        memo_table[n] = answer

    return answer

In C, the type of memo_table will be unsigned long*, i.e., an array of positive numbers. size is the length of that array. Here’s the function signature for our new function:


unsigned long m_fibonacci(int n, unsigned long* memo_table, int size);

Implement this m_fibonacci function in RISC-V assembly. Put your code in memoization.s.

We have provided a memoization.c wrapper that you can use to test your code. You can use the same procedure as above to try your implementation: rv make memoization followed by rv qemu memoization <number>.

Notice how much faster the new implementation is! Take some number that was especially slow in the recursive implementation and time it using your memoized version:


$ time rv qemu memoization 42

On my machine, that takes just 0.5 seconds. That’s 22× faster!

Version C: Tail Recursive Fibonacci

While the new version is a lot faster, it still makes a lot of function calls. Some of those function calls turn out to be fast, because they just look up the answer in the memo table. But we can do better by changing the algorithm to need only one recursive call.

Again using Python syntax, here’s the algorithm for a faster recursive version:


def tail_r_fibonacci(n, a, b):
    if n == 0:
        return a
    if n == 1:
        return b
    return tail_r_fibonacci(n - 1, b, a + b)

This version is called tail-recursive because the recursive call is the very last thing the function does before returning. Marvel at the fact that this version makes only $n$ recursive calls to calculate the $n$ th Fibonacci number!

Here’s the function signature for this version:


unsigned long tail_r_fibonacci(int n, unsigned long a, unsigned long b);

Implement this tail_r_fibonacci function in tail_recursive.s. As usual, we have provided a C wrapper so you can test your implementation: rv make tail_recursive followed by rv qemu tail_recursive <number>.

Version D: Tail-Call Optimized Fibonacci

Making $n$ recursive calls is pretty good, but is it possible to optimize this code to do no recursion at all? That would mean that the algorithm uses $O(1)$ stack space instead of $O(n)$ .

That’s the idea in tail-call optimization. The plan is to exploit that, once the recursive call to tail_r_fibonacci is done, the caller has nothing more to do. The callee puts its return value in a0, and that is exactly what the caller wants to return too. Because there is no more work to do after the tail call, we don’t need to waste time maintaining the stack frame for the caller. We can just reuse the same stack frame for the recursive call!

Implement an optimized version of the tail-recursive Fibonacci algorithm in opt_tail.s. Instead of using a jal (or call) instruction for the recursive call, you can just use a plain unconditional jump (j in RISC-V). Be sure to carefully think through when and where you need to save and restore the return address to make this work.

Your function should be named opt_tail_fibonacci, and it should have the same function signature as the previous version. As usual, opt_tail.c can help you test your implementation: rv make opt_tail followed by rv qemu opt_tail <number>.

Compare Performance

We have provided a program, in compare.c, that can compare the performance of these various optimizations more precisely than the time command. (That was also measuring the time it takes to start the executable up, which can be slow, especially when it entails launching a Docker container.) Build the tool and invoke it like this:


$ rv make compare
$ rv qemu compare <method> <n>

You can give it the name of a method (recursive, memoization, tail_recursive, or opt_tail) and a number $n$ to measure the time taken to compute the $n$ th Fibonacci number. Or use the all method to compare all the implementations.

When I ran this once on my machine with $n=20$ , it reported that the recursive implementation took about 2.6 seconds, memoization brought this down to just 7 milliseconds, tail recursion was even faster at 3 ms, and the optimized tail call version was blazingly fast at only half a millisecond. Every computer is different, so your numbers will vary, but see if you observe the same overall performance trend.

There is nothing to turn in for this part—it’s just cool!

Submission

Submit all the files listed in Submission Requirements to Gradescope. Upon submission, we will provide a smoke test to ensure your code compiles and passes the public test cases.

Rubric

Part 0:
- addone.s: 5 points
- recsum.s: 5 points
Part 1:
- recursive.s: 10 points
- memoization.s: 15 points
- tail_recursive.s: 10 points
- opt_tail.s: 15 points