Calling Conventions

Topics

One aspect of instruction selection we haven't gotten to is instruction selection for functions and function calls. Recall that calls show up in the lowered IR as call statements \(\textit{CALL}(f, e_1, \dots, e_n)\).

On the x86-64 ISA, we want to implement these IR nodes using the instruction call. It pushes the current instruction pointer (register rip) onto the stack and then jumps to the specified destination. Thus, the instruction call f is equivalent to sub rsp, 8; mov [rsp], rip; jmp f. However, it is more compact and faster.

Calling convention

A calling convention is a standardized contract about how to invoke functions. Having a calling convention allows code generated by different compilers and languages to interoperate. Since we want to give you Xi libraries to compile against, it's important that we follow the same calling conventions.

Unfortunately, there are multiple calling conventions for the x86-64. We will describe here the System V style calling convention used by Linux. The Microsoft calling convention is similar but differs in a few details. Note that the full calling convention is more complex than described here, in order to support struct arguments that are larger than one word. But Xi does not have these language features, simplifying matters.

Unlike in the 32-bit architecture, arguments are usually passed to functions entirely in registers. The first six word-size arguments are passed in the registers rdi, rsi, rdx, rcx, r8, and r9, in that order. For functions with more than 6 arguments, the remaining arguments are pushed onto the stack, in reverse order. Using reverse order supports functions with a variable number of arguments, though this is not a consideration in Xi. The figure above shows what the stack looks like as a procedure/function call proceeds. Note that stacks grow downward in this picture, so the top of the stack is at the lowest used address, which is at the bottom!

The code for doing a function call with \(n\) arguments looks something like the following, assuming that arguments 7–\(n\) have been placed in temporaries \(t_7, \dots t_n\):

push tn
push tn-1
...
push t7
mov r9, t6
...
mov rdi, t1
call f
mov dest, rax
add rsp, 8*(n-6) // if n > 6

Note that the stack pointer register rsp always points to the top entry of the stack. So the instruction push x is equivalent to sub rsp, 8; mov [rsp], x, and there is a corresponding instruction pop x that does this opposite: mov x, [rsp]; add rsp, 8.

This code sequence also shows how function results are handled. The result of a function, if any, comes back in the rax register. Once the function completes, it should clean up the stack so the stack pointer is back where it started. Thus, the final instruction adds the appropriate offset to reverse the effect on rsp of all the push instructions.

For functions that return multiple results, the calling convention specifies that the second result comes back in rdx. The System V calling convention does not give a way to return more than two results; how this is done for Xi is described elsewhere.

Inside the function

Stack frame of a running function

Once the function is entered, it sets up its stack frame to look like the figure above. The region labeled “temporary storage” is used to store local variables and other temporaries that don't fit into registers. Because the stack pointer can move around, it is common to use a different register, the frame pointer or base pointer register rbp. For example, a variable located immediately below the base pointer would be accessed with the memory operand [rbp-8]. But if rbp is used, it is necessary to save the caller's rbp before overwriting it. So rbp is saved on the stack immediately after the program counter rip of the caller.

Recall that we generate code for the body of a function defined as f(x):τ{s} as simply the statement translation \({\mathcal S}[\![s]\!]\). And the translation of return e is \(\textit{MOVE}(\textit{RV}, {\mathcal E}[\![e]\!]); \textit{RETURN}\), where \(\textit{RV}\) is really a name for rax. Therefore the IR code generated for the declaration f(x:int,y:int):int { return x+y } is something like \(\textit{MOVE}(\textit{RV},\textit{ADD}(\textit{TEMP}(x),\textit{TEMP}(y))); \textit{RETURN}\). With appropriate instruction selection, we should get something like:

mov rax, x
add rax, y
...
ret

But this code does nothing to set up the stack frame or to tear it down, or even to move the function arguments into "x" and "y". These can be achieved by adding a function prologue and epilogue:

f: push rbp
   mov rbp, rsp
   sub rsp, 8*l
   mov x, rdi
   mov y, rsi
   mov rax, x
   add rax, y
   mov rsp, rbp
   pop rbp
   ret

Here we are assuming that the stack frame needs to contain \(\ell\) temporary words.

In fact, the ISA has an instruction to accomplish the first three instructions directly: enter 8*l, 0 saves the frame pointer and adjusts the stack pointer. And the two instructions preceding ret can be accomplished using leave:

f: enter 8*l, 0
   mov x, rdi
   mov y, rsi
   mov rax, x
   add rax, y
   leave
   ret

In this case, the function doesn't really need a frame pointer at all, since it doesn't use it. Then we don't need enter and leave. And a smart register allocator can choose x=rdi, y=rsi, making two mov instructions superfluous:

f: mov rax, rdi
   add rax, rsi
   ret

A couple of details to watch out for: first, the stack pointer rsp is required always to be 16-byte aligned when a call instruction is performed. (This is a requirement of various system libraries.) Since the call instruction itself pushes an 8-byte word onto the stack (rip), the stack pointer is misaligned on entry. In the example function above, this is not a problem, because it is a leaf procedure that doesn't call anything else.

Second, it is generally inadvisable to access memory below the current stack pointer, because interrupt routines are likely to overwrite anything there. However, the System V (Linux) calling conventions introduce a “red zone” of 16 words that interrupt routines will not overwrite (at least when compiling non-kernel code). Consequently, leaf procedures can use a small amount of local storage without the cost of explictly setting up a stack frame.

Caller-save vs. callee-save

In the previous example, we needed to save rbp before changing it, in order to restore it to its old value when returning to the caller. The contract between the caller and callee was that callee would not change this register. In general, the calling conventions define certain registers as callee-save registers, meaning that a called procedure must restore those registers to their original values before returning. In the current calling conventions, these registers are rbp, rsp, rbx, and r12r15. Using these registers is not worth it if the overhead of saving them to memory is more than the speedup achieved by using them.

Other registers are designated as caller-save registers, meaning that a given function is permitted to change them arbitrarily. These registers include rax, rcx, rdx, rsi, rdi, and r8r11. Since called functions may overwrite these registers, it is responsibility of the caller to save these registers (if necessary) and to restore them after the call. Clearly, performance is better if these registers are used for values that do not need to be saved and restored.

Eliminating the base pointer

The previous example shows that the frame pointer doesn't need to be used in leaf procedures. We can also avoid using a register to keep track of the frame pointer when the offset between the stack pointer and the frame pointer is known at compile time. Suppose that the compiler knows the offset is \(8*\ell\). Then any memory reference of the form [rbp + k] can be equally well written as [rsp + k'] where the constant \(k'\) is equal to \(k + 8\ell\). Once all such references to rbp are removed from the code, there is no need to use rbp, and therefore no need to save it and restore it in the prologue and epilogue. If it is used, it can be used as a general-purpose register!

Although this trick is rather nice, it does mean that the distance between the two stack pointers must be statically known. This rules out using dynamic allocation on the stack, including on-stack arrays with dynamically determined length or the alloca() system call familiar to experienced C programmers.

Trivial register allocation

At this point, the translation has produced abstract assembly code, in which temporaries have been mapped to a set of registers of unbounded size. The set of registers used includes not only the real registers supported by the ISA, but also pseudo-registers named like the corresponding temporaries. The job of register allocation is to replace these abstract registers with real registers, rewriting the code as necessary.

Doing a good job of register allocation is challenging. However, it is not hard to generate code if all temporaries (abstract registers) are assigned to stack locations. For example, gcc -O0 uses this approach.

Temporaries are assigned to distinct stack locations (e.g., [rbp-8], [rbp-16], etc.). Any given instruction uses at most 3 register operands, so 3 registers are reserved for the purpose of getting values onto and off of the stack. It makes sense to select three caller-save registers, such as rax, rcx, and rdx. In general, instructions are inserted before each abstract assembly instruction, to read operand registers from the stack, and further instructions are inserted afterward, to write results back to the stack. Essentially, each abstract register is allocated to one of these three reserved registers for the duration of a single assembly instruction.

Suppose that we have allocated temporary t1 to location [rbp-8]. Then the abstract assembly instruction push t1 is converted to concrete assembly that first loads the operand into one of the three reserved registers, and then performs the original instruction with that reserved register serving the role of the temporary:

mov rax, [rbp-8]
push rax

For instructions that update temporaries, instructions are added afterward to move the new values of the temporaries into the appropriate stack locations. So the transformation of the instruction add t2, [t1+8] might be the following, assuming that t1 is located at [rbp-8] and t2 is located at [rbp-16]:

mov rcx, [rbp-8]
mov rax, [rbp-16]
add rax, [rcx+8]
mov [rbp-16], rax

A couple of simple optimizations can help. On CISC instruction sets like x86-64, many instructions can read operands directly from memory or write results directly to memory, avoiding the need to add some instructions. The first conversion above could have been done just as push [rbp-8], for example. Also, some temporaries can be allocated to registers, as long as three registers are reserved for the job of shuttling temporaries on and off the stack. For example, the callee-save registers rbx and r12r15 are not serving any special purpose yet; nor are the caller-save registers r10–r11.

Linear-Scan Register Allocation

Really effective register allocation requires more program analysis than we have had a chance to learn at this point, but the linear scan allocation of Poletto and Sarkar is simple to implement and improves performance significantly above the 3-register technique just described. The algorithm also has the benefit that it is much faster than more sophisticated methods, making it attractive for JIT compilers where compilation speed is important.

The algorithm works by coarsely approximating the instructions at which each temporary needs to be assigned to some register. Given a sequence of instructions on which register allocation needs to be done—usually, the body of a function—the algorithm approximates the live range of each temporary as single contiguous sequence of instructions in which the value of the temporary might be needed. For each temporary, the beginning of its live range can be approximated by scanning from the beginning of the code to find the first instruction that might change its value; to find the end of the live range, the compiler can scan backward to find the first instruction that might use the value of the temporary.

Given the live ranges for each temporary, the compiler then scans from the beginning of the code to the end, assigning a register (or stack location, if no registers are available) to each temporary. At each instruction where a temporary live range starts, an available register is assigned to it; at each instruction where a live range ends, that register is made available again.

Note that instructions of the form mov t1, t2 offer the opportunity to do move coalescing. If the same register is assigned to both t1 and t2, the mov instruction can be deleted from the code. To assign the same register to both temporaries, we should check that the live range of temporary t2 ends at the mov instruction and that the live range of t1 begins at it.