At this point we have seen how programs can be checked by lexical analysis, syntactic analysis, and semantic analysis, and how these compiler phases can extract a representation of the program as an abstract syntax tree. Assuming none of these compiler phases has detected an error, the program is known at this point in compilation to be legal program in the programming language. Barring bugs in the compiler or resource limitations, the compiler should be able to translate it to the target language.
However, it is difficult to a good job of translating directly from a typical high-level AST to low-level code such as assembly code. Both of these code representations are inconvenient for the kinds of program analysis and optimizations that are needed to achieve good performance. For one thing, they are complex. A typical high-level AST will have dozens of different node types, and assembly code is much worse, with hundreds of different instructions. Hence, optimizing compilers usually translate the AST to one or more intermediate representations (IRs) that support analysis and optimization better.
Some of the desirable properties of a good IR are:
Simplicity. If the IR has a small number of constructs, it is easier to analyze code represented in it. The IR that we will use has about a dozen constructs, so it is simpler than the source language. Simplification is possible because the IR does not need to be a user-friendly language and because the program is already known to be valid.
Machine independence. Ideally, an IR does not encode aspects of the target language such as the underlying calling conventions used at the assembly level. If the IR is machine-independent, the same compiler front end can be used to target multiple architectures in the back end.
Language independence. Conversely, an IR that is not tied closely to the source language means the back end of the compiler can be reused for multiple source languages. For example, LLVM and JVM are both IRs that have been used successfully for a number of source languages.
Support code transformation. Optimization involves rearranging and rewriting program code. A good IR makes it easy to perform these code transformations.
Support code generation. The IR code is generated from source code, and it is used to generate target code. A good IR will be a compromise between high-level and low-level representations so that both code generation tasks are made easier.
Meeting all these goals simultaneously may not be feasible! As a result, many compilers use multiple intermediate representations. A high-level intermediate representation (HIR) would be close to the language AST and support high-level optimizations such as inlining and constant folding. Mid-level intermediate representations (MIRs) like the one we will use will support program analysis and reorganization of program control flow. Low-level intermediate representations are closer to assembly code but have some unrealistic features, such as higher-level pseudo-instructions or an unbounded number of registers. The compilation process successively lowers the code through a series of such representations, taking advantage of the useful properties of each representation.
For mid-level IRs, there are multiple popular approaches.
The approach we focus on in this course is a low-level AST representation. It is low-level in that it has explicit memory operations and low-level control flow (jumps and conditional branches). The IR we will use is based on the one in Appel's “Tiger book”. This kind of IR strikes a nice balance between the the different uses of the IR. It is easy to generate from a source-level AST, but is sufficiently low-level that it also makes assembly code generation easy. Because it allows complex expressions, high-quality instruction is made easier, especially on CISC instruction sets such as those in the x86 family. It can also be analyzed and optimized fairly conveniently.
A classic IR approach is a code for an abstract stack machine. This idea goes back to the p-code compilers of the 1960's, but more recent bytecode representations like Smalltalk's virtual machine or the Java Virtual Machine (JVM) also are examples of stack-machines IRs. Stack-machine code is easy to generate recursively from a high-level AST, and it is also conducive to building an interpreter. However, stack-machine code is more awkward for program analysis and transformation, and for code generation.
A lower-level representation that is customized for easy analysis is three-address code, also known as quadruples. The code consists mainly of low-level instructions of the form "x1 ← x2 OP x3", where the operands are either variables or constants. The lack of complex subexpressions makes this kind of representation particularly congenial to program analysis and transformation. Typically, these instructions are organized into basic blocks, which are sequences of instructions, and the program is represented as a graph of basic blocks.
Converting high-level code to this representation is more work than with approach 1, especially when the code is required in be in static single assignment (SSA) form, in which no variable is assigned more than once. The highly popular LLVM representation is closest to this kind of representation.
Compilers for functional programming languages often use a continuation-passing representation, in which code points are treated in a more first-class way. This representation simplifies the analyses and transformations used in functional languages. A-normal form is a version of this approach in which complex subexpressions are ruled out, so it ends up looking much like 3-address code.
We will use a tree-structured intermediate representation based on the one in the Tiger book. It has two kinds of terms: expressions, which compute a word-sized value, and statements, which have some side effect but do not compute a value.
Syntax | Explanation | Abbreviation | |
---|---|---|---|
(Expr) e ::= | CONST(n) | Literal (bool, int, ...) | \(n\) |
| | TEMP(t) | Variable or temporary value | \(t\) |
| | OP(e1, e2) | Arithmetic/logical operation | \(e_1~\mathit{OP}~e_2\) |
| | MEM(e) | Contents of memory at location e | |
| | CALL(ef, e1, ..., en) | Function or procedure call | |
| | NAME(l) | Address of a labeled memory address (code or data) | |
| | ESEQ(s, e) | Execute statement s, then evaluate e |
A number of different operators OP are allowed, reflecting the capabilities of the typical processor. We might even want to add more!
Statements are defined as follows.
Syntax | Explanation | Abbreviation | |
---|---|---|---|
(Stmt) s ::= | MOVE(edest,e) | Move the value of e to the location edest, which must either have the form TEMP(t) or MEM(e). | |
| | SEQ(s1,...,sn) | Sequential composition of IR statements | \(s_1; \dots ; s_n \) |
| | JUMP(e) | Jump to code address e (usually computed using NAME) | |
| | CJUMP(e,l1,l2) | Jump to label l1 if e evaluates to non-zero, and to l2 otherwise | |
| | LABEL(l) | This statement has no effect when executed. It merely gives a name to the next statement. The name must be unique in the current compilation unit. | |
| | RETURN(e1,...,en) | Return 0 or more values from the current function. The values are placed into temporaries named RV0, RV1, etc. |
For simplicity we require that an expression used as the des
It is not difficult to implement a recursive interpreter for this code representation, and we have in fact supplied one that you can use. The trickiest part is implementing jumps, because they stop the current recursive evaluation and start a new one that needs to begin from the location of the label jumped to.
Now we are ready to introduce one of the key ideas of compiler construction. We express translation as a recursive traversal of the source language (the AST) into the target language. The key idea is that for a given source-language construct, there is just one possible translation into the target language. We can define the translation as a set of rules, where the rule to use is always clear from the syntax of the AST node under consideration. Thus, the translation is syntax-directed in the same way that the typing rules were previously.
Following the approach we have taken in this course, we develop a crisp, formal definition of the translation process. We define two translation function for translating expressions and statements respectively.
Once we understand the specification of these two functions, it becomes relatively easy to develop the translations of each of the syntactic forms.
Let's start by developing rules for \({\mathcal E}[\![e]\!]\). We simply need to consider each syntactic form in turn. The translations of boolean and integer literals are simply constants: \begin{align*} {\mathcal E}[\![\texttt{false}]\!] & = \textit{CONST}(0) \\ {\mathcal E}[\![\texttt{true}]\!] & = \textit{CONST}(1) \\ {\mathcal E}[\![\texttt{n}]\!] & = \textit{CONST}(n) \end{align*}
Here we have made the standard choice to represent \(\texttt{true}\) as 1 and \(\texttt{false}\). We could have made a different choice! But this choice is convenient for compatibility with external code.
Most mathematical operators can be translated directly into the underlying IR operation:
\begin{align*} {\mathcal E}[\![e_1 + e_2]\!] & = \textit{ADD}({\mathcal E}[\![e_1]\!], {\mathcal E}[\![e_2]\!]) \\ {\mathcal E}[\![e_1 - e_2]\!] & = \textit{SUB}({\mathcal E}[\![e_1]\!], {\mathcal E}[\![e_2]\!]) \\ \dots \end{align*}Notice that we must recursively apply the translation function to the subexpressions \(e_1\) and \(e_2\), because these subexpressions are source-language terms that certainly can't appear directly in an IR expression.
The translation of a local variable or function parameter will simply be a temporary:
\begin{align*} {\mathcal E}[\![x]\!] & = \textit{TEMP}(x) \end{align*}The translation of a function call uses the \(\textit{CALL}\) node:
\begin{align*} {\mathcal E}[\![f(e_1,\dots,e_n]\!] & = \textit{CALL}({\textit NAME}(f), {\mathcal E}[\![e_1]\!],\dots,{\mathcal E}[\![e_n]\!]) \end{align*}At this point we have enough translation rules that we can use them together to translate an expression that we might expect to see in the implementation of GCD: \begin{align*} {\mathcal E}[\![\texttt{gcd(x, y-x)}]\!] & = \textit{CALL}(\textit{NAME}(\texttt{gcd}), \textit{TEMP}(x), \textit{SUB}(\textit{TEMP}(y), \textit{TEMP}(x))) \end{align*}
Of course, keep in mind that both the translated source expression and the result of translation are really trees.
The translation function \({\mathcal S}[\![\cdot]\!]\) can be defined similarly, using \({\mathcal E}[\![\cdot]\!]\) to handle subexpressions.
\begin{align*} {\mathcal S}[\![x = e]\!] &= \textit{MOVE}(\textit{TEMP}(x), {\mathcal E}[\![e]\!]) \\ {\mathcal S}[\![s_1; \dots; s_n]\!] &= \textit{SEQ}({\mathcal S}[\![s_1]\!], \dots,{\mathcal S}[\![s_n]\!]) \end{align*}To translate source-level control structures, we need IR statements that transfer control. In the following translation, there is a conditional jump around the code for \(s\):
\begin{align*} {\mathcal S}[\![\texttt{if (}e{)}~s]\!] &= \textit{SEQ}( \\ & \textit{CJUMP}({\mathcal E}[\![e]\!], l_t, l_f), \\ & \textit{LABEL}(l_t), \\ & {\mathcal S}[\![s]\!], \\ & \textit{LABEL}(l_f)) \end{align*}You might wonder where the labels \(l_t\) and \(l_f\) come from. These are intended to denote fresh labels that appear nowhere else in the translated code. In general, translation rules will introduce fresh labels and temporaries like these.
We translate while
-statements by introducing a label to serve
as a loop header:
\begin{align*}
{\mathcal S}[\![\texttt{while (}e{)}~s]\!] &= \textit{SEQ}( \\
& \textit{LABEL}(l_h), \textit{CJUMP}({\mathcal E}[\![e]\!], l_t, l_f), \\
& \textit{LABEL}(l_t), {\mathcal S}[\![s]\!], \\
& \textit{JUMP}(\textit{NAME}(l_h)), \\
& \textit{LABEL}(l_f))
\end{align*}
As we see next, the translations of if
and while
are
actually less efficient than we would like. And we also need to see how to translate
functions and arrays...