The payoff of all this parsing theory is that it underpins tools that can generate parser code for us automatically. Some parser generators (yacc, bison, CUP, ...) are based on LALR; others (Antlr) are based on LL. Parser generators allow programmers to specify not only a grammar, but also actions to be performed when each production is completed.
What should these actions do? Back when memory was expensive, some compilers would directly emit output low-level code as they parsed, often writing it out directly to an external file for later compiler passes to work on. That implementation strategy saves memory but has the weakness that the code generated for a given source-level expression can only depend on the prefix of the source code seen up to the point where the expression occurs. And this limitation is why languages like C require that all functions be declared before they are used. With such a compiler, implementing recursion and forward declarations requires "backpatching" generated code with information learned later in the compilation. Generating optimized code is made more difficult, and the compiler tends to end up as as monolithic code that is hard to tease apart into maintainable modules.
To support more declarative languages like Java in which method and class definitions can appear in any order, modern compilers do not emit code directly; instead, they build an abstract syntax tree during parsing.
An abstract syntax tree (AST) is a high-level intermediate representation of the program, in which the constructs of the programming language are nodes in the tree. In fact, a programming-language theorist is likely to consider the AST as the true program, with its original parsed representation as an unfortunate necessity for obtaining it.
It's important to distinguish between the abstract syntax tree for a program and the parse tree. The parse tree describes how to use the grammar productions to derive the input, but includes nodes that are unnecessary for later stages of the compiler. For example, if using a simple arithmetic grammar, the AST for the expression "(1+2)*3" would not include nodes to represent parentheses or nodes for nonterminals:
How an AST is best represented depends on the programming language. In an object-oriented language like Java, it is convenient to design the AST as an object-oriented class hierarchy, using method dispatch to execute code specific to different AST node types. For example, the AST for a Java-like language might be defined by a class hierarchy with classes like those shown here.
Object-Oriented AST design
An alternative approach that sometimes designers use is to have a smaller number of classes to represent nodes, perhaps even just one class. However, this approach is not recommended. It requires that the node class be able to store all of the information that any kind of node might have as an attribute, and it prevents the type system from providing useful guidance.
If implementing a compiler in a functional language like OCaml, it will be more convenient to express the AST as a variant type (aka algebraic datatype), with a declaration like the following:
type expr = Plus of expr * expr | Times of expr * expr | Not of expr | ...
In a top-down parser, it is natural to have each parsing function return an AST node as its result. For example, the function for parsing arithmetic expressions might look something like the following. Notice that this code will construct exactly the same AST for the expression "(3)" as for "3".
Expr parseExpr() { switch (peek()) { case ID: return new Identifier(consume(ID)); case LPAREN: consume(LPAREN); Expr e = parseExpr(); consume(RPAREN); return e; ... } }
In a bottom-up parser, a result can be associated with every symbol on the parser stack. These results can then be used in the action associated with each production to construct a result when reducing the production. For example, if using the CUP parser generator, we might have a rule for addition like the following:
expr ::= expr:e1 PLUS expr:e2 {: RESULT = new Plus(e1, e2); :} | NUMBER:n {: RESULT = new Number(n); :} | LPAREN expr:e RPAREN {: RESULT = e; :}
CUP allows variable names (e1
, e2
) to be associated with
symbols on the right-hand side of each production. These variables are bound to the results
computed for those symbols, and can be used as suggested to construct an AST node for the
current production.
For example, consider parsing the expression "(1+2)*3" with the kind of arithmetic grammar we've been considering:
E → n | ( E ) | E + E
Consider what happens at the point where the expression
"1+2" is about to be reduced using the production E→E+E, as shown below. The parser stack σ contains four symbols; the two
E symbols carry with them the result of two previous reductions. The action taken builds
a new Plus
node that refers to the previously created Number
nodes:
Building an AST node during a bottom-up reduction
It is helpful if a compiler can recover from errors in the program and continue to give useful feedback to the programmer. Modern interactive development environments (IDEs) typically report multiple errors. Some can even generate partial code for programs containing errors, allowing developers to test the other parts of the program.
How can a compiler recover from a syntax error in the source program? The key idea is "wall off" syntax errors in the parse tree. At some point in the input, a syntax error is detected; the parser then abandons part of the parse tree, attempting to scan forward heuristically to a synchronizaton point where there are tokens that follow the "bad part". It then resumes the parse. The following figure depicts this process. The parse tree is shown as a white triangle; the abandoned part of the parse is shown in gray.
Recovering from a syntax error
LALR parser generators like CUP typically support this with a special nonterminal symbol
named error
that indicates where in the grammar a syntax error can
be tolerated. We can think of the error
nonterminal as deriving an
arbitrary, but probably short, sequence of tokens. The parser uses this nonterminal
when it must to skip past ill-formed syntax.
For example, in CUP we might want to catch errors at the granularity of statements for a Java-like source language. We could then add an error production to the corresponding nonterminal:
stmt ::= error SEMI {: reportSyntaxError(); RESULT = new ErrorStmt(); :}
When a syntax error is encountered, the parser uses productions that
include this symbol to find a good synchronization point. It first pops the stack until
the top of the stack is a state in which error
can be shifted. It then scans
forward heuristically until it finds a good synchronization point. How this is done
depends on the parser. CUP scans forward until a certain number of tokens (by
default, 3) can be parsed successfully without generating another error. The yacc parser
generator uses a less robust technique: if the right-hand side of the production has
symbols \(β\) following error
, it scans forward until it finds a token in
\(\textit{FIRST}(β)\). With the example production above, the yacc parser would
scan forward to the next semicolon. More sophisticated techniques try
try to find the optimal synchronization point.
In recursive-descent parsers, error recovery can be implemented manually along similar lines. Errors can cause exceptions, and exception handlers can implement error recovery. For example, we might implement statement-level error recovery as follows:
Stmt parseStmt() { try { switch (peek()) { // handle different statement types case IF: ... case WHILE: ... ... } } catch (SyntaxError e) { while (peek() != SEMI) consume(); consume(SEMI); return new ErrorStmt(); } }
Doing more sophisticated synchronization is difficult in recursive-descent parsers because the parser state is not exposed as an explicit machine.
As the examples have shown, error recovery does not mean that the compiler has to abandon AST construction. Instead, it can generate special error nodes in the AST that are treated specially by later compiler stages. For example, it would make sense to skip error nodes when type checking, and when generating code, to emit code that halts the program with an error message. This strategy enables early testing, even before the code fully compiles!