Building ASTs and Handling Errors

Parser Generators

The payoff of all this parsing theory is that it underpins tools that can generate parser code for us automatically. Some parser generators (yacc, bison, CUP, ...) are based on LALR; others (Antlr) are based on LL. Parser generators allow programmers to specify not only a grammar, but also actions to be performed when each production is completed.

What should these actions do? Back when memory was expensive, some compilers would directly emit output low-level code as they parsed, often writing it out directly to an external file for later compiler passes to work on. That implementation strategy saves memory but has the weakness that the code generated for a given source-level expression can only depend on the prefix of the source code seen up to the point where the expression occurs. And this limitation is why languages like C require that all functions be declared before they are used. With such a compiler, implementing recursion and forward declarations requires "backpatching" generated code with information learned later in the compilation. Generating optimized code is made more difficult, and the compiler tends to end up as as monolithic code that is hard to tease apart into maintainable modules.

To support more declarative languages like Java in which method and class definitions can appear in any order, modern compilers do not emit code directly; instead, they build an abstract syntax tree during parsing.

Abstract Syntax Trees

An abstract syntax tree (AST) is a high-level intermediate representation of the program, in which the constructs of the programming language are nodes in the tree. In fact, a programming-language theorist is likely to consider the AST as the true program, with its original parsed representation as an unfortunate necessity for obtaining it.

It's important to distinguish between the abstract syntax tree for a program and the parse tree. The parse tree describes how to use the grammar productions to derive the input, but includes nodes that are unnecessary for later stages of the compiler. For example, if using a simple arithmetic grammar, the AST for the expression "(1+2)*3" would not include nodes to represent parentheses or nodes for nonterminals:

How an AST is best represented depends on the programming language. In an object-oriented language like Java, it is convenient to design the AST as an object-oriented class hierarchy, using method dispatch to execute code specific to different AST node types. For example, the AST for a Java-like language might be defined by a class hierarchy with classes like those shown here.

Object-Oriented AST design

An alternative approach that sometimes designers use is to have a smaller number of classes to represent nodes, perhaps even just one class. However, this approach is not recommended. It requires that the node class be able to store all of the information that any kind of node might have as an attribute, and it prevents the type system from providing useful guidance.

If implementing a compiler in a functional language like OCaml, it will be more convenient to express the AST as a variant type (aka algebraic datatype), with a declaration like the following:

type expr = Plus of expr * expr
          | Times of expr * expr
          | Not of expr
          | ...

Constructing the AST

In a top-down parser, it is natural to have each parsing function return an AST node as its result. For example, the function for parsing arithmetic expressions might look something like the following. Notice that this code will construct exactly the same AST for the expression "(3)" as for "3".

Expr parseExpr() {
    switch (peek()) {
        case ID:
            return new Identifier(consume(ID));
        case LPAREN:
            consume(LPAREN);
            Expr e = parseExpr();
            consume(RPAREN);
            return e;
        ...
    }
}

In a bottom-up parser, a result can be associated with every symbol on the parser stack. These results can then be used in the action associated with each production to construct a result when reducing the production. For example, if using the CUP parser generator, we might have a rule for addition like the following:

expr ::= expr:e1 PLUS expr:e2
            {:
                RESULT = new Plus(e1, e2);
            :}
       | NUMBER:n
            {:
                RESULT = new Number(n);
            :}
       | LPAREN expr:e RPAREN
            {:
                RESULT = e;
            :}

CUP allows variable names (e1, e2) to be associated with symbols on the right-hand side of each production. These variables are bound to the results computed for those symbols, and can be used as suggested to construct an AST node for the current production.

For example, consider parsing the expression "(1+2)*3" with the kind of arithmetic grammar we've been considering:

    E → n | ( E ) | E + E

Consider what happens at the point where the expression "1+2" is about to be reduced using the production E→E+E, as shown below. The parser stack σ contains four symbols; the two E symbols carry with them the result of two previous reductions. The action taken builds a new Plus node that refers to the previously created Number nodes:

Building an AST node during a bottom-up reduction

Error Recovery

It is helpful if a compiler can recover from errors in the program and continue to give useful feedback to the programmer. Modern interactive development environments (IDEs) typically report multiple errors. Some can even generate partial code for programs containing errors, allowing developers to test the other parts of the program.

How can a compiler recover from a syntax error in the source program? The key idea is "wall off" syntax errors in the parse tree. At some point in the input, a syntax error is detected; the parser then abandons part of the parse tree, attempting to scan forward heuristically to a synchronizaton point where there are tokens that follow the "bad part". It then resumes the parse. The following figure depicts this process. The parse tree is shown as a white triangle; the abandoned part of the parse is shown in gray.

Recovering from a syntax error

LALR parser generators like CUP typically support this with a special nonterminal symbol named error that indicates where in the grammar a syntax error can be tolerated. We can think of the error nonterminal as deriving an arbitrary, but probably short, sequence of tokens. The parser uses this nonterminal when it must to skip past ill-formed syntax.

For example, in CUP we might want to catch errors at the granularity of statements for a Java-like source language. We could then add an error production to the corresponding nonterminal:

stmt ::= error SEMI
            {:
                reportSyntaxError();
                RESULT = new ErrorStmt();
            :}

When a syntax error is encountered, the parser uses productions that include this symbol to find a good synchronization point. It first pops the stack until the top of the stack is a state in which error can be shifted. It then scans forward heuristically until it finds a good synchronization point. How this is done depends on the parser. CUP scans forward until a certain number of tokens (by default, 3) can be parsed successfully without generating another error. The yacc parser generator uses a less robust technique: if the right-hand side of the production has symbols \(β\) following error, it scans forward until it finds a token in \(\textit{FIRST}(β)\). With the example production above, the yacc parser would scan forward to the next semicolon. More sophisticated techniques try try to find the optimal synchronization point.

In recursive-descent parsers, error recovery can be implemented manually along similar lines. Errors can cause exceptions, and exception handlers can implement error recovery. For example, we might implement statement-level error recovery as follows:

Stmt parseStmt() {
    try {
        switch (peek()) { // handle different statement types
            case IF: ... 
            case WHILE: ...
            ...
        }
    } catch (SyntaxError e) {
        while (peek() != SEMI) consume();
        consume(SEMI);
        return new ErrorStmt();
    }
}

Doing more sophisticated synchronization is difficult in recursive-descent parsers because the parser state is not exposed as an explicit machine.

As the examples have shown, error recovery does not mean that the compiler has to abandon AST construction. Instead, it can generate special error nodes in the AST that are treated specially by later compiler stages. For example, it would make sense to skip error nodes when type checking, and when generating code, to emit code that halts the program with an error message. This strategy enables early testing, even before the code fully compiles!