Grammars are defined by a set of productions. Productions use two kinds of symbols, terminals and nonterminals. Terminals are also called tokens: they are the symbols that the scanner recognizes and provides to the parser. A nonterminal represents some sequence of tokens in the string that is being parsed. A production has a left-hand-side, which is always a nonterminal. We will use upper-case letters to represent nonterminals and other characters to represent terminals. For example, the following are productions:
E → T E → ( E + E ) T → nA production explains one way to put together tokens to make up the nonterminal on the left-hand side. In this case, the productions mean that an
E
can be either
anything generated by the nonterminal T
, or else
a left parenthesis, followed by anything generated by the nonterminal
E
itself, followed by the token +
, followed by
anything generated by E
and then a closing parenthesis.
Here the nonterminal T
can only represent an integer,
but the nonterminal E
can represent
“(n+n)
”,
“((n+n)+n)
”,
“(n+(n+n))
”, and so on. (Since n
actually stands for a number, real examples of inputs would have numbers in
them, e.g. 2, (2+3), ((2+3)+7), etc.)
A grammar has a start symbol, usually a nonterminal. A legal string
as defined by the grammar is anything that can be generated from the start
symbol using productions. One way to think about this is that we start with a
string containing just the start symbol, and use productions to change
nonterminals into whatever appears on the right-hand side of the production,
until there are no nonterminals left. For example, we can generate the
string ((1 + 2) + 3)
by using productions as follows:
E
→ (E + E)
→ (T + E)
→
(1 + E)
→ (1 + (E + E))
→
(1 + (T + E))
→
(1 + (T + T))
→
(1 + (2 + T))
→
(1 + (2 + 3))
.
void parseE() { if (lookaheadToken().isNumber()) { parseT(); return; // E → T } else { checkToken("("); parseE(); checkToken("+"); parseE(); // E → ( E + E ) checkToken(")"); } }Here we've assumed a method
lookaheadToken
that returns the
next token without consuming it, and a method checkToken
that
reads the next token and makes sure that it is the given expected token. If we
have a method nextToken
that reads (consumes) the next token
from the scanner, we can implement checkToken
as follows:
void checkToken(String expected) throws Exception { String s = nextToken(); if (s.equals(expected)) return; throw new Exception("Got " + s + ", expected " + expected); }This method simply returns normally if the expected token was seen. If some other token was seen, there must have been a
syntax error
in the input, and the
code throws an exception, which will propagate up the
stack of method calls until it hits an exception handler. If there is
no exception, the program will stop and print an error message. Notice
that we had to add the declaration throws Exception
to the
top of the method to indicate that it could throw an exception. If we
take this approach to handling errors, we'll have to add similar
declarations to all the parsing methods.
We can handle exceptions using a try...catch
statement.
For example, we could parse an expression as follows:
try { parseE(); } catch (Exception exc) { System.out.println("Syntax error: " + exc.getMessage()); }
If the input has an unexpected token, the exception will stop all the
recursively nested calls to parseE
, etc., and run the exception
handler starting with catch
. Notice that the exception handler
declares a variable exc
of type Exception
.
This variable refers to the new exception object that was created in the
throw
statement, and exc.getMessage()
returns
the string message that was given when the exception was created.
So far, the return type of all the parsing methods has been void
or (in class) boolean
. The parsers have just checked whether
the input has the right form. Suppose we wanted to compute the value of
the input expression, instead of just checking it. Then we could declare
the type of the parse methods to be int
and write code like
this:
int parseE() { if (lookaheadToken().isNumber()) { return parseT(); // E → T } else { checkToken("("); int a = parseE(); checkToken("+"); int b = parseE(); checkToken(")"); return a + b; // E → ( E + E ) } }
Not every grammar works with recursive descent parsing, but usually it is possible to rewrite your grammar so it works. One problem that can arise is a grammar that is left-recursive: it has a nonterminal that can be expanded through productions into a string of symbols in which the left-most symbol is the same nonterminal. For example, in this grammar:
E → E + E E → T
The right-hand side starts with E
, the very same nonterminal being
expanded by the production. When we write a
recursive descent parser for this production, it will go into an infinite loop:
parseE() { parseE(); checkToken("+"); parseE(); }
To handle this kind of grammar in a recursive descent parser, it needs to be rewritten as a grammar without left recursion.
Actually, there is a second problem with this grammar: it is
ambiguous, meaning that there is more than one way to parse
some strings. If we see E + E + E
, we don't know whether
the first or second +
was expanded last. This can be
important if (x+y)+z means something different from x+(y+z). The
former assumes +
is left-associative, the latter
right-associative. Since addition is an associative operator, it doesn't
make too much difference in this case, but if we were talking about subtraction
(-
), the two parses would have very different meanings.
With recursive descent parsers, it's tricky to parse
left-associative operators because of the left recursion problem.
Here is a grammar that is not ambiguous, works with recursive descent,
and that generates the same strings. It does makes +
right-associative, however.
E → T P P → ε P → + E
Note that ε represents the empty string. We can then write
parseE, parseP
as follows (a little more verbosely than
necessary, for clarity):
parseE() { parseP(); parseT(); } parseP() { if (lookaheadToken().equals("+")) { return; // P → ε (don't consume any input) } else { check("+"); // always succeeds; parseE(); return; // P → + E } }
To get to this grammar, we first eliminate the left recursion by eliminating the ambiguity:
E → T + E E → T
This still won't work because both productions have right-hand sides
starting with T
, so we won't know which one to pick. The
next step is to left-factor the grammar by splitting out the
common prefix of the right-hand side, in this case T
, and
generating all the suffixes of that common prefix (+ E
and ε) using a new nonterminal, called P
above.
This lets us write the E
production as E → T P
,
so we don't have to figure out which P
production to choose
until we get to a place in the input where a sensible choice can be made.
For serious parsing tasks, it's usually a good idea to use a parser generator to produce your parser code. They take in a grammar as input and produce Java code to parse input. And they can handle more grammars than a recursive descent parser can.
There is much more to building parsers than we can cover in this course. Take the compilers course, CS412, to learn more about building parsers and how to write and parse different grammars (and much else besides).