Last Update Time-stamp: "97/06/30 13:50:05 umrigar"

This is a brief intuitive introduction to recursive descent parsing. Any compiler text should provide more details. A elementary introduction to grammars and language analysis is also available.

Given a grammar, consider how one could write a parser for it. One possible approach would proceed as follows:

We know that all strings in the language must be derived from the start
nonterminal `S`. Hence if we can write a procedure `S()`

which matches all strings derived from `S`, then we are done.
But how do we do this?

Let's generalize the above question somewhat. Instead of merely posing
the problem of writing a procedure which matches strings derived from the
specific start nonterminal `S`, let's look at the more general
problem of writing a procedure which matches strings derived from any
nonterminal `N`; i.e., for each nonterminal `N`, we would
like to construct a parsing procedure `N()`

which matches all
strings derived from `N`.

If a nonterminal `N` has only a single grammar rule, then its
parsing procedure is easy to write. To match the rule we know that we need
to match each grammar symbol on its right-hand side (RHS). There are two
possibilities:

- If the RHS symbol is a terminal symbol, check that the current lookahead matches that terminal symbol. If it does, then advance the input, setting the current lookahead to the next input symbol. If it does not, signal an error.
- For each occurrence of a nonterminal symbol on the RHS, call the corresponding parsing procedure. If the parsing procedure works as posited above, then it will match a prefix of the input with a rule for the nonterminal.

Hence a parsing function `N()`

for nonterminal `N`
will simply contain logic to pick a rule based on the current lookahead,
followed by code to match each individual rule. Its structure will look
something like the following:

N() { if (lookahead can start first rule forN) { match rule 1 forN} else if (lookahead can start second rule forN) { match rule 2 forN} ... ... else if (lookahead can startn'th rule forN) { match rulenforNelse { error(); } }

Unfortunately, there are some problems with this simple scheme. A
grammar rule is said to be *directly left recursive* if the first
symbol on the RHS is identical to the LHS nonterminal. Hence, after a
directly left recursive rule has been selected, the first action of the
corresponding parsing procedure, will be to call itself immediately, without
consuming any of the input. It should be clear that such a recursive call
will never terminate. Hence a recursive descent parser cannot be written
for a grammar which contains such directly (or indirectly) left recursive
rules; in fact, the grammar cannot be LL(1) in the presence of such rules.

Fortunately, it is possible to transform the grammar to remove left
recursive rules. Consider the left recursive nonterminal `A`
defined by the following rules:

whereA:AalphaA:beta

The above rules are not left recursive.A:betaA'A':/* empty */A':alphaA'

Consider the following simple grammar for assignment statements with expressions involving addition or subtraction:

1) stmt: ID ':=' expr 2) expr: expr '+' NUM 3) expr: expr '-' NUM 4) expr: NUMThis grammar allows statements like

`a:= 1 + 2 - 3`

.
Since the rules for `expr`

are left recursive, we need to
transform the grammar to the following:
1) stmt: ID ':=' expr 2) expr: NUM exprRest 3) exprRest: '+' NUM exprRest 4) exprRest: '-' NUM exprRest 5) exprRest: /* empty */Assuming that the current lookahead is in a global variable

`tok`

, we can write the parsing procedures in pseudo-C as
follows:
stmt() { match(ID); match(':='); expr(); } expr() { match(NUM); exprRest(); } exprRest() { if (tok == '+') { /* rule 3 */ match('+'); match(NUM); exprRest(); } else if (tok == '-') { /* rule 4 */ match('-'); match(NUM); exprRest(); } else { /* rule 5 */ } }Note that

`exprRest()`

chooses rule 5 if the lookahead cannot
match any of the other rules for `exprRest`

. It would have also
been correct to choose rule 5 only if the current lookahead was a token
which can `exprRest`

, but the method used above
is simpler with the sole disadvantage that errors are detected later.
The `match()`

procedure merely needs to check if the current
lookahead in `tok`

matches its argument and advance the input:

match(Token t) { if (tok == t) { tok= nextToken(); } else { error(): } }where we assume that

`nextToken()`

is the interface to the
scanner and returns the next token from the input stream.
All that remains is to make sure we startup and terminate the parser correctly:

parse() { tok= nextToken(); stmt(); match('<EOF>'); }The first line merely primes the global variable used for the lookahead token. The second line calls the parsing procedure for

`stmt`

.
Finally, assuming that `'<EOF>'`

is the token used to
represent the end of the input stream, the last line ensures that no garbage
follows a legal `stmt`

.
Copyright (C) 1997 Zerksis D. Umrigar

**Feedback**: Please email any feedback to zdu@acm.org.

Up to Parsing Demos Main Page