CSC 173
Thurs. Oct 31, 2002

Read AU, Ch 11

----------
---------------------------------------

An Unambiguous Grammar for Expressions

It is possible to write a grammar for arithmetic expressions that

   * is unambiguous
   * naturally reflects the precedence of * and / over + and -
   * naturally reflects left associativity

Here is one such grammar:

  E --> E + T | E - T | T
  T --> T * F | T / F | F
  F --> ( E ) | number

If we attempt to build a parse tree for number - number + number, we see
there is only one such tree:

    E --> E --> E --> T --> F --> number
                -
                T --> F --> number
          +
          T --> number

This parse tree correctly represents left associativity by using
recursion on the left.  If we rewrote the grammar to use recursion on the
right, we would represent right associativity:

  E --> T + E | T - E | T
  T --> F * T | F / T | F
  F --> ( E ) | number

Right associativity isn't usually what we want for addition,
subtraction, multiplication, and division, but it may be appropriate for
exponentiation.

Our grammar also correctly represents precedence levels by introducing a
new non-terminal symbol for each precedence level. According to our
grammar, expressions consist of the sum or difference of terms (or a
single term), where a term consists of the product or division of
factors (or a single factor), and a factor is a nested expression or a
number.

------------
Parsing

A parser is an algorithm that determines whether a given input string is
in a language and, as a side-effect, usually produces a parse tree for
the input.  There is a procedure for generating a parser from a given
context-free grammar.

In fact, it is possible to parse any CFG in time cubic in the length of
the input.  There are two known algorithms for this, one due to Earley, the
other to Cook, Younger, and Kasami.

In practice, cubic time is too slow for most purposes.  Fortunately,
many (but not all!) grammars can be parsed in linear time.  There are
two major families of parsing algorithms that run in linear time.
One family constructs the parse tree from the root downward; the other
builds it from the leaves upward.  We will study one form of top-down
parser: recursive descent.

------------------------------------------------------------------------

Recursive-Descent Parsing

Recursive-descent parsing is one of the simplest parsing techniques that
is used in practice.

The basic idea is to associate each non-terminal with a procedure. The
goal of each such procedure is to read a sequence of input characters
that can be generated by the corresponding non-terminal, and return a
pointer to the root of the parse tree for the non-terminal. The
structure of the procedure is dictated by the productions for the
corresponding non-terminal.

The procedure attempts to "match" the right hand side of some production
for a non-terminal.

   * To match a terminal symbol, the procedure compares the terminal
     symbol to the input; if they agree, then the procedure is
     successful, and it consumes the terminal symbol in the input (that
     is, moves the input cursor over one symbol).

   * To match a non-terminal symbol, the procedure simply calls the
     corresponding procedure for that non-terminal symbol (which may be
     a recursive call, hence the name of the technique).

---------------------------------------

Recursive-Descent Parser for Expressions

As it turns out, the expression grammar we were using earlier can't be
parsed top-down (more on why later).  Here's one that can:

  E     --> T Etail
  Etail --> + T Etail | - T Etail | epsilon
  T     --> F Ttail
  Ttail --> * F Ttail | / F Ttail | epsilon
  F     --> ( E ) | num

We create procedures for each of the non-terminals. According to
production 1, the procedure to match expressions (E) must match a term
(by calling the procedure for T), and then more expressions (by
calling the procedure Etail).

    procedure E
        T()
        Etail()

Some procedures, such as Etail, must examine the input to determine which
production to choose.

    procedure Etail
        switch next_token
            case +
                match(+)
                T()
                Etail()
            case -
                match(-)
                T()
                Etail()
            default
                return

We've assumed here a global variable next_token and a utility routine
named match:

    procedure match(expected)
        if next_token != expected
            error()
        else
            next_token = scan()
                // read next terminal symbol into global variable

The error routine in a pure parser simply halts without accepting.
In a compiler it prints a nice diagnostic message and then does
something potentially really complicated (which I won't cover here) to
patch up the parser tree and/or the input and continue looking for
further errors.

Here are the rest of the recursive descent routines:

    procedure T             // very similar to E
        F()
        Ttail()

    procedure Ttail         // very similar to Etail
        switch next_token
            case *
                match(*)
                F()
                Ttail()
            case /
                match(/)
                F()
                Ttail()
            default
                return

    procedure F
        switch next_token
            case (
                match(()
                E()
                match())
            case num
                match(num)
            default
                error()

Notice that the default case in F is an error, whereas the default case
in Etail and Ttail was to return without doing anything.  The reason for
the difference is that Etail and Ttail have epsilon productions: they
are allowed to have an empty subtree under them in the parse tree.  F
does not have an epsilon production: it *has* to be either a number or a
parenthesized expression.

Look carefully also at the second call to march() within F.  There is no
guarantee that we will actually have a right parenthesis coming up in
the input.  That's why match() has a check inside.  In larger,
programming language-size grammars, there are lots of similar cases
in which the check inside match is non-redundant.

Finally, we need a main program:

    procedure main
        E
        match(eof)

Here we adopt the convention that end-of-file is represented by a
pseudo-token so we can use whatever standard error-detection/recovery
mechanism we've built into the match routine.

---------------------------------------

Tracing the Parser

As an example, consider the following input: 1 + (2 * 3) / 4. We just
call the procedure corresponding to the start symbol.

next_token = "1"
Call E
    Call T
        Call F
            next_token = "+" /* Match 1 with F */
        Call Ttail /* Match epsilon */
    Call Etail
        next_token = "(" /* Match + */
        Call T
            Call F
                /* Match (, looking for E ) */
                next_token = "2"
                Call E
                    Call T
                        Call F
                            /* Match 2 with F */
                            next_token = "*"
                        Call Ttail
                            /* Match * */
                            next_token = "3"
                            Call F
                              /* Match 3 with F */
                              next_token = ")"
                            Call Ttail
                              /* Match epsilon */
                    Call Etail /* Match epsilon */
                next_token = "/" /* Match ")" */
            Call Ttail
                next_token = "4" /* Match "/" */
                Call F
                    /* Match 4 with F */
                    next_token = eof
                    Call Ttail /* Match epsilon */
                Call Ttail /* Match epsilon */
        Call Etail /* Match epsilon */
    /* Match eof */

---------------------------------------

Observations about Recursive-Descent Parsing

   * In procedure Etail and Ttail, we match one of the productions with
     an arithmetic operator if we see such an operator in the input;
     otherwise we simply return.  A procedure that returns without
     matching any symbols is, in effect, choosing the epsilon
     production.

   * In our expression parser, we only choose the epsilon production if
     the next_token doesn't match the first terminal on the right
     hand side of the production.

   * We never attempt to read beyond the end marker (eof), which is
     matched only at the end of an outermost expression.  In all other
     circumstances, the presence of the end marker signals a syntax
     error.

   * As written, our recursive-descent parser only determines whether or
     not the input string is in the language of the grammar; it does not
     give the structure of the string according to the grammar.  We could
     easily build a parse tree incrementally during parsing.  The book
     shows how in section 11.6.

---------------------------------------

Lookahead in Recursive-Descent Parsing

In order to implement a recursive-descent parser for a grammar, for each
nonterminal in the grammar, it must be possible to determine which
production to apply for that non-terminal by looking at only one upcoming
input symbol.  (We want to avoid having the compiler or other text
processing program scan ahead in the input to determine what action to take
next.)

The lookahead symbol is simply the next terminal that we will try to
match in the input.  We use a single lookahead symbol to decide what
production to match.

Consider a production: A --> X1...Xm.  We need to know the set of
possible lookahead symbols that indicate this production is to be
chosen.  This set is clearly those terminal symbols that can begin a
string produced by the symbols X1...Xm (which may be either terminals or
non-terminals).

We donote the set of symbols that could be produced first by X1...Xm as
First(X1...Xm).

------------------------------------------------------------------------

First Sets

To distinguish two productions with the same non-terminal on the left
hand side, we examine the First sets for their corresponding right hand
sides.

We do this in 3 steps

(1) figure out which non-terminals can generate epsilon
(2) figure out FIRST sets for all non-terminals
(3) figure out FIRST sets for right-hand sides

Steps (1) and (2) start with "obvious" facts from the grammar and
iterate until they can't learn any more.  Consider step (1).  If
we have
    A --> epsilon
    B --> epsilon
then clearly A and B are symbols that can generate epsilon.  These are
the "obvious" facts.  Then in a second pass over the grammar, if we have
    C --> A B
we can deduce that C is a symbol that can generate epsilon.  If we have
    D --> C A B
then in a third pass we can deduce that D is a symbol that can generate
epsilon.  We continue this process until we make a complete pass over
the grammar without learning anything.

Now consider step (2).  If we have
    A --> b C D
    B --> c D e
then clearly b is an element of FIRST(A) and c is an element of FIRST(B).
These are obvious facts.  Then in a second pass if we have
    C --> B A d
clearly c is an element of FIRST(C), because it's an element of FIRST(B)
and a C can start with a B.  But if B can generate epsilon, then b is
also an element of FIRST(C), because we can erase the B and generate the
b from A.  In each pass over the grammar we work our way through each
RHS, adding elements to the FIRST set of the LHS, until we find a
symbol in the RHS that cannot generate epsilon, at which point we
move on to the next production.  As in step (1) we keep making passes
until we don't learn anything new.

Finally, in step (3) we use our knowledge of FIRST sets for individual
symbols to calculate FIRST sets for RHSs.  Given the production
A --> X1...Xm we must determine First(X1...Xm).

We first consider the leftmost symbol, X1.

   * If this is a terminal symbol, then First(X1...Xm) = X1.
   * If X1 is a non-terminal, then we compute the First sets for each
     right hand side corresponding to X1.

In our expression grammar above:

    First(E) = First(T Etail)
    First(T Etail) = First(T)
    First(T) = First(F Ttail)
    First(F Ttail) = First(F) = {(,num}

If X1 can generate epsilon, then X1 can (in effect) be erased, and
First(X1...Xm) depends on X2.

   * If X2 is a terminal, it is included in First(X1...Xm).
   * If X2 is a non-terminal, we compute the First sets for each of its
     corresponding right hand sides.

Similarly, if both X1 and X2 can produce epsilon, we consider X3, then
X4, etc.

It is possible that X1, X2, ..., Xm can *all* produce epsilon.  What
then?  The informal answer is that we should predict A --> X1...Xm if
the lookahead symbol can come *after* an A in some line of the
derivation.  A formal treatment of this subject requires the notion of
so-called Follow sets for symbols.  In practice, we don't generally have
to know about Follow sets when building a recursive-descent parser.
Suppose we have three productions for A:

    A --> B c D
    A --> e f
    A --> G H

where G and H can both generate epsilon.  Our parsing routine then says:

    A() {
        switch (next_token) {
            case First(BcD):
                B
                match(c)
                D
            case e:
                match(e)
                match(f)
            default:
                G
                H

If next_token is not in First(BcD) U {e}, we assume we can use the third
production.  If it turns out that next_token is not in First(GH) U
Follow(A) either, then this was a bad decision, but nothing catastrophic
happens: the calls to G and H will go ahead and generate epsilon, we'll
return, and our caller will announce a syntax error -- just a bit later
than we could have.