CSC 173
Thurs. Nov 7, 2002
---------------------------------------

Example Parse

Let's trace the parse for the input 1 + (2 * 3) eof

    Stack Contents         Current input         Action
 1: S                     1 + (2 * 3) eof       1
 2: E eof                 1 + (2 * 3) eof       2
 3: T Et eof              1 + (2 * 3) eof       6
 4: F Tt Et eof           1 + (2 * 3) eof       11
 5: N Tt Et eof           1 + (2 * 3) eof       match
 6: Tt Et eof             + (2 * 3) eof         9
 7: Et eof                + (2 * 3) eof         3
 8: + T Et eof            + (2 * 3) eof         match
 9: T Et eof              (2 * 3) eof           6
10: F Tt Et eof           (2 * 3) eof           10
11: ( E ) Tt Et eof       (2 * 3) eof           match
12: E ) Tt Et eof         2 * 3) eof            2
13: T Et ) Tt Et eof      2 * 3) eof            6
14: F Tt Et ) Tt Et eof   2 * 3) eof            11
15: N Tt Et ) Tt Et eof   2 * 3) eof            match
16: Tt Et ) Tt Et eof     * 3) eof              7
17: * F Tt Et ) Tt Et eof * 3) eof              match
18: F Tt Et ) Tt Et eof   3) eof                11
19: N Tt Et ) Tt Et eof   3) eof                match
20: Tt Et ) Tt Et eof     ) eof                 9
21: Et ) Tt Et eof        ) eof                 5
22: ) Tt Et eof           ) eof                 match
23: Tt Et eof             eof                   9
28: Et eof                eof                   5
29: eof                   eof                   match
30: Done!

---------------------------------------
The complexity of LL(1) parsing

What is the Big-O complexity of our table-driven parser?

The work inside the main loop is bounded by a constant: even when we
push all the symbols of a right-hand side, the length of that RHS is
bounded by a constant.  So the real question is: how many times does the
main loop execute?

Things would be easy if we called scan() on every iteration: then we'd
know the number of iterations was the same as the number of tokens in
the input.  But we don't scan on every iteration.

The trick is to think about the parse tree.  In every iteration of the
main loop we either predict and expand a production (possibly an epsilon
production) or we match a token.  That means we have precisely as many
iterations as there are nodes in the parse tree.

The same observation holds in recursive descent parsing: we make exactly
one subroutine call -- either to match() or to one of the non-terminal
routines -- for every node in the parse tree.

So how many nodes can there be in the parse tree?  First, suppose we
have no epsilon productions in our grammar.  Then the number of leaves
is equal to the number of tokens in the input.  How about internal
nodes?  Well, because we know the grammar is unambiguous (this is
crucial), we never have a node that derives only itself (if we did we
could repeat the derivation an arbitrary number of times, generating
different trees).  This means that starting with any node in the tree,
after at most P predictions, working downward, where P is the number of
productions in the grammar, we have to get some fan-out.  Informally,
after generating some constant number of internal nodes, we have to
double the number of leaves.  You might be tempted, then, to think there
could be N log N nodes in the tree, where N is the number of leaves, but
fortunately that isn't so.  Again speaking informally, note that N + N/2
+ N/4 + N/8 + N/16 + ... doesn't sum to N log N: it sums to 2N.  In the
same way, the number of nodes in the parse tree turns out to be O(NP),
and P is a constant -- O(N).

========================================================================

CFGs vs Regular Expressions

Context-free grammars are strictly more powerful than regular
expressions.

   * Any language that can be generated using regular expressions can be
     generated by a context-free grammar.

   * There are languages that can be generated by a context-free grammar
     that cannot be generated by any regular expression.

As a corollary, CFGs are strictly more powerful than DFAs and NDFAs.

The proof is in two parts:

   * Given a regular expression R , we can generate a CFG G such that
     L(R) == L(G).

   * We can define a grammar G for which there there is no FA F such
     that L(F) == L(G).

---------------------------------------

Simulating a Regular Expression with a CFG

To show that CFGs are at least as powerful as regular expressions, we
show how to simulate a RE using a CFG. The construction is similar to
the one used to simulate a regular expression with a FA; we build the
CFG G in pieces, where each piece corresponds to the operands and
operators in the regular expression.

   * Assume the RE is a single operand. Then if RE is epsilon or a
     character in the alphabet, add to G the production

         <RE> --> RE

     If RE is null, don't add a production.

   * Assume the RE is R1R2. Add to G the production

         <RE> --> <R1> <R2>

     and create productions for regular expressions R1 and R2.

   * Assume the RE is R1 | R2. Add to G the production

         <RE> --> <R1> | <R2>

     and create productions for regular expressions R1 and R2.

   * Assume the RE is R1*. Add to G the production

         <RE> --> <R1> <RE> | epsilon

     and create productions for regular expression R1.