CSC 173 Thurs. Nov 7, 2002 --------------------------------------- Example Parse Let's trace the parse for the input 1 + (2 * 3) eof Stack Contents Current input Action 1: S 1 + (2 * 3) eof 1 2: E eof 1 + (2 * 3) eof 2 3: T Et eof 1 + (2 * 3) eof 6 4: F Tt Et eof 1 + (2 * 3) eof 11 5: N Tt Et eof 1 + (2 * 3) eof match 6: Tt Et eof + (2 * 3) eof 9 7: Et eof + (2 * 3) eof 3 8: + T Et eof + (2 * 3) eof match 9: T Et eof (2 * 3) eof 6 10: F Tt Et eof (2 * 3) eof 10 11: ( E ) Tt Et eof (2 * 3) eof match 12: E ) Tt Et eof 2 * 3) eof 2 13: T Et ) Tt Et eof 2 * 3) eof 6 14: F Tt Et ) Tt Et eof 2 * 3) eof 11 15: N Tt Et ) Tt Et eof 2 * 3) eof match 16: Tt Et ) Tt Et eof * 3) eof 7 17: * F Tt Et ) Tt Et eof * 3) eof match 18: F Tt Et ) Tt Et eof 3) eof 11 19: N Tt Et ) Tt Et eof 3) eof match 20: Tt Et ) Tt Et eof ) eof 9 21: Et ) Tt Et eof ) eof 5 22: ) Tt Et eof ) eof match 23: Tt Et eof eof 9 28: Et eof eof 5 29: eof eof match 30: Done! --------------------------------------- The complexity of LL(1) parsing What is the Big-O complexity of our table-driven parser? The work inside the main loop is bounded by a constant: even when we push all the symbols of a right-hand side, the length of that RHS is bounded by a constant. So the real question is: how many times does the main loop execute? Things would be easy if we called scan() on every iteration: then we'd know the number of iterations was the same as the number of tokens in the input. But we don't scan on every iteration. The trick is to think about the parse tree. In every iteration of the main loop we either predict and expand a production (possibly an epsilon production) or we match a token. That means we have precisely as many iterations as there are nodes in the parse tree. The same observation holds in recursive descent parsing: we make exactly one subroutine call -- either to match() or to one of the non-terminal routines -- for every node in the parse tree. So how many nodes can there be in the parse tree? First, suppose we have no epsilon productions in our grammar. Then the number of leaves is equal to the number of tokens in the input. How about internal nodes? Well, because we know the grammar is unambiguous (this is crucial), we never have a node that derives only itself (if we did we could repeat the derivation an arbitrary number of times, generating different trees). This means that starting with any node in the tree, after at most P predictions, working downward, where P is the number of productions in the grammar, we have to get some fan-out. Informally, after generating some constant number of internal nodes, we have to double the number of leaves. You might be tempted, then, to think there could be N log N nodes in the tree, where N is the number of leaves, but fortunately that isn't so. Again speaking informally, note that N + N/2 + N/4 + N/8 + N/16 + ... doesn't sum to N log N: it sums to 2N. In the same way, the number of nodes in the parse tree turns out to be O(NP), and P is a constant -- O(N). ======================================================================== CFGs vs Regular Expressions Context-free grammars are strictly more powerful than regular expressions. * Any language that can be generated using regular expressions can be generated by a context-free grammar. * There are languages that can be generated by a context-free grammar that cannot be generated by any regular expression. As a corollary, CFGs are strictly more powerful than DFAs and NDFAs. The proof is in two parts: * Given a regular expression R , we can generate a CFG G such that L(R) == L(G). * We can define a grammar G for which there there is no FA F such that L(F) == L(G). --------------------------------------- Simulating a Regular Expression with a CFG To show that CFGs are at least as powerful as regular expressions, we show how to simulate a RE using a CFG. The construction is similar to the one used to simulate a regular expression with a FA; we build the CFG G in pieces, where each piece corresponds to the operands and operators in the regular expression. * Assume the RE is a single operand. Then if RE is epsilon or a character in the alphabet, add to G the production --> RE If RE is null, don't add a production. * Assume the RE is R1R2. Add to G the production --> and create productions for regular expressions R1 and R2. * Assume the RE is R1 | R2. Add to G the production --> | and create productions for regular expressions R1 and R2. * Assume the RE is R1*. Add to G the production --> | epsilon and create productions for regular expression R1.