CSC 173 Thurs. Oct 31, 2002 Read AU, Ch 11 ---------- --------------------------------------- An Unambiguous Grammar for Expressions It is possible to write a grammar for arithmetic expressions that * is unambiguous * naturally reflects the precedence of * and / over + and - * naturally reflects left associativity Here is one such grammar: E --> E + T | E - T | T T --> T * F | T / F | F F --> ( E ) | number If we attempt to build a parse tree for number - number + number, we see there is only one such tree: E --> E --> E --> T --> F --> number - T --> F --> number + T --> number This parse tree correctly represents left associativity by using recursion on the left. If we rewrote the grammar to use recursion on the right, we would represent right associativity: E --> T + E | T - E | T T --> F * T | F / T | F F --> ( E ) | number Right associativity isn't usually what we want for addition, subtraction, multiplication, and division, but it may be appropriate for exponentiation. Our grammar also correctly represents precedence levels by introducing a new non-terminal symbol for each precedence level. According to our grammar, expressions consist of the sum or difference of terms (or a single term), where a term consists of the product or division of factors (or a single factor), and a factor is a nested expression or a number. ------------ Parsing A parser is an algorithm that determines whether a given input string is in a language and, as a side-effect, usually produces a parse tree for the input. There is a procedure for generating a parser from a given context-free grammar. In fact, it is possible to parse any CFG in time cubic in the length of the input. There are two known algorithms for this, one due to Earley, the other to Cook, Younger, and Kasami. In practice, cubic time is too slow for most purposes. Fortunately, many (but not all!) grammars can be parsed in linear time. There are two major families of parsing algorithms that run in linear time. One family constructs the parse tree from the root downward; the other builds it from the leaves upward. We will study one form of top-down parser: recursive descent. ------------------------------------------------------------------------ Recursive-Descent Parsing Recursive-descent parsing is one of the simplest parsing techniques that is used in practice. The basic idea is to associate each non-terminal with a procedure. The goal of each such procedure is to read a sequence of input characters that can be generated by the corresponding non-terminal, and return a pointer to the root of the parse tree for the non-terminal. The structure of the procedure is dictated by the productions for the corresponding non-terminal. The procedure attempts to "match" the right hand side of some production for a non-terminal. * To match a terminal symbol, the procedure compares the terminal symbol to the input; if they agree, then the procedure is successful, and it consumes the terminal symbol in the input (that is, moves the input cursor over one symbol). * To match a non-terminal symbol, the procedure simply calls the corresponding procedure for that non-terminal symbol (which may be a recursive call, hence the name of the technique). --------------------------------------- Recursive-Descent Parser for Expressions As it turns out, the expression grammar we were using earlier can't be parsed top-down (more on why later). Here's one that can: E --> T Etail Etail --> + T Etail | - T Etail | epsilon T --> F Ttail Ttail --> * F Ttail | / F Ttail | epsilon F --> ( E ) | num We create procedures for each of the non-terminals. According to production 1, the procedure to match expressions (E) must match a term (by calling the procedure for T), and then more expressions (by calling the procedure Etail). procedure E T() Etail() Some procedures, such as Etail, must examine the input to determine which production to choose. procedure Etail switch next_token case + match(+) T() Etail() case - match(-) T() Etail() default return We've assumed here a global variable next_token and a utility routine named match: procedure match(expected) if next_token != expected error() else next_token = scan() // read next terminal symbol into global variable The error routine in a pure parser simply halts without accepting. In a compiler it prints a nice diagnostic message and then does something potentially really complicated (which I won't cover here) to patch up the parser tree and/or the input and continue looking for further errors. Here are the rest of the recursive descent routines: procedure T // very similar to E F() Ttail() procedure Ttail // very similar to Etail switch next_token case * match(*) F() Ttail() case / match(/) F() Ttail() default return procedure F switch next_token case ( match(() E() match()) case num match(num) default error() Notice that the default case in F is an error, whereas the default case in Etail and Ttail was to return without doing anything. The reason for the difference is that Etail and Ttail have epsilon productions: they are allowed to have an empty subtree under them in the parse tree. F does not have an epsilon production: it *has* to be either a number or a parenthesized expression. Look carefully also at the second call to march() within F. There is no guarantee that we will actually have a right parenthesis coming up in the input. That's why match() has a check inside. In larger, programming language-size grammars, there are lots of similar cases in which the check inside match is non-redundant. Finally, we need a main program: procedure main E match(eof) Here we adopt the convention that end-of-file is represented by a pseudo-token so we can use whatever standard error-detection/recovery mechanism we've built into the match routine. --------------------------------------- Tracing the Parser As an example, consider the following input: 1 + (2 * 3) / 4. We just call the procedure corresponding to the start symbol. next_token = "1" Call E Call T Call F next_token = "+" /* Match 1 with F */ Call Ttail /* Match epsilon */ Call Etail next_token = "(" /* Match + */ Call T Call F /* Match (, looking for E ) */ next_token = "2" Call E Call T Call F /* Match 2 with F */ next_token = "*" Call Ttail /* Match * */ next_token = "3" Call F /* Match 3 with F */ next_token = ")" Call Ttail /* Match epsilon */ Call Etail /* Match epsilon */ next_token = "/" /* Match ")" */ Call Ttail next_token = "4" /* Match "/" */ Call F /* Match 4 with F */ next_token = eof Call Ttail /* Match epsilon */ Call Ttail /* Match epsilon */ Call Etail /* Match epsilon */ /* Match eof */ --------------------------------------- Observations about Recursive-Descent Parsing * In procedure Etail and Ttail, we match one of the productions with an arithmetic operator if we see such an operator in the input; otherwise we simply return. A procedure that returns without matching any symbols is, in effect, choosing the epsilon production. * In our expression parser, we only choose the epsilon production if the next_token doesn't match the first terminal on the right hand side of the production. * We never attempt to read beyond the end marker (eof), which is matched only at the end of an outermost expression. In all other circumstances, the presence of the end marker signals a syntax error. * As written, our recursive-descent parser only determines whether or not the input string is in the language of the grammar; it does not give the structure of the string according to the grammar. We could easily build a parse tree incrementally during parsing. The book shows how in section 11.6. --------------------------------------- Lookahead in Recursive-Descent Parsing In order to implement a recursive-descent parser for a grammar, for each nonterminal in the grammar, it must be possible to determine which production to apply for that non-terminal by looking at only one upcoming input symbol. (We want to avoid having the compiler or other text processing program scan ahead in the input to determine what action to take next.) The lookahead symbol is simply the next terminal that we will try to match in the input. We use a single lookahead symbol to decide what production to match. Consider a production: A --> X1...Xm. We need to know the set of possible lookahead symbols that indicate this production is to be chosen. This set is clearly those terminal symbols that can begin a string produced by the symbols X1...Xm (which may be either terminals or non-terminals). We donote the set of symbols that could be produced first by X1...Xm as First(X1...Xm). ------------------------------------------------------------------------ First Sets To distinguish two productions with the same non-terminal on the left hand side, we examine the First sets for their corresponding right hand sides. We do this in 3 steps (1) figure out which non-terminals can generate epsilon (2) figure out FIRST sets for all non-terminals (3) figure out FIRST sets for right-hand sides Steps (1) and (2) start with "obvious" facts from the grammar and iterate until they can't learn any more. Consider step (1). If we have A --> epsilon B --> epsilon then clearly A and B are symbols that can generate epsilon. These are the "obvious" facts. Then in a second pass over the grammar, if we have C --> A B we can deduce that C is a symbol that can generate epsilon. If we have D --> C A B then in a third pass we can deduce that D is a symbol that can generate epsilon. We continue this process until we make a complete pass over the grammar without learning anything. Now consider step (2). If we have A --> b C D B --> c D e then clearly b is an element of FIRST(A) and c is an element of FIRST(B). These are obvious facts. Then in a second pass if we have C --> B A d clearly c is an element of FIRST(C), because it's an element of FIRST(B) and a C can start with a B. But if B can generate epsilon, then b is also an element of FIRST(C), because we can erase the B and generate the b from A. In each pass over the grammar we work our way through each RHS, adding elements to the FIRST set of the LHS, until we find a symbol in the RHS that cannot generate epsilon, at which point we move on to the next production. As in step (1) we keep making passes until we don't learn anything new. Finally, in step (3) we use our knowledge of FIRST sets for individual symbols to calculate FIRST sets for RHSs. Given the production A --> X1...Xm we must determine First(X1...Xm). We first consider the leftmost symbol, X1. * If this is a terminal symbol, then First(X1...Xm) = X1. * If X1 is a non-terminal, then we compute the First sets for each right hand side corresponding to X1. In our expression grammar above: First(E) = First(T Etail) First(T Etail) = First(T) First(T) = First(F Ttail) First(F Ttail) = First(F) = {(,num} If X1 can generate epsilon, then X1 can (in effect) be erased, and First(X1...Xm) depends on X2. * If X2 is a terminal, it is included in First(X1...Xm). * If X2 is a non-terminal, we compute the First sets for each of its corresponding right hand sides. Similarly, if both X1 and X2 can produce epsilon, we consider X3, then X4, etc. It is possible that X1, X2, ..., Xm can *all* produce epsilon. What then? The informal answer is that we should predict A --> X1...Xm if the lookahead symbol can come *after* an A in some line of the derivation. A formal treatment of this subject requires the notion of so-called Follow sets for symbols. In practice, we don't generally have to know about Follow sets when building a recursive-descent parser. Suppose we have three productions for A: A --> B c D A --> e f A --> G H where G and H can both generate epsilon. Our parsing routine then says: A() { switch (next_token) { case First(BcD): B match(c) D case e: match(e) match(f) default: G H If next_token is not in First(BcD) U {e}, we assume we can use the third production. If it turns out that next_token is not in First(GH) U Follow(A) either, then this was a bad decision, but nothing catastrophic happens: the calls to G and H will go ahead and generate epsilon, we'll return, and our caller will announce a syntax error -- just a bit later than we could have.