CSC 173 Tues. Oct 29, 2002 Read Aho & Ullman chapter 11 Next project posted today or tomorrow Exams graded as soon as possible - Thurs? -------------------------------------------------- Recursive Patterns and Context Free Grammars Context-free Grammar - set of recursive rewriting rules (productions) used to generate patterns fo strings used to define the syntax of programming languages. Parse tree - display structure used by grammar to generate input string used within compiler to describe structure of input program in terms of the syntactic rules used to define valid progs Parser - alg that determins whether a given input string is in a lang a mechanical procedure exists for generating a parser from a CFG --------------------------------------- A CFG consists of the following components: * a set of terminal symbols, which are the characters of the alphabet that appear in the strings generated by the grammar. * a set of nonterminal symbols, which are placeholders for patterns of terminal symbols that can be generated by the nonterminal symbols. * a set of productions, which are rules for replacing (or rewriting) nonterminal symbols (on the left side of the production) in a string with other nonterminal or terminal symbols (on the right side of the production). * a start symbol, which is a special nonterminal symbol that appears in the initial string generated by the grammar. By convention the start symbol is usually the LHS of the first production. To generate a string of terminal symbols from a CFG, we: * Begin with a string consisting of the start symbol; * Apply one of the productions with the start symbol on the left hand size, replacing the start symbol with the right hand side of the production; * Repeat the process of selecting nonterminal symbols in the string, and replacing them with the right hand side of some corresponding production, until all nonterminals have been replaced by terminal symbols. The resulting sequence of strings is called a *derivation*. --------------------------------------- A CFG for Arithmetic Expressions An example grammar that generates strings representing arithmetic expressions with the four operators +, -, *, /, and numbers as operands is: 1. expr --> number 2. expr --> ( expr ) 3. expr --> expr + expr 4. expr --> expr - expr 5. expr --> expr * expr 6. expr --> expr / expr The only nonterminal symbol in this grammar is expr, which is also the start symbol. The terminal symbols are {+,-,*,/,(,),number}. (We will interpret "number" to represent any valid number.) The first rule (or production) states that an expr can be rewritten as (or replaced by) a number. In other words, a number is a valid expression. The second rule says that an expr enclosed in parentheses is also an expr. Note that this rule defines an expression in terms of expressions, an example of the use of recursion in the definition of context-free grammars. Recursion is the ONE SINGLE thing that gives CFGs power that REs and FAs lack. The remaining rules say that the sum, difference, product, or division of two exprs is also an expr. --------------------------------------- Generating Strings from a CFG In our grammar for arithmetic expressions, the start symbol is , so our initial string is: expr Using rule 5 we can choose to replace this nonterminal, producing the string: expr * expr We now have two nonterminals to replace. We can apply rule 3 to the first nonterminal, producing the string: expr + expr * expr We can apply rule two to the first nonterminal in this string to produce: (expr) + expr * expr If we apply rule 1 to the remaining nonterminals (the recursion must end somewhere!), we get: (number) + number * number This is a valid arithmetic expression, as generated by the grammar. When applying the rules above, we often face a choice as to which production to choose. Different choices will typically result in different strings being generated. Given a grammar G with start symbol S, if there is some sequence of productions that, when applied to the initial string S, result in the string s, then s is in L(G), the language of the grammar. --------------------------------------- CFGs with Epsilon Productions A CFG may have a production for a nonterminal in which the right hand side is the empty string (which we denote by epsilon). The effect of this production is to remove the nonterminal from the string being generated. Here is a grammar for balanced parentheses that uses epsilon productions. P --> ( P ) P --> P P P --> epsilon Epsilon productions are commonly written with just an empty RHS: P --> We begin with the string P. We can replace P with epsilon, in which case we have generated the empty string (which does have balanced parentheses). Alternatively, we can generate a string of balanced parentheses within a pair of balanced parentheses, which must result in a string of balanced parentheses. Alternatively, we can concatenate two strings of balanced parentheses, which again must result in a string of balanced parentheses. This grammar is equivalent to: P --> ( P ) | P P | epsilon We use the notational shorthand '|', which can be read as "or", to represent multiple rewriting rules within a single line. Here the epsilon really is needed for clarity. --------------------------------------- Notational conventions Some authors (including A&U) put non-terminals in angle brackets. A&U also distinguish between "abstract" and "concrete" terminals, putting the former in bold and the latter in italics. Others (including me) put non-terminals in italics and terminals in typewriter (monospace) font. Since fonts don't work in plain ascii, the parsing project uses uppercase for terminals and lowercase for non-terminals. Strictly speaking you don't need such conventions, since non-terminals are the symbols that appear on LHSs, and terminals are the ones that don't. --------------------------------------- CFG Examples A CFG describing strings of letters with the word "main" somewhere in the string: --> m a i n --> | epsilon --> A | B | ... | Z | a | b ... | z A CFG for the set of identifiers in Pascal: --> --> | | epsilon --> A | B | ... | Z | a | b ... | z --> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 A CFG describing real numbers in Pascal: --> --> | epsilon --> '.' | epsilon --> 'E' | epsilon --> + | - | epsilon --> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Note that all three of the above examples are regular sets; recursion is not required to define them. A CFG for C compound statements (more or less): --> { } --> | epsilon --> --> id : --> if ( ) --> if ( ) else --> while ( ) --> do while ( ) ; --> for ( ; ) --> switch ( ) --> case : | default: --> break ; | continue ; | ; --> return ; | goto ; Note that this *does* require recursion (stmt's have stmt's inside) ------------------------------------------------------------------------ The quick story on CFGs and Regular Expressions (more later if we have time) CFGs are strictly more powerful: anything you can do with a RE you can do with a CFG, but not vice versa. The intuition is that CFGs give you concatenation and alternation, and can easily emulate Kleene closure, but they also let you define things recursively *in terms of themselves*, which REs don't. ------------------------------------------------------------------------ Parse Trees A parse tree for a grammar G is a tree where * the root is the start symbol for G * the interior nodes are nonterminals of G * the leaf nodes are terminal symbols of G. * the children of a node T (from left to right) correspond to the symbols on the right hand side of some production for T in G. Every terminal string generated by a grammar has at least one corresponding parse tree; every valid parse tree represents a string generated by the grammar (called the yield of the parse tree). Example: Given the following grammar, find a parse tree for the string 1 + 2 * 3: 1. E --> number 2. E --> ( E ) 3. E --> E + E 4. E --> E - E 5. E --> E * E 6. E --> E / E One parse tree is: E --> E --> N --> 1 + E --> E --> N --> 2 * E --> N --> 3 ======================================================================== Ambiguous Grammars A grammar for which there are two different parse trees for the same terminal string is said to be ambiguous. The grammar for balanced parentheses given earlier is an example of an ambiguous grammar: P --> ( P ) | P P | epsilon We can prove this grammar is ambiguous by demonstrating two parse trees for the same terminal string. Here are two parse trees for the empty string: P --> P --> epsilon P --> epsilon P --> epsilon Here are two parse trees for (): P --> P --> ( P --> epsilon ) P --> epsilon P --> P --> epsilon P --> ( P --> epsilon ) While in general it may be difficult to prove an arbitrary grammar is ambiguous, the demonstration of two distinct parse trees for the same terminal string is sufficient proof that some particular grammar is ambiguous. An unambiguous grammar for the set of strings consisting of balanced parentheses is: P --> ( P ) P | epsilon --------------------------------------- The Problem of Ambiguous Grammars A parse tree is supposed to display the structure used by a grammar to generate an input string. This structure is not unique if the grammar is ambiguous. A problem arises if we attempt to impart meaning to an input string using a parse tree; if the parse tree is not unique, then the string has multiple meanings. We typically use a grammar to define the syntax of a programming language. The structure of the parse tree produced by the grammar imparts some meaning on the strings of the language. If the grammar is ambiguous, the compiler has no way to determine which of two meanings to use. Thus, the code produced by the compiler is not fully determined by the program input to the compiler. --------------------------------------- Ambiguous Precedence Recall the grammar for expressions given earlier: E --> number E --> ( E ) E --> E + E E --> E - E E --> E * E E --> E / E This grammar is ambiguous as shown by the two parse trees for the input string number + number * number: E --> E --> number + E --> E --> number * E --> number E --> E --> E --> number + E --> number * E --> number The first parse tree gives precedence to multiplication over addition; the second parse tree gives precedence to addition over multiplication. In most programming languages, only the former meaning is correct. As written, this grammar is ambiguous with respect to the precedence of the arithmetic operators. Note (THIS IS IMPORTANT): precedence is NOT a property of the context-free language consisting of syntactically valid expressions. It's a property of the *meaning* (semantics) we *choose* to apply to those strings. Using a grammar that "naturally" reflects predecence makes it easier for a compiler to implement the chosen semantics. --------------------------------------- Ambiguous Associativity Consider again the same grammar for expressions: E --> number E --> ( E ) E --> E + E E --> E - E E --> E * E E --> E / E This grammar is ambiguous even if we only consider operators at the same precedence level, as in the input string number - number + number: E --> E --> number - E --> E --> number + E --> number E --> E --> E --> number - E --> number + E --> number The first parse tree (incorrectly) gives precedence to the addition operator; the second parse tree gives precedence to the subtraction operator. Since we normally group operators left to right within a precedence level, only the latter interpretation is correct. As with precedence, associativity is NOT a property of the context-free expression language; it's a property of the semantics we choose to associate with that language. Second important note: computer arithmetic is not associative! Because of overflow, it may not always be the case that (a+b)+c gives the same result as a+(b+c).