CSC 173
Tues. Oct 29, 2002

Read Aho & Ullman chapter 11
Next project posted today or tomorrow
Exams graded as soon as possible - Thurs?

--------------------------------------------------

Recursive Patterns and Context Free Grammars

Context-free Grammar - set of recursive rewriting rules (productions)
   used to generate patterns fo strings
   used to define the syntax of programming languages.

Parse tree - display structure used by grammar to generate input
             string 
      used within compiler to describe structure of input
        program in terms of the syntactic rules used to define valid progs

Parser - alg that determins whether a given input string is in a lang
       a mechanical procedure exists for generating a parser from a CFG

---------------------------------------

A CFG consists of the following components:

   * a set of terminal symbols, which are the characters of the alphabet
     that appear in the strings generated by the grammar.

   * a set of nonterminal symbols, which are placeholders for patterns
     of terminal symbols that can be generated by the nonterminal symbols.

   * a set of productions, which are rules for replacing (or rewriting)
     nonterminal symbols (on the left side of the production) in a
     string with other nonterminal or terminal symbols (on the right
     side of the production).

   * a start symbol, which is a special nonterminal symbol that appears
     in the initial string generated by the grammar.  By convention the
     start symbol is usually the LHS of the first production.

To generate a string of terminal symbols from a CFG, we:

   * Begin with a string consisting of the start symbol;

   * Apply one of the productions with the start symbol on the left hand
     size, replacing the start symbol with the right hand side of the
     production;

   * Repeat the process of selecting nonterminal symbols in the string,
     and replacing them with the right hand side of some corresponding
     production, until all nonterminals have been replaced by terminal
     symbols.  The resulting sequence of strings is called a
     *derivation*.

---------------------------------------

A CFG for Arithmetic Expressions

An example grammar that generates strings representing arithmetic
expressions with the four operators +, -, *, /, and numbers as operands
is:

  1. expr --> number
  2. expr --> ( expr )
  3. expr --> expr + expr
  4. expr --> expr - expr
  5. expr --> expr * expr
  6. expr --> expr / expr

The only nonterminal symbol in this grammar is expr, which is
also the start symbol. The terminal symbols are {+,-,*,/,(,),number}.
(We will interpret "number" to represent any valid number.)

The first rule (or production) states that an expr can be
rewritten as (or replaced by) a number.  In other words, a number is a
valid expression.

The second rule says that an expr enclosed in parentheses is also an expr.
Note that this rule defines an expression in terms of expressions, an
example of the use of recursion in the definition of context-free grammars.
Recursion is the ONE SINGLE thing that gives CFGs power that REs and FAs
lack.

The remaining rules say that the sum, difference, product, or division
of two exprs is also an expr.

---------------------------------------

Generating Strings from a CFG

In our grammar for arithmetic expressions, the start symbol is
<expression>, so our initial string is:

    expr

Using rule 5 we can choose to replace this nonterminal, producing the
string:

    expr * expr

We now have two nonterminals to replace. We can apply rule 3 to the
first nonterminal, producing the string:

    expr + expr * expr

We can apply rule two to the first nonterminal in this string to
produce:

    (expr) + expr * expr

If we apply rule 1 to the remaining nonterminals (the recursion must end
somewhere!), we get:

    (number) + number * number

This is a valid arithmetic expression, as generated by the grammar.

When applying the rules above, we often face a choice as to which
production to choose. Different choices will typically result in
different strings being generated.

Given a grammar G with start symbol S, if there is some sequence of
productions that, when applied to the initial string S, result in the
string s, then s is in L(G), the language of the grammar.

---------------------------------------

CFGs with Epsilon Productions

A CFG may have a production for a nonterminal in which the right hand
side is the empty string (which we denote by epsilon). The effect of
this production is to remove the nonterminal from the string being
generated.

Here is a grammar for balanced parentheses that uses epsilon
productions.

    P --> ( P )
    P --> P P
    P --> epsilon

Epsilon productions are commonly written with just an empty RHS:

    P -->

We begin with the string P. We can replace P with epsilon, in which case we
have generated the empty string (which does have balanced parentheses).
Alternatively, we can generate a string of balanced parentheses within a
pair of balanced parentheses, which must result in a string of balanced
parentheses.  Alternatively, we can concatenate two strings of balanced
parentheses, which again must result in a string of balanced parentheses.

This grammar is equivalent to:

    P --> ( P ) | P P | epsilon

We use the notational shorthand '|', which can be read as "or", to
represent multiple rewriting rules within a single line.  Here the
epsilon really is needed for clarity.

---------------------------------------

Notational conventions

Some authors (including A&U) put non-terminals in angle brackets.
A&U also distinguish between "abstract" and "concrete" terminals,
putting the former in bold and the latter in italics.

Others (including me) put non-terminals in italics and terminals in
typewriter (monospace) font.

Since fonts don't work in plain ascii, the parsing project uses
uppercase for terminals and lowercase for non-terminals.

Strictly speaking you don't need such conventions, since non-terminals
are the symbols that appear on LHSs, and terminals are the ones that
don't.

---------------------------------------

CFG Examples

A CFG describing strings of letters with the word "main" somewhere in
the string:

<program> --> <letter*> m a i n <letter*>
<letter*> --> <letter> <letter*> | epsilon
<letter> --> A | B | ... | Z | a | b ... | z

A CFG for the set of identifiers in Pascal:

<id> --> <L> <LorD*>
<LorD*> --> <L> <LorD*> | <D> <LorD*> | epsilon
<L> --> A | B | ... | Z | a | b ... | z
<D> --> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

A CFG describing real numbers in Pascal:

<real> --> <digit> <digit*> <decimal part> <exp>
<digit*> --> <digit> <digit*> | epsilon
<decimal part> --> '.' <digit> <digit*> | epsilon
<exp> --> 'E' <sign> <digit> <digit*> | epsilon
<sign> --> + | - | epsilon
<digit> --> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Note that all three of the above examples are regular sets;
recursion is not required to define them.

A CFG for C compound statements (more or less):

<compound stmt> --> { <stmt list> }
<stmt list> --> <stmt> <stmt list> | epsilon
<stmt> --> <compound stmt>
<stmt> --> id : <stmt>
<stmt> --> if ( <expr> ) <stmt>
<stmt> --> if ( <expr> ) <stmt> else <stmt>
<stmt> --> while ( <expr> ) <stmt>
<stmt> --> do <stmt> while ( <expr> ) ;
<stmt> --> for ( <stmt> <expr> ; <expr> ) <stmt>
<stmt> --> switch ( <expr> ) <compound stmt>
<stmt> --> case <expr> : <stmt> | default: <stmt>
<stmt> --> break ; | continue ; | ;
<stmt> --> return <expr> ; | goto <id> ;

Note that this *does* require recursion (stmt's have stmt's inside)

------------------------------------------------------------------------

The quick story on CFGs and Regular Expressions
(more later if we have time)

CFGs are strictly more powerful: anything you can do with a RE you can
do with a CFG, but not vice versa.

The intuition is that CFGs give you concatenation and alternation, and
can easily emulate Kleene closure, but they also let you define things
recursively *in terms of themselves*, which REs don't.

------------------------------------------------------------------------

Parse Trees

A parse tree for a grammar G is a tree where

   * the root is the start symbol for G

   * the interior nodes are nonterminals of G

   * the leaf nodes are terminal symbols of G.

   * the children of a node T (from left to right) correspond to the
     symbols on the right hand side of some production for T in G.

Every terminal string generated by a grammar has at least one corresponding
parse tree; every valid parse tree represents a string generated by the
grammar (called the yield of the parse tree).

Example: Given the following grammar, find a parse tree for the string
1 + 2 * 3:

  1. E --> number
  2. E --> ( E )
  3. E --> E + E
  4. E --> E - E
  5. E --> E * E
  6. E --> E / E

One parse tree is:

    E --> E --> N --> 1
          +
          E --> E --> N --> 2
                *
                E --> N --> 3

========================================================================

Ambiguous Grammars

A grammar for which there are two different parse trees for the same
terminal string is said to be ambiguous.

The grammar for balanced parentheses given earlier is an example of an
ambiguous grammar:

    P --> ( P ) | P P | epsilon

We can prove this grammar is ambiguous by demonstrating two parse trees
for the same terminal string.

Here are two parse trees for the empty string:

     P --> P --> epsilon
           P --> epsilon

     P --> epsilon

Here are two parse trees for ():

     P --> P --> (
                 P --> epsilon
                 )
           P --> epsilon

     P --> P --> epsilon
           P --> (
                 P --> epsilon
                 )

While in general it may be difficult to prove an arbitrary grammar is
ambiguous, the demonstration of two distinct parse trees for the same
terminal string is sufficient proof that some particular grammar is
ambiguous.

An unambiguous grammar for the set of strings consisting of balanced
parentheses is:

    P --> ( P ) P | epsilon

---------------------------------------

The Problem of Ambiguous Grammars

A parse tree is supposed to display the structure used by a grammar to
generate an input string. This structure is not unique if the grammar is
ambiguous. A problem arises if we attempt to impart meaning to an input
string using a parse tree; if the parse tree is not unique, then the
string has multiple meanings.

We typically use a grammar to define the syntax of a programming
language.  The structure of the parse tree produced by the grammar
imparts some meaning on the strings of the language.

If the grammar is ambiguous, the compiler has no way to determine which
of two meanings to use. Thus, the code produced by the compiler is not
fully determined by the program input to the compiler.

---------------------------------------

Ambiguous Precedence

Recall the grammar for expressions given earlier:

  E --> number
  E --> ( E )
  E --> E + E
  E --> E - E
  E --> E * E
  E --> E / E

This grammar is ambiguous as shown by the two parse trees for the input
string number + number * number:

    E --> E --> number
          +
          E --> E --> number
                *
                E --> number

    E --> E --> E --> number
                +
                E --> number
          *
          E --> number

The first parse tree gives precedence to multiplication over addition; the
second parse tree gives precedence to addition over multiplication.  In
most programming languages, only the former meaning is correct.  As
written, this grammar is ambiguous with respect to the precedence of the
arithmetic operators.

Note (THIS IS IMPORTANT): precedence is NOT a property of the context-free
language consisting of syntactically valid expressions.  It's a property of
the *meaning* (semantics) we *choose* to apply to those strings.  Using a
grammar that "naturally" reflects predecence makes it easier for a compiler
to implement the chosen semantics.

---------------------------------------

Ambiguous Associativity

Consider again the same grammar for expressions:

  E --> number
  E --> ( E )
  E --> E + E
  E --> E - E
  E --> E * E
  E --> E / E

This grammar is ambiguous even if we only consider operators at the same
precedence level, as in the input string number - number + number:

    E --> E --> number
          -
          E --> E --> number
                +
                E --> number

    E --> E --> E --> number
                -
                E --> number
          +
          E --> number

The first parse tree (incorrectly) gives precedence to the addition
operator; the second parse tree gives precedence to the subtraction
operator. Since we normally group operators left to right within a
precedence level, only the latter interpretation is correct.

As with precedence, associativity is NOT a property of the context-free
expression language; it's a property of the semantics we choose to associate
with that language.

Second important note: computer arithmetic is not associative!  Because of
overflow, it may not always be the case that (a+b)+c gives the same result
as a+(b+c).