CS173
Tuesday Oct 8, 2002

Project 3 is out.  

========================================================================

NDFAs
  epsilon transitions and/or multiple transitions on same input
  allow  non-unique transitions - set of possible states
  Why?
    some det algs require choice at some stage
      e.g., game playing, choose move, follow to decide good/bad
            NDFA makes it easy to model without backtracking
    solving problems easily
      accept AAA or event number of A's
      easy to write as NDFA.  relatively easy as DFA, but more
        difficult to understand

--------------------------------  
Regular Expressions

RE algebraic way to define (generate) patterns
  value of an RE is a pattern describing a set of strings - LANGUAGE
    L(E), language of expression E

Note: If you can write an RE, you can make an automata

Operands in a regular expression can be:

   * symbols from the alphabet over which the regular expression is
     defined.
   * variables, whose values are any pattern defined by a regular expression.
   * epsilon, which denotes the empty string containing no symbols.
   * NULL, which denotes the empty set of strings.

If R is a regular expression, we use L(R) to indicate the *language*
(set of strings) described (generated) by R.

Operators used in regular expressions include:

   * Union: If R1 and R2 are regular expressions, then R1 | R2 (also written
     as R1 U R2 or R1 + R2) is also a regular expression.

     L(R1|R2) = L(R1) U L(R2).

     Union is also sometimes called alternation.

   * Concatenation: If R1 and R2 are regular expressions, then R1R2 (also
     written as R1.R2) is also a regular expression.

     L(R1R2) = L(R1) concatenated with L(R2).

   * Kleene closure: If R1 is a regular expression, then R1* (the Kleene
     closure of R1) is also a regular expression.

     L(R1*) = epsilon U L(R1) U L(R1R1) U L(R1R1R1) U ...

By convention, in the absence of parentheses, closure has the highest
precedence, followed by concatenation, followed by union.

  ------------------------------------------------------------------------

Examples

The set of strings over {0,1} that end in 3 consecutive 1's.

    (0 | 1)* 111

The set of strings over {0,1} that have at least one 1.

    0* 1 (0 | 1)*

The set of strings over {0,1} that have at most one 1.

    0* | 0* 1 0*

The set of strings over {A..Z,a..z} that contain the word "main".

    Let <letter> = A | B | ... | Z | a | b | ... | z

    <letter>* main <letter>*

The set of strings over {A..Z,a..z} that contain 3 x's.

    <letter>* x <letter>* x <letter>* x <letter>*

The set of identifiers in Pascal.

    Let <letter> = A | B | ... | Z | a | b | ... | z
    Let <digit> = 0 | 1 | 2 | 3 ... | 9

    <letter> (<letter> | <digit>)*

The set of real numbers in Pascal.
Rules:
    must have a fractional part, an exponent, or both (otherwise it's an
        integer)
    if it has a fractional part, there must be at least one digit on
        each side of the decimal point.
    in an exponent the sign is optional, but there must be at least one
        digit

    Let <digit> = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
    digit_string = <digit> <digit>*
    Let <exponent> = 'E' <sign> <digit_string>
    Let <sign> = '+' | '-' | epsilon
    Let <fraction> = '.' <digit_string>

    <digit_string> ( <fraction> | <exponent> | <fraction> <exponent> )

Remember: abbreviations like <digit> and <fraction> are for convenience
ONLY; they do not change the power of the notation.  To see this, just
expand them all out in-line (they aren't allowed to be recursive).

  ------------------------------------------------------------------------
Unix Operator Extensions

Regular expressions are used frequently in Unix:

   * In shell command lines
   * Within text editors
   * In the context of pattern matching programs such as grep and egrep

To facilitate construction of regular expressions, Unix recognizes
additional operators. These operators can be defined in terms of the
operators given above; they represent a notational convenience only.

   * character classes: '[' <list of chars> ']'
   * start of a line: '^'
   * end of a line: '$'
   * wildcard matching any character except newline: '.'
   * optional instance: R? = epsilon | R
   * one or more instances: R+ == RR*

NB: notation is NOT the same in all tools.  For example, in most shells
'.' is just a dot, and '?' means "any one (non-newline) character".

  ------------------------------------------------------------------------

Equivalence of Regular Expressions and Finite Automata

Regular expressions and finite automata have equivalent expressive power:

   * For every regular expression R, there is a corresponding FA that
     accepts the set of strings described by R.

   * For every FA A there is a corresponding regular expression that
     describes the set of strings accepted by A.

The proof is in two parts:

  1. an algorithm that, given a regular expression R, produces an FA A such
     that L(A) == L(R).

  2. an algorithm that, given an FA A, produces a regular expression R such
     that L(R) == L(A).

The first part (construction of an FA from an RE) is what tools like
lex, emacs, and grep do.  The construction relies on epsilon
transitions, but these are just a notational convenience: for every FA
with epsilon transitions there is a corresponding FA without them.
In practice we can deal with epsilon transitions directly in the NDFA
to DFA construction.

  ------------------------------------------------------------------------

Constructing an FA from an RE

We begin by showing how to construct an FA for the operands in a regular
expression.

   * If the operand is a symbol c, then our FA has two states, s0 (the
     start state) and sF (the final, accepting state), and a transition from
     s0 to sF with label c.

   * If the operand is epsilon, then our FA has two states, s0 (the start
     state) and sF (the final, accepting state), and an epsilon transition
     from s0 to sF.

   * If the operand is null, then our FA has two states, s0 (the start
     state) and sF (the final, accepting state), and no transitions.

Given FA for R1 and R2, we now show how to build an FA for R1R2, R1|R2, and
R1*. Let A (with start state a0 and final state aF) be the machine accepting
L(R1) and B (with start state b0 and final state bF) be the machine
accepting L(R2).

   * The machine C accepting L(R1R2) includes A and B, with start state
     a0, final state bF, and an epsilon transition from aF to b0.  If we
     note that there is no transition out of aF and no transition into
     b0, we can eliminate the epsilon transition and simply merge aF and b0.

   * The machine C accepting L(R1|R2) includes A and B, with a new start
     state c0, a new final state cF, and epsilon transitions from c0 to a0
     and b0, and from aF and bF to cF.

   * The machine C accepting L(R1*) includes A, with a new start state c0, a
     new final state cF, and epsilon transitions from c0 to a0 and cF, and
     from aF to a0, and from aF to cF.