CS173 Tuesday Oct 8, 2002 Project 3 is out. ======================================================================== NDFAs epsilon transitions and/or multiple transitions on same input allow non-unique transitions - set of possible states Why? some det algs require choice at some stage e.g., game playing, choose move, follow to decide good/bad NDFA makes it easy to model without backtracking solving problems easily accept AAA or event number of A's easy to write as NDFA. relatively easy as DFA, but more difficult to understand -------------------------------- Regular Expressions RE algebraic way to define (generate) patterns value of an RE is a pattern describing a set of strings - LANGUAGE L(E), language of expression E Note: If you can write an RE, you can make an automata Operands in a regular expression can be: * symbols from the alphabet over which the regular expression is defined. * variables, whose values are any pattern defined by a regular expression. * epsilon, which denotes the empty string containing no symbols. * NULL, which denotes the empty set of strings. If R is a regular expression, we use L(R) to indicate the *language* (set of strings) described (generated) by R. Operators used in regular expressions include: * Union: If R1 and R2 are regular expressions, then R1 | R2 (also written as R1 U R2 or R1 + R2) is also a regular expression. L(R1|R2) = L(R1) U L(R2). Union is also sometimes called alternation. * Concatenation: If R1 and R2 are regular expressions, then R1R2 (also written as R1.R2) is also a regular expression. L(R1R2) = L(R1) concatenated with L(R2). * Kleene closure: If R1 is a regular expression, then R1* (the Kleene closure of R1) is also a regular expression. L(R1*) = epsilon U L(R1) U L(R1R1) U L(R1R1R1) U ... By convention, in the absence of parentheses, closure has the highest precedence, followed by concatenation, followed by union. ------------------------------------------------------------------------ Examples The set of strings over {0,1} that end in 3 consecutive 1's. (0 | 1)* 111 The set of strings over {0,1} that have at least one 1. 0* 1 (0 | 1)* The set of strings over {0,1} that have at most one 1. 0* | 0* 1 0* The set of strings over {A..Z,a..z} that contain the word "main". Let = A | B | ... | Z | a | b | ... | z * main * The set of strings over {A..Z,a..z} that contain 3 x's. * x * x * x * The set of identifiers in Pascal. Let = A | B | ... | Z | a | b | ... | z Let = 0 | 1 | 2 | 3 ... | 9 ( | )* The set of real numbers in Pascal. Rules: must have a fractional part, an exponent, or both (otherwise it's an integer) if it has a fractional part, there must be at least one digit on each side of the decimal point. in an exponent the sign is optional, but there must be at least one digit Let = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 digit_string = * Let = 'E' Let = '+' | '-' | epsilon Let = '.' ( | | ) Remember: abbreviations like and are for convenience ONLY; they do not change the power of the notation. To see this, just expand them all out in-line (they aren't allowed to be recursive). ------------------------------------------------------------------------ Unix Operator Extensions Regular expressions are used frequently in Unix: * In shell command lines * Within text editors * In the context of pattern matching programs such as grep and egrep To facilitate construction of regular expressions, Unix recognizes additional operators. These operators can be defined in terms of the operators given above; they represent a notational convenience only. * character classes: '[' ']' * start of a line: '^' * end of a line: '$' * wildcard matching any character except newline: '.' * optional instance: R? = epsilon | R * one or more instances: R+ == RR* NB: notation is NOT the same in all tools. For example, in most shells '.' is just a dot, and '?' means "any one (non-newline) character". ------------------------------------------------------------------------ Equivalence of Regular Expressions and Finite Automata Regular expressions and finite automata have equivalent expressive power: * For every regular expression R, there is a corresponding FA that accepts the set of strings described by R. * For every FA A there is a corresponding regular expression that describes the set of strings accepted by A. The proof is in two parts: 1. an algorithm that, given a regular expression R, produces an FA A such that L(A) == L(R). 2. an algorithm that, given an FA A, produces a regular expression R such that L(R) == L(A). The first part (construction of an FA from an RE) is what tools like lex, emacs, and grep do. The construction relies on epsilon transitions, but these are just a notational convenience: for every FA with epsilon transitions there is a corresponding FA without them. In practice we can deal with epsilon transitions directly in the NDFA to DFA construction. ------------------------------------------------------------------------ Constructing an FA from an RE We begin by showing how to construct an FA for the operands in a regular expression. * If the operand is a symbol c, then our FA has two states, s0 (the start state) and sF (the final, accepting state), and a transition from s0 to sF with label c. * If the operand is epsilon, then our FA has two states, s0 (the start state) and sF (the final, accepting state), and an epsilon transition from s0 to sF. * If the operand is null, then our FA has two states, s0 (the start state) and sF (the final, accepting state), and no transitions. Given FA for R1 and R2, we now show how to build an FA for R1R2, R1|R2, and R1*. Let A (with start state a0 and final state aF) be the machine accepting L(R1) and B (with start state b0 and final state bF) be the machine accepting L(R2). * The machine C accepting L(R1R2) includes A and B, with start state a0, final state bF, and an epsilon transition from aF to b0. If we note that there is no transition out of aF and no transition into b0, we can eliminate the epsilon transition and simply merge aF and b0. * The machine C accepting L(R1|R2) includes A and B, with a new start state c0, a new final state cF, and epsilon transitions from c0 to a0 and b0, and from aF and bF to cF. * The machine C accepting L(R1*) includes A, with a new start state c0, a new final state cF, and epsilon transitions from c0 to a0 and cF, and from aF to a0, and from aF to cF.