Processing math: 1%

Lecture 26: Regular expressions

Regular expressions

Regular expressions are patterns that match certain strings. They give a way to define a language: the language of a regular expression is the set of all strings that match the pattern.

There are six ways to construct regular expressions. Formally, the set of regular expressions is formed by the following grammar:

r ∈ RE ::= ∅ ~~|~~ ε ~~|~~ a ~~|~~ r_1r_2 ~~|~~ (r_1+r_2) ~~|~~ r^*

matches no strings. L(∅) = ∅.

ε matches only the empty string. L(ε) = \{ε\}.

a matches the string "a". L(a) = \{a\}.

r_1r_2 (the concatenation of r_1 and r_2) matches any string that can be broken into two parts x and y, with x matching r_1 and y matching r_2. L(r_1r_2) = \{xy ~~|~~ x \in L(r_1), y \in L(r_2)\}.

r_1+r_2 (the alternation of r_1 and r_2, sometimes written r_1 + r_2 or r_1 \cup r_2) matches any string that matches either r_1 or r_2. Formally, L(r_1+r_2) = L(r_1) \cup L(r_2). Note that some sources use r_1|r_2 or r_1 \cup r_2 for the alternation.

r^* (the Kleene star or Kleene closure of r) matches the concatenation of any number (including 0) of strings, each of which match r. Formally, L(r) = \{x_1x_2x_3\dots ~~|~~ x_i \in L(r)\}.

Important note: Many programming languages add other kinds of regular expressions, such as r^+ to denote one or more r's, or r? to denote 0 or 1 r's. While convenient for programming, additional forms make the theory more complicated without adding anything. For this class, these are the only forms of regular expressions. You can achieve the same effects of most of these extensions by translating them to our basic regular expressions. For example, you can use rr^* to denote one or more repetitions of r, and (r + ε) to denote 0 or 1 repetitions.

Examples

Numbers: Let Σ be a the ASCII character set (including upper- and lower-case letters, digits, and punctuation). Let D ::= 0 + 1 + 2 + \cdots + 9. Then D matches any single digit (e.g. D matches 0, D matches 1, etc.) D^* matches numbers of any length, including the empty string. DD^* matches any natural number with length \geq 1. ('-'+ε)DD^*(ε+'.'DD^*) matches numbers that contain at least one digit and optionally start with a '-' character, and are optionally followed by a decimal point and one or more digits.

Dates: With D defined as above, DD/DD represents dates of the form mm/dd. We can be more specific: Let N ::= 1 + 2 + \cdots + 9 be a regular expression matching non-zero digits. We can represent numbers between one and 12 with the expression Mo ::= (1(0+1+2) + N). Then we can build regular expressions for dates that start with valid month numbers using the regular expression Mo/DD. We could similarly restrict days to be numbers between 1 and 31. We could even match the number of days and the month, for example by writing

\begin{aligned} Date &::= 2/(N + 1D + 2D) && \text{29 day months} \\ &+ (4+6+9+11)/(N + 1D + 2D + 30) && \text{30 day months}\\ &+ (1+3+5+7+8+10+12)/(N+1D+2D+30+31) && \text{31 day months} \\ \end{aligned}

We could optionally add years:

DateWithOptionalYear ::= Date(ε + /DD + /19DD + 20DD)

Even number of 0's: We could write a RE for binary strings with an even number of 0's and any number of 1's by noticing that every 0 must be paired with another 0. We could start with a pattern matching an even number of 0's with no 1's: (00)^*. We could then allow ourselves to add any number of 1's between the 0's: (01^*0)^*. We can also add 1's at the beginning of the string, or between repetitions of the 0's: 1^*(01^*01^*)^*. Some experimentation shows that this is sufficient, although you could explicitly allow 1's everywhere: 1^*(1^*01^*01^*)^*1^*.

Checking your work: When writing regular expressions, it's always good to test your answer by coming up with a variety of strings that should match and a variety of strings that shouldn't, and testing them out.