Problem Set 1: Lexical Analysis
due: Monday, September 7
When writing regular expressions, you may use the extended RE syntax presented
in class, such as [^chars] to indicate the possibility of any character except the characters
listed in chars.
You may find the Dot and Graphviz packages helpful for generating drawings
of DFAs and other graphs. You can get these
packages for multiple OSes from the
Graphviz downloads page.
An example of a DFA drawn using Graphviz may be useful.
- Write a regular expression for HTTP and FTP URLs.
An HTTP URL consists of four parts: the protocol (http:// or
ftp://), the DNS name or the IP address of a host, an optional port
number, and the pathname for a file. For simplicity, let's assume:
- A DNS name is a list of non-empty alphabetical strings separated by periods.
- An IP address consists of four non-negative integers of at most three digits each,
separated by periods.
- A port number is a positive integer following a colon. (e.g., :8080)
- The pathname part is a Unix-style absolute pathname. The allowed symbols
are letters, digits, period and slash. The sequence "//" is forbidden: no empty directory names. A pathname may end with a slash.
- A comment in the C language begins with the two-character sequence "/*",
followed by the body of the comment, and then the sequence "*/". The body of
the comment may not contain the sequence "*/", although it may contain the "*"
and "/" characters.
-
Show that the following regular expression does not correctly describe C comments. (Quoted
strings indicate that the characters inside are to be taken literally rather than as
RE operators, and whitespace is not significant.)
"/*" (/)* ([^*/] | [^*]/ | "*"[^/])* ("*")* "*/"
- Draw the DFA that accepts C comments
- Use your DFA to write a regular expression that correctly describes C comments.
- Convert the following NFA to a DFA. Make sure you show the initial and final states.
Label each DFA state with the set of NFA states to which it corresponds.
- Consider this lexer specification:
aa(b+) { return new Tok1(); }
aa { return new Tok2(); }
ab*ab* { return new Tok3(); }
- Show the NFA that accepts strings matching one of the above three patterns.
- Transform the NFA to a DFA. Label each DFA state with the set of NFA states to
which it corresponds.
- What is the sequence of states that the lexer goes through when given the input
aabaaabbba?