CS 4120: Introduction to Compilers
Fall 2009

Problem Set 1: Lexical Analysis

due: Monday, September 7

When writing regular expressions, you may use the extended RE syntax presented in class, such as [^chars] to indicate the possibility of any character except the characters listed in chars.

You may find the Dot and Graphviz packages helpful for generating drawings of DFAs and other graphs. You can get these packages for multiple OSes from the Graphviz downloads page. An example of a DFA drawn using Graphviz may be useful.

  1. Write a regular expression for HTTP and FTP URLs. An HTTP URL consists of four parts: the protocol (http:// or ftp://), the DNS name or the IP address of a host, an optional port number, and the pathname for a file. For simplicity, let's assume:
  2. A comment in the C language begins with the two-character sequence "/*", followed by the body of the comment, and then the sequence "*/". The body of the comment may not contain the sequence "*/", although it may contain the "*" and "/" characters.
    1. Show that the following regular expression does not correctly describe C comments. (Quoted strings indicate that the characters inside are to be taken literally rather than as RE operators, and whitespace is not significant.)
      "/*" (/)* ([^*/] | [^*]/ | "*"[^/])* ("*")* "*/"
    2. Draw the DFA that accepts C comments
    3. Use your DFA to write a regular expression that correctly describes C comments.
  3. Convert the following NFA to a DFA. Make sure you show the initial and final states. Label each DFA state with the set of NFA states to which it corresponds.

  4. Consider this lexer specification:
      aa(b+)  { return new Tok1(); }
      aa      { return new Tok2(); }
      ab*ab*  { return new Tok3(); }
      
    1. Show the NFA that accepts strings matching one of the above three patterns.
    2. Transform the NFA to a DFA. Label each DFA state with the set of NFA states to which it corresponds.
    3. What is the sequence of states that the lexer goes through when given the input aabaaabbba?