Introduction to Natural Language Processing
COM S/LING/COGST 474
Spring 2002
Mats Rooth
Morrill 203A (enter through Linguistics main office)
Office hour: Wednesday 3-4
Books
Foundations of Statistical Natural Language Processing, by Christopher D. Manning and Hinrich Schutze, MIT Press.
Syntactic Theory: A Formal Introduction, by Ivan A. Sag and Thomas Wasow, CSLI Publications.
Introduction to Formal Languages, by Gyorgy E. Revesz, Dover Publications.
If you haven't taken the data structures and algorithms prerequisite, the following is useful.
Data Structures and Problem Solving Using Java, by Mark Allen Allen Weiss, Addison Wesley Longman, Inc.
The four books are in the book store.
Class requirements
Problem sets consist of conventional problems, programming, grammar hacking, or experiments. About five problem sets will be assigned. They are due two weeks after you get them. Problem sets contribute 50% of the final grade.
In-class labs introduce some NLP application, and include a small problem or experiment. There will be three or four in-class labs. Turn in a short lab report (a few pages) describing your results. These reports are distinct from problem sets, and contribute 10% of the final grade.
Term projects involve programming, NLP experiments, or a combination of the two. Turn in a paper (a 10 page length is ideal), and if relevant arrange to give a demo. I will distribute a list of suggested projects. A one-page project proposal will be due about five weeks before the end of classes. There won't be any problem sets during the last four weeks of class, so you can concentrate on the projects. Term projects contribute 35% of the final grade.
Utilities are simple sanity checks or utility programs such as format conversions which are relevant to lectures or labs. Utilities should be callable from the Unix shell. They will be assigned irregularly. Do one or more of these, for 5% of the final grade.
Lab and software
The computational linguistics lab is in Morrill 203A, in the basement of Morrill Hall. This is the location for in-class labs, and you may work there on your own when there is no class or meeting in the lab.
The Sun Ultra 10 machines in the lab run Solaris 8, which is a Unix operating system. Minimally, you will need to be able to work with the Unix shell and edit text files (usually with emacs or vi). In addition, it is useful to know some scripting language (perl, awk, and/or Unix shell scripts) and make.
You will receive a login and home directory. Most course software is installed in /usr/local/bin. Material for labs is under /fsys/blue/a/Lab. Manual pages are in /usr/local/man.
The parser lopar from Helmut Schmid at the University of Stuttgart is a parser and parameter estimator for probabilistic context free grammars. In addition to Solaris, it is available for Linux. You can get your own copy from his web page.
The system xfst from Xerox is an implementation of the calculus of regular relations. They license it to universities for educational use. Reportedly, Windows, Linux, and Solaris versions will be included with the textbook Finite State Morphology by Lauri Karttunen and Kenneth Beesley (Cambridge University Press).
The parser YAP is also from Helmut Schmid at the University of Stuttgart. It is a feature constraint parser which is compatible in certain ways with lopar. You can get your own copy from his page, but only for Solaris.
The North American News Paper Corpus licensed from the Linguistic Data Consortium includes material from AP and the New York Times.
The Penn Treebank is a database of syntactic trees, and is also licensed from the LDC.
Topics
Tree syntax of natural language
Context free grammars and parse forest algorithms
Weighted grammars
Markup and testing methodology
Estimation of weighted grammars
Feature constraint grammars
Lexicalized probabilistic grammars
Feature formalisms and parsing
Filler-gap dependencies
Computational morphology
Beyond the context free
The items correspond roughly to weeks, but the schedule is stretched out because of in-class labs.
Academic Integrity
You will work in groups in in-class labs. Turn in separate lab reports which indicate the members of your working group.
Problem sets (including problems, programs, and experiments) should be a result of your individual efforts. This does not exclude discussing problems in a general way, discussing strategies for algorithms, or giving or receiving help in experimental or programming technique. You may exchange utility programs, such as scripts which reformat data or tabulate results, but be cautious about assuming that someone else's utility is correct. Specify and credit any help you receive. See http://www.cs.cornell.edu/ugrad/AcadInteg.html for general and CS-specific guidelines.
You may not copy licensed software or corpora from the lab machines.