CS 674: Natural Language Processing
Spring 2002

Instructor: Lillian Lee, 4152 Upson, . Office hours: Wednesdays 2-3:30, or by appointment
Time and location: MW 11:15-12:05, Upson 207

Announcements: See the cornell.class.cs674 newsgroup.
Reaction essay info and READING LIST: click here
Some online resources: click here
Handouts and lecture references:

Lecture index:

Intro: Jan 21

FB-CFGs: Jan 23

TAGs: Jan 28, Jan 30, Feb 4

Stats & Zipf's law: Feb 6

HMMs: Feb 11, Feb 13 and 18

Good-Turing: Feb 20

Discourse: Feb 25, Feb 27

WSD: Mar 4, Mar 6, Mar 11
Segmentation: Mar 13

Semantics: Mar 25, Mar 27

Info. theory: Apr 1

Clustering: Apr 3, Apr 8

MT: Apr 10, Apr 15
EM: Apr 17

Summarization: Apr 22

"Ling. vs. Stats": Apr 24

1/21: Introduction

Four handouts: course information sheet (includes syllabus), Student information sheet, Reaction essay readings, Introductory paper
Introductory paper: ``I'm sorry Dave, I'm afraid I can't do that'': Linguistics, Statistics, and Natural Language Processing circa 2001
Bill Gates quote, Gartner symposium, 1997. (Here is an alternate link.)
About the word "undertoad" (see John Irving, The World According to Garp, 1976.)
Mary Klages, 2001. Structuralism and Sausurre
Roni Rosenfeld, the Universal Speech Interface Manifesto
Alan M. Turing, 1950. Computing machinery and intelligence. Mind LIX(236), pp. 433-460.

1/23: Feature-based context-free grammars

Handout: lecture examples
See also ch. 4 of Allen or ch. 9 and 11 of Jurafsky and Martin.

1/28: Tree adjoining grammars

Handout: lecture examples
Christopher Culy, 1985. The complexity of the vocabulary of Bambara. Linguistics and Philosophy, 8:345-351.
Aravind K. Joshi, Leon S. Levy, Masako Takahashi, 1975. Tree adjunct grammars. Journal of Computer and System Sciences 10(1), pp. 136-163.
Aravind K. Joshi and Yves Schabes, Tree-adjoining grammars. I don't know of a publication source for this.
Aravind K. Joshi, K. Vijay-Shankar, and David Weir, 1991. The convergence of mildly context-sensitive grammar formalisms. In Peter Sells, Stuart Shieber, and Tom Wasow, editors, Foundational Issues in Natural Language Processing, pp. 31--81. MIT Press.
Aravind K. Joshi, 1985. Tree adjoining grammars: how much context-sensitivity is required to provide reasonable structural description. In David R. Dowty, Lauri Karttunen, and Arnold M. Zwicky, eds, Natural Language Processing: psychological, computational, and theoretical perspectives, Cambridge. Note -- there may be an error in the parsing-time claim.
Geoff Pullum, 1986. Footloose and context-free. Natural Language and Linguistic Theory 4, pp. 283--289. Reprinted in The Great Eskimo Vocabulary Hoax, U. of Chicago Press, 1991.
Stuart Shieber, 1985. Evidence against the context-freeness of natural language. Linguistics and Philosophy, 8:333-343.
Stuart Shieber and Yves Schabes, 1990. Synchronous Tree-Adjoining Grammars. In Proceedings of the 13th International Conference on Computational Linguistics, volume 3, pp. 1--6.
K. Vijay-Shankar and Aravind K. Joshi, 1985. Some computational properties of Tree Adjoining Grammars. Proc. of the 23rd ACL, pp. 82--93.
K. Vijay-Shankar and David Weir, 1994. The equivalence of four extensions of context-free grammars. Mathematical Systems Theory, 27:511-545.
The XTAG home page.
TAG+6: The 6th International Workshop on Tree Adjoining Grammars and Related Frameworks .

[back to lecture index | back to top]

1/30: TAGs with adjunction constraints

Handout: Lecture examples
Anne Abeillé and Yves Schabes, 1989. Parsing idioms in lexicalized TAGs. Fourth Conference of the European Chapter of the Association for Computational Linguistics (EACL '89).
Tilman Becker, Aravind K. Joshi, and Owen Rambow, 1991. Long distance scrambling and tree adjoining grammars. EACL, pp. 21--26.
Yves Schabes and Stuart Shieber, 1994. An alternative conception of Tree-adjoining derivation. Computational Linguistics 20(1), pp. 91--124.
K. Vijay-Shankar and Aravind K. Joshi, 1988. Feature structure based tree adjoining grammars. Proceedings of the 12th International Conference on Computational Linguistics (COLING '88).

[back to lecture index | back to top]

2/04: Feature-based TAGs

Handout: Lecture examples
Steven Abney, 1996. Statistical methods and linguistics. The Balancing Act, Judith Klavans and Philip Resnik, eds, MIT Press. Paper available online as ps or pdf
Sanguthevar Rajasekaran and Shibu Yooseph, 1995. TAL recognition in O(M(n^2)) time. Proc. of the 33rd ACL, pp. 166--173. (Link is to the journal version, in Journal of Computer and System Sciences, 56(1), pp. 83--89, 1998.)
Giorgio Satta, 1994. Tree Adjoining Grammar parsing and Boolean matrix multiplication. Computational Linguistics, 20(2), pp. 173--191.

[back to lecture index | back to top]

2/06: (Some) statistics of language

Handout: Lecture handout
J.B. Estoup, 1916. Gammes Stenographiques. Institut Stenographique de France.
W. Nelson Francis and Henry Kucera, 1982. Frequency Analysis of English Usage. Houghton Mifflin. Also of potential interest: the Brown corpus manual, 1979; and a search interface.
Tommi Jaakkola and David Haussler, 1998. Exploiting generative models in discriminative classifiers. NIPS 11.
Mark Johnson, 2001. Joint and conditional estimation of tagging and parsing models. Proc. of ACL, pp. 314--321.
John Lafferty, Andrew McCallum, and Fernando Pereira, 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proc. of ICML.
Benoit Mandelbrot, 1957. Th�orie math�matique de la d'Estoup-Zipf. Inst. de Statistique de l'Univ�rsit�.
George A. Miller, 1957. Some effects of intermittent silence. American J. Psychology 70, pp. 311-313.
Andrew Y. Ng and Michael Jordan, 2002. On discriminative vs. generative classifiers: A comparison of logistic regression and Naive Bayes. Proc. of NIPS.
Y. Dan Rubenstein and Trevor Hastie, 1997. Discriminative vs Informative Learning. Proc. of KDD.
George Zipf, 1949. Human Behavior and the principle of least effort. Addison-Wesley Press.
Zipf's Law (webpage by Wentian Li). Many Zipf's Law refs in many fields.
See also section 1.4 of the Manning/Schütze text or chapter 4 of Timothy C. Bell, John G. Cleary, and Ian H. Witten, Text Compression, 1990.

[back to lecture index | back to top]

2/11: Hidden Markov Models

Handout: Lecture handout
Lawrence R. Rabiner, 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77 (2), pp. 257--286.

Errata by Ali Rahimi

Lawrence R. Rabiner and B. H. Juang, 1986. An introduction to hidden Markov models. IEEE ASSP Magazine, pp. 4-15.
See also Ch. 9 of Manning/Schütze, Ch. 2 of Jelinek, or Ch. 7 of Jurafsky/Martin. Ch. 3 of the Durbin/Eddy/Krogh/Mitchison introduces the same material from a computational-biology perspective.

2/13 and 2/18: Hidden Markov Models (cont.)

Handout: Lecture handout
Bernard Merialdo, 1994. Tagging English text with a probabilistic model. Computational Linguistics 20 (2), pp. 155-172.

[back to lecture index | back to top]

2/20: Good-Turing smoothing

Handout: Lecture handout
Peter F. Brown and Vincent J. DellaPietra and Peter V. deSouza and Jennifer C. Lai and Robert L. Mercer, 1992. Class-based n-gram models of natural language. Computational Linguistics 18(4), pp. 467--479.
Stanley F. Chen and Joshua Goodman, 1996. An empirical study of smoothing techniques for language modeling Proceedings of the 34th Meeting of the Association for Computational Linguistics, pp 310--318. TR version: TR-10-98, Computer Science Group, Harvard University, 1998.
Kenneth W. Church and William A. Gale, 1991. A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language 5, pp. 19--54.
Michael Collins and James Brooks, 1995. Prepositional Attachment through a Backed-off Model. Third Workshop on Very Large Corpora, pp. 27--38.
I. J. Good, 1953. The population frequencies of species and the estimation of population parameters. Biometrika 40, pp. 237--264.
Pierre Simon Laplace. Essai Philosophique sur les probabilities. There appear to be several editions; Ristad's A natural law of succession dates it to 1775.
Arthur Nadas, 1985. On Turing's formula for word probabilities. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-33 (6), pp. 1414--1416.
Geoffrey Sampson's Good-Turing page.
See also section 6.2 of the Manning/Schütze text or chapter 15 of the Jelinek text.

[back to lecture index | back to top]

2/25: Discourse phenomena

Handout: Lecture handout
Ralph Grishman, 1986. Computational Linguistics: An Introduction. Cambridge.
Jerry R. Hobbs, 1978. Resolving Pronoun References. Reprinted in Grosz, Sparck Jones, and Webber, Readings in Natural Language Processing.
Jerry R. Hobbs, 1979. Coherence and co-reference. Cognitive Science 3(1), pages 67--82.
Ray Jackendoff, 1972. Semantic Interpretation in Generative Grammar. MIT Press.
William C. Mann and Sandra A. Thompson, 1986. Relational propositions in discourse. Discourse Processes 9(1), pp. 57--90.
Vladimir Nabokov, Lolita. 1955.
Candace Sidner, 1979, Towards a Computational Theory of Definite Anaphora Comprehension in English Discourse. PhD Thesis, MIT.
Remko Scha and Livia Polyani, 1988. An augmented context free grammar for discourse. Proceedings of COLING.
Yorick Wilks, 1975. An intelligent analyzer and understander of English. Communications of the ACM 18(5), 264--274. Reprinted in Grosz et al, Readings in Natural Language Processing
See also ch. 14 of Allen. or ch. 18 of Jurafsky/Martin.

[back to lecture index | back to top]

2/27: The Grosz and Sidner discourse theory

Handouts: Example discourse, Project guidelines
Barbara Grosz and Candace Sidner, 1986. Attention, intentions, and the structure of discourse. Computational Linguistics 12(3), pp. 175--204.
See also ch. 14 and 16 of Allen or ch. 19 of Jurafsky/Martin.

[back to lecture index | back to top]

3/4: Word sense disambiguation

Handouts: Lecture handout
Kathleen G. Dahlgren, 1988. Naive Semantics for Natural Language Understanding. Kluwer.
William Gale, Kenneth Church, and David Yarowsky, 1992. Estimating upper and lower bounds on the performance of word-sense disambiguation programs. Proceedings of the ACL, pp. 249--256.
Nancy Ide and Jean Veronis. Introduction to the special issue on word sense disambiguation: The state of the art. Computational Linguistics 24(1), pp. 1--40.
Abraham Kaplan, 1950. An experimental study of ambiguity and context. Mechanical Translation 2(2): 39-46 (issue appeared in 1955).
Jerrold J. Katz and Jerry A. Fodor, 1963, The structure of semantic theory. Language (39), pp. 170--210.
Alpha Luk, 1995. Statistical sense disambiguation with relatively small corpora using dictionary definitions. Proceedings of the 33rd ACL.
Ray Mooney, 1996. Comparative experiments on disambiguating word senses: An illustration of the role of bias in machine learning. Proceedings of the 1996 Conference on Empirical Methods in Natural Language Processing, pp. 82-91.
Erwin Reifler, 1955. The mechanical determination of meaning. In William N. Locke and A. Donald Booth, eds., Machine Translation of Languages. John Wiley and Sons. Also included in the upcoming Readings in Machine Translation, MIT Press.
Philip Resnik, 1995. Disambiguating noun groupings with respect to WordNet senses. Proc. of the 3rd Workshop on Very Large Corpora.
Hinrich Sch�tze, 1992. Dimensions of meaning. Proceedings of Supercomputing, pp 787-796. (postscript)
Yorick Wilks and Mark Stevenson, 1998. The Grammar of Sense: Using part-of-speech tags as a first step in semantic disambiguation. Journal of Natural Language Engineering 4(2), pp. 135-144. (See also cmp-lg/9607028)
David Yarowsky, 1992. Word-Sense Disambiguation Using Statistical Models of Roget's Categories Trained on Large Corpora. Proc. of COLING, pp. 454--460.
See also ch 17.1-17.2 of Jurafsky and Martin

[back to lecture index | back to top]

3/6: Word sense disambiguation methods

Handouts: lecture handout
Christiane Fellbaum, ed., 1998. WordNet: An Electronic Lexical Database
See also ch. 7 of Manning and Schütze.

[back to lecture index | back to top]

3/11: Supervised and bootstrapped WSD methods

Handouts: lecture handout, sample project proposal (available by hardcopy only)
David Yarowsky, 1994. Decision Lists for Lexical Ambiguity Resolution: Application to Accent Restoration in Spanish and French.Proceedings, ACL-94, pp. 88--95.
David Yarowsky, 1995. Unsupervised word sense disambiguation rivaling supervised methods, Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp 189--196.
David Yarowsky, 1999.A comparison of corpus-based techniques for restoring accents in Spanish and French text. Natural Language Processing Using Very Large Corpora, pp. 99--120.
SemCor (see also SemCor 1.7a)

[back to lecture index | back to top]

3/13: Mostly-unsupervised Japanese segmentation

Rie Kubota Ando and Lillian Lee, 2000. Mostly-Unsupervised Statistical Segmentation of Japanese: Applications to Kanji. NAACL, pp. 241--248.

[back to lecture index | back to top]

3/25: Representing semantics

Handout: lecture handout
Richard Montague, 1974. Formal Philosophy.
Fernando C.N. Pereira and Stuart Shieber, Prolog and Natural-Language Analysis, 1987. See Section 4.1.
Mark Steedman, 2000. The Syntactic Process.
See also Allen ch. 11.2, Jurafsky/Martin ch 14. 15.5

[back to lecture index | back to top]

3/27: Quantifier scope ambiguity

Handout: lecture handout
Jerry Hobbs and Stuart Shieber, 1987. An Algorithm for Generating Quantifier Scopings. Computational Linguistics, 13(1-2), pages 47-63.
Fernando C.N. Pereira and Stuart Shieber, Prolog and Natural-Language Analysis, 1987. See Section 4.1.

[back to lecture index | back to top]

4/1: Introduction to information theory

Handouts: lecture handout, sample literature survey (hardcopy only)
Thomas M. Cover and Joy A. Thomas, 1991. Elements of Information Theory. Wiley.
Ming Li and Paul Vitanyi, An Introduction to Kolmogorov Complexity and Its Applications, 1997 (2nd ed).
Claude Shannon, 1948. A mathematical theory of communication. Bell System Technical Journal, vol. 27, pp. 379-423 and 623-656. Republished as "The mathematical theory of communication" in Warren Weaver and Claude E. Shannon, eds., The Mathematical Theory of Communication, U. Illinois Press, 1949.
See also ch 7 and 8 of the Jelinek text, or Ch 2.2 of the Manning/ Schütze text. Jelinek recommends Abramson's Information Theory and Coding, 1963.

[back to lecture index | back to top]

4/3: Class-based language modeling with hard clusters

Handout: Table 2 and 3 of the Brown paper (see link)
Peter F. Brown and Vincent J. DellaPietra and Peter V. deSouza and Jennifer C. Lai and Robert L. Mercer, 1992. Class-based n-gram models of natural language. Computational Linguistics 18(4), pages 467--479.
Frederick Jelinek, Robert L. Mercer and Salim Roukos, 1992. Principles of Lexical Language Modeling for Speech Recognition. In Sadaoki Furui and M. Mohan Sondhi, eds., Advances in Speech Signal Processing, Mercer Dekker.

[back to lecture index | back to top]

4/3: Class-based language modeling with soft clusters

Handout: Lecture handout
Slava M. Katz, 1987. Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-35 (3), pp 400--401.
Lillian Lee and Fernando Pereira, 1999. Distributional similarity models: Clustering vs. nearest neighbors.
Fernando Pereira, Naftali Tishby, and Lillian Lee, 1993. Proceedings of the 37th ACL, pp 33--40. Distributional Clustering of English Words. Proceedings of the 31st ACL, pp 183--190.
See also Chapter 4 of Similarity-Based Approaches to Natural Language Processing (has proofs)

[back to lecture index | back to top]

4/10: Introduction to machine translation

Yehoshua Bar-Hillel, 1960. The present status of automatic translation of languages. In Franz Alt, A. Donald Booth, and R. E. Meagher, eds., Advances in Computers. Academic Press.
John Hutchins' history of machine translation in a nutshell.
Eduard H. Hovy and Kevin Knight, 1996. Machine Translation. Tutorial at ACL '96.
Kishore Papineni and Salim Roukos and Todd Ward and Wei-Jing Zhu, 2001. Bleu: a Method for Automatic Evaluation of Machine Translation. IBM research report RC22176 (W0109-022).
See also ch. 21 of Jurafsky/Martin

[back to lecture index | back to top]

4/15: Statistical machine translation

Handout: lecture handout
Adam L. Berger, Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, John R. Gillett, John D. Lafferty, Robert L. Mercer, Harry Printz, and Lubos Ures, 1994. The Candide System for Machine Translation. Proceedings of the 1994 ARPA Workshop on Human Language Technology
Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer, 1993. The Mathematics of Statistical Machine Translation". Computational Linguistics 19(2), pp. 263--311.
Kevin Knight, 1999. A Statistical MT Tutorial Workbook. (html)
Voltaire, 1759. Candide
Warren Weaver. Translation. Memorandum. See MT News International discussion, July 1999.

[back to lecture index | back to top]

4/17: The EM algorithm

Michael Collins, 1997. The EM Algorithm. Written Preliminary Exam (WPE) paper.
Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin, 1977. Maximum Likelihood From Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society Series B, 39(1), pp. 1-38.
Geoffrey J. McLachlan and Thriyambakam Krishnan, 1997. The EM Algorithm and Extensions. Wiley.
Ted Pedersen's introductory talk and paper
See also chapter 9 of the Jelinek text, or chapter 11 of the Durbin, Eddy et al text, or the "When does EM work?" panel at EMNLP 2001

[back to lecture index | back to top]

4/22: Statistical Summarization

Handouts: lecture handout, Presentation and Final Report Instructions
Kevin Knight and Daniel Marcu, 2000. Statistics-Based Summarization -- Step One: Sentence Compression.

Inderjeet Mani and Mark Maybury, eds., 1999. Advances in Automatic Text Summarization.
Michael J. Witbrock and Vibhu Mittal, 1999. Ultra-summarization: a statistical approach to generating highly condensed non-extractive summaries Poster abstract, SIGIR 1999. (long version

[back to lecture index | back to top]

4/2: Linguistics and statistics: the case of POS tagging

Handouts: lecture handout
Eric Brill and Grace Ngai, 1999. Man [and Woman] vs. Machine: A Case Study in Base Noun Phrase Learning

Jean-Pierre Chanod and Pasi Tapanainen, 1995. Tagging French -- comparing a statistical and a constraint-based method. Proceedings of the EACL.
Kenneth W. Church, 1992. Current practice in part of speech tagging and suggestions for the future. In Simmons, ed., Sbornik praci: In Honor of Henri Kucera.
Doug Cutting, Julian Kupiec, Jan Pedersen, and Penelope Sibun, 1992. A Practical Part-of-Speech Tagger. Proceedings of the Third Conference on Applied Natural Language Processing, pp. 133-140.
Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19, pp. 313-330.
Christer Samuelsson and Atro Voutilainen, 1997. Comparing a linguistic and a stochastic tagger, Proceedings of the 35th ACL/8th EACL. Also cmp-lg/97060
Pasi Tapanainen and Atro Voutilainen, 1994. Tagging accurately - Don't guess if you know, Proceedings of the Fourth ACL Conference on Applied Natural Language Processing. Also cmp-lg/9408009

[back to lecture index | back to top]

Other online resources:

Cornell Natural Language Processing Page
Statistical Natural Language Processing: Models and Methods (a related class I taught previously)
Cognitive Studies at Cornell. The colloquia are often relevant.
Citeseer (ResearchIndex)
The Computation and Language section of CoRR. The Computation and Language E-Print Archive is a handy repository for NLP papers.
Hutchins' bibliography contains online links to some historical MT papers.
Search engines: Google | Ask Jeeves (uses NLP)
WordNet. The famed lexical database.
Ulrich Germann's version of the Hansards
Speech and Language Processing, by Daniel Jurafksy and James Martin.
Foundations of Statistical Natural Language Processing, companion website to Chris Manning and Hinrich Sch�tze's text.
ACL. The Association for Computational Linguistics is the main organization for people interested in NLP. A lot of links are available off the "universe".
Manning's list of resources
Kenji Kita's pages. A collection of empirical/statistical links. Also has an (idiosyncractic?) archive of papers.
Corpora list. Archive of postings.
NLP people (includes (slightly outdated) organization by region)
Toolkits I didn't see on other lists:
- The Lemur Toolkit for Language Modeling and Information Retrieval. (Chengxiang Zhai, Thi Nhu Truong, John Lafferty, and Jamie Callan, et al)
- The rainbow toolkit for text classification. (Andrew McCallum)
- The bigram statistics package (Ted Pedersen)
- The Perl WordNet interface (Ted Pedersen)
- Carmel, a finite-state toolkit
- EGYPT, a statistical machine translation toolkit built at a JHU '99 summer workshop.

Other CS course websites

. .

CS 674: Natural Language Processing Spring 2002

CS 674: Natural Language Processing
Spring 2002