- 1/21: Introduction
- Four handouts: course information sheet (includes syllabus), Student
information sheet, Reaction essay readings, Introductory paper
- Introductory paper: ``I'm sorry Dave, I'm
afraid I can't do that'': Linguistics, Statistics, and Natural
Language Processing circa 2001
- Bill
Gates quote, Gartner symposium, 1997. (Here
is an alternate link.)
- About the word
"undertoad" (see John Irving, The World According to Garp,
1976.)
- Mary Klages, 2001. Structuralism
and Sausurre
- Roni
Rosenfeld, the Universal Speech
Interface Manifesto
- Alan M. Turing, 1950. Computing
machinery and intelligence. Mind LIX(236), pp. 433-460.
- 1/23: Feature-based context-free grammars
- Handout: lecture examples
- See also ch. 4 of Allen or ch. 9 and 11 of Jurafsky and Martin.
- 1/28: Tree adjoining grammars
- Handout: lecture examples
- Christopher Culy, 1985. The complexity of the vocabulary of
Bambara. Linguistics and Philosophy, 8:345-351.
- Aravind K. Joshi, Leon S. Levy, Masako Takahashi, 1975. Tree
adjunct grammars. Journal of Computer and System Sciences
10(1), pp. 136-163.
- Aravind K. Joshi and Yves Schabes, Tree-adjoining
grammars. I don't know of a publication source for this.
- Aravind K. Joshi,
K. Vijay-Shankar, and David Weir,
1991. The convergence of mildly context-sensitive grammar formalisms.
In Peter Sells, Stuart Shieber, and Tom Wasow, editors, Foundational
Issues in Natural Language Processing, pp. 31--81. MIT Press.
- Aravind K. Joshi, 1985. Tree adjoining grammars: how much
context-sensitivity is required to provide reasonable structural
description. In David R. Dowty, Lauri Karttunen, and Arnold M. Zwicky,
eds, Natural Language Processing: psychological, computational,
and theoretical perspectives, Cambridge. Note -- there may be an
error in the parsing-time claim.
- Geoff Pullum, 1986. Footloose
and context-free. Natural Language and
Linguistic Theory 4, pp. 283--289. Reprinted in The Great Eskimo Vocabulary Hoax, U. of
Chicago Press, 1991.
- Stuart
Shieber, 1985. Evidence against the context-freeness of natural
language. Linguistics and Philosophy, 8:333-343.
- Stuart
Shieber and Yves Schabes, 1990. Synchronous
Tree-Adjoining Grammars. In Proceedings of the 13th International Conference on Computational Linguistics, volume 3, pp. 1--6.
- K. Vijay-Shankar and Aravind K. Joshi, 1985. Some computational
properties of Tree Adjoining Grammars. Proc. of the 23rd ACL,
pp. 82--93.
- K. Vijay-Shankar and
David
Weir, 1994. The
equivalence of four extensions of context-free grammars. Mathematical Systems Theory, 27:511-545.
- The XTAG home page.
- TAG+6: The 6th
International Workshop on Tree Adjoining Grammars and Related Frameworks .
[back to
lecture index | back to top]
- 1/30: TAGs with adjunction
constraints
[back to
lecture index | back to top]
- 2/04: Feature-based TAGs
- Handout: Lecture examples
- Steven Abney, 1996. Statistical
methods and linguistics. The Balancing
Act, Judith
Klavans and Philip Resnik, eds, MIT Press. Paper available online as ps
or
pdf
- Sanguthevar
Rajasekaran and Shibu Yooseph, 1995. TAL recognition in O(M(n^2)) time. Proc. of the 33rd
ACL, pp. 166--173. (Link is to the journal version, in
Journal of Computer and System Sciences, 56(1), pp. 83--89, 1998.)
- Giorgio Satta, 1994. Tree Adjoining Grammar parsing
and Boolean matrix multiplication. Computational Linguistics, 20(2),
pp. 173--191.
[back to
lecture index | back to top]
- 2/06: (Some) statistics of
language
- Handout: Lecture handout
- J.B. Estoup, 1916. Gammes Stenographiques. Institut
Stenographique de France.
- W. Nelson Francis and Henry Kucera, 1982. Frequency Analysis of
English Usage. Houghton Mifflin.
Also of potential interest: the Brown corpus
manual, 1979; and a search interface.
- Tommi Jaakkola
and David Haussler, 1998. Exploiting
generative models in discriminative classifiers. NIPS 11.
- Mark Johnson, 2001.
Joint
and conditional estimation of tagging and parsing
models. Proc. of ACL, pp. 314--321.
- John Lafferty, Andrew McCallum, and Fernando Pereira,
2001. Conditional
random fields: Probabilistic models for segmenting and labeling
sequence data. Proc. of ICML.
- Benoit Mandelbrot, 1957. Théorie mathématique de la
d'Estoup-Zipf. Inst. de Statistique de l'Univèrsité.
- George A. Miller, 1957. Some
effects of intermittent silence. American J. Psychology 70,
pp. 311-313.
- Andrew Y. Ng and
Michael Jordan, 2002. On
discriminative vs. generative classifiers: A comparison of logistic
regression and Naive Bayes. Proc. of NIPS.
- Y. Dan Rubenstein and Trevor Hastie, 1997.
Discriminative
vs Informative Learning. Proc. of KDD.
- George Zipf, 1949. Human Behavior and the principle of least
effort. Addison-Wesley Press.
- Zipf's
Law (webpage by Wentian Li). Many Zipf's Law refs in
many fields.
- See also section 1.4 of the Manning/Schütze text or chapter
4 of Timothy C. Bell, John G. Cleary, and Ian H. Witten, Text
Compression, 1990.
[back to
lecture index | back to top]
- 2/11: Hidden Markov Models
- Handout: Lecture handout
- Lawrence R. Rabiner, 1989. A tutorial on hidden Markov models and
selected applications in speech recognition. Proceedings of the
IEEE 77 (2), pp. 257--286.
- Lawrence R. Rabiner and B. H. Juang, 1986. An introduction to
hidden Markov models. IEEE ASSP Magazine, pp. 4-15.
- See also Ch. 9 of Manning/Schütze, Ch. 2 of Jelinek, or
Ch. 7 of Jurafsky/Martin. Ch. 3 of the Durbin/Eddy/Krogh/Mitchison
introduces the same material from a computational-biology perspective.
- 2/13 and 2/18: Hidden Markov Models (cont.)
[back to
lecture index | back to top]
- 2/20: Good-Turing smoothing
- Handout: Lecture handout
- Peter F. Brown and Vincent J. DellaPietra and Peter V. deSouza
and Jennifer C. Lai and Robert L. Mercer, 1992. Class-based n-gram
models of natural language. Computational Linguistics 18(4),
pp. 467--479.
-
Stanley F. Chen and Joshua Goodman, 1996. An
empirical study of smoothing techniques for language modeling
Proceedings of the 34th Meeting of the Association for Computational
Linguistics, pp 310--318.
TR version: TR-10-98,
Computer Science Group, Harvard University, 1998.
- Kenneth W. Church and William A. Gale, 1991. A comparison of the
enhanced Good-Turing and deleted estimation methods for estimating
probabilities of English bigrams. Computer Speech and Language
5, pp. 19--54.
- Michael
Collins and James Brooks, 1995. Prepositional Attachment
through a Backed-off Model. Third Workshop on Very Large
Corpora, pp. 27--38.
- I. J. Good, 1953. The population frequencies of species and the
estimation of population parameters. Biometrika 40, pp.
237--264.
- Pierre Simon Laplace. Essai Philosophique sur les
probabilities. There appear to be several editions; Ristad's A
natural law of succession dates it to 1775.
- Arthur Nadas, 1985. On Turing's formula for word probabilities.
IEEE Transactions on Acoustics, Speech, and Signal
Processing, ASSP-33 (6), pp. 1414--1416.
- Geoffrey Sampson's Good-Turing
page.
- See also section 6.2 of the Manning/Schütze text or chapter
15 of the Jelinek text.
[back to
lecture index | back to top]
- 2/25: Discourse phenomena
- Handout: Lecture handout
- Ralph Grishman, 1986. Computational Linguistics: An
Introduction. Cambridge.
- Jerry R. Hobbs, 1978. Resolving Pronoun References. Reprinted
in Grosz, Sparck Jones, and Webber, Readings in Natural Language
Processing.
- Jerry R. Hobbs, 1979. Coherence and co-reference. Cognitive
Science 3(1), pages 67--82.
- Ray Jackendoff, 1972. Semantic Interpretation in Generative
Grammar. MIT Press.
- William
C. Mann and Sandra A. Thompson, 1986. Relational propositions in
discourse. Discourse Processes 9(1), pp. 57--90.
- Vladimir Nabokov, Lolita. 1955.
- Candace Sidner, 1979,
Towards a Computational Theory of Definite Anaphora Comprehension
in English Discourse. PhD Thesis, MIT.
- Remko Scha and Livia Polyani, 1988. An augmented context free grammar
for discourse. Proceedings of COLING.
- Yorick Wilks, 1975. An intelligent analyzer and understander of
English. Communications of the ACM 18(5), 264--274. Reprinted in Grosz et al,
Readings in Natural Language Processing
- See also ch. 14 of Allen. or ch. 18 of Jurafsky/Martin.
[back to
lecture index | back to top]
- 2/27: The Grosz and Sidner
discourse theory
[back to
lecture index | back to top]
- 3/4: Word sense disambiguation
- Handouts: Lecture handout
- Kathleen G. Dahlgren, 1988. Naive Semantics for Natural
Language Understanding. Kluwer.
- William Gale, Kenneth Church, and David Yarowsky, 1992. Estimating upper and lower bounds on the performance of word-sense disambiguation programs.
Proceedings of the ACL, pp. 249--256.
- Nancy Ide and Jean Veronis. Introduction to the special issue on
word sense disambiguation: The state of the art. Computational
Linguistics 24(1), pp. 1--40.
- Abraham Kaplan, 1950. An experimental study of ambiguity and
context. Mechanical Translation 2(2): 39-46 (issue appeared
in 1955).
- Jerrold
J. Katz and Jerry A. Fodor, 1963, The structure of semantic
theory. Language (39), pp. 170--210.
- Alpha Luk, 1995. Statistical sense disambiguation with
relatively small corpora using dictionary definitions.
Proceedings of the 33rd ACL.
- Ray Mooney, 1996. Comparative experiments on disambiguating word
senses: An illustration of the role of bias in machine learning. Proceedings of the 1996 Conference on Empirical Methods in Natural Language Processing, pp. 82-91.
- Erwin Reifler, 1955. The mechanical determination of meaning.
In William N. Locke and A. Donald Booth, eds., Machine
Translation of Languages. John Wiley and Sons. Also included in
the upcoming Readings in Machine Translation, MIT Press.
- Philip
Resnik, 1995. Disambiguating noun
groupings with respect to WordNet senses. Proc. of the 3rd
Workshop on Very Large Corpora.
- Hinrich
Schütze, 1992. Dimensions
of meaning. Proceedings of Supercomputing, pp 787-796. (postscript)
- Yorick Wilks and
Mark Stevenson, 1998. The
Grammar of Sense: Using part-of-speech tags as a first step in
semantic disambiguation. Journal of Natural Language
Engineering 4(2), pp. 135-144. (See also cmp-lg/9607028)
- David Yarowsky,
1992. Word-Sense
Disambiguation Using Statistical Models of Roget's Categories Trained
on Large Corpora. Proc. of COLING, pp. 454--460.
- See also ch 17.1-17.2 of Jurafsky and Martin
[back to
lecture index | back to top]
- 3/6: Word sense
disambiguation methods
[back to
lecture index | back to top]
- 3/11: Supervised and
bootstrapped WSD
methods
[back to
lecture index | back to top]
- 3/13: Mostly-unsupervised
Japanese segmentation
[back to
lecture index | back to top]
- 3/25: Representing semantics
[back to
lecture index | back to top]
- 3/27: Quantifier scope ambiguity
[back to
lecture index | back to top]
- 4/1: Introduction to information theory
- Handouts: lecture handout, sample literature
survey (hardcopy only)
- Thomas
M. Cover and Joy
A. Thomas, 1991. Elements of
Information Theory. Wiley.
- Ming Li and Paul Vitanyi, An Introduction to
Kolmogorov Complexity and Its Applications, 1997 (2nd ed).
- Claude Shannon, 1948. A mathematical theory of communication. Bell System Technical Journal, vol. 27, pp. 379-423 and
623-656. Republished as "The mathematical theory of communication" in Warren Weaver and Claude E. Shannon, eds., The
Mathematical Theory of Communication, U. Illinois Press,
1949.
- See also ch 7 and 8 of the Jelinek text, or Ch 2.2 of the Manning/
Schütze text. Jelinek recommends
Abramson's Information Theory and Coding, 1963.
[back to
lecture index | back to top]
- 4/3: Class-based language
modeling with hard clusters
- Handout: Table 2 and 3 of the Brown paper (see link)
- Peter F. Brown and Vincent J. DellaPietra and Peter V. deSouza
and Jennifer C. Lai and Robert L. Mercer, 1992. Class-based n-gram
models of natural language. Computational Linguistics 18(4),
pages 467--479.
- Frederick Jelinek, Robert L. Mercer and Salim Roukos, 1992.
Principles of Lexical Language Modeling for Speech Recognition. In
Sadaoki Furui and M. Mohan Sondhi, eds., Advances in Speech Signal
Processing, Mercer Dekker.
[back to
lecture index | back to top]
- 4/3: Class-based language
modeling with soft clusters
- Handout: Lecture handout
- Slava M. Katz, 1987. Estimation of Probabilities from Sparse
Data for the Language Model Component of a Speech Recognizer.
IEEE Transactions on Acoustics, Speech and Signal Processing,
ASSP-35 (3), pp 400--401.
- Lillian Lee and Fernando Pereira, 1999. Distributional
similarity models: Clustering vs. nearest neighbors.
- Fernando Pereira,
Naftali Tishby, and Lillian Lee, 1993. Proceedings of the 37th ACL, pp 33--40. Distributional
Clustering of English Words. Proceedings of the 31st ACL,
pp 183--190.
- See also Chapter 4 of Similarity-Based
Approaches to Natural Language Processing (has proofs)
[back to
lecture index | back to top]
- 4/10: Introduction to
machine translation
[back to
lecture index | back to top]
- 4/15: Statistical
machine translation
- Handout: lecture handout
- Adam L. Berger, Peter F. Brown, Stephen A. Della Pietra, Vincent
J. Della Pietra, John R. Gillett, John D. Lafferty, Robert L. Mercer,
Harry Printz, and Lubos Ures, 1994. The
Candide System for Machine Translation. Proceedings of the
1994 ARPA Workshop on Human Language Technology
- Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra,
and Robert L. Mercer, 1993.
The
Mathematics of Statistical Machine Translation". Computational
Linguistics 19(2), pp. 263--311.
- Kevin
Knight, 1999. A Statistical
MT Tutorial Workbook. (html)
- Voltaire, 1759. Candide
- Warren Weaver. Translation. Memorandum. See MT
News International discussion, July 1999.
[back to
lecture index | back to top]
- 4/17: The EM algorithm
- Michael
Collins, 1997. The EM
Algorithm. Written Preliminary Exam (WPE) paper.
- Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin, 1977. Maximum
Likelihood From Incomplete Data via the EM Algorithm. Journal of
the Royal Statistical Society Series B, 39(1), pp. 1-38.
- Geoffrey J. McLachlan and Thriyambakam Krishnan, 1997. The
EM Algorithm and Extensions. Wiley.
- Ted Pedersen's
introductory talk
and paper
- See also chapter 9 of the Jelinek text, or chapter 11 of the
Durbin, Eddy et al text, or the "When does EM work?" panel at EMNLP 2001
[back to
lecture index | back to top]
- 4/22: Statistical Summarization
[back to
lecture index | back to top]
- 4/2: Linguistics and
statistics: the case of POS tagging
- Handouts: lecture handout
-
Eric Brill and Grace
Ngai, 1999. Man
[and Woman] vs. Machine: A Case Study in Base Noun Phrase Learning
- Jean-Pierre Chanod and Pasi Tapanainen, 1995. Tagging French -- comparing a statistical and a constraint-based method. Proceedings of the EACL.
- Kenneth W. Church,
1992. Current practice in part of speech tagging and suggestions for
the future. In Simmons, ed., Sbornik praci: In Honor of Henri Kucera.
- Doug Cutting, Julian Kupiec, Jan Pedersen, and Penelope Sibun,
1992. A Practical Part-of-Speech Tagger. Proceedings of the
Third Conference on Applied Natural Language Processing,
pp. 133-140.
- Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building
a large annotated corpus of English: the Penn Treebank. Computational
Linguistics 19, pp. 313-330.
- Christer Samuelsson and Atro Voutilainen, 1997. Comparing
a linguistic and a stochastic tagger, Proceedings of the 35th ACL/8th
EACL. Also cmp-lg/97060
- Pasi Tapanainen and Atro Voutilainen, 1994. Tagging
accurately - Don't guess if you know, Proceedings of the Fourth
ACL Conference on Applied Natural Language Processing.
Also cmp-lg/9408009
[back to
lecture index | back to top]