CS 775: Seminar in Natural Language Understanding, Spring 2001
"Statistical Natural Language Processing: Models and Methods"
Wednesdays, 1:25-2:15, Hollister 362
Natural language processing (NLP) has been considered one of the "holy
grails" for artificial intelligence ever since Turing proposed his
famed "imitation game" (the Turing Test).
Statistical approaches have revolutionized the way NLP is done.
Furthermore, some of these approaches can be employed in other
applications, such as computational biology and data mining.
This course will explore important classes of probabilistic models of
language and survey some of the common general techniques.
Syllabus
- I. Models
- Zipf's law: introduction to the statistics of language
- Hidden Markov Models and related formalisms: the Viterbi and
Baum-Welch algorithms, integration with semantic hierarchies,
language modeling, smoothing, and Maximum Entropy Markov Models
- The noisy channel model (basics of information theory): entropy,
the Kullback-Leibler divergence, and mutual information
- Parametric clustering
- II. Methods
- The Expectation-Maximization (EM) algorithm
- The Maximum Entropy (ME) principle
- The singular value decomposition (SVD) and latent semantic
indexing (LSI)
Reference Texts
- Eugene Charniak, Statistical Language Learning, 1993
- Thomas
M. Cover and Joy
A. Thomas, Elements of
Information Theory, 1991
- Richard Durbin, Sean Eddy, Anders Krogh, and Graeme Mitchison,
Biological Sequence Analysis: Probabilistic Models of Proteins
and Nucleic Acids, 1998
- Frederick Jelinek, Statistical Methods for Speech Recognition,
1997.
- Christopher
D. Manning and Hinrich Schuetze, Foundations of Statistical Natural
Language Processing, 1999.
Notes and References from Lecture
Some references may be available only internally to Cornell.
Lecture list: Jan 24 | Jan 31 | Feb 7 | Feb 14 | Feb
21 | Feb 28 | Mar 7 | Mar 14 | Mar 28 | April 11 | April 18 | April 25
- Jan 24: Models and (vs.) methods What is NLP - AI-completeness
- Statistical NLP - generative models - benefits and tradeoffs.
References:
- Jan 31: Zipf's law and the model moral Statistics of words -
Zipf-like behavior - why log-log - least effort - Miller's monkeys.
References:
- Timothy
C. Bell, John G. Cleary, and Ian H. Witten, 1990.
Text Compression, Prentice-Hall.
- J.B. Estoup, 1916. Gammes Stenographiques. Institut
Stenographique de France.
- Wentian Li, 1992. Random texts
exhibit Zipf's-law-like word frequency distribution. IEEE
Transactions on Information Theory, 38(6) 1842-1845. (Click here for
alternate link.)
- Zipf's
Law (webpage by Wentian Li). Many Zipf's Law refs in
many fields.
- Benoit Mandelbrot, 1957. Théorie mathématique de la
d'Estoup-Zipf. Inst. de Statistique de l'Univèrsité.
- The Manning/Schuetze text, section 1.4.3
- George A. Miller, 1957. Some effects of intermittent silence. American Journal of Psychology, 70:311-314.
- George Kinglsey Zipf, 1949. Human Behavior and the Principle
of Least Effort. Addison-Wesley.
- Feb 7: Hidden Markov Models (I) Markov chains - decoupling the
output from the mechanism - HMMs -
calculating the probability of an observed sequence the hard way - the
forward-backward algorithm to the rescue.
References:
- Feb 14: Hidden Markov Models (II) Finding likely state sequences -
finding likely paths with the Viterbi algorithm - revealing the hidden
states - Baum-Welch when
nobody in the lab knows the language.
Further references:
- Doug Cutting, Julian Kupiec, Jan Pedersen, and Penelope Sibun,
1992. A Practical Part-of-Speech Tagger. Proceedings of the
Third Conference on Applied Natural Language Processing,
pp. 133-140. Available as ps.
- Stephen J. DeRose, 1988. Grammatical category disambiguation by
statistical optimization. Computational Linguistics 14(2),
pp. 31-39.
- David Elworthy, 1994. Does Baum-Welch Re-estimation help taggers?
Proceedings of the 4th Conference on Applied Natural Language
Processing.
- Julian Kupiec, 1992. Robust Part-of-Speech Tagging Using a Hidden
Markov Model. Computer Speech and Language 6, pp. 225-242.
- Bernard Merialdo, 1994. Tagging English Text with a
Probabilistic Model. Computational Linguistics 20(2), pp. 155--171.
- Ralph Weischedel, Marie Meteer, Richard Schwartz, Lance
A. Ramshaw, and Jeff Palmucci, 1993. Coping with ambiguity and unknown
words through probabilistic models. Computational
Linguistics 19(2), pp. 359--382.
- Feb 21: Abney and Light: a WordNet-based HMM Selectional
preferences - word sense disambiguation via HMM's - WordNet-induced
HMM topology - fun with hand-simulation - alterations to Baum-Welch -
problems with Baum-Welch, WordNet, or the approach?
References:
- Steven Abney and Marc Light, 1999. Hiding a
semantic hierarchy in a Markov model. Proceedings of the
ACL'99 Workshop on Unsupervised Learning in Natural Language
Processing , pp. 1-8. (Longer version)
- Massimiliano Ciaramita and Mark Johnson, 2000. Explaining away ambiguity:
Learning verb selectional preference with Bayesian networks.
Proceedings of the 18th International Conference on Computational
Linguistics, Volume 1.
- Stephen Clark and David Weir, An iterative approach to
estimating frequencies over a semantic hierarchy, Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 258-265.
- Hang Li and Naoki Abe, 1998. Generalizing
Case Frames Using a Thesaurus and the MDL Principle. Computational Linguistics, Vol. 24(2).
- Philip Resnik, 1997. Selectional
preference and sense disambiguation. ACL SIGLEX Workshop on
Tagging Text with Lexical Semantics: Why, What, and How?
- WordNet. See
also WordNet: An Electronic Lexical Database, Christiane
Fellbaum, ed., 1998.
- Feb 28: Smoothing, the Good
old-fashioned Turing way The zero-frequency problem - the
return of Zipf's law - the Good-Turing estimate - counting bunnies in
Bletchley Park and cracking the Enigma - a Bayesian derivation -
singleton rates and novel-item rates - Jelinek and Mercer's
empirical derivation - leave-one-out - leave-high-counts-alone.
References:
-
Stanley F. Chen and Joshua Goodman, 1996. An
empirical study of smoothing techniques for language modeling
Proceedings of the 34th Meeting of the Association for Computational
Linguistics, pp 310--318.
TR version: TR-10-98,
Computer Science Group, Harvard University, 1998.
- Kenneth W. Church
and William A. Gale, 1991. A comparison of the enhanced Good-Turing
and deleted estimation methods for estimating probabilities of English
bigrams. Computer Speech and Language 5, 19-54.
- Irving J. Good, 1953. The population frequencies of species and
the estimation of population parameters. Biometrika 40(3-4),
pp. 237-264.
- Jelinek text: see last chapter.
- Slava M. Katz, 1987. Estimation of Probabilities from Sparse
Data for the Language Model Component of a Speech Recognizer.
IEEE Transactions on Acoustics, Speech and Signal Processing,
ASSP-35 (3), pp 400-401.
- Reinhard Kneser and Hermann Ney, 1995. Improved backing-off for m-gram
language modeling. Proceedings of the IEEE International
Conference on Acoustics, Speech, and Signal Processing, 181-184.
- Arthur Nadas, 1985. On Turing's formula for word probabilities.
IEEE Transactions on Acoustics, Speech, and Signal
Processing, Vol. ASSP-33(6), 1414-1416.
- Geoffrey Sampson's Good-Turing
Frequency Estimation page
- Mar 7: EM and
Baum-Welch Using hidden structure variables - maximizing expected log-joint-likelihood increases likelihood - deriving
Baum-Welch as a special case.
References:
- Mar 14: Statistical
Generation Presented by Regina Barzilay. Note use of
generative models for ranking proposed utterances.
References:
- Mar 28: Introduction to
Information Theory Relating information to communication,
randomness, and surprise - semantics vs. engineering? - definitions and axioms for entropy - cross
entropy when an unknown source is involved - relative entropy/Kullback
Leibler divergence to measure distributional similarity - mutual
information as a special case - can the mutual information be
negative? It depends on whom you ask
- Norman Abramson, 1963. Information Theory and Coding
- Claude Shannon, 1948. A mathematical theory of communication. Bell System Technical Journal, vol. 27, pp. 379-423 and
623-656. Republished as "The mathematical theory of communication" in Warren Weaver and Claude E. Shannon, eds., The
Mathematical Theory of Communication, U. Illinois Press,
1949.
- See also the Cover and Thomas text or chapters 7 and 8 of the
Jelinek text.
- April 11: Maximum Entropy
Models The missing key, the principle of insufficient
reason, and the probability the sun will
rise tomorrow - constraints on the empirical and estimated
expectations - deriving the maximum-entropy distribution (the return
of Lagrange) - Candide appears that the Bank of Boston has almost
completed its review of Shawmut (or, improving statistical machine
translation)
- Adam Berger's
MaxEnt reading list
- Adam Berger, Stephen Della Pietra, and Vincent Della Pietra, 1996. A
maximum entropy approach to natural language processing .
Computational Linguistics 22(1).
- P. Brown, J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek,
J. Lafferty, R. Mercer, and P. Roosin, 1990. A statistical approach
to machine translation. Computational Linguistics 16, 79--85.
- P. Brown, S. Della Pietra, V. Della Pietra, and R. Mercer, 1991. The
mathematics of statistical machine translation: parameter
estimation". Computational Linguistics, 19(2), 263--311.
- Albert Einstein. "Everything should be made as simple as
possible, but not simpler".
- John Maynard Keynes, 1921. A Treatise on Probability
- Pierre Simon Laplace, 1775. Essai Philosophique sur les
probabilities.
- See also chapters 13-14 of the Jelinek text.
- April 18: Maximum Entropy
Models: iterative scaling (references as above) deriving
the update equations - sanity checks - comparisons to EM
- April 25: Latent Semantic
Indexing and Iterative Residual Rescaling Subspace
projection, or, why hopefully no one will make a model of car named
the Chomsky - a geometric view of the singular value decomposition -
an analysis not based on a generative model (for once) - compensating
for non-uniformity via the IRR algorithm
- Rie Kubota Ando, 2000. Latent
semantic space: Iterative scaling improves precision of inter-document
similarity measurement. SIGIR '00.
- Rie Kubota
Ando and Lillian Lee, 2001. Semantic space: On the
effectiveness of iterative residual rescaling. SIGIR
2001, pp. 154--162. (alternate formats)
- Scott Deerwester, Susan Dumais,
George Furnas, Thomas Landauer
and Richard Harshman, 1990. Indexing by Latent Semantic Analysis.
Journal of the American Society for Information Science 41(6),
pp. 391-407. (Alternate link here)
- Christos H. Papadimitriou, Prabhakar Raghavan, Hisao Tamaki and
Santosh Vempala, 2000. Latent Semantic Indexing: A
Probabilistic Analysis. Journal of Computer and System
Sciences 61(2), pp. 217--235.
- Telcordia (formerly
BellCore) LSI page
- Colorado LSA homepage
- Tennessee LSI web page
Administrative Information
2 credits, S/U only. Grade based on attendance/participation.
Prereqs: elementary probability, mathematical maturity. COMS674 not
required.
URL: http://www.cs.cornell.edu/courses/cs775/2001sp/
CS775, Spring '01
Lillian Lee
Related Links: NLP
at CUCS | COMS674
| Other
COMS course web pages | CS775, Spring 2000