Department of Computer Science Colloquium
Tuesday March 2nd, 2002 at 4:15pm
Upson Hall B17
Statistical Methods for Natural Language
Processing
Mike Collins
AT&T Research Labs
http://www.research.att.com/~mcollins
Sequential
data is everywhere: obvious examples being text (the web, or digital libraries),
speech, and biological sequences. Algorithms which recover structure underlying
this data are becoming increasingly important. This leads to an interesting
class of learning problems: how to learn functions which map strings to other
discrete structures such as trees, segmentations, or underlying state sequences?
In
the first part of the talk I will review recent work on probabilistic,
history-based approaches for natural language parsing. Many of these methods
fall into the general category of history-based models, where a tree is
represented as a derivation (sequence of decisions) and the probability of the
tree is then calculated as a product of decision probabilities. While these
approaches have many advantages, it can be awkward to encode some constraints
within this framework. It is often easy to think of features which might be
useful in discriminating between candidate trees for a sentence, but much more
difficult to alter the derivation to take these features into account.
As
an alternative to these methods, I will present new algorithms for these
problems, derived from Freund and Schapire's voted perception algorithm for
classification tasks. A first motivation for the new algorithms concerns
*representation*: in comparison to hidden markov models, or probabilistic
context-free grammars, the methods are considerably more flexible in the
features that can be used to discriminate between competing structures. A second
theme is*discriminative training*: the parameter estimation methods make
relatively weak assumptions about the distribution underlying examples. During
the talk I will present experiments with the methods, showing their utility on a
number of natural language problems.