Research
Research Area: |
Natural
Language Processing |
Advisor: |
Claire
Cardie |
Interests: |
Syntax and semantics of natural and artificial
languages; Corpus-based natural language processing; Information
extraction; Question answering. |
Our research uses machine learning techniques to build components for
understanding natural language. This methodology typically requires a
corpus of examples describing the task one wishes to accomplish. For
example, the corpus might consist of sentences with their
corresponding parse trees. The component -- in this case a parser --
is built by learning from the example parses.
At this time our chief goal is to lower the cost of corpus-based NLP
by reducing the amount of training data required. This should allow
language learning to be more widely applied, especially by non-experts
in computational linguistics.
Partial Parsing Framework
Partial parsing is a simplified version of the parsing task in which
the goal is to identify major constituents and relationships, such as
NPs and predicate-argument structure, while disregarding difficult
ambiguities, such as prepositional phrase attachment.
We designed a simple corpus-based framework for partial parsing that
uses sequences of part-of-speech (syntactic category) tags as rules.
We instantiated this framework for NPs and subject-verb-object
relationships. The framework provides a testbed for experimenting
with some techniques for increasing the performance of a parser.
- Grammar Pruning
- We tracked the mistakes made by the grammar when we used it to reparse
the training. Rules that made more mistakes were discarded to improve
the grammar.
- Lexical Information
- Lexical information was added as an additional source of information
by piggy-backing a ``probabilistic'' model on top of the grammar to
score constituents and choose between alternate combinations of
constituents.
Error-driven pruning of treebank grammars for
base noun phrase identification. Claire Cardie and David Pierce.
In Proceedings of the 36th Annual Meeting of the ACL and
COLING-98, pages 218-224, 1998. Available as cmp-lg/9808015.
[abstract, ps, pdf]
The role of lexicalization and pruning for base
noun phrase grammars. Claire Cardie and David Pierce. In
Proceedings of the Sixteenth National Conference on Artificial
Intelligence (AAAI-99), pages 423-430, 1999.
[abstract, ps, pdf]
Combining error-driven pruning and
classification for partial parsing. Claire Cardie, Scott Mardis,
and David Pierce. In Proceedings of the Sixteenth Internation
Conference on Machine Learning, pages 87-96, 1999.
[abstract, ps, pdf]
Information Extraction
Information extraction is the identification of domain-specific
structured information in natural language text. We are in the
process of implementing an IE system. We hope to explore an
interesting new paradigm of IE in which an IE component learner
interacts tightly with a human user so that the learner can help the
human quickly identify appropriate training events.
Proposal for an interactive environment for
information extraction. Claire Cardie and David Pierce. Technical
Report TR98-1702, Cornell University Computer Science, September 1998.
Available as ncstrl.cornell/TR98-1702.
[abstract, ps,
pdf]
Question Answering
Question answering is a more fine-grained form of information
retrieval (IR). Standard IR retrieves documents based on natural
language queries. But users typically want shorter responses. Our
question answering system uses the Smart retrieval engine and then
attempts to locate chunks of text in the top-ranked documents that
specifically answer the query. Currently we can retrieve NPs to
answer who, what, when, where, and which questions.
Features of the system include some simple semantic type checking
between the question and answer and use of text summarization to
narrow the search.
Examining the role of statistical and
linguistic knowledge sources in a general-knowledge question-answering
system. Claire Cardie, Vincent Ng, David Pierce, and Chris
Buckley. In Proceedings of the Sixth Applied Natural Language
Processing Conference (ANLP-2000), pages 180-187, 2000.
[abstract, ps, pdf]
Unsupervised and Weakly-Supervised Language Learning
We are currently applying some new machine learning algorithms within
our partial parsing framework to experiment with reducing the training
data required by bracketer learning. Cotraining is one such
algorithm that leverages the inherent redundancy in language. For
example, NPs can often be detected both by their context
(e.g. John ate ____) or by their content (e.g. the
____). If learners are built to use separate features of the
problem, they can bootstrap each other starting with very little data.