This semester's CS775 (two credits, S/U only) will consist of work-in-progress presentations (recent work counts too!) by the participants.
Previous runnings:
F98, S00, S01
(as Statistical Natural Language Processing: Models and Methods)
See also the
Cornell NLP page.
Schedule (quick links: Holland-Minkley; Rooth; Pang; Lapata; Andrews; Breck; Wagstaff; Blitzer; Moody; Barzilay; Pucella):
Abstract: I will be talking about a corpus-based methodology for building a natural language generation system of proof texts, and its extensibility to new proof techniques. Our approach connects a formal proof with its natural language version via an intermediate representation that capitalizes on the high level structure of the formal proof. The intermediate representation is based on observations of regularities in a corpus collected by a study asking subjects to write English texts for formal proofs. We evaluate the generality and extensibility of our methodology to new domains of mathematics, using the results of a second study which employed a different formal proof format from that used in collecting our initial corpus, and drew the formal proofs from a different, more complex domain. By examining the new, larger corpus we observe that the methodology derived from the first corpus remains applicable to the second corpus.
Abstract: In a headed tree, each terminal word can be uniquely labeled with a governing word and grammatical relation. This labeling is a summary of a syntactic analysis which eliminates detail, reflects aspects of semantics, and for some grammatical relations (such as subject of finite verb) is nearly uncontroversial. We define a notion of expected governor markup, which sums vectors indexed by governors and scaled by probabilistic tree weights. The quantity is computed in a parse forest representation of the set of tree analyses for a given sentence, using vector sums and scaling by inside probability and flow.
Today's presentation includes a generalization of the algorithm which works for any function of a certain linear form on the rules and symbols of a non-recursive stochastic context free grammar.
Abstract: Most previous work on text classification has focussed on topical classification - i.e., documents exist in different topical categories and the task is to learn these categories so that new documents can be appropriately classified. The recent explosion of available text information has increased interest in the analysis of these documents for purposes other than simply topical classification. A particularly important area, for Business Intelligence, is the analysis of documents in the various discussion forums, problem reports etc. to gauge customer feed-back. In the present project we restrict ourselves to the problem of document sentiment classification - i.e., whether the document expresses a positive, negative or neutral opinion.
In this talk we will describe our preliminary results and compare the performance of Maximum Entropy, Naive Bayes and SVMs on data-sets collected from the epinions website. Our initial experiments using different sets of features (obtained using a part-of-speech tagger) balanced vs unbalanced training sets do not indicate any clear winners. In some friendly settings we are able to obtain results with about 80% accuracy.
Mentor: Shivakumar Vaithyanathan
Data and many helpful discusions: Jussi Myllymaki,
Vikas Krishna
This talk discusses the interpretation of nominalisations, a particular class of compound nouns whose head noun is derived from a verb and whose modifier is interpreted as the argument of this verb (e.g., satellite observation means observation of satellites or observation by statellite). Any attempt to automatically interpret nominalisations needs to take into account: (a) the selectional constraints imposed by the nominalised compound head, (b) the fact that the relation of the modifier and the head noun can be ambiguous, and (c) the fact that these constraints can be easily overridden by contextual or pragmatic factors. The interpretation of nominalisations poses a further challenge for probabilistic approaches since the argument relations between a head and its modifier are not readily available in the corpus. Even an approximation which maps the compound head to its underlying verb provides insufficient evidence. We present an approach which treats the interpretation task as a disambiguation problem and show how we can ``recreate'' the missing distributional evidence by exploiting partial parsing, smoothing techniques, and contextual information. We combine these distinct information sources using Ripper, a system that learns sets of rules from data, and achieve a precision of 86.1% (over a baseline of 61.5%) on the British National Corpus.
Abstract: In this talk, I want to address the general problem of language learning. I choose to do this from a purely formal or mathematical perspective. I'll make no effort to describe the details of human language learning as it actually occurs in children. I will not even use real human languages in examples. All the languages I describe will be unapologetically artificial. I want to take this more abstract perspective to address some fundamental questions that are probably of general interest: Questions such as what a language is, what it means to learn a language and how a language can be learned in principle. The original results that I will present occur as a consequence of extending the analysis of language to incorporate continuous variable math, particularly the theory of nonlinear dynamical systems. For example, I'll show how even simple and low-dimensional dynamical systems have the ability to learn arbitrarily complex languages. These results have implications for how a neural computational system (as manifested in either a biological organism or in an artificial neural network) can in principle learn and process a language.
Abstract: End-to-end performance statistics are important for evaluating the performance of a technology, but they can't tell the whole story. In this work, we consider a simplified, but representative, question-answering engine and analyze the performance of several of its modules. We carry out three types of analysis: inherent properties of the data, feature analysis, and performance bounds. This approach yields useful techniques, important insights into this task, and a demonstration of the importance of component-wise performance evaluation.
This talk is based on Light, Mann, Riloff, and Breck. "Analyses for Elucidating Current Question Answering Technology." To appear in Journal of Natural Language Engineering.
Abstract: It is commonly observed that a human speaker or author will avoid repeating the same noun phrase by using a variety of noun phrases to refer to the same entity. For example, a single document may alternately refer to "George W. Bush" as "a man", "the man", "he", "Bush", "the president of the United States", "the commander-in-chief", "the president", or even, "Dubya". These noun phrases are all considered 'coreferent'. Human audiences generally have no trouble linking the different noun phrases together, but they can present quite a challenge to an automated natural language processing (NLP) system. This task of determining which noun phrases are related to each other, and which are not, is commonly referred to as noun phrase coreference resolution.
Most coreference systems approach this problem by developing a set of hand-crafted filters. We present a different approach: We view the problem as one of partitioning, or clustering, the noun phrases into groups that represent each entity that appears in a document. This is achieved by using a constrained clustering algorithm that takes as input, in addition to the set of noun phrases themselves, a set of constraints that encode linguistic information about coreference. In this way, the general clustering algorithm becomes a temporary "expert" on coreference clustering.
This talk will present the details of this algorithm and demonstrate its performance on documents from the MUC-6 coreference evaluation. We will also spend some time on the issue of evaluation, which turns out to be more challenging (and more interesting) than expected.
Abstract: This talk will address two important issues in automatic summarization. Evaluation of summarization systems is often cumbersome, inaccurate, and human-intensive. Relevance Preservation (RP) is a metric which attempts to circumvent these problems by using an information retrieval system to evaluate the performance of an automatic summarizer. In the first part of this talk, I will discuss work done this summer at the Johns Hopkins Center for Language and Speech Processing in the development and evaluation of RP (Radev, Teufel, Lam, Saggion, Blitzer, Celebi, Drabek, Liu, Qi, "Automatic Summarization of Multiple Documents").
The second part of the talk will center around research which grew out of the JHU workshop and will address the problem of redundancy in multidocument extractive summaries. Using data annotated for an "information subsumption" relationship, I am attempting to develop techniques for predicting and avoiding redundant information in extractive summaries. I will discuss the progress of and give some initial results for this research (with Professor Mats Rooth).
Abstract: Increasing amounts of non-textual information, such as stock-market data and transaction logs, are available online. This data is not easily understandable by people, moreover it is not accessible using existing traditional search engines. Textual description of this data can alleviate both of these problems. Existing generation systems are fully hand-crafted, which significantly increases their development time and limits their portability to other domains. We propose an approach for learning English verbalizations of symbolic input, using a parallel corpus of semantic representation of input and its translations into English. The main challenge of aligning this corpus is the granularity of the alignment -- some semantic concepts are verbalized as multiword phrases and others as single words. However, the boundaries of these units are not known a priori, which presents a problem for extracting them. To solve this problem, the method uses multi-sequence alignment between all the sentences that contain a given concept and finds a consensus sequence of the alignment. We implemented this approach in the domain of mathematical proofs, translating the output of automatic theorem prover Nuprl into natural language proofs. Our initial results indicate that our technique produces proofs that rival in quality with a hand-crafted generation system.
Abstract: There are essentially two schools of thought regarding how to give a semantics (or meaning) to sentence in a natural language. The first, with Chomsky as a well-known representative, advocates the autonomy of syntax from semantics. The second school uses the syntax as a map to drive the derivation of the semantics. In this talk, I will present an important representative of this approach, namely categorial grammars. In such a framework, the meaning of sentences is given by a truth-functional semantics based on higher-order logic, while the syntax is analyzed using a calculus of syntactic types known as the Lambek calculus. The key feature of this approach is that a derivation of the well-formedness of a sentence can be used to derive a meaning for the sentence.