Thursday, September 8, 2005
4:15 pm
B17 Upson Hall

Computer Science
Colloquium
Fall 2005


Lillian Lee
Cornell University

Sense and Sensibility:  Automatically Analyzing Subject and Sentiment in Human-Authored Texts


This talk addresses issues in document classification, which we construe broadly to mean the grouping together of texts that have similar content. While this task is presumably easier than explicitly determining document content, it has great utility in practice and is still plenty hard.

One problem currently attracting a great deal of attention is that of classifying documents by their overall *sentiment*: for example, one might want to determine whether a movie review is "thumbs up" or "thumbs down". Sentiment analysis has empirically been shown to be resistant to traditional text-categorization approaches, and involves more subtlety than one might at first imagine. We demonstrate that new learning techniques based on finding minimum cuts in graphs yield state-of-the-art results even when no explicit linguistic information is used.

We also discuss the long-standing problem of representing topical content. In particular, we present an analysis of the widely-used SVD-based Latent Semantic Indexing algorithm; this analysis motivates an intuitive generalization providing striking empirical improvements over LSI.

Portions of this talk describe joint work with Rie Kubota Ando, Carmel Domshlak, Oren Kurland, and Bo Pang.