Movie Review Data
This page is a distribution site for movie-review data for use in
sentiment-analysis experiments. Available are collections of
movie-review documents labeled with respect to their overall sentiment polarity
(positive or negative) or subjective rating (e.g., "two and a
half stars") and sentences labeled with respect to their
subjectivity status (subjective or objective) or polarity. These data sets were
introduced in the following papers:
Until April 2012 (but no longer), we maintained a list for of other papers using our data the purposes of facilitating comparison of results.
Please cite the version number of the
dataset you used in any publications, in order to facilitate
comparison of results. Thank you.
Sentiment polarity datasets
- polarity dataset v2.0 (
3.0Mb) (includes
README v2.0): 1000 positive and 1000 negative processed reviews.
Introduced in Pang/Lee ACL 2004. Released June 2004.
- Pool of
27886 unprocessed html files
(81.1Mb) from which the polarity dataset v2.0 was derived.
(This file is identical to movie.zip from data release v1.0.)
- sentence polarity dataset v1.0
(includes sentence polarity dataset README v1.0:
5331 positive and 5331 negative processed sentences / snippets.
Introduced in Pang/Lee ACL 2005. Released July 2005.
- archive:
-
polarity
dataset v1.0 (2.8Mb) (includes README): 700 positive and 700 negative processed reviews. Released
July 2002.
- polarity
dataset v1.1 (2.2Mb) (includes README.1.1): approximately 700 positive and 700 negative processed
reviews. Released November 2002. This alternative version was
created by Nathan
Treloar, who removed a few non-English/incomplete reviews and
changing some of the labels (judging some polarities to be different
from the original author's rating). The complete list of changes made to
v1.1 can be found in
diff.txt.
-
polarity
dataset v0.9 (2.8Mb) (includes a README):. 700 positive and 700 negative processed
reviews. Introduced in Pang/Lee/Vaithyanathan
EMNLP 2002. Released July 2002.
Please read the "Rating Information - WARNING" section
of the README.
-
movie.zip (81.1Mb): all html files we collected from the IMDb archive.
Sentiment scale datasets
- scale dataset v1.0 (includes scale data README v1.0):
a collection of documents whose labels come from a
rating scale. Introduced in Pang/Lee ACL 2005. Released July 2005.
- Sep 30, 2009: Yanir Seroussi points
out that due to some misformatting in the raw html files, six reviews
are misattributed to Dennis Schwartz (29411 should be Max Messier,
29412 should be Norm Schrager, 29418 should be Steve Rhodes, 29419
should be Blake French,
29420 should be Pete Croatto, 29422 should be Rachel Gordon) and one (23982) is blank.
- original reviews for scale dataset v1.0 (includes scale data README v1.0): original reviews from which the subjective extracts in scale dataset v1.0 were extracted.
Subjectivity datasets
- subjectivity dataset v1.0
(508K) (includes
subjectivity README v1.0): 5000 subjective and 5000
objective processed sentences. Introduced in Pang/Lee ACL 2004. Released June
2004.
- Pool
of unprocessed source
documents (9.3Mb) from which the sentences in the subjectivity dataset
v1.0 were extracted. Note: On April 2, 2012, we replaced the original gzipped tarball with one in which the subjective files are now in the correct directory (so that the subjectivity directory is no longer empty; the subjective files were mistakenly placed in the wrong directory, although distinguishable by their different naming scheme).
The creation of this website is based upon work supported in part by
the National Science Foundation (NSF) under grant no. ITR/IM
IIS-0081334, IIS-0329064, CCR-0122581, and BES-0329549; SRI
International under subcontract no. 03-000211 on their project funded
by the Department of the Interior, National Business Center; a Cornell
Graduate Fellowship in Cognitive Studies; and by an Alfred P. Sloan
Research Fellowship. Any opinions, findings, and conclusions or
recommendations expressed above are those of the authors and do not
necessarily reflect the views of the National Science Foundation or
Sloan Foundation and should
not be interpreted as representing the official policies, either expressed
or implied, of any sponsoring institution, the U.S. government or any other
entity.
If you have any questions or comments regarding this site, please send
email to Bo Pang or Lillian Lee.
NLP at
Cornell