Datasets from Some Distributional Similarity Experiments
This page, http://www.cs.cornell.edu/home/llee/data/sim.html, contains the data first introduced in
Dagan,
Lee, and Pereira ACL '97 and then subsequently used in
Dagan,
Lee, and Pereira MLJ '99,
Lee
ACL '99, and Lee
AISTATS '01. Please be sure to read this page, especially the
notes, before using this data.
If you have any questions or comments regarding this site, want to
be notified of updates, or downloaded and used this data, please
contact Lillian Lee.
Data description
The data was derived from verb-object cooccurrence pairs in the 1988
Associated Press newswire involving the 1000 most frequent nouns,
extracted via Church's (1988) and Yarowsky's processing tools. We
split this corpus into an 80% portion and a 20% portion. The 80%
portion (587,833 pairs) served as a training set from which base
probability distributions (and hence similarities) were computed. We
then prepared test sets for the pseudoword disambiguation task, as
follows.
- We first needed to determine which verb pairs would serve as confusion
sets. We simply sorted the verbs by frequency, and created
confusion sets by going down the list two words at a
time. Hence, the two words in each confusion set would generally be
close to the same frequency.
- Next came the creation of the test sets. From the 20% portion of
the original data, we
discarded the noun-verb pairs appearing in the other 80% (our work focused on
estimation for unseen events). Then, we split
the remaining pairs into five partitions, and replaced each noun-verb
pair (N,V) with a noun-verb-verb triple
(N,V,V') where {V,V'} was one of our confusion sets. The task
for the language model under evaluation was to determine which of
(N,V) or (N,V') was the original cooccurrence. Observe that by construction
the first verb was always the correct answer with respect to this task
(see note below) and the two alternatives would generally have similar frequencies.
Important notes
- In the test sets, there is no guarantee that (N,V') did not
occur in the data (either in the 80% or the 20%), and in fact it is
possible that (N,V') occurred more times than (N,V). That
is, the evaluation task measured whether the method could recover the
original verb, not whether the method could select the more likely
verb to take the noun as object.
- The results of the cited papers are averages over the
five test sets, where for each set the other four served as parameter
tuning data. But the 80% training set was the same for all five
test runs, so this was not standard cross-validation. The reason is
that we view the computation of similarity as potentially divorced from
the task of estimating probabilities based on the similarities (see
Lee
ACL '99 for further discussion). So the similarity training data,
consisting of noun-verb pairs, was separated from the task-specific
parameter-tuning data, consisting of noun-verb-alt_verb triples.
- The datasets for the experiments of Lee
and Pereira ACL '99, which were stored at AT&T, are not currently
available. For the record, we recommend considering the
experimental setup described in that paper. The differences include a
dataset creation technique more closely mirroring standard
cross-validation, no restriction to the most frequent nouns, and
guarantees that in the test data the alternative verb was indeed
(empirically) less likely to co-occur with the noun than the original
verb.
File format and download
All files are ascii.
The training set train consists of lines of the form count
noun verb. The tuning/test sets, test1, ...,
test5 consist of lines of the form noun verb alt-verb count, where count
is the number of times (noun,verb) occurred in that partition.
V1.1 August 5, 2002: gzipped tarball simdata.tar.gz
(740K): directory containing 7 files, largest
file is train.gz (3M unzipped), smallest is README (a text version
of this webpage)
References
@inproceedings{Church:88a,
author = {Kenneth Church},
title = {A Stochastic Parts Program and Noun Phrase Parser
for Unrestricted Text},
booktitle = {Proceedings of the Second Conference on Applied
Natural Language Processing},
pages = {136-143},
year = {1988}
}
@inproceedings{Dagan+Lee+Pereira:97a,
author = "Ido Dagan and Lillian Lee and Fernando Pereira",
title = "Similarity-Based Methods for Word Sense Disambiguation",
booktitle = "35th Annual Meeting of the ACL",
year = 1997,
pages = {56--63}
}
@InProceedings{Lee:99a,
author = {Lillian Lee},
title = {Measures of Distributional Similarity},
booktitle = "37th Annual Meeting of the Association for Computational Linguistics",
pages={25--32},
year = 1999,
}
@InProceedings{Lee+Pereira:99a,
author = {Lillian Lee and Fernando Pereira},
title = {Distributional similarity models: Clustering vs. nearest neighbors},
booktitle = "37th Annual Meeting of the Association for Computational Linguistics",
pages = {33--40},
year = 1999
}
@Article{Dagan+Lee+Pereira:99a,
author = {Ido Dagan and Lillian Lee and Fernando Pereira},
title = {Similarity-Based Models of Cooccurrence Probabilities},
journal = {Machine Learning},
year = 1999,
volume={34},
number={1-3},
pages={43-69}
}
@InProceedings{Lee:01a,
author = {Lillian Lee},
title = {On the Effectiveness of the Skew Divergence for
Statistical Language Analysis},
booktitle = "Artificial Intelligence and Statistics 2001",
pages = {65--72},
year = 2001
}
Back
to Lillian Lee's home page.
Go to the
CUCS NLP home page.