Datasets from Some Distributional Similarity Experiments

This page, http://www.cs.cornell.edu/home/llee/data/sim.html, contains the data first introduced in Dagan, Lee, and Pereira ACL '97 and then subsequently used in Dagan, Lee, and Pereira MLJ '99, Lee ACL '99, and Lee AISTATS '01. Please be sure to read this page, especially the notes, before using this data.

If you have any questions or comments regarding this site, want to be notified of updates, or downloaded and used this data, please contact Lillian Lee.

Data description

The data was derived from verb-object cooccurrence pairs in the 1988 Associated Press newswire involving the 1000 most frequent nouns, extracted via Church's (1988) and Yarowsky's processing tools. We split this corpus into an 80% portion and a 20% portion. The 80% portion (587,833 pairs) served as a training set from which base probability distributions (and hence similarities) were computed. We then prepared test sets for the pseudoword disambiguation task, as follows.

We first needed to determine which verb pairs would serve as confusion sets. We simply sorted the verbs by frequency, and created confusion sets by going down the list two words at a time. Hence, the two words in each confusion set would generally be close to the same frequency.
Next came the creation of the test sets. From the 20% portion of the original data, we discarded the noun-verb pairs appearing in the other 80% (our work focused on estimation for unseen events). Then, we split the remaining pairs into five partitions, and replaced each noun-verb pair (N,V) with a noun-verb-verb triple (N,V,V') where {V,V'} was one of our confusion sets. The task for the language model under evaluation was to determine which of (N,V) or (N,V') was the original cooccurrence. Observe that by construction the first verb was always the correct answer with respect to this task (see note below) and the two alternatives would generally have similar frequencies.

Important notes

In the test sets, there is no guarantee that (N,V') did not occur in the data (either in the 80% or the 20%), and in fact it is possible that (N,V') occurred more times than (N,V). That is, the evaluation task measured whether the method could recover the original verb, not whether the method could select the more likely verb to take the noun as object.
The results of the cited papers are averages over the five test sets, where for each set the other four served as parameter tuning data. But the 80% training set was the same for all five test runs, so this was not standard cross-validation. The reason is that we view the computation of similarity as potentially divorced from the task of estimating probabilities based on the similarities (see Lee ACL '99 for further discussion). So the similarity training data, consisting of noun-verb pairs, was separated from the task-specific parameter-tuning data, consisting of noun-verb-alt_verb triples.
The datasets for the experiments of Lee and Pereira ACL '99, which were stored at AT&T, are not currently available. For the record, we recommend considering the experimental setup described in that paper. The differences include a dataset creation technique more closely mirroring standard cross-validation, no restriction to the most frequent nouns, and guarantees that in the test data the alternative verb was indeed (empirically) less likely to co-occur with the noun than the original verb.

File format and download

All files are ascii. The training set train consists of lines of the form count noun verb. The tuning/test sets, test1, ..., test5 consist of lines of the form noun verb alt-verb count, where count is the number of times (noun,verb) occurred in that partition.

V1.1 August 5, 2002: gzipped tarball simdata.tar.gz (740K): directory containing 7 files, largest file is train.gz (3M unzipped), smallest is README (a text version of this webpage)

References

@inproceedings{Church:88a,
  author =	 {Kenneth Church},
  title =	 {A Stochastic Parts Program and Noun Phrase Parser
                  for Unrestricted Text},
  booktitle =	 {Proceedings of the Second Conference on Applied
                  Natural Language Processing},
  pages =	 {136-143},
  year =	 {1988}
}

@inproceedings{Dagan+Lee+Pereira:97a,
  author =       "Ido Dagan and Lillian Lee and Fernando Pereira",
  title =        "Similarity-Based Methods for Word Sense Disambiguation",
  booktitle =    "35th Annual Meeting of the ACL",
  year =         1997,
  pages =        {56--63}
}

@InProceedings{Lee:99a,
  author =       {Lillian Lee},
  title =        {Measures of Distributional Similarity},
  booktitle =    "37th Annual Meeting of the Association for Computational Linguistics",
  pages={25--32},
  year =         1999,
}

@InProceedings{Lee+Pereira:99a,
  author =       {Lillian Lee and Fernando Pereira},
  title =        {Distributional similarity models: Clustering vs. nearest neighbors},
  booktitle =    "37th Annual Meeting of the Association for Computational Linguistics",
  pages =        {33--40},
  year =         1999
}

@Article{Dagan+Lee+Pereira:99a,
  author =       {Ido Dagan and Lillian Lee and Fernando Pereira},
  title =        {Similarity-Based Models of Cooccurrence Probabilities},
  journal =      {Machine Learning},
  year =         1999,
  volume={34},
  number={1-3},
  pages={43-69}
}

@InProceedings{Lee:01a,
  author =       {Lillian Lee},
  title =        {On the Effectiveness of the Skew Divergence for
  Statistical Language Analysis},
  booktitle =    "Artificial Intelligence and Statistics 2001",
  pages =        {65--72},
  year =         2001
}

Back to Lillian Lee's home page.
Go to the CUCS NLP home page.