Sept 4, 2003: The datasets available for public download have been
finalized.
I. Citation Prediction Task
Available for contestants:
-
The LaTeX sources of all papers in the hep-th portion of the arXiv until May 1,
2003 are available for download. Each paper is identified by a unique
arXiv id.
There are approximately 29,000 hep-th papers with 1.7 gigs of data. The papers
have been compressed to about 500M and divided into separate years for
downloading.
-
The abstracts for all the hep-th papers as a hep-th
abstracts tarball.
-
The SLAC dates for each hep-th paper as a
hep-th slacdates tarball .
-
The format for the slac dates is a sorted 2 column vector where the left column
is the paper's arxiv id and the right column is the SLAC date:
[arxiv id] [date in YYYY-MM-DD format]
-
The citation graph of the hep-th portion of the arXiv as a
hep-th citations tarball.
-
The format for citations is a sorted 2 column vector where the left column is
the cited from paper arxiv id and the right column is the cited to paper arxiv
id:
[paper cited from] [paper cited to]
II. Data Cleaning Task
For this task the LaTeX sources of the hep-ph papers on March 1, 2003 are
available for download. A random paper id between 1 and 100,000 has been
assigned to each paper. Also, a small subset of papers were converted from
pdf/ps and only appear as plain text.
There are over 35,000 hep-ph papers with 1.8 gigs of data, so the download has
been broken into 10 separate tar gzips of 50MB each, plus 1 extra tarball with
the plain text papers.
Sept 4, 2003: The corresponding citation graph for hep-ph used as the
evaluation criteria is now available here.
III. Download Estimation Task
Available for this task are the same datasets for task 1 plus:
-
For each paper that was published in one of the listed six months (2/2000,
3/2000, 2/2001, 4/2001, 3/2002, 4/2002), the download logs from its first 60
days in the arXiv are provided.
Update Sept 4, 2003: Download data is no longer publicly available for download.
IV. Open Task
Contestants can use any of the hep-th data from Tasks 1 or 3.
|