General Questions
Question (June 19, 2003): Is the use of external allowed for Tasks II
and III?
Revised Answer: For Tasks II, and III, our initial policy was to prohibit
the use of external data. The intent of this was to prevent KDD Cup
participants from designing solutions in which they explicitly make use of
on-line resources that might be construed as containing partial solutions to
the specified tasks.
However, after considerable communication with Cup participants, we feel it is
necessary to resolve the policy more finely, in a way that still preserves its
initial intent:
(1) The use of any bibliographic data, task-specific external data, or any other
external resources specific to the task of indexing scientific literature, is
prohibited.
(2) However, the use of generic lexical resources -- that is, general resources
about the English language, such as WordNet, general-purpose dictionaries and
thesauri, and lists of stop-words -- is permitted.
(3) If you are making use of external resources other than the examples
specifically mentioned in (2), you must verify their eligibility with the KDD
Cup chairs as soon as possible, and in no case later than July 1.
Task III (Download Estimation)
Question (June 9, 2003): Are we allowed to use all the data available in
Task 1 or just the Latex sources?
Answer: All of the datasets from Task 1 are available for use for Task 3.
The task description for Task 3 has been updated to clarify this.
Question: Is it acceptable to produce a vector of floating point numbers
(i.e., representing the number of downloads for each paper), or are we required
to output a vector of integers?
Answer: A vector of floating point number is acceptable. This has also
been updated in the statement of the task.
Question: The naming convention on the files that seems to correspond
with the month and year of most submissions is the "hep-th arxiv
number." It has no meaning for the purpose of this problem, aside from
being a unique ID. Let me know if this represents a publication date or
something key like that.
Answer: The hep-th/yymmnnn number is simply a sequential accession number
nnn within the yr/month yymm that it was submitted to the hep-th arXiv. It is
assigned at the time of hep-th submission.
Question: The date that appears in all the abstracts after "Date:
" is the "arXiv submission date." Is the date that the paper was
submitted for publication?
Answer: No, this is the date it was deposited in the arXiv. It may have
been submitted for journal publication before or after that date, though
typically articles are submitted for publication shortly afterwards.
Question: Revised dates. These only appear in some articles. I didn't see
any connection to how we need to use them.
Answer: Some articles are later replaced with (a series of) revised
versions, some are not. Revisions sometimes involve added references. The
relevant date is usually the earliest date associated with any submission.
Question: SLAC/SPIRES date. Our best estimate for when a paper was
published. I am guessing it is included because it is the date we should use
for determining the date of any given citation. The file containing these dates
should list the arxiv numbers of every article we downloaded.
Answer: The SLAC/SPIRES date is sometimes a mysterious notion. Most
often, it is a date shortly after the above arXiv received date, corresponding
to when SLAC/SPIRES has downloaded the metadata. Sometimes it is a date long
before the arXiv received date, which means that it is a pre-existing record
corresponding to a back submission, e.g. a paper published in the 80's that
someone has chosen to submit to hep-th during the 90s for historical or other
purposes. In general, the earliest date associated to any given submission is
typically the relevant one.
Question: What is the L_1 difference between two vectors X and Y?
Answer: By the L_1 difference we mean the L_1 norm of X - Y, and hence
the sum of the absolute values of the differences
Task I (Citation Prediction)
Question (May 6, 2003): Is it acceptable to produce a vector of floating
point numbers (i.e., representing partial differences in the number of
citations between time periods), or are we required to output a vector of
integers?
Answer: A vector of floating point number is acceptable. This has also
been updated in the statement of the task.
|