Gloss for Slide 1: OpCit and the Cornell Digital Library Research Group ----------------------------------------------------------------------- Part I has 4 bullets under Achievements and Shared Activity: 1. Defines the role we chose to play in the OpCit project. You guys were doing the arXiv analysis, we chose to take a higher level architectural view of reference linking. With the hope that the goals would be compatible, of course. 2. We designed the API from the top down but always knew that parsing the reference strings at the end of an online paper was going to have to be done in order to implement the API. For this, Carr's deciter software was the linchpin. Our API was implemented on top of this. One of the methods in the API is "getRefList()". This XML file essentially encapsulates what the deciter parses out of the reference strings. Of course, we added other things like "getLinkedText()". 3. Status of the Reference API is that it is sufficiently coded to be usable at this point (though buggy). I would say alpha status -- but usable. Lots of promise here. At this point, we need to do something about collecting citation information (Southampton is way ahead here), continue to evaluate and improve quality, write applications and tools, handle more online journals. 4. Performance as of end of 2000 is determined on the basis of 89 papers analyzed during the second half of 2000. The extracted data was graded (by hand, groan) against objective, quantitative accuracy criteria. Reference analysis is on a per-reference string basis and averaged over all items (item = an analyzed online paper). The accuracy metric for reference data is number of elements correctly extracted divided by total number of elements in the reference string. Elements include: title, each author's name, year of publication, and URL's if any are included right in the reference string. Note that I am currently ignoring journal name and page number. We are interested at getting our hands on the "work", not the "item". I believe that Southampton's work is aimed to recognizing the "item". ASK IF YOU HAVE A QUESTION ABOUT THIS. Item analysis is on a per-item basis and averaged over all items (of which there were 89 as of this writingg). The accuracy metric again is number of elements correctly extracted divided by the total number of elements. The elements include: title of item, each author name, year of publication, for each reference the context of the reference, and the average reference accuracy for the references in this item. For this metric, we are running at 82.42%. Part II has 3 bullets for Implementability 1. Applications and tools are meant to be build on top of the Reference Linking API. The basic call shown here analyzes the item located at the specified URL. Once the constructor returns, the URL has been fetched and partially analyzed. The Surrogate object is returned to the caller. If there are problems, the Surrogate is null, and error messages can be found on syserr. (System.err in Java.) With the surrogate in hand, the caller can then invoke methods, such as "s.getLinkedText()". 2. The software seems to be portable. It should be portable because it is written in Java. On microsoft machines, one must use Sun's jdk 1.2 or jdk 1.2. One java file contains the configuration (i.e. directories and filenames) and must be edited in order to set up the software on a new machine. 3. The jar files include the Soton Harvester (Carr's deciter software), an XML parser, the JTidy conversion tool, and an XSLT processor.