Donna Bergmark
Talk Presented at ``Interoperable Data for Scholarly Communication'' at U. of Southampton, July 13, 2001
A year or so ago,
I was charged with developing an API for Reference Linking.
One world view would put this API in the context of Figure 1.
The API provides an interface upon which Reference Linking applications can be built. In turn, the implementation of the API may use one or more toolkits. Our API, for example, uses Les Carr's deciter, as well as a number of XML toolkits. You can think of the API as being a library to make the building of Reference Linking applications easier.
Today's talk is organized into three sections, as follows:
In addition we built a sample reference linking application based on the API in general and the getLinkedText() method in particular. For a demonstration of this application, go to http://cs-tr.cs.cornell.edu/RefLinkingDemo.
To construct a Surrogate for an online item, you simply provide its URL to the Surrogate constructor. The Surrogate then scans the item for reference linking information and stores this as instance data. Essentially, the Surrogate is a container for data useful for reference linking that item.
The information collected by the Surrogate includes:
Once construction of the Surrogate, along with its instance data, is complete, the Surrogate stands ready for its methods to be invoked. These methods yield XML documents containing reference linking data of various sorts. The save() method can be used to save the Surrogate in persistent storage by writing its instance data out to XML files.
The quality of the data returned by the Surrogate method calls is directly determined by the quality of the data collected by the Surrogate during its construction. How well the data is parsed by our API is the subject of the next section.
We devised a performance metric that could be objectively applied to stored Surrogates, to determine how accurate the instance data in the Surrogate was. This in turn would determine the accuracy of the data supplied to applications using the API.
The primary metric is Item Accuracy. This is a percentage of elements correctly extracted divided by the total number of reference linking elements present in the item. These elements include the title, year of publication, each author, each context, and each reference. For example, if the paper contains eight contexts, then the analysis should find all eight. Figure 2 shows an example calculation of Item Accuracy for a hypothetical item.
In fact, the number of references correctly determined itself involves another metric, the Average Reference Accuracy. This is computed similarly to Item Accuracy in that we first determine the total number of elements in the references for this item, and the number that were correctly determined. Dividing the total into the correct gives the Average Reference Accuracy. This metric the number of references yield the number of correct references for calculating the Item Accuracy.
Figure 3 shows the Avereage Reference Accuracy calculation for the hypothetical example introduced in Figure 2.
Using this metric, I (by hand) calculated item accuracies for 66 D-Lib papers. Following is the result.
A few items were disastrous, where practically nothing was detected. Many were analyzed 100 % correctly. The average accuracy was 83.1% which should be enough for many applications.
Since we also have - apart from item accuracy - individual reference
accuracies, one can plot these as well. Below are the individual reference
accuracies for all 504 references found in the 66 D-Lib items.
A more meaningful snapshot of how well Les Carr's (slightly modified by Cornell) routine does on references can be obtained by computing a quartile histogram:
More than half of all the references are parsed with greater than 90 % accuracy.
Finally I will present one of the Surrogate methods, getLinkedText().
Like the other calls to the Surrogate API, getLinkedText() returns an XML document. This document contains all the original text of the HTML, converted to XHTML by JTidy (a tool from the World Wide Web Consortium). In addition, each reference within the text is embedded within a <reflink> element, which carries bibliographic data about the referenced work. For example, one of the references in the following context is ``For95'':
Data gathering at the University of Sheffield will also focus on extending previous studies exploring users cognitive styles [For95] [FF93].
If you were to view source in your browser for this sentence, you would see that the reference is contained in a <reflink> element:
Data gathering at the University of Sheffield will also focus on extending previous studies exploring users cognitive styles [<a href="#For95"><reflink ord="8" author="Ford" year= "1995" title="Levels and Types of Mediation in Instructional Systems: An Individual Differences Approach." literal="For95 N. Ford. Levels and Types of Mediation in Instructional Systems: An Individual Differences Approach. International Journal of Human-Computer Studies, 43:241-259, 1995."><url>http://www.idealibrary.com/links/doi/10.1006/ijhc.1995.1043/pdf</url> <url> http://www.idealibrary.com/links/doi/10.1006/ijhc.1995.1043 </url> For95</reflink> </a>] [<a href="#FF93"> <reflink ord="9" author="Ford" year="1993" title="Toward a Cognitive Theory of Information Accessing: An Empirical Study." literal="FF93 N. Ford and R. Ford. Toward a Cognitive Theory of Information Accessing: An Empirical Study. Information Processing and Management, 29(5):569-585, 1993."> FF93</reflink> </a>].
All of the bibliographic information was pulled out of the reference as it appears at the end of the paper. The URLs are currently obtained in two ways: if the referenced work appears in a repository that has been previously analyzed, then we know one of its URLs. The other way is if the author of this paper supplied a URL along with the reference. A future research project would be to locate URLs and add that information to stored Surrogates. Alternatively the <reflink> element can be translated into an openURL and then used to locate the proper copy of the work.
One of the more challenging aspects of implementing getLinkedText() was to match up the references with the contexts so that the <reflink> could be inserted in the appropriate place in the text.
To achieve this, references in the text are converted into a canonical format (for example, [1-5] is converted to [1][2] ...). Then the tags on the reference literals are matched up with these references.
We have made one demonstration application that is meant for user engines in a repository search system. If the user is reading the full text of a document, it is possible to click on references in the text. This will cause a dialogue window to pop up, with choices for retrieving full text for the reference.
This Java application works by (1) building a Surrogate for the text that the user has fetched; (2) replacing the <reflink> elements with JavaScript function calls to a JavaScript routine that brings up the user dialogue in a separate window (the JavaScript routine is embedded in the viewed document); and (3) displaying the edited text to the user. The edited text looks like a normal HTML document, but the references are linked.
You may see the effect by going to the demo mentioned in the beginning, and selecting the third version - the JavaScript version - of the paper for viewing. Click on the ``For95'' reference to see the effect of the JavaScript code.