Progress with Reference Linking API

Donna Bergmark

Talk Presented at ``Interoperable Data for Scholarly Communication'' at U. of Southampton, July 13, 2001

Introduction

The process of turning references within online documents into links that take you directly to the full-text of the reference is called ``Reference Linking''.

A year or so ago, I was charged with developing an API for Reference Linking. One world view would put this API in the context of Figure 1.

**Figure 1:** Toolkits, APIs, and Applications
$\begin{figure} \centering \centerline{\epsfxsize=3in\epsfbox{overview.eps}}\end{figure}$

The API provides an interface upon which Reference Linking applications can be built. In turn, the implementation of the API may use one or more toolkits. Our API, for example, uses Les Carr's deciter, as well as a number of XML toolkits. You can think of the API as being a library to make the building of Reference Linking applications easier.

Today's talk is organized into three sections, as follows:

1.: First, I will review the nature of our API (which has not much changed since I last presented it here in Southampton). Then I will present two things developed in the past year:
2.: a metric for evaluating the API's performance, and
3.: the implementation of the API's getLinkedText() method.

In addition we built a sample reference linking application based on the API in general and the getLinkedText() method in particular. For a demonstration of this application, go to http://cs-tr.cs.cornell.edu/RefLinkingDemo.

The API

The API, you will recall, is written in Java and is based on an object-oriented approach whereby each item in an archive or collection is represented by a Surrogate object, which provides various kinds of reference linking information about that item.

To construct a Surrogate for an online item, you simply provide its URL to the Surrogate constructor. The Surrogate then scans the item for reference linking information and stores this as instance data. Essentially, the Surrogate is a container for data useful for reference linking that item.

The information collected by the Surrogate includes:

What work is this? An attempt is made to determine the year of publication, the title, and the authors. A handle is constructed from the year, last name of first author, and first 20 characters of the title. This handle is a string which uniquely (we hope) identifies this work.
Next, the body is scanned for contexts. A context is a sentence that contains one or more references.
Finally, the Reference section is encountered. The Surrogate scans this section, using the deciter to analyze each reference into a Reference object. Thus the Surrogate contains an array of Reference objects that represent its item's references.
Once construction of the Surrogate, along with its instance data, is complete, the Surrogate stands ready for its methods to be invoked. These methods yield XML documents containing reference linking data of various sorts. The save() method can be used to save the Surrogate in persistent storage by writing its instance data out to XML files.
The quality of the data returned by the Surrogate method calls is directly determined by the quality of the data collected by the Surrogate during its construction. How well the data is parsed by our API is the subject of the next section.

Evaluation of the API

Since my colleagues at Southampton have done such a nice job of statistically evaluating their reference linking work with arXiv, I thought it would be good to have some idea of how well or poorly the API could extract reference linking data from archive items. I thought this extraction would be quite difficult to do, and many assured us that it would be impossible.

We devised a performance metric that could be objectively applied to stored Surrogates, to determine how accurate the instance data in the Surrogate was. This in turn would determine the accuracy of the data supplied to applications using the API.

The primary metric is Item Accuracy. This is a percentage of elements correctly extracted divided by the total number of reference linking elements present in the item. These elements include the title, year of publication, each author, each context, and each reference. For example, if the paper contains eight contexts, then the analysis should find all eight. Figure 2 shows an example calculation of Item Accuracy for a hypothetical item.

**Figure 2:** Example of Item Accuracy for hypothetical item with 2 authors, 8 contexts, and 16 references. Assuming that none of the authors and an average of 65% of the references were correctly detected, the Item Accuracy is 20 divided by 28, or 71%.
$\begin{figure}\centering \centerline{\begin{tabular}{\vert c c c c \vert} \multi... ...{1}{c}{Totals} & 28 & 20 & \multicolumn{1}{r}{71\%} \end{tabular}}\end{figure}$

In fact, the number of references correctly determined itself involves another metric, the Average Reference Accuracy. This is computed similarly to Item Accuracy in that we first determine the total number of elements in the references for this item, and the number that were correctly determined. Dividing the total into the correct gives the Average Reference Accuracy. This metric $\times$ the number of references yield the number of correct references for calculating the Item Accuracy.

Figure 3 shows the Avereage Reference Accuracy calculation for the hypothetical example introduced in Figure 2.

**Figure 3:** Example of reference accuracy calculation. The elements of interest are: year of publication if present, title if present, each author, contexts for the reference, and each url, if present. In this example, the total number of elements is 97, 63 of which were correctly determined. This yields an overall average reference accuracy of 65 %, the figure used in Figure 2.
$\begin{figure}\centering \begin{tabular}{\vert c c c c \vert\vert c c c c \vert}... ...ct = 63. \newline Average Reference Accuracy, for this item: 65 \%\end{figure}$

Using this metric, I (by hand) calculated item accuracies for 66 D-Lib papers. Following is the result.

A few items were disastrous, where practically nothing was detected. Many were analyzed 100 % correctly. The average accuracy was 83.1% which should be enough for many applications.

Since we also have - apart from item accuracy - individual reference accuracies, one can plot these as well. Below are the individual reference accuracies for all 504 references found in the 66 D-Lib items.

A more meaningful snapshot of how well Les Carr's (slightly modified by Cornell) routine does on references can be obtained by computing a quartile histogram:

More than half of all the references are parsed with greater than 90 % accuracy.

getLinkedText()

Finally I will present one of the Surrogate methods, getLinkedText().

Like the other calls to the Surrogate API, getLinkedText() returns an XML document. This document contains all the original text of the HTML, converted to XHTML by JTidy (a tool from the World Wide Web Consortium). In addition, each reference within the text is embedded within a <reflink> element, which carries bibliographic data about the referenced work. For example, one of the references in the following context is ``For95'':

Data gathering at the University of Sheffield will also focus on extending previous studies exploring users cognitive styles [For95] [FF93].

If you were to view source in your browser for this sentence, you would see that the reference is contained in a <reflink> element:

Data gathering at the University of Sheffield will also focus on extending previous studies exploring users cognitive styles [<a href="#For95"><reflink ord="8" author="Ford" year= "1995" title="Levels and Types of Mediation in Instructional Systems: An Individual Differences Approach." literal="For95 N. Ford. Levels and Types of Mediation in Instructional Systems: An Individual Differences Approach. International Journal of Human-Computer Studies, 43:241-259, 1995."><url>http://www.idealibrary.com/links/doi/10.1006/ijhc.1995.1043/pdf</url> <url> http://www.idealibrary.com/links/doi/10.1006/ijhc.1995.1043 </url> For95</reflink> </a>] [<a href="#FF93"> <reflink ord="9" author="Ford" year="1993" title="Toward a Cognitive Theory of Information Accessing: An Empirical Study." literal="FF93 N. Ford and R. Ford. Toward a Cognitive Theory of Information Accessing: An Empirical Study. Information Processing and Management, 29(5):569-585, 1993."> FF93</reflink> </a>].

All of the bibliographic information was pulled out of the reference as it appears at the end of the paper. The URLs are currently obtained in two ways: if the referenced work appears in a repository that has been previously analyzed, then we know one of its URLs. The other way is if the author of this paper supplied a URL along with the reference. A future research project would be to locate URLs and add that information to stored Surrogates. Alternatively the <reflink> element can be translated into an openURL and then used to locate the proper copy of the work.

One of the more challenging aspects of implementing getLinkedText() was to match up the references with the contexts so that the <reflink> could be inserted in the appropriate place in the text.

To achieve this, references in the text are converted into a canonical format (for example, [1-5] is converted to [1][2] ...). Then the tags on the reference literals are matched up with these references.

An Application That Uses getLinkedText()

We have made one demonstration application that is meant for user engines in a repository search system. If the user is reading the full text of a document, it is possible to click on references in the text. This will cause a dialogue window to pop up, with choices for retrieving full text for the reference.

This Java application works by (1) building a Surrogate for the text that the user has fetched; (2) replacing the <reflink> elements with JavaScript function calls to a JavaScript routine that brings up the user dialogue in a separate window (the JavaScript routine is embedded in the viewed document); and (3) displaying the edited text to the user. The edited text looks like a normal HTML document, but the references are linked.

You may see the effect by going to the demo mentioned in the beginning, and selecting the third version - the JavaScript version - of the paper for viewing. Click on the ``For95'' reference to see the effect of the JavaScript code.