FEDORA and Reference Linking

BiblioData Object Design 04/10/2000

FEDORA Digital Objects that Support Reference Linking

We have a collection of digital objects, managed within the FEDORA architecture. We wish to add linkages between these objects, where the linkage represents one object making a reference to the other. Such a reference is called a cite-ref.

In order to extract reference data, we will need to add to digital objects a new behavior in terms of a new disseminator type and a servlet to implement that. I propose, at least initially, to add a single disseminator: getRefSection() that returns a stream of bytes that represents a document's reference section, if present.

Since this is a limited view of what digital objects can represent, at a later point I suggest implementing an alternative disseminator, getReferences, which would return an octet stream of all the references in the digital object, including href's to other objects, and even informally worded references, such as Carl Lagoze's last DLIB paper.

A key problem is building a citation database. While not strictly needed for reference linking, once you have extracted reference data from digital objects, it is tempting to move forward and also try to develop citation lists for each digital object.

Note that we can easily (well, pretty easily) retrieve a list of references, but we would not want to build the list of citations on the fly. Building the citation list on the fly would mean examining every digital object in existence to see if it contains a reference to the object in question. It is far more feasible to process each object once (or occasionally, if necessary) and during that processing, add citations to other digital objects. Such processing is normally done when an object is added to a collection, but for existing collections this can also be done in a one-time pass over the collection.

For new digital objects, there will be a one-time need to look at all the other objects in existence to see if they reference this object. The fact that this could be cumbersome is the main reason why efforts like reference linking involve building a citation database, in which the current object can be used as a hash key to return cite-refs to this object. That is, given a new object, we use the database to find all the cite-refs for which this object is represented in the to field of the cite-ref.

Example

We add a new digital object, d0, which contains only three references, r1, r2, and r3.

If we can find the object corresponding to r1, then we add the cite-ref <d0,r1> to its list of citations. If we cannot find the object, we add the cite-ref <d0,r1> to our database of pending citations. At this point, or at some other time, we analyze this database to resolve many variations of cite-refs to canonical forms of <from,to> for better dissemination and pattern matching.

Similarly, process r1 and r2.

Finally, initialize d0's citation list by seeing if d0 already exists in our pending citation database (i.e. there is a cite-ref that corresponds to <x,d0>. Add x to d0's citation list. Remove that cite-ref from the pending citation database. Repeat for all cite-ref's in the database. (This is usually implemented by inverted lists on document identifiers, so that we just look up the object and see which other objects have cited it.)

Refinements

For object oriented reasons, and for performance reasons, we may want to abstract the bibliographic data away from the digital object, keeping it in a BiblioData object.

For example, rather than ask all digital objects to implement a getRefSection(), we could put that method into the BiblioData object and let it work on whatever stream of bytes come from an object when you ask it for its full text in ascii.

The contents of a BiblioData object could include the following:

The URN of the digital object for which this is the reference and citation data.
The list of references (which can be a data stream from the base digital object, since references can be extracted on the fly), possibly in a canonical form (e.g. Dublin Core).
The current list of citations from other objects in FEDORA repositories.
The list of cite-refs, which is just a combination of the above two.

The list can go on. For example, we could ask a BiblioData object to give us a list of co-cited objects. This list would be all x such that cite-refs <y,x> and <y,this-object> exists, for some object y.

The collection of BiblioData objects is our citation database. A given BiblioData ojbect is used to disseminate bibliographic information about digital objects. One potential use of this database is for Reference Linking.

Reference Linking

In most general terms, reference linking is the process of displaying a digital object in general or a document in particular with links to the other digital objects to which it refers. It is assumed that digital objects are on-line somewhere, and that they have a URN.

The advantage of the reference-linked display is that the user of the digital object can immediately get a viewable copy of the object being referenced. (The last statement is modulo, of course, authorization firewalls built around third-party collections, but we assume that there will be other FEDORA objects/methods that can get us through that authorization directly to the referenced object.)

How might this work in terms of FEDORA. First, we ask for the reference-linked full-text view of an object, d0. (The disseminator for reference-linked full-text is used from the BiblioData object corresponding to d0, which in turn invokes the full-text disseminator on the FEDORA object accessing d0.)

The Postscript stream, or whatever, that comes in from object d0 is decorated on the fly by reference links. When rendered to the user, these reference links are high-lighted. The BiblioData object is able to insert these links because it has a list of all the References in d0, and if there is also a URN for the Reference, it can insert a link into the byte stream being passed on to the user's viewing mechanism.

But there is more. Since we are also storing citation lists, the user can also asked a digital object (via its BiblioData object) to get the list of citations. The BiblioData object can return this list directly, with some citations high-lighted if the citing object is in a FEDORA repository somewhere.

Game Plan

The FEDORA'izing of reference linking can proceed in steps. The citation data can be left until later. The immediate goal is to extract the list of references from an object and display them as links. Initially this boils down to extracting the Reference section from the Postscript, DVI, PDF, or HTML version of an online article. This reference section will be parsed and used to build the article's BiblioData object. Each reference will be looked up in the Handle system to see if a URN exists. If so, this is stored along with the reference.

Standard FEDORA disseminators will already exist for the BiblioData object, since it is a digital object. We next would try to extend one these disseminators in two ways: (1) be able to ask a BiblioData object for its list of references; (2) be able to ask a BiblioData object for a reference-linked full-text rendition of its object.

Once this much has been done, it would be a good time to step back and evaluate the use of FEDORA in this process. If we feel that the advantages are great, then we proceed to build FEDORA digital objects for all the papers in DLIB and in a number of NCSTRL collections, and for the online ACM collection and the online LANL collections. We should also implement the citation part of the project. Finally, this is incorporated as a reference linking service within the Dienst model for processing online literature.

Incorporating Other Tools

Where do the other neat reference linking tools come into play, such as the SFX button on each NCSTRL portal page? This could boil down to a variety of reference-linked display mechanisms. If you want SFX, ask for that mechanism. If you want direct links, ask for that.

Clearly some of the CiteSeer modules will be useful in analyzing and managing reference data. They will be used "under the covers". This will require invoking Perl scripts from a Java program. The latest version of PERL comes with some Java and Perl integration tools, which we plan to use. We will try Jperl which is a Java package that allows running Perl code in Java (most of the CiteSeer tools are written in Perl). [http://www.perl.com/CPAN-local/authors/id/S/SB/SBALA] Also [http://www.ddj.com/ftp/1999/1999_02/jperl.zip/]

Handles and Document Identifiers

How many handles already exist out there? How many are for FEDORA objects? This semester we will be working to integrate handles into FEDORA, so we should use handles here. Note that we can still reference link to objects that are not FEDORA objects, because as long as we can find a URL, we can build a link In principle, we should be able to link to objects in a wide variety of collections.

One reference target may have a number of URLs, because a reference target is a somewhat virtual object. It could have many instantiations, all equally good. Part of the job of the BiblioData object is to decide which URL to use.