BiblioData Object Design 04/10/2000
In order to extract reference data, we will need to add to digital objects a new behavior in terms of a new disseminator type and a servlet to implement that. I propose, at least initially, to add a single disseminator: getRefSection() that returns a stream of bytes that represents a document's reference section, if present.
Since this is a limited view of what digital objects can represent, at a later point I suggest implementing an alternative disseminator, getReferences, which would return an octet stream of all the references in the digital object, including href's to other objects, and even informally worded references, such as Carl Lagoze's last DLIB paper.
A key problem is building a citation database. While not strictly needed for reference linking, once you have extracted reference data from digital objects, it is tempting to move forward and also try to develop citation lists for each digital object.
Note that we can easily (well, pretty easily) retrieve a list of references, but we would not want to build the list of citations on the fly. Building the citation list on the fly would mean examining every digital object in existence to see if it contains a reference to the object in question. It is far more feasible to process each object once (or occasionally, if necessary) and during that processing, add citations to other digital objects. Such processing is normally done when an object is added to a collection, but for existing collections this can also be done in a one-time pass over the collection.
For new digital objects, there will be a one-time need to look at all the other objects in existence to see if they reference this object. The fact that this could be cumbersome is the main reason why efforts like reference linking involve building a citation database, in which the current object can be used as a hash key to return cite-refs to this object. That is, given a new object, we use the database to find all the cite-refs for which this object is represented in the to field of the cite-ref.
If we can find the object corresponding to r1, then we add the cite-ref <d0,r1> to its list of citations. If we cannot find the object, we add the cite-ref <d0,r1> to our database of pending citations. At this point, or at some other time, we analyze this database to resolve many variations of cite-refs to canonical forms of <from,to> for better dissemination and pattern matching.
Similarly, process r1 and r2.
Finally, initialize d0's citation list by seeing if d0 already exists in our pending citation database (i.e. there is a cite-ref that corresponds to <x,d0>. Add x to d0's citation list. Remove that cite-ref from the pending citation database. Repeat for all cite-ref's in the database. (This is usually implemented by inverted lists on document identifiers, so that we just look up the object and see which other objects have cited it.)
For example, rather than ask all digital objects to implement a
getRefSection()
, we could put that method into the BiblioData
object and let it work on whatever stream of bytes come from an object when
you ask it for its full text in ascii.
The contents of a BiblioData object could include the following:
The collection of BiblioData objects is our citation database. A given BiblioData ojbect is used to disseminate bibliographic information about digital objects. One potential use of this database is for Reference Linking.
The advantage of the reference-linked display is that the user of the digital object can immediately get a viewable copy of the object being referenced. (The last statement is modulo, of course, authorization firewalls built around third-party collections, but we assume that there will be other FEDORA objects/methods that can get us through that authorization directly to the referenced object.)
How might this work in terms of FEDORA. First, we ask for the reference-linked full-text view of an object, d0. (The disseminator for reference-linked full-text is used from the BiblioData object corresponding to d0, which in turn invokes the full-text disseminator on the FEDORA object accessing d0.)
The Postscript stream, or whatever, that comes in from object d0 is decorated on the fly by reference links. When rendered to the user, these reference links are high-lighted. The BiblioData object is able to insert these links because it has a list of all the References in d0, and if there is also a URN for the Reference, it can insert a link into the byte stream being passed on to the user's viewing mechanism.
But there is more. Since we are also storing citation lists, the user can also asked a digital object (via its BiblioData object) to get the list of citations. The BiblioData object can return this list directly, with some citations high-lighted if the citing object is in a FEDORA repository somewhere.
Standard FEDORA disseminators will already exist for the BiblioData object, since it is a digital object. We next would try to extend one these disseminators in two ways: (1) be able to ask a BiblioData object for its list of references; (2) be able to ask a BiblioData object for a reference-linked full-text rendition of its object.
Once this much has been done, it would be a good time to step back and evaluate the use of FEDORA in this process. If we feel that the advantages are great, then we proceed to build FEDORA digital objects for all the papers in DLIB and in a number of NCSTRL collections, and for the online ACM collection and the online LANL collections. We should also implement the citation part of the project. Finally, this is incorporated as a reference linking service within the Dienst model for processing online literature.
Where do the other neat reference linking tools come into play, such as the SFX button on each NCSTRL portal page? This could boil down to a variety of reference-linked display mechanisms. If you want SFX, ask for that mechanism. If you want direct links, ask for that.
Clearly some of the CiteSeer modules will be useful in analyzing and managing reference data. They will be used "under the covers". This will require invoking Perl scripts from a Java program. The latest version of PERL comes with some Java and Perl integration tools, which we plan to use. We will try Jperl which is a Java package that allows running Perl code in Java (most of the CiteSeer tools are written in Perl). [http://www.perl.com/CPAN-local/authors/id/S/SB/SBALA] Also [http://www.ddj.com/ftp/1999/1999_02/jperl.zip/]
How many handles already exist out there? How many are for FEDORA objects? This semester we will be working to integrate handles into FEDORA, so we should use handles here. Note that we can still reference link to objects that are not FEDORA objects, because as long as we can find a URL, we can build a link In principle, we should be able to link to objects in a wide variety of collections.
One reference target may have a number of URLs, because a reference target is a somewhat virtual object. It could have many instantiations, all equally good. Part of the job of the BiblioData object is to decide which URL to use.