March 2006
Table of Contents
II. CMDA - A New Way to Do Disseminators with Content Models
III. Open Issues and Design Questions
Individual digital objects can conform to an informal a "content model" (a descriptive property of a Fedora digital object). This informal descriptor for is used to define that nature of a set of digital objects (as in the number and types of datastreams and disseminators). For example, at University of Virginia, a set of content models have been defined for images, TEI texts, EAD finding aids, and more (see: http://www.lib.virginia.edu/digital/resndev/fedora_imp/content_models.htm). As of Fedora 2.1, institutions developed their own sets of rules for content models, and enforced these rules either through "best practices" or through custom validation code within workflow applications or middleware on top of Fedora. Thus, as of Fedora 2.1 there is no built-in support for content models in the repository, except that the content model descriptor can be stored in an object, and thus searched upon. This enables institutions to have a group identifier so they can locate objects that have the same datastream and disseminator patterns.
As of Fedora 2.1, digital objects can have one or more disseminators attached to them. Each disseminator is pre-bound to an individual digital object (and stored as a component of the object in FOXML). It has been observed that modification of pre-bound disseminators is awkward. If a change to a disseminator must occur, each object must be individually modified if a disseminator is to be modified. Also, the current release of Fedora does not allow modification of BMech and BDef objects that lie behind disseminators. This means that (1) you can't add a new method to a BDef/BMech, (2) you can't modify existing method definitions, and (3) you can't modify BMech binding maps. The current reasons restriction are: (1) details related the current implementation of the "replication" module that keeps the dissemination database tables up to date, and (2) over optimization of the dissemination database tables.
While the current situation does not interfere with how disseminations work at runtime, and it does not prevent people for modifying their disseminators, the current design does not make it easy to do certain kinds of modifications. First, if you want to modify an existing BDef or BMech object, you must purge it, make the changes, then re-ingest it. This is easy enough, but since the system currently will not allow you to purge a BDef/BMech objects that are currently referenced in digital object disseminators (since the system forces referential integrity of an object's disseminators to BDef/BMech objects in the repository. Thus, you are forced to take the following steps; (1) purge disseminators on every digital object that uses the BDef/BMech object that you want to change, (2) purge the BDef/BMech, (3) re-ingest the modified BDef/BMech, (4) add new disseminators back on the digital objects.
Clearly this situation is not what was intended in the original Fedora design! Too much enforcement of referential integrity, but not enough flexibility in terms of disseminator management.
There are three basic goals that we put forth for Fedora 3.0 (development during 2006-2007):
1. Formalize Content Models in the core Fedora repository service
2. Allow easier management of Disseminators (and related BDef and BMech objects)
3. Simplify and streamline the Fedora system by eliminating existing dissemination database and re-factoring the replication module
The proposed Content Model Dissemination Architecture (CMDA) is intended to provide a looser binding of disseminators to digital objects by building disseminators around the notion of "content models." A pre-requisite for this strategy is the formalization of content models, and the registration of such content models as special Fedora digital objects known as Content Model (CModel) objects (similar to how Fedora now registers BDef and BMech objects). A CModel object will hold the specifications about the number and types of datastreams that must exist in any "conforming" digital object.
Up through Fedora 2.1, digital objects could have "disseminators" directly linked to them. In the newly proposed CMDA, digital objects will not carry their own disseminators. Instead, the objects will associated with a CModel object from which it will acquire compatible services (i.e., "disseminations"). The CMDA will also enable both "contractual" and "opportunistic" disseminations. Contractual disseminations are behaviors that a digital object attains because it conforms to a particular CModel. The CModel object has relationships to particular services (via relationships to BMech objects). This is very similar to the current notion of a Disseminator in Fedora - it achieves the same result by in a more indirect manner. In contrast, "opportunistic" disseminations are behaviors that an object can attain in a totally dynamic manner at runtime (via simple service matching algorithms). This new way of defining behaviors for digital objects is described, step-by-step, below.
In the proposed CMDA, regular digital objects in a repository can have a relationship to one or more CModel objects (assume conformance is validated). CModel objects, in turn, have relationships to one or more BMech objects (which store service description and service binding metadata that are the building blocks for a set of run-time behaviors for "conforming" digital objects. As always, a BMech object is related to a BDef object (a BDef defines a set of methods in the abstract; a BMech contains WSDL bindings to a concrete service that runs the abstract behaviors).
If a regular digital object has a relationship with a CModel object , then that digital object have a transitive relationship to BMech objects related to the CModel. The new CMDA will exploit these relationships to enable digital objects to attain behaviors at runtime. Conceptually, it is as though regular digital objects inherit disseminators from their content models.
Definition:
Content models (CModels) are stored as a special type of digital object in a Fedora repository. These special objects are "control" objects in the way Fedora BDef and BMech objects are. A CModel object is used to establish a set of constraints that other digital objects must object if they are said to be "conforming" to a content model. Below is a simple design for CModel objects, dealing with both the modeling of datastreams, and the association of services (behaviors) with the model:
The CModel object contains:
Datastream Composite Model: an XML specification of datastream constraints for objects that conform to the content model, stored in the CModel object. - the model has an entry for each type of datastream (DS-TYPE) that must/may be present in conforming digital objects - prescribes the Datastream IDs that a conforming digital object's datastreams must use - for each such datastream, it prescribes the MIME type and/or format URI(s) - for each such datastream, it prescribes the mimimum and maximum instances of such a datastream the conforming digital object can have - for each such datastream, it prescribe whether multiple instances of such a datastream must be ordered within the conforming digital object <dsCompositeModel> <dsTypeModel ID="ARTICLE" ORDERED="false" MIN="1" MAX="1"> <!-- The article can be in the form of PDF OR Postscript --> <form MIME="application/pdf" FORMAT_URIS="info:pathways/fmt/pronom/20"/> <form MIME="application/ps" FORMAT_URIS="info:pathways/fmt/pronom/??"/> </dsTypeModel>
<dsTypeModel ID="ABSTRACT" ORDERED="false" MIN="1" MAX="1"> <!-- The abstract must be in the form of HTML --> <form MIME="text/html"/> </dsTypeModel> </dsCompositeModel>
It should be noted that if the a dsTypeModel element specifies that a particular type of datastream can appear more than once in a conforming digital object, then the datastream id for each instance must be unique within the conforming object. In cases such as this, the conforming digital object will append the prescribed datastream id with (_N) and increment N for each instance, as below:
Option A - Disseminator Template (rejected):: we put actual
disseminators on the CModel object, just like we put disseminators on
digital objects now. However, only the CModel object would store the
disseminator (as a template for conforming objects). The conforming
digital objects would not have disseminators pre-bound or stored on them.
In this new design, a disseminator gets "inherited" by a conforming digital
objects at runtime. The disseminator on the CModel object would
contain a datastream binding map, but this binding map would refer to the
IDs of the dsTypeModel elements in dsCompositeModel (it would map these
"abstract" datastream types to the BMechs binding keys (specified in the
DSBINDMAP of the BMech object). So, this is really quite similar to
how disseminators are bound to regular objects in the current Fedora, but
instead, the disseminator is bound to the CModel object in a more abstract
way. A new dissemination algorithm would dynamically associate
the disseminator, at runtime, with a digital object given that the object
conforms to the content model.
OR
Option B - RELS-EXT (preferred): we do the association in a simpler way by just
asserting relationships in RELS-EXT to one or more BMechs that will work
with this content model. We can use RELS-EXT to assert relationships
even if the Resource Index is OFF. Asserting a CModel object
relationship to a BMech object implies a disseminator.
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:fedora="info:fedora/fedora-system:def/relations-external#" xmlns:myns="http://www.nsdl.org/ontologies/relationships#"> <rdf:Description rdf:about="info:fedora/demo:99"> <!-- The CModel object is contracting with two different BMech objects --> <!-- This implies two different "disseminators" at runtime --> <fedora:hasContractualBMech rdf:resource="info:fedora/bmech:2"/> <fedora:hasContractualBMech rdf:resource="info:fedora/bmech:3"/> </rdf:Description> </rdf:RDF>
Multiple BMech objects can be compatible with a given content model. However, not all of these BMech objects are necessarily in a contractual relationship from the CModel's perspective. A CModel will assert which BMech objects it considers "preferred" and establish a loosely bound contract via the "hasContractualBMech" relatinship assertion. In the example below, the CModel object (cmodel:1) is contractually-related to two BMech objects (bmech:2 and bmech:3). Each of these BMechs implements a unique set of service methods (since each has a different BDef object associated with it). Two of the BMech objects implement the same service methods (behaviors) as indicated by their relationship to the same BDef object. It should be noted that it is not permitted for a CModel assert relationships with two or more BMech objects that share the same BDef. The reason is that such a situation essentially associates the same set of methods (i.e., behaviors) with the CModel multiple times. This is not meaningful since, at the end of the day, objects that conform to the CModel would wind up having duplicate "disseminations" at runtime (e.g., "getFoo" "getFoo"). In the diagram, notice that of the two BMech objects that share the same BDef, only one is contractually related to the CModel object (thus not breaking the no duplicate behavior rule). From the BMech perspective, overall there are three BMech objects that are compatible with this CModel. However, the CModel has only contractually accepted two of the three.
The next thing that is required in the RELS-EXT option is to make associations between the datastream types described in the dsCompositeModel and the datastream input requirements of the BMech (defined in "DSINPUTSPEC"). This can be done by tagging the appropriate dsTypeModel elements of the dsCompositeModel with one or more serviceBindingKey elements (i.e., datastream binding keys for BMechs), for example:
<dsCompositeModel> <dsTypeModel ID="ARTICLE" ORDERED="false" MIN="1" MAX="1"> <!-- The article can be either a pdf OR postscript -->
<form MIME="application/pdf" FORMAT_URIS="info:pathways/fmt/pronom/20"/> <form MIME="application/ps" FORMAT_URIS="info:pathways/fmt/pronom/??"/>
<!-- The article is associated with a service binding key--> <serviceBindingKey>info:fedora/semantic/SINGLE-DOC-AS-ARTICLE</semanticBindKey> </dsTypeModel> </dsCompositeModel>
According to the above CModel example, the dsCompositeModel specifies
that a conforming digital object must have one datastream
whose ID is "ARTICLE." This datastream must have a MIME
type of either "application/pdf" or "application/ps."
The ARTICLE datastream will also be associated with a serviceBindingKey
known as "info:fedora/semantic/SIMPLE-DOC." The serviceBindingKey is
the basis for connecting particular types of datastreams with service
methods (behaviors) defined in BMechs. It should be noted that there
can be multiple serviceBindingKeys associated with a single dsTypeModel. The question
then arises as to which BMechs the serviceBindingKeys are for.
It may be the case that lots of BMechs use this serviceBindingKey. The
CModel asserts the BMechs it has "contracted with" in RELS-EXT.
It doesn't matter that we specify the relationship between these BMechs and
the serviceBindingKeys in the CModel. This is all resolved at runtime
by the CMDA service matching algorithm. Ideally, serviceBindingKey values
are URIs,
not simple strings.
In the CMDA, regular digital objects can assert that they meet the constraints of one or more content models. It should be noted that in Fedora 2.1, a digital object had a core property for content model. This is property was treated as a simple string and was not controlled in any way by the system. This property has been used to identify an informal notion of a content model meaning that it was an informal group identity for an object. The Fedora system had no way to enforce conformance or do to anything but just index the property as a general descriptor for the object. This property was not repeatable, thus in Fedora 2.1 an object could have only one informal content model.
In the new CMDA, an object can conform to multiple content models (polymorphism of content model conformance). In the new CMDA, the content model will become repeatable object-to-object relationship property of the object. The subject will be a digital object URI, the predicate is the "hasFormalContentModel" relationship, and the object is the URI of a CModel object. This can be expressed in RELS-EXT as follows:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:fedora="info:fedora/fedora-system:def/relations-external#" xmlns:myns="http://www.nsdl.org/ontologies/relationships#"> <rdf:Description rdf:about="info:fedora/demo:99"> <fedora:hasFormalContentModel rdf:resource="info:fedora/cmodel:article-1"/> <fedora:hasFormalContentModel rdf:resource="info:fedora/cmodel:generic-document"/> </rdf:Description> </rdf:RDF>
Thus far, we have discussed how CModel objects can assert contractual relationships to BMechs. The net result is that, at runtime, any digital objects that are related to that content model ("conforming objects") will attain the behaviors of those BMech(s) that the CModel has asserted a contractual relationship to. The CModel-to-BMech relationship "hasContractualBMech" controls what behaviors will get associated with digital objects that conform to the CModel.
It should be noted that in figure X, above, BMech objects can also assert relationships to CModels (orange arrows labeled "isCompatibleWith" pointing from BMech to CModel). We also noted that a CModel object may not have asserted a contractual relationship with all compatible BMechs. BMechs that are compatible with a CModel, but that are not named by the CModel, have the potential to endow the CModel with behaviors, its just that the CModel has not explicitly modeled such a relationship. (The BMech-to-CModel relationships are "incoming" arrows asserted outside the context of the CModel). These incoming relationships mean that a BMech declares that it is compatible with a CModel. The BMech has the potential to provide run-time behaviors for objects that conform to the CModel, but the CModel has not explicitly "authorized" this relationship.
It is possible to allow a CModel to declare that it will endorse these "non-contractual" relationships. The CModel can do so by asserting a special property ("endorseNonContractualBMechs") and setting it to true or false. This is done in RELS-EXT as follows:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:fedora="info:fedora/fedora-system:def/relations-external#" xmlns:myns="http://www.nsdl.org/ontologies/relationships#"> <rdf:Description rdf:about="info:fedora/demo:99"> <fedora:hasContractualBMech rdf:resource="info:fedora/bmech:2"/> <fedora:hasContractualBMech rdf:resource="info:fedora/bmech:3"/>
<!-- Allow other compatible BMechs to bind with objects conforming to this CModel --> <fedora:endorseNonContractualBMechs>true</fedora:endorseNonContractualBMechs </rdf:Description> </rdf:RDF>
The CMDA will look for this property at runtime. If the CModel says that it endorses non-contractual BMechs ("true"), the behaviors of such BMechs will be made available on conforming digital objects at runtime. If the CModel says it will not endorse non-contractual BMechs ("false"), then the behaviors of such BMechs will not be bound to conforming digital objects.
A question arises as to whether the endorsement of non-contractual BMechs can be controlled at either the repository configuraton level, or at the level of any single digital object. We can have a repository configuration option that globally enables/disables the capability of the binding such BMechs at run time. Also, from the perpspective of an individual digital object, it is always possible to have an XACML policy to permit only disseminations from certain BMechs. Thus, in terms of controlling whether "non-contractual" BMechs can bind at runtime there are three options:
It may be desirable to allow completely dynamic and opportunistic behaviors to be dynamically bound with digital objects at run time, outside of the context of CModels. This functionality could be enabled by having an repository configuration option to allow MIME (or FormatURI) matching directly between digital objects and BMechs. An example might be where a BMech was defined to do a generic service operation like convert PDF to HTML. It would deal with any PDF, irrespective of the role of that PDF in a digital object. This would be the dynamic association of a general utility behavior with a digital object. This may be out of scope for CMDA, but it may be another sort of dynamic dissemination feature we may want to explore. <Elaborate on this more, and discuss whether this is a desirable feature to support in Fedora.>
One of the most problematic aspects of the current dissemination architecture in Fedora is the dissemination database, particularly keeping it up-to-date as digital objects are added/modified/purged. Our attempts to make these tables efficient have resulted in an over normalization of these tables that have proven inefficient in terms of purge (e.g., there are cases where entries in certain tables are shared by multiple digital objects making for expensive manual queries to maintain referential integrity among the tables). Most concerning, however, is the replication module which is the code that keeps these tables up to date by replicating changes after every API-M transaction. A combination of table design, referential integrity enforcement, and the replication module, has made it difficult for us to open up functionality to let people modify existing BDef and BMech objects. The main issues pertain to code in place to prevent breaking disseminators on existing objects that use these BDef/BMech objects. (More details later.) In the mean time, we are currently undertaking an analysis of the database and the replication module to see if we can make immediate improvements to: (1) improve performance, and (2) enable modification of BDef/BMech objects.
However, if we pursue the new content model dissemination architecture proposed in this document, we may be able to work around the existing problems (any possibly obsolete the existing dissemination database). Below are links to diagrams of the existing database schema, and two possible new ones:
The basic idea is that a target digital object asserts that it conforms to a CModel. A CModel object has a relationship to a BMech object. The target digital object has datastreams with MIME types (and possibly format URIs). The CModel has a dsComponentModel that describes a set of dsTypeModel elements for "abstract" datastreams, including the prescribed datastream IDs. Each dsTypeModel also has one or more serviceBindingKey elements defined within it. These are ultimately used to link the dsTypeModel elements (abstract datastreams) to the semantic keys in a BMech's dsbinding map. Minimal information is kept in the relational database. To enable the disseminator matching, the database records relationships of CModel objects to BMech objects. Everything else is done dynamically.
1. ListMethods (API-A)
2. GetDissemination (API-A)
3. GetObjectHistory (API-A)
If we decide that we either want to obsolete existing disseminators, or we want to give people the option migrate existing objects to the new CMDA, then we need an easy migration utility that will not require an entire export/ingest of objects. It might be possible to create a utility that works like the repository rebuilder and crawls the FOXML sources. Here is one way it might work:
1. STEP 0: seed the utility with the PIDs of a set of existing objects that are representative of all existing digital objects that have disseminators on them
2. STEP 1: auto-create CModel objects by reading representative digital objects. For each unique disseminator, build a CModel object. The dsCompositeModel will be driven off the disseminator, back referencing to datastreams in the representative digital object. (The related datastreams become dsTypeModel elements in the dsCompositeModel and we can pickup MIME and formatURI from the representative object; we can pick up serviceBindingKey values from the datastream binding map on the disseminator. Assert the "hasContractualBMech" relationship in RELS-EXT of the CModel object) by grabbing the BMechPID off of the disseminator. Record the CModel-BMech relationships in the database (varies dependent on which implementation approach we go with).
3. STEP 2: crawl all FOXML files. For each Data Object, look at disseminators and check BMech PID. If we have the correlation somewhere of what CModel objects are associated with which BMech objects, then we can: (1) remove the disseminator from the Data Object, and (2) assert the relationship to the CModel object (presumable in RELS-EXT), and (3) record the DataObject-CModel relationships in the database (varies dependent on which implementation approach we go with).
In Fedora 2.1, the RDF-based Resource Index contains a triple for every "stable" dissemination on digital objects. A stable dissemination is defined by a behavior method that either (1) does not have any parameters, or (2) has parameters, but the parameter values are from a fixed domain. The Resource Index is kept up to date, incrementally, as API-M add/modify/purge operations are committed. In Fedora 2.1, each object has its own dissemination, so each specific dissemination for each object can be easily figured out directly from the disseminator on the object. In the CMDA, we propose that disseminators are not put directly on each object, but instead, disseminations are figured out via the relationship the object has with one or more CModel objects.
To support the CMDA, we propose some modifications to the Fedora model in the Resource Index. These modification will be as follows:
<insert diagram of new RDF model for Fedora objects using CMDA>
This modification will be an accurate reflection of the new CMDA in the Resource Index. It will cut down on the number of triples in the Resource Index. Most importantly, it will also simplify the incremental updating algorithm for the Resource Index, so as to not pre-calculate all dissemination triples for all digital objects. Costly updates could occur whenever a CModel or BMech object is modified (since this would involve expensive queries to see which digital objects were affected, and then lots of triple deletes/inserts to make sure that all dissemination triples on individual digital objects reflect changes to the CModel or BMech nodes). It should be noted that this update scenario would eventually hit us in Fedora 2.1. We just have not had to deal with it since Fedora 2.1 does not allow modification of BMech objects. In any event, the CMDA will provide the opportunity to simplify the incremental updating of the Resource Index.
In terms of evaluating the impact of dissemination triples not being "pre-calculated" in the Resource Index, we evaluated the changes to RI queries that would be necessary to get information about disseminations on objects. Below are sample queries for discovering disseminations on digital objects using the new model.
Query: Determine which methods are supported by an object (demo:11) ------------------------------------------------------------------- select $dissType from <#ri> where <info:fedora/demo:11> <hasCModel> $cModel and $cModel <usesBMech> $bMech and $bMech <implementsBDef> $bDef and $bDef <definesMethod> $dissType Query: Which objects' bdef:1/getDC dissems changed between time 1 and 3? ------------------------------------------------------------------------- select $object from <#ri> where $bMech <implementsBDef> <info:fedora/bdef:1> and $cModel <usesBMech> $bMech and $cModel <datastreamType> $datastreamType and $datastreamType <ID> $datastreamTypeID and $object <hasDatastream> $datastream and $datastream <ID> $datastreamTypeID and $datastream <lastModifiedDate> $dsModDate and $dsModDate <after> '1' and $dsModDate <before> '3' Query: More accurate version of above query -------------------------------------------- select $object from <#ri> where $bMech <implementsBDef> <info:fedora/bdef:1> and $bMech <hasMethodImpl> $methodImpl and $methodImpl <semanticType> $semanticType and $cModel <usesBMech> $bMech and $cModel <datastreamType> $datastreamType and $datastreamType <semanticType> $semanticType and $datastreamType <ID> $datastreamTypeID and $object <hasDatastream> $datastream and $datastream <ID> $datastreamTypeID and $datastream <lastModifiedDate> $dsModDate and $dsModDate <after> '1' and $dsModDate <before> '3'
1. Open Issues for Database Schema:
What is the minimal set of database tables necessary to support CMDA (with good performance)? Most notably, the proposed CMDA database (see db schema options PROPOSED A and PROPOSED B) does not record binding information for individual digital objects (i.e., the relationships between datastreams of specific digital objects with BMech binding keys). We expect the run time dissemination algorithm to be fast enough that it won't be a problem. CMDA requires that whenever dissemination-oriented requests are made upon digital objects (e.g., listMethods, getDissemination) that object's FOXML must be parsed to get a list of datastreams in the object. In general, the new algorithm for fulfilling disseminations may be I/O intensive if we go with miminal db tables and lean towards parsing FOXML for the target object, CModel object, and BMech object. However, earlier JMeter tests and actual experience has shown that the SAX parsing approach performs well. We need to test the new CMDA under load again to be sure. If necessary, we can add database tables, but we must be careful not to recreate something that looks like what we have now with the dissemination database.
2. Open Issues for Traditional Disseminators: Should we obsolete traditional disseminators?
3. Open Issues for Validation (Referential Integrity of Object-to-CModel-to-BMech-to-BDef):
The CMDA's vigor referential integrity validation is something we want to discuss more. There are pros and cons to both a loose approach and a very tight approach.
4. Open Issues for Content Model Objects:
5. Open Issues for Historical Disseminations (via versioning)
6. Open Issues for Export with Behaviors:
7. New Design Possibility: Content Model Inheritance
<add here: details on how to achieve CModel inheritance based on discussions in meeting>
8. Evaluation Typical Process Flows in creating Objects, CModels, and BDef/BMechs