NCSTRL Documentation
Dienst protocols, Release 3.5
DRAFT
Introduction
This document describes the Dienst protocol, which provides an open,
distributed digital library. The Dienst protocol is currently
implemented by the Dienst system, and is the basis for the
CSTR digital library.
Overview of Dienst architecture
In the Dienst architecture, there are four classes of services:
A Repository Service stores digital documents, each of which has a
unique name called a docid and may exist in several different
formats. An Index Service server searches a collection and
returns a list of docids. Each site will typically run a repository
and index service for documents issued by that site. A single,
centralized Meta Service (also called a Contact Service)
provides a directory of locations of all other services. Finally, a
User Interface service mediates human access to this library.
All these services communicate via the Dienst protocol.
Note that the protocol has evolved over time and not all Dienst
servers are running the most recent release.
A docid is string which uniquely specifies a technical report.
It consists of a publisher and a string, separated
by a colon.
The tokens may not contain whitespace or the colon character.
Publishers are defined just as in
RFC 1357, e.g.
CORNELLCS, STAN, UCB etc.
The string is assigned by the publisher, and must be unique
within that publisher.
An example is CORNELLCS:TR92-1321.
The syntax of a docid differs from that of an ID in
RFC 1357
only in that the separator between publisher and string is a colon
rather than a pair of slashes. This change is because the pair of
slashes would look strange in the URL syntax.
Dienst HTTP embedding
The Dienst protocol is (currently) embedded in HTTP, which thus
imposes some restrictions on the protocol that are specific to HTTP,
not to Dienst.
HTTP request methods
All Dienst requests must be expressed with either the GET or
HEAD HTTP methods. In general, GET returns full information,
and HEAD returns only meta information. Not all Dienst requests
support HEAD.
Special characters
The syntax rules for URLs restrict a few characters to special roles.
and require that if these characters are used in any other way that
they be written as an escape sequence, a percent sign followed by the
character code in hexadecimal. The reserved characters are:
- / - separates components in the URL.
- ? - separates optional arguments from the rest of the URL
- # - indicates reference to a named anchor within a document
- = - separates name from value in an argument list
- & - separates multiple arguments after a ?
Finally, the space character may not appear anyplace. It must be written
with a "+" (or with a percent sign escape sequence.)
optional arguments
Many of the Dienst protocol messages take optional arguments.
These arguments consist of a parameter name and value, separated
by an equal sign. All the arguments are joined, separated by an
ampersand, and attached to the end of the URL, separated from it
by a question mark. So for example, to pass the parameter
timeout with value 259 to the Shred
method, the URL would be Shred?timeout=259, and if a
second argument weight were added, the URL would then be
Shred?timeout=259&weight=7.4.
Standard record list header
Many Dienst messages returns lists of results. Many, but not
all, of these return lists of records. (Those that do not
are older and retained for compatibility.) Such lists are always
prefaced with a standard header consisting of two lines:
- Version: version
- Where version is a version number. At present, all
messages are using version 1.0. This allows for change in the format
of the record list header or the record format.
- Count: N message
- Where N is the number of records that follow and
message is an optional error message string.
Protocol messages
For each class of Dienst service we list the messages it
implements. Note that in the current implementation,
conceptually distinct services (Repository, Index) are accessed
through a common Web server and share the same host and port, and
thus a message is seen by all of them, though only one will
reply. The messages are listed by name, followed by the syntax
of the URL that encodes the message.
Generic Messages
Version
/Server/Info/Version
returns the version of the service, e.g.
Dienst v3-6-0. Note that older or customized servers may
return a different string.
Time
/Server/Info/Time
returns the local time in
RFC 1036
format. Timezone is omited. An example is:
Thu, 22 Jun 95 09:16:43
Repository Service
The repository allows a given document to be stored in many different
formats, and provides messages to obtain the document or pieces of the
document in any of the stored formats. In Dienst releases prior to 3.5, formats
are named with MIME types, in Dienst 3.5 and after, formats are named with
reserved keywords (e.g. "ocr", "postscript", "scanned").
Format names
Formats describe the intended purpose, rather than the representation, which is better described by a MIME type.
- bib
- Bibliographic information in RFC-1357.
- postscript
- The entire body of the document, sent as application/postscript
- text
- plain ASCII text, sent as text/plain
- ocr
- ASCII text produced by OCR, sent as text/plain
- scanned
- scanned page image, usually TIFF, at at least 300 spots per inch.
- inline
- a page image, suitable for screen display. Usually a GIF, at about 72 dots per inch, four bits per pixel.
- structure
- a document structure file
In addition, there are a number of internal formats, not documented as part of the protocol.
List Contents
/Server/List-Contents
A list of the docids available from this service, one per line.
Get Document Body
/Server/TR/docid/Body[?format=format]
Return the body of the document, in the selected format.
Get Page
/Server/TR/docid/Page/NNNN[?format=format]
Return a single page, where the document is available in discrete
pages, in the selected format. Reasonable values for format for
Dienst 3.5 are scanned or inline.
Get Page Count
/Server/TR/docid/NPages[?format=format]
Return the number of pages for this document, when it is available
in discrete pages.
List MIME types
/Server/TR/docid/Formats
This is an older message retained for compatibility. Its use is not encouraged. It returns a list of the available MIME content types for
the document, rather than a list of the Dienst 3.5 formats.
The returned list consists of lines of the form:
content-type size
where content-type is
the MIME content type, and size is in bytes, if it
can be determined. (In general, the size can only be
determined if the data is stored in a single file.) There
is no guarantee that, if the data is retrieved in this form,
that this is the number of bytes that will actually be
transmitted, as it is possible that the file might be stored
compressed, but be transmitted uncompressed, or vice versa.
Index Service
The index service searches a set of descriptions of
documents and return docids for those that match. Document
descriptions (bibliographic information) are stored in the
RFC 1357 format.
Get Bibliographic Records
/Server/Bibliography
/Server/Bibliography?docid=docid
/Server/Bibliography?file-after=time
Returns the bibliographic information for documents on the service.
The first form returns all bibliographic records, the second form for
a single document, and the third checks for all documents added or
modified since time, a universal time expressed in RFC 1036
format. Note that this is distinct from any dates encoded internal to
the bibliographic record, e.g. the date the document itself was
written.
Search
/Server/IndexBoolean/?kwds
Searches the collection. kwds is a set of keywords
and values specifying the search criteria. Returns a record list
where each record begins with a blank line, then has
docid, title, author, date each on a separate line.
allowable keywords
- title
- words from the title.
- author
- author's last or first name.
- abstract
- words from the abstract.
- any
- search for words in any of the title, author, or abtract fields,
e.g. any=smith will find documents written by Smith or with
Smith in the title.
- publisher
- symbolic name of publisher. Defaults to "any".
- number
- The number of the document, e.g. 259.
- boolean
- The connective between operators, either and (the
default) or or.
Rules for bibliographic keyword matching
Words in the three bibliographic keyword fields
(author, title, abstract) are matched to
bibliographic entries according to the following rules:
- Each word matches any word in the respective field
that begins with respective word. For example, the word "comp"
matches "computer", "computation", "comprehensive", etc.
- The value for a keyword field may contain the logical connectors
"and" and "or". For example, "robotics or vision" in the
abstract field, will return documents that have the word
"robotics" or "vision" in their abstracts. "robotics and vision" in
the abstract field, will return documents that have both the
word "robotics" and "vision" in their abstracts. Multiple words that
are not separated by "and" are assumed to be "and" separated. For
example, "computer vision" in the abstract field, will
return documents that have both the words "computer" and "vision" in
their abstracts. Finally, parentheses may be used to group words.
For example, "Gries or (Teitelbaum and Field)" in the author
field, will return documents authored by "Gries" or by "Teitelbaum"
and "Field".
- Finally, the booleanfield may specify either the logical
connector and or or between the bibliographic
keyword fields (the default is and).
For example, oring "robot" in the title field and
"robotics" in the abstract field will return documents that
have either "robot" in their titles or "robotics" in their abstracts.
anding these fields will return only those documents that have
"robot" in their titles and "robotics" in their abstracts.
examples
- reports written by either "Davis" or "Fox"
- /Server/IndexBoolean/?author=davis+or+fox
- reports written by "donald" and with "robot" in the title.
- /Server/IndexBoolean/?author=donald&title=robot
- reports written by "donald" or with "robot" in the title.
- /Server/IndexBoolean/?author=donald&title=robot&boolean=or
Search (old format)
/Server/Index/(\\?.*)
This is an older form of search. It will be supported until all
servers are running at least version 3.5, and then will cease. It
differs from IndexBoolean in that the boolean
keyword is not supported, nor are boolean operators allowed in fields.
Meta Service
Get Publishers
Syntax:/MetaServer/Publishers
Returns a record list of the publishers in the collection.
Each record consists of the publishers symbolic name and "pretty
name", separated by the ASCII FS character (octal 034).
Get Index Servers
/MetaServer/Indeces
Returns a record list of the Index services. Each record
consists of four fields separated by the ASCII FS character (octal
034):
- host
-
- port
-
- publishers
- List of symbolic names of publishers, separated by colon
- protocol
- The protocol running at the server. The only supported protocol
is DIENST_handler
Get Repositories
/MetaServer/Repositories
Returns a record list of
the Repository services. Each record consists of four fields,
separated by the ASCII FS character (octal 034):
- host
-
- port
-
- obsolete
- Do not use this field.
- publishers
- The symbolic names of publishers in this repository, separated by
colon.
UI Service
There is no "UI Service" protocol. Each Dienst UI service is free to
implement any user interface that the local site finds helpful. These
URLs are documented simply for convenience, and may or may not be
available on any given service.
- /Document/docid
- Return a nicely formatted HTML page summarizing information about
the document. You can send this message to any UI service, and if the
document is not stored on that service it will relay the message
to the relevant UI service.
- /TR/Search
- Return an HTML form for searching the collection.
- /TR/List/Authors
- Return a list of all authors in the index service at the site
where this UI service is running. This is not a list of all
authors in the entire collection, only those at the local site.
Typically this list will include hyperlinks to search for all papers
by those authors.
- /TR/List/Numbers
- Return a list of all documents authors in the repository service
at the site where this UI service is running. Note that "numbers" is a
misnomer, a better name would be "docids" but the old name is retained
for compatibility.
Up to Main Information Menu
NCSTRL Documentation
Any comments or questions?
Contact us at help@ncstrl.org.
Acknowledgements
This work was supported in part by the Advanced Research Projects
Agency under Grant No. MDA972-92-J-1029 with the Corporation for
National Research Initiatives (CNRI). Its content does not
necessarily reflect the position or the policy of the Government or
CNRI, and no official endorsement should be inferred. This work was
done at the Design Research Institute, a collaboration of Xerox
Corporation and Cornell University, and at the Computer Science
Department at Cornell University.
Up to Main Information Menu
NCSTRL Documentation
Any comments or questions?
Contact us at help@ncstrl.org.