CS 501
Software Engineering
Spring 2003

Project Concepts

Legal Information Institute


Client

Thomas R. Bruce, Co-Director of the Legal Information Institute (trb2@cornell.edu)

Project outlines

Cornell's Legal Information Institute is the premier open access legal service on the Internet (http://www.law.cornell.edu/).  Two projects have been suggested:

Word Wheel: interactive searching assistance applet.

Most Internet search-engine interfaces offer remarkably low levels of interactivity and little ability to "preview" results. It is hard for users to monitor the effects of search-term selection on the result set -- there is no way for them to see quickly, for example, the effect that selection of particular terms has on the number of hits returned, whether selecting a particular term narrows or widens the results, and so on. This lack of "feel" or "touch" makes it harder for users to derive a general sense of which term-selection strategies are effective and which are not, or to know what the general trend of a series of search modifications is. Similarly, it is hard for users to know what terms occur in a database and with what frequency. 

Examples of such tools exist for local, commercial hypertext/knowledgebase products such as Folio Views, but so far as we know none exists for commonly used Internet-based full-text systems such as SWISH-e or SINO. The goal of this project would be to develop such a tool as a Java applet designed to work with SWISH-E indexes (minimally) and (preferably) with a tiered architecture that would permit it to work with many of the more common open-source engines. 

Performance requirements are likely to make this a technical challenge, as the applet will have to remain responsive for term vocabularies that are rather large. For instance, there are approximately 132,000 discrete terms in the database of judicial opinions from the Ninth US Circuit Court of Appeals. This may require some interesting and difficult design decisions (translation: several well-chosen levels of cheating). 

It may be worth pointing out that a successful solution to this problem would be very widely adopted across the Web.

Improved Swiss-Army Knife for web-crawling

The Legal Information Institute  runs a number of full-text collections that integrate the efforts of discrete but related information providers into one searchable full-text database. Of these, the most problematic is the database of decisions of the US Circuit Courts of Appeal, which integrates the output of the 13 Federal Circuit Courts. The individual providers share little by way of technical approach, collection scope or any other characteristic. Hence the problem of integration demands great flexibility and configurability on the part of the web-crawling software used to pull discrete Web pages into the collection, as well as the ability to work around the peculiarities of various back-end database solutions in use by the various sites. The overall problem is, then, not so much one of a single insurmountable technical problem as it is one of a series of nagging, low-level difficulties that can be overcome by well-designed software.

Currently we use a modified version of the LWP::Robot CPAN module "bolted on" to the SWISH-e indexing engine. It has whetted our appetite for a laundry list of discrete improvements and features too numerous to list here in detail. Generally these consist of better and more convenient ways of specifying URLs for starting points by algorithmic means (usually based on dates), more expressive URL inclusion and exclusion mechanisms, more efficient parallel crawling of multiple sites aimed at reducing the ill effects of server latency (many government systems are quite slow), better ability to deal with the strange world of Domino servers, and so on.


[CS 501 Home Page]

William Y. Arms
(wya@cs.cornell.edu)

Last changed: January 22, 2003