CS 501 Software Engineering: Projects -- Legal Information Institute

CS 501
Software Engineering
Spring 2002

Project Concepts

Legal Information Institute

Client

Thomas R. Bruce, Co-Director of the Legal Information Institute (trb2@cornell.edu)

Project outlines

Cornell's Legal Information Institute is the premier open access legal service on the Internet (http://www.law.cornell.edu/). Two projects have been suggested:

PDF versions of the United States Code

The United States Code is released to the general public by the US House of Representatives on its Web site (at http://uscode.house.gov/download.htm ). This is a fairly plain-vanilla ASCII version to which the Legal Information Institute adds value (visible at http://www4.law.cornell.edu/uscode/ ). An earlier CS 501 project team and a later student project developed programs for the Legal Information Institute that convert the raw ASCII output of the House of Representatives to XML, for subsequent reuse in various settings. This is a flagship of the Legal Information Institute. The US Code currently gets about half a million hits daily.

The new project is to create PDF versions from the XML. Creating PDF is not hard. The hard part is building a user interface/delivery system that will allow this to be done at arbitrary levels of structure (up to an entire title), to come up with some way of either pre-building and caching the PDF versions or of building them on demand quickly enough for on-the-fly ordering, and tying the whole in with some kind of catalog/shopping cart system. In other words, the idea is that the user should be able to walk up to the system via browser, say "gimme that chunk of the Code", possibly pay for the service, and then get the chunk in reasonable time without the system having to prebuild the whole thing at every level.

Search engine for legal information

Users of legal information, including the Legal Information Institute need, on open-source search engine that is fast, good, and law specific. The concept is to take an existing open-source engine like SWISH-E or SINO, modify it to run on a Beowulf cluster, and then add some refinements that are useful in law, such as better treatment of punctuation characters and possibly granting of extra weight to citation matches and the like.

Such engines are typically used to remotely index multiple sites that use a variety of mechanisms to deliver information, including database-driven sites which hide content behind CGI scripts or content-delivery systems such as Lotus Notes; information may be in a variety of formats other than HTML -- PDF is especially popular with legal sites, because it is hard for others to modify. Given these characteristics of the legal-information environment, you will need to pay particular attention to the crawling and harvesting components of the engine.

Technical

You can select the technical environment for this project. Work in this area is typically carried out in Perl.

[CS 501 Home Page]

William Y. Arms

(wya@cs.cornell.edu)
Last changed: January 22, 2002