The SEQ Project: Querying Sequence Data

(Document under construction)

Time to put Order in the Database!

Order Time put in the Database!

Time to put the Database in Order!


Document Contents:


Project Objectives

A number of important database applications require the processing of large amounts of ordered sequence data. The domains of these applications include financial management, historical analysis, economic and social sciences, metereology, medical sciences and biological sciences. Existing relational databases are inadequate in this regard; data collections are treated as sets, not sequences. Consequently, expressing sequence queries is tedious, and evaluating them is inefficient.

Databases should These requirements serve as the goals of the SEQ project. Various kinds of sequences need to be supported, temporal sequences being the most important kind. Queries should be expressible using notions like "next" and "previous" which are natural when considering sequences. These queries should be optimized so that they can be evaluated efficiently. These issues need to be studied in theory, and then a database system needs to be built that demonstrates the feasibility of the theoretical ideas.


Project Status

The current status of the project is:


Motivating Example of a Sequence Query

A weather monitoring system records information about various meteorological phenomena. There is a sequentiality in the occurrence of these phenomena; the various meteorological events are sequenced by the time at which they are recorded. A scientist asks the query:

"For which volcano eruptions did the most recent earthquake have a strength greater than 7.0 on the Richter scale?".

If this query is to be expressed in a relational query language like SQL, complex features like groupby clauses, correlated subqueries and aggregate functions are required. Further, a conventional relational query optimizer would not find an efficient query execution plan, even given the knowledge that the Earthquakes and Volcano relations are sorted by time.

However a very efficient plan exists, if one models the data as sequences ordered by time. The two sequences can be scanned in lock step (similar to a sort merge join). The most recent earthquake record scanned can be stored in a temporary buffer. Whenever a volcano record is processed, the value of the most recent earthquake record stored in the buffer is checked to see if its strength was greater than 7.0, possibly generating an answer. This query can therefore be processed with a single scan of the two sequences, and using very little memory. The key to such optimization is the sequentiality of the data and the query.


Data Model

The details of the SEQ data model are described in a published paper (click here for postscript version). Here we present the gist of it. The basic model of a sequence is a set of records mapped to an ordered domain of ``positions''. This many-to-many relationship between records and positions can be viewed in two dual but distinct ways: as a set of records mapped to each position, or as a set of positions mapped to each record. These two views are called ``Positional'' and ``Record-Oriented'' respectively, and each gives rise to a set of query operators based on that view. Queries on sequences could require operators of either or both flavors. The Record-Oriented operators are similar to relational operators and include various kinds of joins (overlap, containment, etc) and aggregates. Such operators have been extensively explored by researchers in the temporal database community.

(Picture of Sequence Mapping)

The Positional operators include Next, Previous, Offset, Moving Aggregates, etc. Further operators allow ``zooming'' operations on sequences by means of collapsing and expanding the ordering domains associated with the sequence. For instance, a daily sequence could be ``zoomed out'' (i.e.collapsed) to a weekly sequence, or ``zoomed in'' (i.e. expanded) to an hourly sequence. The last part of the model deals with operations on groups (i.e. sets) of sequences. The advantage is that this makes it easy to model queries involving sequence collections (which is the case in many real-world situations). All the sequence operators are extended to work with groups of similar sequences, instead of with single sequences. This extension of the SEQ model indicates that a practical implementation of SEQ would probably involve a nested complex object system.


Sequin Query Language

We have devised a query language called Sequin using which declarative sequence queries can be specified. The language is similar in flavor to SQL, except that the inputs to queries as well as the results of queries are sequences. Click here for a description of the Sequin language with examples.


Optimization Techniques

We have proposed new optimization techniques for sequence queries involving Positional operators. There are existing techniques that have been proposed for queries with Record-Oriented operators. Our optimizations use query transformations, meta--data, and caching of intermediate results to efficiently evaluate a query. An optimal query evaluation plan can be generated using an algorithm that relies on cost estimates. One of the important observations is that accessing sequence data in a single stream is probably very efficient, and evaluation strategies should take this into account.

The details of the optimization techniques are described in a published paper (click here for postscript version).


System Development

The SEQ database system has a client-server architecture, supporting multiple clients via a multi-threaded server. The server is built on top of the SHORE storage manager. Both Sequin and a subset of SQL are supported as query languages which can be embedded inside each other. The data model is a nested complex object model that allows arbitrary levels of nesting of relations inside sequences and vice versa. The system is also extensible, providing support for new data types, new ordering domains, user-defined functions, new storage implementations and new query languages. For more details on the SEQ system, click here.


Publications

Sequence Query Processing Praveen Seshadri, Miron Livny and Raghu Ramakrishnan. Proceedings of the ACM SIGMOD Conference on Data Management, May 1994.

SEQ: A Framework for Sequence Data Praveen Seshadri, Miron Livny and Raghu Ramakrishnan. Proceedings of the IEEE Conference on Data Engineering, March 1995.

The Design and Implementation of a Sequence Database System Praveen Seshadri, Miron Livny and Raghu Ramakrishnan. Submitted to VLDB 96.

What's Next? Sequence Queries Raghu Ramakrishnan, Michael Cheng, Miron Livny, and Praveen Seshadri. In Proceedings of the International Conference on the Management of Data (COMAD), December, 1994.


Related Work

The DEVise project is complementary to SEQ. It provides a visualization environment that can be used to explore sequence data. DEVise can act as a front-end through which queries can be posed against a SEQ database server, and the answers can be examined graphically.

Also see:


Contact Information

For more information, contact

Praveen Seshadri, praveen@cs.wisc.edu

Raghu Ramakrishnan, raghu@cs.wisc.edu

Miron Livny, miron@cs.wisc.edu

Computer Sciences Department,
University of Wisconsin,
1210, W.Dayton Street,
Madison, WI 53706.


Last modified: Fri Sep 15 1995 by Praveen Seshadri
Praveen Seshadri / praveen@cs.wisc.edu