Traditional databases are
deterministic: every item is either in the database or is not, 0 or 1.
But modern enterprise applications need to deal with lots of imprecise
data: different representation of the same same object in multiple data
sources, data that has been extracted automatically from unstructured
text, imprecise schema alignments, and many others. Imprecise data is
best modeled as probabilistic data, and managed by a probabilistic
database system: here each tuple has probability between 0 and 1,
and each SQL query returns answers ranked by their probabilities. But
there is a big problem: the data complexity of even some simple SQL
queries is #P-complete. Thus, probabilistic database are a major
paradigm shift, unlike previous extensions of the relational data model
(say, to object-relational, semistructured, or XML data), where all
queries had PTIME complexity.
I will present in this talk results of our study of the SQL query
complexity on probabilistic databases, and describe a new algorithm for
evaluating top-k SQL queries based on a Monte Carlo simulation algorithm
due to Luby and Karp's. The idea in our algorithm is concentrate the
simulation steps on the tuples most likely to be in the top k, thus
minimizing the number of steps spent on the rest. Using this technique
we were able to evaluate SQL queries on a large probabilistic database
(over 1M probabilistic tuples) in about 5 to 50 seconds, using IBM's DB2
database system.
Joint work with Nilesh Dalvi and Chris Re
Bio:
Dan Suciu is an associate
professor in Computer Science at the University of Washington. He
received his Ph.D. from the University of Pennsylvania, was principal
member of the technical staff at Bell Labs, then at AT&T Labs, and in
2000 joined the University of Washington in Seattle. Suciu is conducting
research in data management, with an emphasis on topics that arise from
sharing data on the Internet, such as management of semistructured and
heterogeneous data, data security, and managing imprecisions in data. He
is a co-author of the book Data on the Web: from Relations to
Semistructured Data and XML, holds six US patents, received the 2000 ACM
SIGMOD Best Paper Award, is a recipient of the NSF Career Award and of
an Alfred P. Sloan Fellowship. He likes to work on problems that require
nontrivial theoretical solutions, but he also likes to contribute
(through his students) to practical tools in the public domain, such as
XMill (the XML compressor), and SilkRoute (an XML publishing system with
a comprehensive translator from XQuery to SQL).