SAND

Project Overview:	This project investigates data management issues associated with performing widely, and densely, distributed query processing. Topics include data and operator distribution, parallelization and placement problems. These optimization challenges are subject to network and application oriented tradeoffs, while overcoming the distribution of data sources and clients, to achieve the desired levels of scalability. Our focus is on enabling wide-area query processing, by striving to overcome *structural mismatches* in application deployment between the logical layer and the network layer. This principle of scalability through network-awareness, pervades our mechanisms, leading us towards localized algorithms executing in a distributed manner. We highlight the following key challenges, brought to the forefront by large-scale networked environments, and presently ignored in traditional distributed query processors: Scalability: in terms of the large number of data sources, and the large number of participants. Combined with the wide-dispersion of source and participants, our algorithms cannot depend on any form of global knowledge, and must intelligently incorporate data sampling and network probing mechanisms to maintain any necessary localized metadata. Furthermore the scale of both the network and the queryable data set leads to a combinatorial explosion in potential placements requiring efficient, effective heuristics to optimize over multiple, potentially conflicting, application level and resource utilization objectives. Understanding the tradeoff between the optimality of our system configuration, and the overhead in determining this configuration will lead to a flexible, autonomous system. Adaptivity: in the face of a highly dynamic set of participants, and continuously changing workloads issued by clients. By mixing pull and push-based data flow with event-driven control flow, our system will provide tunable algorithms to provide the necessary evolutionary behaviour in long-running systems. Introspective monitoring will guide our maintenance of a probabilistic model of system behaviour, which in turn, may be used to predict future trends in application and network characteristics, and enable exploitive optimization mechanisms. This project is presently deployed on the PlanetLab network testbed, and leverages the Borealis stream processing engine to execute queries in a distributed manner.

Keywords:	In-network processing, distributed query processing, overlay networks, replica placement, distributed databases.

People:	Ugur Cetintemel	(Faculty)
	John Jannotti	(Faculty)
	Yanif Ahmad	(PhD)
	Alex Zgolinski	(Masters)

Publications:	Y. Ahmad, U. Cetintemel, J. Jannotti, A. Zgolinski, S. Zdonik. Network Awareness in Internet Scale Stream Processing. IEEE Bulletin of the Technical Committee on Data Engineering, March 2005, Vol 28 No. 1. (Invited paper)
	Y. Ahmad, U.Cetintemel, J. Jannotti, A. Zgolinski. Locality Aware Networked Join Evaluation. Proceedings of the First IEEE International Workshop on Networking Meets Databases (NetDB '05).
	Y. Ahmad, U. Cetintemel. Network-Aware Query Processing for Stream-Based Applications. Proceedings of the 30th International Conference on Very Large Data Bases (VLDB '04).

Posters:	Network-Aware Query Processing for Stream-Based Application, VLDB '04, Toronto. (JPG, PDF)