![]() |
CS
501 Project Concepts Biozon |
Golan Yona, Assistant Professor of Computer Science (golan@cs.cornell.edu)
Aaron Birkland. Graduate student in Computer Science (birkland@cs.cornell.edu)
The Biozon project
The massive flow of biological data that will continue to stream during the next decade requires new models for organizing and managing biological databases. The Biozon project addresses this issue, and creates an exhaustive, scalable and highly informative protein classification, retrieval and mapping system, where entities range from the individual genes to families, complexes and cell-wide processes, all the way up to organisms.
The main goals of this project are (1) to address the growing need for a unified resource of protein and DNA data that can be efficiently searched against, queried and manipulated, store various types of data, is easily scalable, and that can compute and verify various biological entities and theories; (2) to enable users to have access to optimal search results and alignments and take advantage of sophisticated tools that are otherwise not available to the whole community; (3) to integrate different novel similarity measures between proteins based on different attributes; (4) to create an infrastructure for global analysis of the protein universe; (5) to integrate the results from the global analysis as part of the knowledge resource; and (6) to develop novel web tools to facilitate data presentation and manipulation.
There are two main software projects that we suggest for students who would like to be part of the Biozon project:
The scientific community is expected to make heavy use of this platform, and will primarily access the data through the world wide web. Our platform will contain many possible means of interaction with the complex and intrinsically high-dimensional space created by the data and its relations to other data. Therefore, this project would be to design an interface that would make such interaction possible. The user will be presented with several different types of searches and the ability to navigate through data as it is drawn together by a variety of explicit and derived relations. The set of relations, searches, and data is expected to grow, so the interface infrastructure must be modular and extensible.
Since the set of biological data is rapidly growing and being refined, our database will have to be able to quickly and accurately incorporate additions and changes to the data it contains. So on one hand, this is a project in maintainability. However, with the addition of new proteins, as well as new types of data and entities, the definitions of protein families change, boundaries shift and fade, and untouched domains of the protein space are revealed. Therefore it is difficult to maintain a consistent and stable classification system on possibly dynamic content. This project will focus, then, on creating an automated or semi-automated framework to handle the complexities of database updates and consistent reads on changing data.
William Y. Arms
(wya@cs.cornell.edu)
Last changed: January 22, 2003