Introduction


This site offers an exhaustive classification of all the proteins in the SWISSPROT and TrEMBL databases, into groups of related proteins. The analysis uses transitivity to identify homologous proteins, and within each group, every two members are either directly or transitively related. Transitivity is applied restrictively in order to prevent unrelated proteins from clustering together. The classification is done at different levels of confidence, and results in a hierarchical organization of all the proteins. For more details see 'methods' below (Note that this box describes the old procedure. Details on the new procedure will be posted here soon).

The resulting classification splits the protein space into well defined groups of proteins, most of them are closely correlated with natural biological families and superfamilies (see references (Proteins, 1999) for comprehensive evaluation results). The hierarchical organization may help to detect finer subfamilies that make up known families of proteins as well as interesting relations between protein families.

For more details see the references.

We will be grateful for any comments/remarks/suggestions (either by using the Feedback button positioned at the bottom of each page, or by mailing directly to golan@gimmel.stanford.edu)

Methods (original procedure): The common measures of similarity between protein sequences (SW, FASTA, BLAST), are combined with two different scoring matrices (blosum 50 and blosum 62) to create an exhaustive list of neighboring sequences, per each sequence in the SWISSPROT and TrEMBL databases. These lists induce a representation of the protein space as a (weighted directed) graph whose vertices are the sequences. The weight of an edge connecting two sequences represents their degree of similarity (the weights are the expectation values of the similarities between the sequences).

Clusters of related proteins correspond to strongly connected components of this digraph. The analysis is aiming to automatically detect these sets, and thus obtain a classification of all protein sequences, as well as a better view of the geometry of the protein space.

The analysis starts from a very conservative classification, based on highly significant similarities (with expectation value below 1e-100), that consists of many classes. Subsequently, classes are merged to account for less significant similarities. Merging is performed via a two phase algorithm. First, the algorithm identifies groups of possibly related clusters (based on transitivity and strong connectivity) using local considerations. Then, a global test is applied to identify nuclei of strong relationships within these groups of clusters, and clusters are merged accordingly.

This process takes place at varying thresholds of statistical significance (confidence levels), where at each step the algorithm is applied on the classes of the previous classification, to obtain the next one, at the more permissive threshold. The analysis starts at the 1e-100 threshold. Subsequent runs are carried out at levels 1e-95,1e-90,1e-85, ... 1e-0 (=1). Consequently, a hierarchical organization of all proteins is obtained.



Copyright © 2000 Golan Yona and the ProtoMap authors.
Contact ProtoMap