The resulting classification splits the protein space into well defined groups of proteins, most of them are closely correlated with natural biological families and superfamilies (see references (Proteins, 1999) for comprehensive evaluation results). The hierarchical organization may help to detect finer subfamilies that make up known families of proteins as well as interesting relations between protein families.
For more details see the references.
We will be grateful for any comments/remarks/suggestions (either by using the Feedback button positioned at the bottom of each page, or by mailing directly to golan@gimmel.stanford.edu)
Methods (original procedure): The common measures of similarity between protein
sequences (SW, FASTA, BLAST), are combined with two different scoring
matrices (blosum 50 and blosum 62) to create an exhaustive list of
neighboring sequences, per each sequence in the SWISSPROT and TrEMBL
databases. These lists induce a representation of the protein space as
a (weighted directed) graph whose vertices are the sequences. The
weight of an edge connecting two sequences represents their degree of
similarity (the weights are the expectation values of the similarities
between the sequences).
Clusters of related proteins correspond to strongly connected components of this digraph. The analysis is aiming to automatically detect these sets, and thus obtain a classification of all protein sequences, as well as a better view of the geometry of the protein space. The analysis starts from a very conservative classification, based on highly significant similarities (with expectation value below 1e-100), that consists of many classes. Subsequently, classes are merged to account for less significant similarities. Merging is performed via a two phase algorithm. First, the algorithm identifies groups of possibly related clusters (based on transitivity and strong connectivity) using local considerations. Then, a global test is applied to identify nuclei of strong relationships within these groups of clusters, and clusters are merged accordingly. This process takes place at varying thresholds of statistical significance (confidence levels), where at each step the algorithm is applied on the classes of the previous classification, to obtain the next one, at the more permissive threshold. The analysis starts at the 1e-100 threshold. Subsequent runs are carried out at levels 1e-95,1e-90,1e-85, ... 1e-0 (=1). Consequently, a hierarchical organization of all proteins is obtained. |
![]() ![]() ![]() |
Copyright © 2000 Golan Yona and the ProtoMap authors.
Contact ProtoMap