Proteins have evolved for more than a billion years and in today's organisms
we can clearly group them into families with common sequence, structure, or
function. Arranging the letters of each string into columns of similarities,
dissimilarities, and gaps is perhaps the oldest and most basic problem in
bioinformatics. In the first part of the talk I will talk about a method we
have developed for multiple sequence alignment, called ProbCons, which
combines a hidden Markov model (HMM) and novel inference techniques.
ProbCons outperforms existing methods in benchmark data sets of real
proteins, even though it does not incorporate domain-specific knowledge.
Then, I will describe a conditional random field (CRF) for sequence
alignment, which can be used to derive superior alignment parameters from a
rich set of biological features such as hydrophobicity, using training sets
of very limited size.
Given multiple sources of functional genomics data, graphs of interactions
between all proteins of an organism can be constructed. We are constructing
such networks for the 230 sequenced bacteria, and we search for modules
(subgraphs) that are statistically significantly conserved across several of
the species. Past approaches to network alignment restrict the types of
alignments that can be performed to simple pathways or cliques, and moreover
cannot perform alignment of more than 3 networks. I will present a new
formulation of alignment that is sufficiently general to remove arbitrary
assumptions on the nature of alignment. Our algorithm, implemented in a tool
called Nuke, scales efficiently to large databases and to multiple alignment
of an arbitrary number of networks. Test results on bacterial networks show
that our algorithm can uncover core conserved modules and hint to novel
biological hypotheses.
|