Uri's list of papers

Optimal Spaced Seeds for Hidden Markov Models, with Application to Homologous Coding Regions

Bro ncaron a Brejová, Daniel G. Brown and Tomá scaron Vina rcaron

Lecture Notes in Computer Science, Volume 2676 / 2003, CPM 2003

We study the problem of computing optimal spaced seeds for detecting sequences generated by a Hidden Markov model. Inspired by recent work in DNA sequence alignment, we have developed such a model for representing the conservation between related DNA coding sequences. Our model includes positional dependencies and periodic rates of conservation, as well as regional deviations in overall conservation rate. We show that, for hidden Markov models in general, the probability that a seed is matched in a region can be computed efficiently, and use these methods to compute the optimal seed for our models. Our experiments on real data show that the optimal seeds are substantially more sensitive than the seeds used in the standard alignment program BLAST, and also substantially better than those of PatternHunter or WABA, both of which use spaced seeds. Our results offer the hope of improved gene finding due to fewer missed exons in DNA/DNA comparison, and more effective homology search in general, and may have applications outside of bioinformatics.

Vector Seeds An Extension to Spaced Seeds Allows Substantial Improvements in Sensitivity and Specificity

We present improved techniques for finding homologous regions in DNA and protein sequences. Our approach focuses on the core region of a local pairwise alignment; we suggest new ways to characterize these regions that allow marked improvements in both specificity and sensitivity over existing techniques for sequence alignment. For any such characterization, which we call a vector seed, we give an efficient algorithm that estimates the specificity and sensitivity of that seed under reasonable probabilistic models of sequence. We also characterize the probability of a match when an alignment is required to have multiple hits before it is detected. Our extensions fit well with existing approaches to sequence alignment, while still offering substantial improvement in runtime and sensitivity, particularly for the important problem of identifying matches between homologous coding DNA sequences.

Abstract: RNA interference (RNAi) is an evolutionarily conserved mechanism in plant and animal cells that directs the degradation of messenger RNAs homologous to short double-stranded RNAs termed small interfering RNA (siRNA). The ability of siRNA to direct gene silencing in mammalian cells has raised the possibility that siRNA might be used to investigate gene function in a high throughput fashion or to modulate gene expression in human diseases. The specificity of siRNA-mediated silencing, a critical consideration in these applications, has not been addressed on a genomewide scale. Here we show that siRNA-induced gene silencing of transient or stably expressed mRNA is highly gene-specific and does not produce secondary effects detectable by genomewide expression profiling. A test for transitive RNAi, extension of the RNAi effect to sequences 5' of the target region that has been observed in Caenorhabditis elegans, was unable to detect this phenomenon in human cells.

Journal of the American Statistical Association, Volume: 98 Number: 461 Page: 55 -- 66

Abstract: Detection of unknown patterns from a randomly generated sequence of observations is a problem arising in fields ranging from signal processing to computational biology. Here we focus on the discovery of short recurring patterns (called motifs) in DNA sequences that represent binding sites for certain proteins in the process of gene regulation. What makes this a difficult problem is that these patterns can vary stochastically. We describe a novel data augmentation strategy for detecting such patterns in biological sequences based on an extension of a "dictionary" model. In this approach, we treat conserved patterns and individual nucleotides as stochastic words generated according to probability weight matrices and the observed sequences generated by concatenations of these words. By using a missingdata approach to find these patterns, we also address other related problems, including determining widths of patterns, finding multiple motifs, handling low-complexity regions, and finding patterns with insertions and deletions. The issue of selecting appropriate models is also discussed. However, the flexibility of this model is also accompanied by a high degree of computational complexity. We demonstrate how dynamic programming-like recursions can be used to improve computational efficiency.

Abstract: Epigenetic modifications of chromatin serve an important role in regulating the expression and accessibility of genomic DNA. We report here a genomewide approach for fractionating yeast chromatin into two functionally distinct parts, one containing RNA polymerase II transcribed sequences, and the other comprising noncoding sequences and genes transcribed by RNA polymerases I and III. Noncoding regions could be further fractionated into promoters and segments lacking promoters. The observed separations were apparently based on differential crosslinking efficiency of chromatin in different genomic regions. The results reveal a genomewide molecular mechanism for marking promoters and genomic regions that have a license to be transcribed by RNA polymerase II, a previously unrecognized level of genomic complexity that may exist in all eukaryotes. Our approach has broad potential use as a tool for genome annotation and for the characterization of global changes in chromatin structure that accompany different genetic, environmental, and disease states.

Authors: Thomas JW, Touchman JW, Blakesley RW, Bouffard GG, Beckstrom-Sternberg SM, Margulies EH, Blanchette M, Siepel AC, Thomas PJ, McDowell JC, Maskeri B, Hansen NF, Schwartz MS, Weber RJ, Kent WJ, Karolchik D, Bruen TC, Bevan R, Cutler DJ, Schwartz S, Elnitski L, Idol JR, Prasad AB, Lee-Lin SQ, Maduro VV, Summers TJ, Portnoy ME, Dietrich NL, Akhter N, Ayele K, Benjamin B, Cariaga K, Brinkley CP, Brooks SY, Granite S, Guan X, Gupta J, Haghighi P, Ho SL, Huang MC, Karlins E, Laric PL, Legaspi R, Lim MJ, Maduro QL, Masiello CA, Mastrian SD, McCloskey JC, Pearson R, Stantripop S, Tiongson EE, Tran JT, Tsurgeon C, Vogt JL, Walker MA, Wetherby KD, Wiggins LS, Young AC, Zhang LH, Osoegawa K, Zhu B, Zhao B, Shu CL, De Jong PJ, Lawrence CE, Smit AF, Chakravarti A, Haussler D, Green P, Miller W, Green ED

Abstract: The systematic comparison of genomic sequences from different organisms represents a central focus of contemporary genome analysis. Comparative analyses of vertebrate sequences can identify coding and conserved non-coding regions, including regulatory elements, and provide insight into the forces that have rendered modern-day genomes. As a complement to whole-genome sequencing efforts, we are sequencing and comparing targeted genomic regions in multiple, evolutionarily diverse vertebrates. Here we report the generation and analysis of over 12 megabases (Mb) of sequence from 12 species, all derived from the genomic region orthologous to a segment of about 1.8 Mb on human chromosome 7 containing ten genes, including the gene mutated in cystic fibrosis. These sequences show conservation reflecting both functional constraints and the neutral mutational events that shaped this genomic region. In particular, we identify substantial numbers of conserved non-coding segments beyond those previously identified experimentally, most of which are not detectable by pair-wise sequence comparisons alone. Analysis of transposable element insertions highlights the variation in genome dynamics among these species and confirms the placement of rodents as a sister group to the primates.

Abstract: With the availability of whole-genome sequences for an increasing number of species, we are now faced with the challenge of decoding the information contained within these DNA sequences. Comparative analysis of DNA sequences from multiple species at varying evolutionary distances is a powerful approach for identifying coding and functional noncoding sequences, as well as sequences that are unique for a given organism. In this review, we outline the strategy for choosing DNA sequences from different species for comparative analyses and describe the methods used and the resources publicly available for these studies.

Authors: Hardison RC, Roskin KM, Yang S, Diekhans M, Kent WJ, Weber R, Elnitski L, Li J, O'Connor M, Kolbe D, Schwartz S, Furey TS, Whelan S, Goldman N, Smit A, Miller W, Chiaromonte F, Haussler D

Abstract: Six measures of evolutionary change in the human genome were studied, three derived from the aligned human and mouse genomes in conjunction with the Mouse Genome Sequencing Consortium, consisting of (1) nucleotide substitution per fourfold degenerate site in coding regions, (2) nucleotide substitution per site in relics of transposable elements active only before the human-mouse speciation, and (3) the nonaligning fraction of human DNA that is nonrepetitive or in ancestral repeats; and three derived from human genome data alone, consisting of (4) SNP density, (5) frequency of insertion of transposable elements, and (6) rate of recombination. Features 1 and 2 are measures of nucleotide substitutions at two classes of "neutral" sites, whereas 4 is a measure of recent mutations. Feature 3 is a measure dominated by deletions in mouse, whereas 5 represents insertions in human. It was found that all six vary significantly in megabase-sized regions genome-wide, and many vary together. This indicates that some regions of a genome change slowly by all processes that alter DNA, and others change faster. Regional variation in all processes is correlated with, but not completely accounted for, by GC content in human and the difference between GC content in human and mouse.

Scoring two-species local alignments to try to statistically separate neutrally evolving from selected DNA segments

Proceedings of the seventh annual international conference on Computational molecular biology (RECOMB), Berlin, Pages: 257 – 266.

We construct several score functions for use in locating unusually conserved regions in a genome-wide search of aligned DNA from two species. We test these functions on regions of the human genome aligned to the mouse genome. These score functions are derived from properties of neutrally evolving sites on the mouse and human genome, and can be adjusted to the local background rate of conservation. The aim of these functions is to try to identify regions of the human genome that are conserved by evolutionary selection, because they have an important function, rather than by chance. We use them to get a very rough estimate of the amount of DNA in the human genome that is under selection.

Detecting protein sequence conservation via metric embeddings

Motivation: Comparing two protein databases is a fundamental task in biosequence annotation. Given two databases, one must find all pairs of proteins that align with high score under a biologically meaningful substitution score matrix, such as a BLOSUM matrix (Henikoff and Henikoff, 1992). Distance-based approaches to this problem map each peptide in the database to a point in a metric space, such that peptides aligning with higher scores are mapped to closer points. Many techniques exist to discover close pairs of points in a metric space efficiently, but the challenge in applying this work to proteomic comparison is to find a distance mapping that accurately encodes all the distinctions among residue pairs made by a proteomic score matrix. Buhler (2002) proposed one such mapping but found that it led to a relatively inefficient algorithm for protein-protein comparison.

Results: This work proposes a new distance mapping for peptides under the BLOSUM matrices that permits more efficient similarity search. We first propose a new distance function on peptides derived from a given score matrix. We then show how to map peptides to bit vectors such that the distance between any two peptides is closely approximated by the Hamming distance (i.e. number of mismatches) between their corresponding bit vectors. We combine these two results with the LSH-ALL-PAIRS-SIM algorithm of Buhler (2002) to produce an improved distance-based algorithm for proteomic comparison. An initial implementation of the improved algorithm exhibits sensitivity within 5% of that of the original LSH-ALL-PAIRS-SIM, while running up to eight times faster.

Motivation: The discovery of cis-regulatory modules in metazoan genomes is crucial for understanding the connection between genes and organism diversity.

Results: We develop a computational method that uses Hidden Markov Models and an Expectation Maximization algorithm to detect such modules, given the weight matrices of a set of transcription factors known to work together. Two novel features of our probabilistic model are: (i) correlations between binding sites, known to be required for module activity, are exploited, and (ii) phylogenetic comparisons among sequences from multiple species are made to highlight a regulatory module. The novel features are shown to improve detection of modules, in experiments on synthetic as well as biological data.

Motivation: The binding of transcription factors to specific regulatory sequence elements is a primary mechanism for controlling gene transcription. Recent findings suggest a modular organization of binding sites for transcription factors that cooperate in the regulation of genes. In this work we establish a framework for finding recurrent cis-regulatory modules in the promoters of a selected set of genes and scoring their statistical significance.

Results: Proceeding from a database of identified binding site motifs and their genomic locations we seek motifs whose frequency in the selected promoters is different than in a background promoter set. We present several statistical tests designed for this purpose. We provide a hashing algorithm for detecting combinations of these motifs that co-occur in clusters within the selected promoters. The significance of such co-occurrences is evaluated using novel statistical scores. Our methods are combined in CREME, a suite of software which includes a browser for viewing the pattern of occurrence of selected cis-regulatory modules. We applied our methodology to find modules within human-mouse conserved promoter segments, focusing on cell cycle regulated genes and stress response related genes. To validate the biological significance of the identified modules we tested whether the associated genes tended to be co-expressed or share similar function. In the cell cycle set five of the seven identified sets of genes were coherently expressed. On the stress response data four of the six detected sets fell predominantly into well-defined functional sub-categories.