Title: The microRNAs of Caenorhabditis elegans

Authors: Lim LP, Lau NC, Weinstein EG, Abdelhakim A, Yekta S, Rhoades MW, Burge CB, Bartel DP

Ref: Genome Res 2003 Apr 15;17(8):991-1008

Abstract: MicroRNAs (miRNAs) are an abundant class of tiny RNAs thought to regulate the expression of protein-coding genes in plants and animals. In the present study, we describe a computational procedure to identify miRNA genes conserved in more than one genome. Applying this program, known as MiRscan, together with molecular identification and validation methods, we have identified most of the miRNA genes in the nematode Caenorhabditis elegans. The total number of validated miRNA genes stands at 88, with no more than 35 genes remaining to be detected or validated. These 88 miRNA genes represent 48 gene families; 46 of these families (comprising 86 of the 88 genes) are conserved in Caenorhabditis briggsae, and 22 families are conserved in humans. More than a third of the worm miRNAs, including newly identified members of the lin-4 and let-7 gene families, are differentially expressed during larval development, suggesting a role for these miRNAs in mediating larval developmental transitions. Most are present at very high steady-state levels-more than 1000 molecules per cell, with some exceeding 50,000 molecules per cell. Our census of the worm miRNAs and their expression patterns helps define this class of noncoding RNAs, lays the groundwork for functional studies, and provides the tools for more comprehensive analyses of miRNA genes in other species.



Title: Genomewide view of gene silencing by small interfering RNAs

Authors: Chi JT, Chang HY, Wang NN, Chang DS, Dunphy N, Brown PO

Ref: Proc Natl Acad Sci USA 2003 May 27;100(11):6343-6

Abstract: RNA interference (RNAi) is an evolutionarily conserved mechanism in plant and animal cells that directs the degradation of messenger RNAs homologous to short double-stranded RNAs termed small interfering RNA (siRNA). The ability of siRNA to direct gene silencing in mammalian cells has raised the possibility that siRNA might be used to investigate gene function in a high throughput fashion or to modulate gene expression in human diseases. The specificity of siRNA-mediated silencing, a critical consideration in these applications, has not been addressed on a genomewide scale. Here we show that siRNA-induced gene silencing of transient or stably expressed mRNA is highly gene-specific and does not produce secondary effects detectable by genomewide expression profiling. A test for transitive RNAi, extension of the RNAi effect to sequences 5' of the target region that has been observed in Caenorhabditis elegans, was unable to detect this phenomenon in human cells.



Significance analysis of microarrays applied to the ionizing radiation response

Virginia Goss Tusher, Robert Tibshirani, and Gilbert Chu

Microarrays can measure the expression of thousands of genes to identify changes in expression between different biological states. Methods are needed to determine the significance of these changes while accounting for the enormous number of genes. We describe a method, Significance Analysis of Microarrays (SAM), that assigns a score to each gene on the basis of change in gene expression relative to the standard deviation of repeated measurements. For genes with scores greater than an adjustable threshold, SAM uses permutations of the repeated measurements to estimate the percentage of genes identified by chance, the false discovery rate (FDR). When the transcriptional response of human cells to ionizing radiation was measured by microarrays, SAM identified 34 genes that changed at least 1.5-fold with an estimated FDR of 12%, compared with FDRs of 60 and 84% by using conventional methods of analysis. Of the 34 genes, 19 were involved in cell cycle regulation and 3 in apoptosis. Surprisingly, four nucleotide excision repair genes were induced, suggesting that this repair pathway for UV-damaged DNA might play a previously unrecognized role in repairing DNA damaged by ionizing radiation.

http://www.pnas.org/cgi/content/abstract/98/9/5116

PNAS | April 24, 2001 | vol. 98 | no. 9 | 5116-5121


Statistical significance for genomewide studies

John D. Storey and Robert Tibshirani

With the increase in genomewide experiments and the sequencing of multiple genomes, the analysis of large data sets has become commonplace in biology. It is often the case that thousands of features in a genomewide data set are tested against some null hypothesis, where a number of features are expected to be significant. Here we propose an approach to measuring statistical significance in these genomewide studies based on the concept of the false discovery rate. This approach offers a sensible balance between the number of true and false positives that is automatically calibrated and easily interpreted. In doing so, a measure of statistical significance called the q value is associated with each tested feature. The q value is similar to the well known p value, except it is a measure of significance in terms of the false discovery rate rather than the false positive rate. Our approach avoids a flood of false positive results, while offering a more liberal criterion than what has been used in genome scans for linkage.


http://www.pnas.org/cgi/content/full/100/16/9440

PNAS, August 5, 2003; 100(16): 9440 - 9445


Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model

Mayetri Gupta ; Jun S. Liu

Journal of the American Statistical Association, Volume: 98 Number: 461 Page: 55 -- 66

Abstract: Detection of unknown patterns from a randomly generated sequence of observations is a problem arising in fields ranging from signal processing to computational biology. Here we focus on the discovery of short recurring patterns (called motifs) in DNA sequences that represent binding sites for certain proteins in the process of gene regulation. What makes this a difficult problem is that these patterns can vary stochastically. We describe a novel data augmentation strategy for detecting such patterns in biological sequences based on an extension of a "dictionary" model. In this approach, we treat conserved patterns and individual nucleotides as stochastic words generated according to probability weight matrices and the observed sequences generated by concatenations of these words. By using a missingdata approach to find these patterns, we also address other related problems, including determining widths of patterns, finding multiple motifs, handling low-complexity regions, and finding patterns with insertions and deletions. The issue of selecting appropriate models is also discussed. However, the flexibility of this model is also accompanied by a high degree of computational complexity. We demonstrate how dynamic programming-like recursions can be used to improve computational efficiency.

http://www.people.fas.harvard.edu/~gupta6/papers/sdict.pdf



Transcriptional regulatory cascades in development: Initial rates, not steady state, determine network kinetics

Hamid Bolouri and Eric H. Davidson

A model was built to examine the kinetics of regulatory cascades such as occur in developmental gene networks. The model relates occupancy of cis-regulatory target sites to transcriptional initiation rate, and thence to RNA and protein output. The model was used to simulate regulatory cascades in which genes encoding transcription factors are successively activated. Using realistic parameter ranges based on extensive earlier measurements in sea urchin embryos, we find that transitions of regulatory states occur sharply in these simulations, with respect to time or changing transcription factor concentrations. As is often observed in developing systems, the simulated regulatory cascades display a succession of gene activations separated by delays of some hours. The most important causes of this behavior are cooperativity in the assembly of cis-regulatory complexes and the high specificity of transcription factors for their target sites. Successive transitions in state occur long in advance of the approach to steady-state levels of the molecules that drive the process. The kinetics of such developmental systems thus depend mainly on the initial output rates of genes activated in response to the advent of new transcription factors.

http://www.pnas.org/cgi/content/full/100/16/9371


Title: Human and mouse genomic sequences reveal extensive breakpoint reuse in mammalian evolution

Authors: Pevzner P, Tesler G Ref: Proc Natl Acad Sci USA 2003 Jun 24;100(13):7672-7

Abstract: The human and mouse genomic sequences provide evidence for a larger number of rearrangements than previously thought and reveal extensive reuse of breakpoints from the same short fragile regions. Breakpoint clustering in regions implicated in cancer and infertility have been reported in previous studies; we report here on breakpoint clustering in chromosome evolution. This clustering reveals limitations of the widely accepted random breakage theory that has remained unchallenged since the mid-1980s. The genome rearrangement analysis of the human and mouse genomes implies the existence of a large number of very short "hidden" synteny blocks that were invisible in the comparative mapping data and ignored in the random breakage model. These blocks are defined by closely located breakpoints and are often hard to detect. Our results suggest a model of chromosome evolution that postulates that mammalian genomes are mosaics of fragile regions with high propensity for rearrangements and solid regions with low propensity for rearrangements.


Title: Genomewide demarcation of RNA polymerase II transcription units revealed by physical fractionation of chromatin

Authors: Nagy PL, Cleary ML, Brown PO, Lieb JD

Ref: Proc Natl Acad Sci USA 2003 May 27;100(11):6364-9

Abstract: Epigenetic modifications of chromatin serve an important role in regulating the expression and accessibility of genomic DNA. We report here a genomewide approach for fractionating yeast chromatin into two functionally distinct parts, one containing RNA polymerase II transcribed sequences, and the other comprising noncoding sequences and genes transcribed by RNA polymerases I and III. Noncoding regions could be further fractionated into promoters and segments lacking promoters. The observed separations were apparently based on differential crosslinking efficiency of chromatin in different genomic regions. The results reveal a genomewide molecular mechanism for marking promoters and genomic regions that have a license to be transcribed by RNA polymerase II, a previously unrecognized level of genomic complexity that may exist in all eukaryotes. Our approach has broad potential use as a tool for genome annotation and for the characterization of global changes in chromatin structure that accompany different genetic, environmental, and disease states.


Title: Comparative analyses of multi-species sequences from targeted genomic regions

Authors: Thomas JW, Touchman JW, Blakesley RW, Bouffard GG, Beckstrom-Sternberg SM, Margulies EH, Blanchette M, Siepel AC, Thomas PJ, McDowell JC, Maskeri B, Hansen NF, Schwartz MS, Weber RJ, Kent WJ, Karolchik D, Bruen TC, Bevan R, Cutler DJ, Schwartz S, Elnitski L, Idol JR, Prasad AB, Lee-Lin SQ, Maduro VV, Summers TJ, Portnoy ME, Dietrich NL, Akhter N, Ayele K, Benjamin B, Cariaga K, Brinkley CP, Brooks SY, Granite S, Guan X, Gupta J, Haghighi P, Ho SL, Huang MC, Karlins E, Laric PL, Legaspi R, Lim MJ, Maduro QL, Masiello CA, Mastrian SD, McCloskey JC, Pearson R, Stantripop S, Tiongson EE, Tran JT, Tsurgeon C, Vogt JL, Walker MA, Wetherby KD, Wiggins LS, Young AC, Zhang LH, Osoegawa K, Zhu B, Zhao B, Shu CL, De Jong PJ, Lawrence CE, Smit AF, Chakravarti A, Haussler D, Green P, Miller W, Green ED

Ref: Nature 2003 Aug 14;424(6950):788-93

Abstract: The systematic comparison of genomic sequences from different organisms represents a central focus of contemporary genome analysis. Comparative analyses of vertebrate sequences can identify coding and conserved non-coding regions, including regulatory elements, and provide insight into the forces that have rendered modern-day genomes. As a complement to whole-genome sequencing efforts, we are sequencing and comparing targeted genomic regions in multiple, evolutionarily diverse vertebrates. Here we report the generation and analysis of over 12 megabases (Mb) of sequence from 12 species, all derived from the genomic region orthologous to a segment of about 1.8 Mb on human chromosome 7 containing ten genes, including the gene mutated in cystic fibrosis. These sequences show conservation reflecting both functional constraints and the neutral mutational events that shaped this genomic region. In particular, we identify substantial numbers of conserved non-coding segments beyond those previously identified experimentally, most of which are not detectable by pair-wise sequence comparisons alone. Analysis of transposable element insertions highlights the variation in genome dynamics among these species and confirms the placement of rodents as a sister group to the primates.


Title: Cross-species sequence comparisons: a review of methods and available resources

Authors: Frazer KA, Elnitski L, Church DM, Dubchak I, Hardison RC

Ref: Genome Res 2003 Jan;13(1):1-12

Abstract: With the availability of whole-genome sequences for an increasing number of species, we are now faced with the challenge of decoding the information contained within these DNA sequences. Comparative analysis of DNA sequences from multiple species at varying evolutionary distances is a powerful approach for identifying coding and functional noncoding sequences, as well as sequences that are unique for a given organism. In this review, we outline the strategy for choosing DNA sequences from different species for comparative analyses and describe the methods used and the resources publicly available for these studies.


Title: Covariation in frequencies of substitution, deletion, transposition, and recombination during eutherian evolution.

Authors: Hardison RC, Roskin KM, Yang S, Diekhans M, Kent WJ, Weber R, Elnitski L, Li J, O'Connor M, Kolbe D, Schwartz S, Furey TS, Whelan S, Goldman N, Smit A, Miller W, Chiaromonte F, Haussler D

Ref: Genome Res 2003 Jan;13(1):13-26

Abstract: Six measures of evolutionary change in the human genome were studied, three derived from the aligned human and mouse genomes in conjunction with the Mouse Genome Sequencing Consortium, consisting of (1) nucleotide substitution per fourfold degenerate site in coding regions, (2) nucleotide substitution per site in relics of transposable elements active only before the human-mouse speciation, and (3) the nonaligning fraction of human DNA that is nonrepetitive or in ancestral repeats; and three derived from human genome data alone, consisting of (4) SNP density, (5) frequency of insertion of transposable elements, and (6) rate of recombination. Features 1 and 2 are measures of nucleotide substitutions at two classes of "neutral" sites, whereas 4 is a measure of recent mutations. Feature 3 is a measure dominated by deletions in mouse, whereas 5 represents insertions in human. It was found that all six vary significantly in megabase-sized regions genome-wide, and many vary together. This indicates that some regions of a genome change slowly by all processes that alter DNA, and others change faster. Regional variation in all processes is correlated with, but not completely accounted for, by GC content in human and the difference between GC content in human and mouse.


Scoring two-species local alignments to try to statistically separate neutrally evolving from selected DNA segments

Krishna M. Roskin, Mark Diekhans, David Haussler

Proceedings of the seventh annual international conference on Computational molecular biology (RECOMB), Berlin, Pages: 257 – 266.


We construct several score functions for use in locating unusually conserved regions in a genome-wide search of aligned DNA from two species. We test these functions on regions of the human genome aligned to the mouse genome. These score functions are derived from properties of neutrally evolving sites on the mouse and human genome, and can be adjusted to the local background rate of conservation. The aim of these functions is to try to identify regions of the human genome that are conserved by evolutionary selection, because they have an important function, rather than by chance. We use them to get a very rough estimate of the amount of DNA in the human genome that is under selection.

http://portal.acm.org/citation.cfm?id=640109&jmp=cit&dl=GUIDE&dl=ACM&CFID=11891234&CFTOKEN=11681079#CIT


Title: Sequencing and comparison of yeast species to identify genes and regulatory elements

Authors: Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES

Ref: Nature 423, 241 - 254 (2003)

Abstract: Identifying the functional elements encoded in a genome is one of the principal challenges in modern biology. Comparative genomics should offer a powerful, general approach. Here, we present a comparative analysis of the yeast Saccharomyces cerevisiae based on high-quality draft sequences of three related species (S. paradoxus, S. mikatae and S. bayanus). We first aligned the genomes and characterized their evolution, defining the regions and mechanisms of change. We then developed methods for direct identification of genes and regulatory motifs. The gene analysis yielded a major revision to the yeast gene catalogue, affecting approximately 15% of all genes and reducing the total count by about 500 genes. The motif analysis automatically identified 72 genome-wide elements, including most known regulatory motifs and numerous new motifs. We inferred a putative function for most of these motifs, and provided insights into their combinatorial interactions. The results have implications for genome analysis of diverse organisms, including the human.



Detecting protein sequence conservation via metric embeddings

E. Halperin, J. Buhler, R. Karp, R. Krauthgamer and B. Westover

Motivation: Comparing two protein databases is a fundamental task in biosequence annotation. Given two databases, one must find all pairs of proteins that align with high score under a biologically meaningful substitution score matrix, such as a BLOSUM matrix (Henikoff and Henikoff, 1992). Distance-based approaches to this problem map each peptide in the database to a point in a metric space, such that peptides aligning with higher scores are mapped to closer points. Many techniques exist to discover close pairs of points in a metric space efficiently, but the challenge in applying this work to proteomic comparison is to find a distance mapping that accurately encodes all the distinctions among residue pairs made by a proteomic score matrix. Buhler (2002) proposed one such mapping but found that it led to a relatively inefficient algorithm for protein-protein comparison.

Results: This work proposes a new distance mapping for peptides under the BLOSUM matrices that permits more efficient similarity search. We first propose a new distance function on peptides derived from a given score matrix. We then show how to map peptides to bit vectors such that the distance between any two peptides is closely approximated by the Hamming distance (i.e. number of mismatches) between their corresponding bit vectors. We combine these two results with the LSH-ALL-PAIRS-SIM algorithm of Buhler (2002) to produce an improved distance-based algorithm for proteomic comparison. An initial implementation of the improved algorithm exhibits sensitivity within 5% of that of the original LSH-ALL-PAIRS-SIM, while running up to eight times faster.

http://bioinformatics.oupjournals.org/cgi/content/abstract/19/suppl_1/i122


Title: Transcriptional Regulatory Networks in Saccharomyces cerevisiae

Authors: Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, Zeitlinger J, Jennings EG, Murray HL, Gordon DB, Ren B, Wyrick JJ, Tagne JB, Volkert TL, Fraenkel E, Gifford DK, Young RA

Ref: Science 2002 Oct 25;298(5594):799-804

Abstract: We have determined how most of the transcriptional regulators encoded in the eukaryote Saccharomyces cerevisiae associate with genes across the genome in living cells. Just as maps of metabolic networks describe the potential pathways that may be used by a cell to accomplish metabolic processes, this network of regulator-gene interactions describes potential pathways yeast cells can use to regulate global gene expression programs. We use this information to identify network motifs, the simplest units of network architecture, and demonstrate that an automated process can use motifs to assemble a transcriptional regulatory network structure. Our results reveal that eukaryotic cellular functions are highly connected through networks of transcriptional regulators that regulate other transcriptional regulators.


Title: Program-specific distribution of a transcription factor dependent on partner transcription factor and MAPK signaling

Authors: Zeitlinger J, Simon I, Harbison CT, Hannett NM, Volkert TL, Fink GR, Young RA

Ref: Cell 2003 May 2;113(3):395-404

Abstract: Specialized gene expression programs are induced by signaling pathways that act on transcription factors. Whether these transcription factors can function in multiple developmental programs through a global switch in promoter selection is not known. We have used genome-wide location analysis to show that the yeast Ste12 transcription factor, which regulates mating and filamentous growth, is bound to distinct program-specific target genes dependent on the developmental condition. This condition-dependent distribution of Ste12 requires concurrent binding of the transcription factor Tec1 during filamentation and is differentially regulated by the MAP kinases Fus3 and Kss1. Program-specific distribution across the genome may be a general mechanism by which transcription factors regulate distinct gene expression programs in response to signaling.


Title: Untangling the wires: A strategy to trace functional interactions in signaling and gene networks

Authors: Kholodenko BN, Kiyatkin A, Bruggeman FJ, Sontag E, Westerhoff HV, Hoek JB

Ref: Proc Natl Acad Sci USA 2002 Oct 1;99(20):12841-6

Abstract: Emerging technologies have enabled the acquisition of large genomics and proteomics data sets. However, current methodologies for analysis do not permit interpretation of the data in ways that unravel cellular networking. We propose a quantitative method for determining functional interactions in cellular signaling and gene networks. It can be used to explore cell systems at a mechanistic level or applied within a "modular" framework, which dramatically decreases the number of variables to be assayed. This method is based on a mathematical derivation that demonstrates how the topology and strength of network connections can be retrieved from experimentally measured network responses to successive perturbations of all modules. Importantly, our analysis can reveal functional interactions even when the components of the system are not all known. Under these circumstances, some connections retrieved by the analysis will not be direct but correspond to the interaction routes through unidentified elements. The method is tested and illustrated by using computer-generated responses of a modeled mitogen-activated protein kinase cascade and gene network.


Trends Genet. 2002 Aug;18(8):395-8.

Linking the genes: inferring quantitative gene networks from microarray data.

de la Fuente A, Brazhnik P, Mendes P.

Virginia Bioinformatics Institute, Virginia Polytechnic Institute and State University, 1880 Pratt Drive, Blacksburg, VA 24061, USA.

Modern microarray technology is capable of providing data about the expression of thousands of genes, and even of whole genomes. An important question is how this technology can be used most effectively to unravel the workings of cellular machinery. Here, we propose a method to infer genetic networks on the basis of data from appropriately designed microarray experiments. In addition to identifying the genes that affect a specific other gene directly, this method also estimates the strength of such effects. We will discuss both the experimental setup and the theoretical background.

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6TCY-46C6G3X-9&_user=492137&_coverDate=08%2F01%2F2002&_fmt=full&_orig=browse&_cdi=5183&view=c&_acct=C000022719&_version=1&_urlVersion=0&_userid=492137&md5=35c1683e1ba637566f99a20a64b700f4&ref=full



Title: Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data

Authors: Segal E, Shapira M, Regev A, Pe'er D, Botstein D, Koller D, Friedman N

Ref: Nat Genet 2003 Jun;34(2):166-76

Abstract: Much of a cell's activity is organized as a network of interacting modules: sets of genes coregulated to respond to different conditions. We present a probabilistic method for identifying regulatory modules from gene expression data. Our procedure identifies modules of coregulated genes, their regulators and the conditions under which regulation occurs, generating testable hypotheses in the form 'regulator X regulates module Y under conditions W'. We applied the method to a Saccharomyces cerevisiae expression data set, showing its ability to identify functionally coherent modules and their correct regulators. We present microarray experiments supporting three novel predictions, suggesting regulatory roles for previously uncharacterized proteins.



Genome Res. 2002 Mar;12(3):470-81.

Related Articles, Links

Extraction of functional binding sites from unique regulatory regions: the Drosophila early developmental enhancers.
Papatsenko DA, Makeev VJ, Lifanov AP, Regnier M, Nazina AG, Desplan C.

The early developmental enhancers of Drosophila melanogaster comprise one of the most sophisticated regulatory systems in higher eukaryotes. An elaborate code in their DNA sequence translates both maternal and early embryonic regulatory signals into spatial distribution of transcription factors. One of the most striking features of this code is the redundancy of binding sites for these transcription factors (BSTF). Using this redundancy, we explored the possibility of predicting functional binding sites in a single enhancer region without any prior consensus/matrix description or evolutionary sequence comparisons. We developed a conceptually simple algorithm, Scanseq, that employs an original statistical evaluation for identifying the most redundant motifs and locates the position of potential BSTF in a given regulatory region. To estimate the biological relevance of our predictions, we built thorough literature-based annotations for the best-known Drosophila developmental enhancers and we generated detailed distribution maps for the most robust binding sites. The high statistical correlation between the location of BSTF in these experiment-based maps and the location predicted in silico by Scanseq confirmed the relevance of our approach. We also discuss the definition of true binding sites and the possible biological principles that govern patterning of regulatory regions and the distribution of transcriptional signals.
http://www.ncbi.nlm.nih.gov/entrez/utils/fref.fcgi?http://www.genome.org/cgi/pmidlookup?view=full&pmid=11875036

Genome Res. 2003 Apr;13(4):579-88.

Related Articles, Links

Homotypic regulatory clusters in Drosophila.
Lifanov AP, Makeev VJ, Nazina AG, Papatsenko DA.

Cis-regulatory modules (CRMs) are transcription regulatory DNA segments (approximately 1 Kb range) that control the expression of developmental genes in higher eukaryotes. We analyzed clustering of known binding motifs for transcription factors (TFs) in over 60 known CRMs from 20 Drosophila developmental genes, and we present evidence that each type of recognition motif forms significant clusters within the regulatory regions regulated by the corresponding TF. We demonstrate how a search with a single binding motif can be applied to explore gene regulatory networks and to discover coregulated genes in the genome. We also discuss the potential of the clustering method in interpreting the differential response of genes to various levels of transcriptional regulators.
http://www.ncbi.nlm.nih.gov/entrez/utils/fref.fcgi?http://www.genome.org/cgi/pmidlookup?view=full&pmid=12670999

Bioinformatics Vol. 19 Suppl. 1 2003, Pages i292-i301

A probabilistic method to detect regulatory modules

Saurabh Sinha , Erik van Nimwegen and Eric D. Siggia

Motivation: The discovery of cis-regulatory modules in metazoan genomes is crucial for understanding the connection between genes and organism diversity.

Results: We develop a computational method that uses Hidden Markov Models and an Expectation Maximization algorithm to detect such modules, given the weight matrices of a set of transcription factors known to work together. Two novel features of our probabilistic model are: (i) correlations between binding sites, known to be required for module activity, are exploited, and (ii) phylogenetic comparisons among sequences from multiple species are made to highlight a regulatory module. The novel features are shown to improve detection of modules, in experiments on synthetic as well as biological data.

http://bioinformatics.oupjournals.org/cgi/content/abstract/19/suppl_1/i292

CREME: a framework for identifying cis-regulatory modules in human-mouse conserved segments

Roded Sharan, Ivan Ovcharenko, Asa Ben-Hur and Richard M. Karp

Motivation: The binding of transcription factors to specific regulatory sequence elements is a primary mechanism for controlling gene transcription. Recent findings suggest a modular organization of binding sites for transcription factors that cooperate in the regulation of genes. In this work we establish a framework for finding recurrent cis-regulatory modules in the promoters of a selected set of genes and scoring their statistical significance.

Results: Proceeding from a database of identified binding site motifs and their genomic locations we seek motifs whose frequency in the selected promoters is different than in a background promoter set. We present several statistical tests designed for this purpose. We provide a hashing algorithm for detecting combinations of these motifs that co-occur in clusters within the selected promoters. The significance of such co-occurrences is evaluated using novel statistical scores. Our methods are combined in CREME, a suite of software which includes a browser for viewing the pattern of occurrence of selected cis-regulatory modules. We applied our methodology to find modules within human-mouse conserved promoter segments, focusing on cell cycle regulated genes and stress response related genes. To validate the biological significance of the identified modules we tested whether the associated genes tended to be co-expressed or share similar function. In the cell cycle set five of the seven identified sets of genes were coherently expressed. On the stress response data four of the six detected sets fell predominantly into well-defined functional sub-categories.

http://bioinformatics.oupjournals.org/cgi/content/abstract/19/suppl_1/i283


Anal Chem. 2003 Feb 1;75(3):435-44.

Related Articles, Links


Intensity-based statistical scorer for tandem mass spectrometry.

Havilio M, Haddad Y, Smilansky Z.

We describe a new statistical scorer for tandem mass spectrometry. The scorer is based on the probability that fragments with given chemical properties create measured intensity levels in the experimental spectrum. The scorer's parameters are computed using a fully automated procedure. Benchmarking the new scorer on a large set of experimental spectra, we show that it performs significantly better than the widely used cross-correlation scoring algorithm of Eng et al. (Eng, J. K; McKormack, A. L.; Yates, J. R. J. Am. Soc. Mass Spectrom. 1994, 5, 976-989.).
http://pubs.acs.org/cgi-bin/article.cgi/ancham/2003/75/i03/html/ac0258913.html

An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database

Eng, Jimmy K.; McCormack, Ashley L.; Yates, John R., III

Journal of the American Society for Mass Spectrometry (1994), 5(11), 976-89 CODEN: JAMSEF; ISSN: 1044-0305. English.

A method to correlate the uninterpreted tandem mass spectra of peptides produced under low energy (10-50 eV) collision conditions with amino acid sequences in the Genpept database has been developed. In this method the protein database is searched to identify linear amino acid sequences within a mass tolerance of .+-.1 u of the precursor ion mol. weight A cross-correlation function is then used to provide a measurement of similarity between the mass-to-charge ratios for the fragment ions predicted from amino acid sequences obtained from the database and the fragment ions observed in the tandem mass spectrum. In general, a difference >0.1 between the normalized cross-correlation functions of the first- and second-ranked search results indicates a successful match between sequence and spectrum. Searches of species-specific protein databases with tandem mass spectra acquired from peptides obtained from the enzymically digested total proteins of E. coli and S. cerevisiae cells allowed matching of the spectra to amino acid sequences within proteins of these organisms. The approach described in this manuscript provides a convenient method to interpret tandem mass spectra with known sequences in a protein database.

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6TH2-44FNFDS-4&_coverDate=11%2F30%2F1994&_alid=110724300&_rdoc=1&_fmt=&_orig=search&_qd=1&_cdi=5270&_sort=d&view=c&_acct=C000022719&_version=1&_urlVersion=0&_userid=492137&md5=61b895ccfb21cf41911b52f91a97ad8d


Bioinformatics. 2001;17 Suppl 1:S13-21

SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database.
Bafna V, Edwards N.
Proteomics, or the direct analysis of the expressed protein components of a cell, is critical to our understanding of cellular biological processes in normal and diseased tissue. A key requirement for its success is the ability to identify proteins in complex mixtures. Recent technological advances in tandem mass spectrometry has made it the method of choice for high-throughput identification of proteins. Unfortunately, the software for unambiguously identifying peptide sequences has not kept pace with the recent hardware improvements in mass spectrometry instruments. Critical for reliable high-throughput protein identification, scoring functions evaluate the quality of a match between experimental spectra and a database peptide. Current scoring function technology relies heavily on ad-hoc parameterization and manual curation by experienced mass spectrometrists. In this work, we propose a two-stage stochastic model for the observed MS/MS spectrum, given a peptide. Our model explicitly incorporates fragment ion probabilities, noisy spectra, and instrument measurement error. We describe how to compute this probability based score efficiently, using a dynamic programming technique. A prototype implementation demonstrates the effectiveness of the model.

http://bioinformatics.oupjournals.org/cgi/content/abstract/17/suppl_1/S13



On de novo interpretation of tandem mass spectra for peptide identification

Vineet Bafna, Nathan Edwards

ABSTRACT

The correct interpretation of tandem mass spectra is a difficult problem, even when it is limited to scoring peptides against a database. De novo sequencing is considerably harder, but critical when sequence databases are incomplete or not available. In this paper we build upon earlier work due to Dancik et al., and Chen et al. to provide a dynamic programming algorithm for interpreting de novo spectra. Our method can handle most of the commonly occurring ions, including a; b; y, and their neutral losses. Additionally, we shift the emphasis away from sequencing to assigning ion types to peaks. In particular, we introduce the notion of core interpretations, which allow us to give confidence values to individual peak assignments, even in the absence of a strong interpretation. Finally, we introduce a systematic approach to evaluating de novo algorithms as a function of spectral quality. We show that our algorithm, in particular the core-interpretation, is robust in the presence of measurement error, and low fragmentation probability.

http://portal.acm.org/citation.cfm?id=640075.640077&dl=GUIDE&dl=ACM&type=series&idx=640075&part=Proceedings&WantType=Proceedings&title=Annual%20Conference%20on%20Research%20in%20Computational%20Molecular%20Biology&CFID=11891234&CFTOKEN=11681079