Title: The microRNAs of Caenorhabditis elegans

Authors: Lim LP, Lau NC, Weinstein EG, Abdelhakim A, Yekta S, Rhoades MW, Burge CB, Bartel DP

Ref: Genome Res 2003 Apr 15;17(8):991-1008

Abstract: MicroRNAs (miRNAs) are an abundant class of tiny RNAs thought to regulate the expression of protein-coding genes in plants and animals. In the present study, we describe a computational procedure to identify miRNA genes conserved in more than one genome. Applying this program, known as MiRscan, together with molecular identification and validation methods, we have identified most of the miRNA genes in the nematode Caenorhabditis elegans. The total number of validated miRNA genes stands at 88, with no more than 35 genes remaining to be detected or validated. These 88 miRNA genes represent 48 gene families; 46 of these families (comprising 86 of the 88 genes) are conserved in Caenorhabditis briggsae, and 22 families are conserved in humans. More than a third of the worm miRNAs, including newly identified members of the lin-4 and let-7 gene families, are differentially expressed during larval development, suggesting a role for these miRNAs in mediating larval developmental transitions. Most are present at very high steady-state levels-more than 1000 molecules per cell, with some exceeding 50,000 molecules per cell. Our census of the worm miRNAs and their expression patterns helps define this class of noncoding RNAs, lays the groundwork for functional studies, and provides the tools for more comprehensive analyses of miRNA genes in other species.

Title: Genomewide view of gene silencing by small interfering RNAs

Authors: Chi JT, Chang HY, Wang NN, Chang DS, Dunphy N, Brown PO

Ref: Proc Natl Acad Sci USA 2003 May 27;100(11):6343-6

Abstract: RNA interference (RNAi) is an evolutionarily conserved mechanism in plant and animal cells that directs the degradation of messenger RNAs homologous to short double-stranded RNAs termed small interfering RNA (siRNA). The ability of siRNA to direct gene silencing in mammalian cells has raised the possibility that siRNA might be used to investigate gene function in a high throughput fashion or to modulate gene expression in human diseases. The specificity of siRNA-mediated silencing, a critical consideration in these applications, has not been addressed on a genomewide scale. Here we show that siRNA-induced gene silencing of transient or stably expressed mRNA is highly gene-specific and does not produce secondary effects detectable by genomewide expression profiling. A test for transitive RNAi, extension of the RNAi effect to sequences 5' of the target region that has been observed in Caenorhabditis elegans, was unable to detect this phenomenon in human cells.

Significance analysis of microarrays applied to the ionizing radiation response

Virginia Goss Tusher, Robert Tibshirani, and Gilbert Chu

Microarrays can measure the expression of thousands of genes to identify changes in expression between different biologicalstates. Methods are needed to determine the significance of thesechanges while accounting for the enormous number of genes. Wedescribe a method, Significance Analysis of Microarrays (SAM),that assigns a score to each gene on the basis of change in geneexpression relative to the standard deviation of repeated measurements.For genes with scores greater than an adjustable threshold, SAMuses permutations of the repeated measurements to estimate thepercentage of genes identified by chance, the false discoveryrate (FDR). When the transcriptional response of human cells toionizing radiation was measured by microarrays, SAM identified34 genes that changed at least 1.5-fold with an estimated FDRof 12%, compared with FDRs of 60 and 84% by using conventionalmethods of analysis. Of the 34 genes, 19 were involved in cellcycle regulation and 3 in apoptosis. Surprisingly, four nucleotideexcision repair genes were induced, suggesting that this repairpathway for UV-damaged DNA might play a previously unrecognizedrole in repairing DNA damaged by ionizingradiation.

http://www.pnas.org/cgi/content/abstract/98/9/5116

PNAS | April 24, 2001 | vol. 98 | no. 9 | 5116-5121

Statistical significance for genomewide studies

John D. Storey and Robert Tibshirani

With the increase in genomewide experiments and the sequencingof multiple genomes, the analysis of large data sets has becomecommonplace in biology. It is often the case that thousandsof features in a genomewide data set are tested against somenull hypothesis, where a number of features are expected tobe significant. Here we propose an approach to measuring statistical significance in these genomewide studies based on the conceptof the false discovery rate. This approach offers a sensiblebalance between the number of true and false positives thatis automatically calibrated and easily interpreted. In doingso, a measure of statistical significance called the q valueis associated with each tested feature. The q value is similarto the well known p value, except it is a measure of significancein terms of the false discovery rate rather than the false positive rate. Our approach avoids a flood of false positiveresults, while offering a more liberal criterion than whathas been used in genome scans for linkage.

http://www.pnas.org/cgi/content/full/100/16/9440

PNAS, August 5, 2003; 100(16): 9440 - 9445

Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model

Mayetri Gupta ; Jun S. Liu

Journal of the American Statistical Association, Volume: 98 Number: 461 Page: 55 -- 66

Abstract: Detection of unknown patterns from a randomly generated sequence of observations is a problem arising in fields ranging from signal processing to computational biology. Here we focus on the discovery of short recurring patterns (called motifs) in DNA sequences that represent binding sites for certain proteins in the process of gene regulation. What makes this a difficult problem is that these patterns can vary stochastically. We describe a novel data augmentation strategy for detecting such patterns in biological sequences based on an extension of a "dictionary" model. In this approach, we treat conserved patterns and individual nucleotides as stochastic words generated according to probability weight matrices and the observed sequences generated by concatenations of these words. By using a missingdata approach to find these patterns, we also address other related problems, including determining widths of patterns, finding multiple motifs, handling low-complexity regions, and finding patterns with insertions and deletions. The issue of selecting appropriate models is also discussed. However, the flexibility of this model is also accompanied by a high degree of computational complexity. We demonstrate how dynamic programming-like recursions can be used to improve computational efficiency.

http://www.people.fas.harvard.edu/~gupta6/papers/sdict.pdf

Transcriptional regulatory cascades in development: Initial rates, not steady state, determine network kinetics

Hamid Bolouri and Eric H. Davidson

A model was built to examine the kinetics of regulatory cascadessuch as occur in developmental gene networks. The model relatesoccupancy of cis-regulatory target sites to transcriptionalinitiation rate, and thence to RNA and protein output. Themodel was used to simulate regulatory cascades in which genesencoding transcription factors are successively activated. Using realistic parameter ranges based on extensive earlier measurementsin sea urchin embryos, we find that transitions of regulatorystates occur sharply in these simulations, with respect totime or changing transcription factor concentrations. As isoften observed in developing systems, the simulated regulatorycascades display a succession of gene activations separatedby delays of some hours. The most important causes of thisbehavior are cooperativity in the assembly of cis-regulatorycomplexes and the high specificity of transcription factorsfor their target sites. Successive transitions in state occurlong in advance of the approach to steady-state levels of themolecules that drive the process. The kinetics of such developmentalsystems thus depend mainly on the initial output rates of genes activated in response to the advent of new transcription factors.

http://www.pnas.org/cgi/content/full/100/16/9371

Title: Human and mouse genomic sequences reveal extensive breakpoint reuse in mammalian evolution

Authors: Pevzner P, Tesler G Ref: Proc Natl Acad Sci USA 2003 Jun 24;100(13):7672-7

Abstract: The human and mouse genomic sequences provide evidence for a larger number of rearrangements than previously thought and reveal extensive reuse of breakpoints from the same short fragile regions. Breakpoint clustering in regions implicated in cancer and infertility have been reported in previous studies; we report here on breakpoint clustering in chromosome evolution. This clustering reveals limitations of the widely accepted random breakage theory that has remained unchallenged since the mid-1980s. The genome rearrangement analysis of the human and mouse genomes implies the existence of a large number of very short "hidden" synteny blocks that were invisible in the comparative mapping data and ignored in the random breakage model. These blocks are defined by closely located breakpoints and are often hard to detect. Our results suggest a model of chromosome evolution that postulates that mammalian genomes are mosaics of fragile regions with high propensity for rearrangements and solid regions with low propensity for rearrangements.

Title: Genomewide demarcation of RNA polymerase II transcription units revealed by physical fractionation of chromatin

Authors: Nagy PL, Cleary ML, Brown PO, Lieb JD

Ref: Proc Natl Acad Sci USA 2003 May 27;100(11):6364-9

Abstract: Epigenetic modifications of chromatin serve an important role in regulating the expression and accessibility of genomic DNA. We report here a genomewide approach for fractionating yeast chromatin into two functionally distinct parts, one containing RNA polymerase II transcribed sequences, and the other comprising noncoding sequences and genes transcribed by RNA polymerases I and III. Noncoding regions could be further fractionated into promoters and segments lacking promoters. The observed separations were apparently based on differential crosslinking efficiency of chromatin in different genomic regions. The results reveal a genomewide molecular mechanism for marking promoters and genomic regions that have a license to be transcribed by RNA polymerase II, a previously unrecognized level of genomic complexity that may exist in all eukaryotes. Our approach has broad potential use as a tool for genome annotation and for the characterization of global changes in chromatin structure that accompany different genetic, environmental, and disease states.

Title: Comparative analyses of multi-species sequences from targeted genomic regions

Authors: Thomas JW, Touchman JW, Blakesley RW, Bouffard GG, Beckstrom-Sternberg SM, Margulies EH, Blanchette M, Siepel AC, Thomas PJ, McDowell JC, Maskeri B, Hansen NF, Schwartz MS, Weber RJ, Kent WJ, Karolchik D, Bruen TC, Bevan R, Cutler DJ, Schwartz S, Elnitski L, Idol JR, Prasad AB, Lee-Lin SQ, Maduro VV, Summers TJ, Portnoy ME, Dietrich NL, Akhter N, Ayele K, Benjamin B, Cariaga K, Brinkley CP, Brooks SY, Granite S, Guan X, Gupta J, Haghighi P, Ho SL, Huang MC, Karlins E, Laric PL, Legaspi R, Lim MJ, Maduro QL, Masiello CA, Mastrian SD, McCloskey JC, Pearson R, Stantripop S, Tiongson EE, Tran JT, Tsurgeon C, Vogt JL, Walker MA, Wetherby KD, Wiggins LS, Young AC, Zhang LH, Osoegawa K, Zhu B, Zhao B, Shu CL, De Jong PJ, Lawrence CE, Smit AF, Chakravarti A, Haussler D, Green P, Miller W, Green ED

Ref: Nature 2003 Aug 14;424(6950):788-93

Abstract: The systematic comparison of genomic sequences from different organisms represents a central focus of contemporary genome analysis. Comparative analyses of vertebrate sequences can identify coding and conserved non-coding regions, including regulatory elements, and provide insight into the forces that have rendered modern-day genomes. As a complement to whole-genome sequencing efforts, we are sequencing and comparing targeted genomic regions in multiple, evolutionarily diverse vertebrates. Here we report the generation and analysis of over 12 megabases (Mb) of sequence from 12 species, all derived from the genomic region orthologous to a segment of about 1.8 Mb on human chromosome 7 containing ten genes, including the gene mutated in cystic fibrosis. These sequences show conservation reflecting both functional constraints and the neutral mutational events that shaped this genomic region. In particular, we identify substantial numbers of conserved non-coding segments beyond those previously identified experimentally, most of which are not detectable by pair-wise sequence comparisons alone. Analysis of transposable element insertions highlights the variation in genome dynamics among these species and confirms the placement of rodents as a sister group to the primates.

Title: Cross-species sequence comparisons: a review of methods and available resources

Authors: Frazer KA, Elnitski L, Church DM, Dubchak I, Hardison RC

Ref: Genome Res 2003 Jan;13(1):1-12

Abstract: With the availability of whole-genome sequences for an increasing number of species, we are now faced with the challenge of decoding the information contained within these DNA sequences. Comparative analysis of DNA sequences from multiple species at varying evolutionary distances is a powerful approach for identifying coding and functional noncoding sequences, as well as sequences that are unique for a given organism. In this review, we outline the strategy for choosing DNA sequences from different species for comparative analyses and describe the methods used and the resources publicly available for these studies.

Title: Covariation in frequencies of substitution, deletion, transposition, and recombination during eutherian evolution.

Authors: Hardison RC, Roskin KM, Yang S, Diekhans M, Kent WJ, Weber R, Elnitski L, Li J, O'Connor M, Kolbe D, Schwartz S, Furey TS, Whelan S, Goldman N, Smit A, Miller W, Chiaromonte F, Haussler D

Ref: Genome Res 2003 Jan;13(1):13-26

Abstract: Six measures of evolutionary change in the human genome were studied, three derived from the aligned human and mouse genomes in conjunction with the Mouse Genome Sequencing Consortium, consisting of (1) nucleotide substitution per fourfold degenerate site in coding regions, (2) nucleotide substitution per site in relics of transposable elements active only before the human-mouse speciation, and (3) the nonaligning fraction of human DNA that is nonrepetitive or in ancestral repeats; and three derived from human genome data alone, consisting of (4) SNP density, (5) frequency of insertion of transposable elements, and (6) rate of recombination. Features 1 and 2 are measures of nucleotide substitutions at two classes of "neutral" sites, whereas 4 is a measure of recent mutations. Feature 3 is a measure dominated by deletions in mouse, whereas 5 represents insertions in human. It was found that all six vary significantly in megabase-sized regions genome-wide, and many vary together. This indicates that some regions of a genome change slowly by all processes that alter DNA, and others change faster. Regional variation in all processes is correlated with, but not completely accounted for, by GC content in human and the difference between GC content in human and mouse.

Scoring two-species local alignments to try to statistically separate neutrally evolving from selected DNA segments

Krishna M. Roskin, Mark Diekhans, David Haussler

Proceedings of the seventh annual international conference on Computational molecular biology (RECOMB), Berlin, Pages: 257 – 266.

We construct several score functions for use in locating unusually conserved regions in a genome-wide search of aligned DNA from two species. We test these functions on regions of the human genome aligned to the mouse genome. These score functions are derived from properties of neutrally evolving sites on the mouse and human genome, and can be adjusted to the local background rate of conservation. The aim of these functions is to try to identify regions of the human genome that are conserved by evolutionary selection, because they have an important function, rather than by chance. We use them to get a very rough estimate of the amount of DNA in the human genome that is under selection.

http://portal.acm.org/citation.cfm?id=640109&jmp=cit&dl=GUIDE&dl=ACM&CFID=11891234&CFTOKEN=11681079#CIT

Title: Sequencing and comparison of yeast species to identify genes and regulatory elements

Authors: Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES

Ref: Nature 423, 241 - 254 (2003)

Abstract: Identifying the functional elements encoded in a genome is one of the principal challenges in modern biology. Comparative genomics should offer a powerful, general approach. Here, we present a comparative analysis of the yeast Saccharomyces cerevisiae based on high-quality draft sequences of three related species (S. paradoxus, S. mikatae and S. bayanus). We first aligned the genomes and characterized their evolution, defining the regions and mechanisms of change. We then developed methods for direct identification of genes and regulatory motifs. The gene analysis yielded a major revision to the yeast gene catalogue, affecting approximately 15% of all genes and reducing the total count by about 500 genes. The motif analysis automatically identified 72 genome-wide elements, including most known regulatory motifs and numerous new motifs. We inferred a putative function for most of these motifs, and provided insights into their combinatorial interactions. The results have implications for genome analysis of diverse organisms, including the human.

Detecting protein sequence conservation via metric embeddings

E. Halperin, J. Buhler, R. Karp, R. Krauthgamer and B. Westover

Motivation: Comparing two protein databases is a fundamentaltask in biosequence annotation. Given two databases, one mustfind all pairs of proteins that align with high score undera biologically meaningful substitution score matrix, such asa BLOSUM matrix (Henikoff and Henikoff, 1992). Distance-basedapproaches to this problem map each peptide in the databaseto a point in a metric space, such that peptides aligning withhigher scores are mapped to closer points. Many techniquesexist to discover close pairs of points in a metric space efficiently,but the challenge in applying this work to proteomic comparisonis to find a distance mapping that accurately encodes all thedistinctions among residue pairs made by a proteomic score matrix.Buhler (2002) proposed one such mapping but found that it ledto a relatively inefficient algorithm for protein-protein comparison.

Results: This work proposes a new distance mapping for peptidesunder the BLOSUM matrices that permits more efficient similaritysearch. We first propose a new distance function on peptidesderived from a given score matrix. We then show how to map peptidesto bit vectors such that the distance between any two peptidesis closely approximated by the Hamming distance (i.e. numberof mismatches) between their corresponding bit vectors. Wecombine these two results with the LSH-ALL-PAIRS-SIM algorithmof Buhler (2002) to produce an improved distance-based algorithmfor proteomic comparison. An initial implementation of theimproved algorithm exhibits sensitivity within 5% of that ofthe original LSH-ALL-PAIRS-SIM, while running up to eight timesfaster.

http://bioinformatics.oupjournals.org/cgi/content/abstract/19/suppl_1/i122

Title: Transcriptional Regulatory Networks in Saccharomyces cerevisiae

Authors: Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, Zeitlinger J, Jennings EG, Murray HL, Gordon DB, Ren B, Wyrick JJ, Tagne JB, Volkert TL, Fraenkel E, Gifford DK, Young RA

Ref: Science 2002 Oct 25;298(5594):799-804

Abstract: We have determined how most of the transcriptional regulators encoded in the eukaryote Saccharomyces cerevisiae associate with genes across the genome in living cells. Just as maps of metabolic networks describe the potential pathways that may be used by a cell to accomplish metabolic processes, this network of regulator-gene interactions describes potential pathways yeast cells can use to regulate global gene expression programs. We use this information to identify network motifs, the simplest units of network architecture, and demonstrate that an automated process can use motifs to assemble a transcriptional regulatory network structure. Our results reveal that eukaryotic cellular functions are highly connected through networks of transcriptional regulators that regulate other transcriptional regulators.

Title: Program-specific distribution of a transcription factor dependent on partner transcription factor and MAPK signaling

Authors: Zeitlinger J, Simon I, Harbison CT, Hannett NM, Volkert TL, Fink GR, Young RA

Ref: Cell 2003 May 2;113(3):395-404

Abstract: Specialized gene expression programs are induced by signaling pathways that act on transcription factors. Whether these transcription factors can function in multiple developmental programs through a global switch in promoter selection is not known. We have used genome-wide location analysis to show that the yeast Ste12 transcription factor, which regulates mating and filamentous growth, is bound to distinct program-specific target genes dependent on the developmental condition. This condition-dependent distribution of Ste12 requires concurrent binding of the transcription factor Tec1 during filamentation and is differentially regulated by the MAP kinases Fus3 and Kss1. Program-specific distribution across the genome may be a general mechanism by which transcription factors regulate distinct gene expression programs in response to signaling.

Title: Untangling the wires: A strategy to trace functional interactions in signaling and gene networks

Authors: Kholodenko BN, Kiyatkin A, Bruggeman FJ, Sontag E, Westerhoff HV, Hoek JB

Ref: Proc Natl Acad Sci USA 2002 Oct 1;99(20):12841-6

Abstract: Emerging technologies have enabled the acquisition of large genomics and proteomics data sets. However, current methodologies for analysis do not permit interpretation of the data in ways that unravel cellular networking. We propose a quantitative method for determining functional interactions in cellular signaling and gene networks. It can be used to explore cell systems at a mechanistic level or applied within a "modular" framework, which dramatically decreases the number of variables to be assayed. This method is based on a mathematical derivation that demonstrates how the topology and strength of network connections can be retrieved from experimentally measured network responses to successive perturbations of all modules. Importantly, our analysis can reveal functional interactions even when the components of the system are not all known. Under these circumstances, some connections retrieved by the analysis will not be direct but correspond to the interaction routes through unidentified elements. The method is tested and illustrated by using computer-generated responses of a modeled mitogen-activated protein kinase cascade and gene network.

Trends Genet. 2002 Aug;18(8):395-8.

Linking the genes: inferring quantitative gene networks from microarray data.

de la Fuente A, Brazhnik P, Mendes P.

Virginia Bioinformatics Institute, Virginia Polytechnic Institute and State University, 1880 Pratt Drive, Blacksburg, VA 24061, USA.

Modern microarray technology is capable of providing data about the expression of thousands of genes, and even of whole genomes. An important question is how this technology can be used most effectively to unravel the workings of cellular machinery. Here, we propose a method to infer genetic networks on the basis of data from appropriately designed microarray experiments. In addition to identifying the genes that affect a specific other gene directly, this method also estimates the strength of such effects. We will discuss both the experimental setup and the theoretical background.

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6TCY-46C6G3X-9&_user=492137&_coverDate=08%2F01%2F2002&_fmt=full&_orig=browse&_cdi=5183&view=c&_acct=C000022719&_version=1&_urlVersion=0&_userid=492137&md5=35c1683e1ba637566f99a20a64b700f4&ref=full

Title: Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data

Authors: Segal E, Shapira M, Regev A, Pe'er D, Botstein D, Koller D, Friedman N

Ref: Nat Genet 2003 Jun;34(2):166-76

Abstract: Much of a cell's activity is organized as a network of interacting modules: sets of genes coregulated to respond to different conditions. We present a probabilistic method for identifying regulatory modules from gene expression data. Our procedure identifies modules of coregulated genes, their regulators and the conditions under which regulation occurs, generating testable hypotheses in the form 'regulator X regulates module Y under conditions W'. We applied the method to a Saccharomyces cerevisiae expression data set, showing its ability to identify functionally coherent modules and their correct regulators. We present microarray experiments supporting three novel predictions, suggesting regulatory roles for previously uncharacterized proteins.

Genome Res. 2002 Mar;12(3):470-81.

CREME: a framework for identifying cis-regulatory modules in human-mouse conserved segments

Roded Sharan, Ivan Ovcharenko, Asa Ben-Hur and Richard M. Karp

Motivation: The binding of transcription factors to specificregulatory sequence elements is a primary mechanism for controllinggene transcription. Recent findings suggest a modular organizationof binding sites for transcription factors that cooperate inthe regulation of genes. In this work we establish a frameworkfor finding recurrent cis-regulatory modules in the promotersof a selected set of genes and scoring their statistical significance.

Results: Proceeding from a database of identified binding sitemotifs and their genomic locations we seek motifs whose frequencyin the selected promoters is different than in a backgroundpromoter set. We present several statistical tests designedfor this purpose. We provide a hashing algorithm for detecting combinationsof these motifs that co-occur in clusters within the selectedpromoters. The significance of such co-occurrences is evaluatedusing novel statistical scores. Our methods are combinedin CREME, a suite of software which includes a browser forviewing the pattern of occurrence of selected cis-regulatory modules.We applied our methodology to find modules within human-mouseconserved promoter segments, focusing on cell cycle regulatedgenes and stress response related genes. To validate the biologicalsignificance of the identified modules we tested whether theassociated genes tended to be co-expressed or share similarfunction. In the cell cycle set five of the seven identifiedsets of genes were coherently expressed. On the stress responsedata four of the six detected sets fell predominantly intowell-defined functional sub-categories.

http://bioinformatics.oupjournals.org/cgi/content/abstract/19/suppl_1/i283

Anal Chem. 2003 Feb 1;75(3):435-44.