logo
    A practical guide to methods controlling false discoveries in computational biology
    298
    Citation
    78
    Reference
    10
    Related Paper
    Citation Trend
    Abstract:
    In high-throughput studies, hundreds to millions of hypotheses are typically tested. Statistical methods that control the false discovery rate (FDR) have emerged as popular and powerful tools for error rate control. While classic FDR methods use only p values as input, more modern FDR methods have been shown to increase power by incorporating complementary information as informative covariates to prioritize, weight, and group hypotheses. However, there is currently no consensus on how the modern methods compare to one another. We investigate the accuracy, applicability, and ease of use of two classic and six modern FDR-controlling methods by performing a systematic benchmark comparison using simulation studies as well as six case studies in computational biology. Methods that incorporate informative covariates are modestly more powerful than classic approaches, and do not underperform classic approaches, even when the covariate is completely uninformative. The majority of methods are successful at controlling the FDR, with the exception of two modern methods under certain settings. Furthermore, we find that the improvement of the modern FDR methods over the classic methods increases with the informativeness of the covariate, total number of hypothesis tests, and proportion of truly non-null hypotheses. Modern FDR methods that use an informative covariate provide advantages over classic FDR-controlling procedures, with the relative gain dependent on the application and informativeness of available covariates. We present our findings as a practical guide and provide recommendations to aid researchers in their choice of methods to correct for false discoveries.
    Keywords:
    Genome Biology
    Human genetics
    Computational genomics
    Genome Biology
    Human genetics
    Computational genomics
    Personal genomics
    Sequence (biology)
    Comparative Genomics
    Serial analysis of gene expression (SAGE) has been used to analyze the complete human 'transcriptome' - the number, identity and level of expression of genes in humans.
    Human genetics
    Genome Biology
    Computational genomics
    Deep exome resequencing is a powerful approach for delineating patterns of protein-coding variation among genes, pathways, individuals and populations.We analyzed exome data from 2,440 individuals of European and African ancestry as part of the National Heart, Lung, and Blood Institute's Exome Project, the aim of which is to discover novel genes and mechanisms that contribute to heart, lung and blood disorders.Each exome was sequenced to a mean coverage of 116×, allowing detailed inferences about the population genomic patterns of both common variation and rare coding variation.We identifi ed more than 500,000 single nucleotide variations, the majority of which were novel and rare (76% of variants had a minor allele frequency of less than 0.1%), refl ecting the recent dramatic increase in the size of the human population.The unprecedented magnitude of this dataset allowed us to rigorously characterize the large variation in nucleotide diversity among genes (ranging from 0 to 1.32%), as well as the role of positive and purifying selection in shaping patterns of proteincoding variation and the diff erential signatures of population structure from rare and common variation.This dataset provides a framework for personal genomics and is an important resource that will allow inferences of broad importance to human evolution and health.
    Human genetics
    Genome Biology
    Computational genomics
    Abstract Background The core promoter region plays a critical role in the regulation of eukaryotic gene expression. We have determined the non-random distribution of DNA sequences relative to the transcriptional start site in Drosophila melanogaster promoters to identify sequences that may be biologically significant. We compare these results with those obtained for human promoters. Results We determined the distribution of all 65,536 octamer (8-mers) DNA sequences in 10,914 Drosophila promoters and two sets of human promoters aligned relative to the transcriptional start site. In Drosophila , 298 8-mers have highly significant ( p ≤ 1 × 10 -16 ) non-random distributions peaking within 100 base-pairs of the transcriptional start site. These sequences were grouped into 15 DNA motifs. Ten motifs, termed directional motifs, occur only on the positive strand while the remaining five motifs, termed non-directional motifs, occur on both strands. The only directional motifs to localize in human promoters are TATA, INR, and DPE. The directional motifs were further subdivided into those precisely positioned relative to the transcriptional start site and those that are positioned more loosely relative to the transcriptional start site. Similar numbers of non-directional motifs were identified in both species and most are different. The genes associated with all 15 DNA motifs, when they occur in the peak, are enriched in specific Gene Ontology categories and show a distinct mRNA expression pattern, suggesting that there is a core promoter code in Drosophila . Conclusion Drosophila and human promoters use different DNA sequences to regulate gene expression, supporting the idea that evolution occurs by the modulation of gene regulation.
    Human genetics
    Genome Biology
    Comparative Genomics
    Computational genomics
    Functional Genomics
    Citations (155)
    It has recently been shown that the detection of gene fusion events across genomes can be used for predicting functional associations of proteins, including physical interaction or complex formation. To obtain such predictions we have made an exhaustive search for gene fusion events within 24 available completely sequenced genomes. Each genome was used as a query against the remaining 23 complete genomes to detect gene fusion events. Using an improved, fully automatic protocol, a total of 7,224 single-domain proteins that are components of gene fusions in other genomes were detected, many of which were identified for the first time. The total number of predicted pairwise functional associations is 39,730 for all genomes. Component pairs were identified by virtue of their similarity to 2,365 multidomain composite proteins. We also show for the first time that gene fusion is a complex evolutionary process with a number of contributory factors, including paralogy, genome size and phylogenetic distance. On average, 9% of genes in a given genome appear to code for single-domain, component proteins predicted to be functionally associated. These proteins are detected by an additional 4% of genes that code for fused, composite proteins. These results provide an exhaustive set of functionally associated genes and also delineate the power of fusion analysis for the prediction of protein interactions.
    Human genetics
    Genome Biology
    Computational genomics
    Functional Genomics
    Genome Biology
    Human genetics
    Computational genomics
    Citations (1)
    The Forkhead (FKH) transcription factor FOXM1 is a key regulator of the cell cycle and is overexpressed in most types of cancer. FOXM1, similar to other FKH factors, binds to a canonical FKH motif in vitro. However, genome-wide mapping studies in different cell lines have shown a lack of enrichment of the FKH motif, suggesting an alternative mode of chromatin recruitment. We have investigated the role of direct versus indirect DNA binding in FOXM1 recruitment by performing ChIP-seq with wild-type and DNA binding deficient FOXM1. An in vitro fluorescence polarization assay identified point mutations in the DNA binding domain of FOXM1 that inhibit binding to a FKH consensus sequence. Cell lines expressing either wild-type or DNA binding deficient GFP-tagged FOXM1 were used for genome-wide mapping studies comparing the distribution of the DNA binding deficient protein to the wild-type. This shows that interaction of the FOXM1 DNA binding domain with target DNA is essential for recruitment. Moreover, analysis of the protein interactome of wild-type versus DNA binding deficient FOXM1 shows that the reduced recruitment is not due to inhibition of protein-protein interactions. A functional DNA binding domain is essential for FOXM1 chromatin recruitment. Even in FOXM1 mutants with almost complete loss of binding, the protein-protein interactions and pattern of phosphorylation are largely unaffected. These results strongly support a model whereby FOXM1 is specifically recruited to chromatin through co-factor interactions by binding directly to non-canonical DNA sequences.
    Genome Biology
    Human genetics
    Computational genomics
    Personal genomics
    Citations (56)
    Nanodroplets of active, solvated protein can be printed onto treated glass slides for protein microarray experiments.
    Human genetics
    Genome Biology
    Computational genomics
    Gene regulation is considered one of the driving forces of evolution. Although protein-coding DNA sequences and RNA genes have been subject to recent evolutionary events in the human lineage, it has been hypothesized that the large phenotypic divergence between humans and chimpanzees has been driven mainly by changes in gene regulation rather than altered protein-coding gene sequences. Comparative analysis of vertebrate genomes has revealed an abundance of evolutionarily conserved but noncoding sequences. These conserved noncoding (CNC) sequences may well harbor critical regulatory variants that have driven recent human evolution.Here we identify 1,356 CNC sequences that appear to have undergone dramatic human-specific changes in selective pressures, at least 15% of which have substitution rates significantly above that expected under neutrality. The 1,356 'accelerated CNC' (ANC) sequences are enriched in recent segmental duplications, suggesting a recent change in selective constraint following duplication. In addition, single nucleotide polymorphisms within ANC sequences have a significant excess of high frequency derived alleles and high F(ST) values relative to controls, indicating that acceleration and positive selection are recent in human populations. Finally, a significant number of single nucleotide polymorphisms within ANC sequences are associated with changes in gene expression. The probability of variation in an ANC sequence being associated with a gene expression phenotype is fivefold higher than variation in a control CNC sequence.Our analysis suggests that ANC sequences have until very recently played a role in human evolution, potentially through lineage-specific changes in gene regulation.
    Human genetics
    Genome Biology
    Computational genomics
    Personal genomics
    Citations (184)
    Abstract Background Next-generation sequencing (NGS) can identify mutations in the human genome that cause disease and has been widely adopted in clinical diagnosis. However, the human genome contains many polymorphic, low-complexity, and repetitive regions that are difficult to sequence and analyze. Despite their difficulty, these regions include many clinically important sequences that can inform the treatment of human diseases and improve the diagnostic yield of NGS. Results To evaluate the accuracy by which these difficult regions are analyzed with NGS, we built an in silico decoy chromosome, along with corresponding synthetic DNA reference controls, that encode difficult and clinically important human genome regions, including repeats, microsatellites, HLA genes, and immune receptors. These controls provide a known ground-truth reference against which to measure the performance of diverse sequencing technologies, reagents, and bioinformatic tools. Using this approach, we provide a comprehensive evaluation of short- and long-read sequencing instruments, library preparation methods, and software tools and identify the errors and systematic bias that confound our resolution of these remaining difficult regions. Conclusions This study provides an analytical validation of diagnosis using NGS in difficult regions of the human genome and highlights the challenges that remain to resolve these difficult regions.
    Human genetics
    Genome Biology
    Personal genomics
    Computational genomics