Regardless of how creative, innovative, and elegant our computational methods, the ultimate proof of an algorithm's worth is the experimentally validated quality of its predictions. Unfortunately, this truism is hard to reduce to practice. Usually, modelers produce hundreds to hundreds of thousands of predictions, most (if not all) of which go untested. In a best-case scenario, a small subsample of predictions (three to ten usually) is experimentally validated, as a quality control step to attest to the global soundness of the full set of predictions. However, whether this small set is even representative of the global algorithm's performance is a question usually left unaddressed. Thus, a clear understanding of the strengths and weaknesses of an algorithm most often remains elusive, especially to the experimental biologists who must decide which tool to use to address a specific problem. In this chapter, we describe the first systematic set of challenges posed to the systems biology community in the framework of the DREAM (Dialogue for Reverse Engineering Assessments and Methods) project. These tests, which came to be known as the DREAM2 challenges, consist of data generously donated by participants to the DREAM project and curated in such a way as to become problems of network reconstruction and whose solutions, the actual networks behind the data, were withheld from the participants. The explanation of the resulting five challenges, a global comparison of the submissions, and a discussion of the best performing strategies are the main topics discussed.
DREAM challenges are community competitions designed to advance computational methods and address fundamental questions in system biology and translational medicine. Each challenge asks participants to develop and apply computational methods to either predict unobserved outcomes or to identify unknown model parameters given a set of training data. Computational methods are evaluated using an automated scoring metric, scores are posted to a public leaderboard, and methods are published to facilitate community discussions on how to build improved methods. By engaging participants from a wide range of science and engineering backgrounds, DREAM challenges can comparatively evaluate a wide range of statistical, machine learning, and biophysical methods. Here, we describe DREAMTools, a Python package for evaluating DREAM challenge scoring metrics. DREAMTools provides a command line interface that enables researchers to test new methods on past challenges, as well as a framework for scoring new challenges. As of March 2016, DREAMTools includes more than 80% of completed DREAM challenges. DREAMTools complements the data, metadata, and software tools available at the DREAM website http://dreamchallenges.org and on the Synapse platform at https://www.synapse.org.Availability:DREAMTools is a Python package. Releases and documentation are available at http://pypi.python.org/pypi/dreamtools. The source code is available at http://github.com/dreamtools/dreamtools.
Abstract Background One of the main challenges in metagenomics is the identification of microorganisms in clinical and environmental samples. While an extensive and heterogeneous set of computational tools is available to classify microorganisms using whole genome shotgun sequencing data, comprehensive comparisons of these methods are limited. In this study, we use the largest (n=35) to date set of laboratory-generated and simulated controls across 846 species to evaluate the performance of eleven metagenomics classifiers. We also assess the effects of filtering and combining tools to reduce the number of false positives. Results Tools were characterized on the basis of their ability to (1) identify taxa at the genus, species, and strain levels, (2) quantify relative abundance measures of taxa, and (3) classify individual reads to the species level. Strikingly, the number of species identified by the eleven tools can differ by over three orders of magnitude on the same datasets. However, various strategies can ameliorate taxonomic misclassification, including abundance filtering, ensemble approaches, and tool intersection. Indeed, leveraging tools with different heuristics is beneficial for improved precision. Nevertheless, these strategies were often insufficient to completely eliminate false positives from environmental samples, which are especially important where they concern medically relevant species and where customized tools may be required. Conclusions The results of this study provide positive controls, titrated standards, and a guide for selecting tools for metagenomic analyses by comparing ranges of precision and recall. We show that proper experimental design and analysis parameters, including depth of sequencing, choice of classifier or classifiers, database size, and filtering, can reduce false positives, provide greater resolution of species in complex metagenomic samples, and improve the interpretation of results.
In this work, we hypothesized that shifts in the food microbiome can be used as an indicator of unexpected contaminants or environmental changes. To test this hypothesis, we sequenced the total RNA of 31 high protein powder (HPP) samples of poultry meal pet food ingredients. We developed a microbiome analysis pipeline employing a key eukaryotic matrix filtering step that improved microbe detection specificity to >99.96% during in silico validation. The pipeline identified 119 microbial genera per HPP sample on average with 65 genera present in all samples. The most abundant of these were Bacteroides, Clostridium, Lactococcus, Aeromonas, and Citrobacter. We also observed shifts in the microbial community corresponding to ingredient composition differences. When comparing culture-based results for Salmonella with total RNA sequencing, we found that Salmonella growth did not correlate with multiple sequence analyses. We conclude that microbiome sequencing is useful to characterize complex food microbial communities, while additional work is required for predicting specific species' viability from total RNA sequencing.
Abstract Summary: Computational workloads for genome-wide association studies (GWAS) are growing in scale and complexity outpacing the capabilities of single-threaded software designed for personal computers. The BlueSNP R package implements GWAS statistical tests in the R programming language and executes the calculations across computer clusters configured with Apache Hadoop, a de facto standard framework for distributed data processing using the MapReduce formalism. BlueSNP makes computationally intensive analyses, such as estimating empirical p-values via data permutation, and searching for expression quantitative trait loci over thousands of genes, feasible for large genotype–phenotype datasets. Availability and implementation: http://github.com/ibm-bioinformatics/bluesnp Contact: rjprill@us.ibm.com Supplementary information: Supplementary data are available at Bioinformatics online.
DREAM challenges are community competitions designed to advance computational methods and address fundamental questions in system biology and translational medicine. Each challenge asks participants to develop and apply computational methods to either predict unobserved outcomes or to identify unknown model parameters given a set of training data. Computational methods are evaluated using an automated scoring metric, scores are posted to a public leaderboard, and methods are published to facilitate community discussions on how to build improved methods. By engaging participants from a wide range of science and engineering backgrounds, DREAM challenges can comparatively evaluate a wide range of statistical, machine learning, and biophysical methods. Here, we describe DREAMTools, a Python package for evaluating DREAM challenge scoring metrics. DREAMTools provides a command line interface that enables researchers to test new methods on past challenges, as well as a framework for scoring new challenges. As of September 2015, DREAMTools includes more than 80% of completed DREAM challenges. DREAMTools complements the data, metadata, and software tools available at the DREAM website http://dreamchallenges.org and on the Synapse platform https://www.synapse.org.Availability: DREAMTools is a Python package. Releases and documentation are available at http://pypi.python.org/pypi/dreamtools. The source code is available at http://github.com/dreamtools.
Biological networks, such as those describing gene regulation, signal transduction, and neural synapses, are representations of large-scale dynamic systems. Discovery of organizing principles of biological networks can be enhanced by embracing the notion that there is a deep interplay between network structure and system dynamics. Recently, many structural characteristics of these non-random networks have been identified, but dynamical implications of the features have not been explored comprehensively. We demonstrate by exhaustive computational analysis that a dynamical property--stability or robustness to small perturbations--is highly correlated with the relative abundance of small subnetworks (network motifs) in several previously determined biological networks. We propose that robust dynamical stability is an influential property that can determine the non-random structure of biological networks.