The solution of high-dimensional inference and prediction problems in computational biology is almost always a compromise between mathematical theory and practical constraints such as limited computational resources. As time progresses, computational power increases but well-established inference methods often remain locked in their initial suboptimal solution. We revisit the approach of Segal et al. (2003) to infer regulatory modules and their condition-specific regulators from gene expression data. In contrast to their direct optimization-based solution we use a more representative centroid-like solution extracted from an ensemble of possible statistical models to explain the data. The ensemble method automatically selects a subset of most informative genes and builds a quantitatively better model for them. Genes which cluster together in the majority of models produce functionally more coherent modules. Regulators which are consistently assigned to a module are more often supported by literature, but a single model always contains many regulator assignments not supported by the ensemble. Reliably detecting condition-specific or combinatorial regulation is particularly hard in a single optimum but can be achieved using ensemble averaging.
The importance of gene gain through duplication has long been appreciated. In contrast, the importance of gene loss has only recently attracted attention. Indeed, studies in organisms ranging from plants to worms and humans suggest that duplication of some genes might be better tolerated than that of others. Here we have undertaken a large-scale study to investigate the existence of duplication-resistant genes in the sequenced genomes of 20 flowering plants. We demonstrate that there is a large set of genes that is convergently restored to single-copy status following multiple genome-wide and smaller scale duplication events. We rule out the possibility that such a pattern could be explained by random gene loss only and therefore propose that there is selection pressure to preserve such genes as singletons. This is further substantiated by the observation that angiosperm single-copy genes do not comprise a random fraction of the genome, but instead are often involved in essential housekeeping functions that are highly conserved across all eukaryotes. Furthermore, single-copy genes are generally expressed more highly and in more tissues than non–single-copy genes, and they exhibit higher sequence conservation. Finally, we propose different hypotheses to explain their resistance against duplication.
“Module networks” are a framework to learn gene regulatory networks from expression data using a probabilistic model in which coregulated genes share the same parameters and conditional distributions. We present a method to infer ensembles of such networks and an averaging procedure to extract the statistically most significant modules and their regulators. We show that the inferred probabilistic models extend beyond the dataset used to learn the models.
Recognition of genomic binding sites by transcription factors can occur through base-specific recognition, or by recognition of variations within the structure of the DNA macromolecule. In this article, we investigate what information can be retrieved from local DNA structural properties that is relevant to transcription factor binding and that cannot be captured by the nucleotide sequence alone. More specifically, we explore the benefit of employing the structural characteristics of DNA to create binding-site models that encompass indirect recognition for the Escherichia coli model organism. We developed a novel methodology [Conditional Random fields of Smoothed Structural Data (CRoSSeD)], based on structural scales and conditional random fields to model and predict regulator binding sites. The value of relying on local structural-DNA properties is demonstrated by improved classifier performance on a large number of biological datasets, and by the detection of novel binding sites which could be validated by independent data sources, and which could not be identified using sequence data alone. We further show that the CRoSSeD-binding-site models can be related to the actual molecular mechanisms of the transcription factor DNA binding, and thus cannot only be used for prediction of novel sites, but might also give valuable insights into unknown binding mechanisms of transcription factors.
Targeted sequencing using next-generation sequencing (NGS) platforms offers enormous potential for plant systematics by enabling economical acquisition of multilocus data sets that can resolve difficult phylogenetic problems. However, because discovery of single-copy nuclear (SCN) loci from NGS data requires both bioinformatics skills and access to high-performance computing resources, the application of NGS data has been limited.We developed MarkerMiner 1.0, a fully automated, open-access bioinformatic workflow and application for discovery of SCN loci in angiosperms. Our new tool identified as many as 1993 SCN loci from transcriptomic data sampled as part of four independent test cases representing marker development projects at different phylogenetic scales.MarkerMiner is an easy-to-use and effective tool for discovery of putative SCN loci. It can be run locally or via the Web, and its tabular and alignment outputs facilitate efficient downstream assessments of phylogenetic utility, locus selection, intron-exon boundary prediction, and primer or probe development.