Motivation: We present a sequence-based framework and algorithm PHYLOCLUS for predicting co-regulated genes. In our approach, de novo discovery methods are used to find motifs conserved by evolution and then a Bayesian hierarchical clustering model is used to cluster these motifs, thereby grouping together genes that are putatively co-regulated. Our clustering procedure allows both the number of clusters and the motif width within each cluster to be unknown.
The recent arrival of large-scale cap analysis of gene expression (CAGE) data sets in mammals provides a wealth of quantitative information on coding and noncoding RNA polymerase II transcription start sites (TSS). Genome-wide CAGE studies reveal that a large fraction of TSS exhibit peaks where the vast majority of associated tags map to a particular location (∼45%), whereas other active regions contain a broader distribution of initiation events. The presence of a strong single peak suggests that transcription at these locations may be mediated by position-specific sequence features. We therefore propose a new model for single-peaked TSS based solely on known transcription factors (TFs) and their respective regions of positional enrichment. This probabilistic model leads to near-perfect classification results in cross-validation (auROC = 0.98), and performance in genomic scans demonstrates that TSS prediction with both high accuracy and spatial resolution is achievable for a specific but large subgroup of mammalian promoters. The interpretable model structure suggests a DNA code in which canonical sequence features such as TATA-box, Initiator, and GC content do play a significant role, but many additional TFs show distinct spatial biases with respect to TSS location and are important contributors to the accurate prediction of single-peak transcription initiation sites. The model structure also reveals that CAGE tag clusters distal from annotated gene starts have distinct characteristics compared to those close to gene 5′-ends. Using this high-resolution single-peak model, we predict TSS for ∼70% of mammalian microRNAs based on currently available data.
In a stark departure from conventional wisdom, Nicolò Bertani, Shane T. Jensen, and Ville A. Satopää’s recently published research article titled “Joint Bottom-up Method for Probabilistic Forecasting of Hierarchical Time Series” dismantles a long-held belief in hierarchical forecasting: that the hierarchical structure itself contains vital information. The joint bottom-up (JBU) method proves otherwise. The authors demonstrate that the sums within a hierarchy—often seen as critical—add no additional information beyond what is contained in the most granular, bottom-level series. By modeling these bottom-level series jointly, JBU leverages their dependencies to deliver probabilistic forecasts that are both coherent and highly accurate across all levels of aggregation. This groundbreaking insight challenges decades of hierarchical forecasting practices. It underscores that upper-level series can be entirely reconstructed from the bottom-level series, rendering the hierarchy redundant. This finding, validated through real-world applications, sets a new standard for forecasting in fields like retail, energy, and tourism. Explore the full study to understand its far-reaching implications.
Understanding neuronal activity in the human brain is an extremely difficult problem both in terms of measurement and statistical modeling. We address a particular research question in this area: the analysis of human intracranial electroencephalogram (iEEG) recordings of epileptic seizures from a collection of patients. In these data, each seizure of each patient is defined by the activities of many individual recording channels. The modeling of epileptic seizures is challenging due the large amount of heterogeneity in iEEG signal between channels within a particular seizure, between seizures within an individual, and across individuals. We develop a new nonparametric hierarchical Bayesian model that simultaneously addresses these multiple levels of heterogeneity in our epilepsy data. Our approach, which we call a multi-level clustering hierarchical Dirichlet process (MLC-HDP), clusters over channel activities within a seizure, over seizures of a patient and over patients. We demonstrate the advantages of our methodology over alternative approaches in human EEG seizure data and show that its seizure clustering is close to manual clustering by a physician expert. We also address important clinical questions like “to which seizures of other patients is this seizure similar?”
Summary The master regulator for entry into sporulation in Bacillus subtilis is the DNA‐binding protein Spo0A, which has been found to influence, directly or indirectly, the expression of over 500 genes during the early stages of development. To search on a genome‐wide basis for genes under the direct control of Spo0A, we used chromatin immunoprecipitation in combination with gene microarray analysis to identify regions of the chromosome at which an activated form of Spo0A binds in vivo . This information in combination with transcriptional profiling using gene microarrays, gel electrophoretic mobility shift assays, using the DNA‐binding domain of Spo0A, and bioinformatics enabled us to assign 103 genes to the Spo0A regulon in addition to 18 previously known members. Thus, in total, 121 genes, which are organized as 30 single‐gene units and 24 operons, are likely to be under the direct control of Spo0A. Forty of these genes are under the positive control of Spo0A, and 81 are under its negative control. Among newly identified members of the regulon with transcription that was stimulated by Spo0A are genes for metabolic enzymes and genes for efflux pumps. Among members with transcription that was in‐hibited by Spo0A are genes encoding components of the DNA replication machinery and genes that govern flagellum biosynthesis and chemotaxis. Also in‐cluded in the regulon are many (25) genes with products that are direct or indirect regulators of gene transcription. Spo0A is a master regulator for sporulation, but many of its effects on the global pattern of gene transcription are likely to be mediated indirectly by regulatory genes under its control.
Locating recombination hotspots in genomic data is an important but difficult task. Current methods frequently rely on estimating complicated models at high computational cost. In this paper we develop an extremely fast, scalable method for inferring recombination hot spots in a population of genomic sequences that is based on the singular value decomposition. Our method performs well in several synthetic data scenarios. We also apply our technique to a real data investigation of the evolution of drug therapy resistance in a population of HIV genomic sequences. Finally, we compare our method both on real and simulated data to a state of the art algorithm.
Statistical evolutionary models provide an important mechanism for describing and understanding the escape response of a viral population under a particular therapy. We present a new hierarchical model that incorporates spatially varying mutation and recombination rates at the nucleotide level. It also maintains separate parameters for treatment and control groups, which allows us to estimate treatment effects explicitly. We use the model to investigate the sequence evolution of HIV populations exposed to a recently developed antisense gene therapy, as well as a more conventional drug therapy. The detection of biologically relevant and plausible signals in both therapy studies demonstrates the effectiveness of the method.
Germline mutation is the mechanism by which genetic variation in a population is created. Inferences derived from mutation rate models are fundamental to many population genetics methods. Previous models have demonstrated that nucleotides flanking polymorphic sites-the local sequence context-explain variation in the probability that a site is polymorphic. However, limitations to these models exist as the size of the local sequence context window expands. These include a lack of robustness to data sparsity at typical sample sizes, lack of regularization to generate parsimonious models and lack of quantified uncertainty in estimated rates to facilitate comparison between models. To address these limitations, we developed Baymer, a regularized Bayesian hierarchical tree model that captures the heterogeneous effect of sequence contexts on polymorphism probabilities. Baymer implements an adaptive Metropolis-within-Gibbs Markov Chain Monte Carlo sampling scheme to estimate the posterior distributions of sequence-context based probabilities that a site is polymorphic. We show that Baymer accurately infers polymorphism probabilities and well-calibrated posterior distributions, robustly handles data sparsity, appropriately regularizes to return parsimonious models, and scales computationally at least up to 9-mer context windows. We demonstrate application of Baymer in three ways-first, identifying differences in polymorphism probabilities between continental populations in the 1000 Genomes Phase 3 dataset, second, in a sparse data setting to examine the use of polymorphism models as a proxy for de novo mutation probabilities as a function of variant age, sequence context window size, and demographic history, and third, comparing model concordance between different great ape species. We find a shared context-dependent mutation rate architecture underlying our models, enabling a transfer-learning inspired strategy for modeling germline mutations. In summary, Baymer is an accurate polymorphism probability estimation algorithm that automatically adapts to data sparsity at different sequence context levels, thereby making efficient use of the available data.