Small non-coding RNAs (ncRNAs) are key regulators of plant development through modulation of the processing, stability, and translation of larger RNAs. We present small RNA data sets comprising more than 200 million aligned Illumina sequence reads covering all major cell types of the root as well as four distinct developmental zones. MicroRNAs (miRNAs) constitute a class of small ncRNAs that are particularly important for development. Of the 243 known miRNAs, 133 were found to be expressed in the root, and most showed tissue- or zone-specific expression patterns. We identified 66 new high-confidence miRNAs using a computational pipeline, PIPmiR, specifically developed for the identification of plant miRNAs. PIPmiR uses a probabilistic model that combines RNA structure and expression information to identify miRNAs with high precision. Knockdown of three of the newly identified miRNAs results in altered root growth phenotypes, confirming that novel miRNAs predicted by PIPmiR have functional relevance.
High-throughput immunoprecipitation methods to analyze RNA binding protein – RNA in-teractions and modifications have great potential to further the understanding of post-tran-scriptional gene regulation. Due to the differences between individual approaches, each of a diverse number of computational methods can typically be applied to only one specific se-quencing protocol. Here, we present a Bayesian model called omniCLIP that can be applied to data from all protocols to detect regulatory elements in RNAs. omniCLIP greatly sim-plifies the data analysis, increases the reliability of results and paves the way for integrative studies based on data from different sources.
Table showing McEnhancer validation of long-range interactions reported in REDfly. This table shows all long-range interactions reported in REDfly that overlap our DHSs, selecting only those that were assigned to genes in included in our analysis, but these genes are not the closest. Columns of this table are: CRMs coordinates from Redfly, their assigned genes, whether such gene is has unique or multiple expression pattern, closest genes, coordinated of McEnhancer predicted DHSs that overlap each CRM, and whether McEnhancer prediction is correct, and a comment in case the assigned genes are different. (XLSX 59 kb)
Abstract In recent years, numerous applications have demonstrated the potential of deep learning for an improved understanding of biological processes. However, most deep learning tools developed so far are designed to address a specific question on a fixed dataset and/or by a fixed model architecture. Here we present Janggu, a python library facilitates deep learning for genomics applications, aiming to ease data acquisition and model evaluation. Among its key features are special dataset objects, which form a unified and flexible data acquisition and pre-processing framework for genomics data that enables streamlining of future research applications through reusable components. Through a numpy-like interface, these dataset objects are directly compatible with popular deep learning libraries, including keras or pytorch. Janggu offers the possibility to visualize predictions as genomic tracks or by exporting them to the bigWig format as well as utilities for keras-based models. We illustrate the functionality of Janggu on several deep learning genomics applications. First, we evaluate different model topologies for the task of predicting binding sites for the transcription factor JunD. Second, we demonstrate the framework on published models for predicting chromatin effects. Third, we show that promoter usage measured by CAGE can be predicted using DNase hypersensitivity, histone modifications and DNA sequence features. We improve the performance of these models due to a novel feature in Janggu that allows us to include high-order sequence features. We believe that Janggu will help to significantly reduce repetitive programming overhead for deep learning applications in genomics, and will enable computational biologists to rapidly assess biological hypotheses.
The recent arrival of large-scale cap analysis of gene expression (CAGE) data sets in mammals provides a wealth of quantitative information on coding and noncoding RNA polymerase II transcription start sites (TSS). Genome-wide CAGE studies reveal that a large fraction of TSS exhibit peaks where the vast majority of associated tags map to a particular location (∼45%), whereas other active regions contain a broader distribution of initiation events. The presence of a strong single peak suggests that transcription at these locations may be mediated by position-specific sequence features. We therefore propose a new model for single-peaked TSS based solely on known transcription factors (TFs) and their respective regions of positional enrichment. This probabilistic model leads to near-perfect classification results in cross-validation (auROC = 0.98), and performance in genomic scans demonstrates that TSS prediction with both high accuracy and spatial resolution is achievable for a specific but large subgroup of mammalian promoters. The interpretable model structure suggests a DNA code in which canonical sequence features such as TATA-box, Initiator, and GC content do play a significant role, but many additional TFs show distinct spatial biases with respect to TSS location and are important contributors to the accurate prediction of single-peak transcription initiation sites. The model structure also reveals that CAGE tag clusters distal from annotated gene starts have distinct characteristics compared to those close to gene 5′-ends. Using this high-resolution single-peak model, we predict TSS for ∼70% of mammalian microRNAs based on currently available data.
In pluripotent cells, a delicate activation-repression balance maintains pro-differentiation genes ready for rapid activation. The identity of transcription factors (TFs) that specifically repress pro-differentiation genes remains obscure. By targeting ~1,700 TFs with CRISPR loss-of-function screen, we found that ZBTB11 and ZFP131 are required for embryonic stem cell (ESC) pluripotency. ESCs without ZBTB11 or ZFP131 lose colony morphology, reduce proliferation rate and upregulate transcription of genes associated with three germ layers. ZBTB11 and ZFP131 bind proximally to pro-differentiation genes. ZBTB11 or ZFP131 loss leads to an increase in H3K4me3, NELF complex release, and concomitant transcription at associated genes. Together, our results suggest that ZBTB11 and ZFP131 maintain pluripotency by preventing premature expression of pro-differentiation genes and present a generalizable framework to maintain cellular potency.
Abstract N6-methyladenosine (m 6 A) regulates a variety of physiological processes through modulation of RNA metabolism. The modification is particularly enriched in the nervous system of several species, and its dysregulation has been associated with neurodevelopmental defects and neural dysfunctions. In Drosophila , loss of m 6 A alters fly behavior albeit the underlying mechanism and the role of m 6 A during nervous system development have remained elusive. Here we find that impairment of the m 6 A pathway leads to axonal overgrowth and misguidance at larval neuromuscular junctions as well as in the adult mushroom bodies. We identify Ythdf as the main m 6 A reader in the nervous system being required for limiting axonal growth. Mechanistically, we show that Ythdf directly interacts with Fragile X mental retardation protein to inhibit the translation of key transcripts involved in axonal growth regulation. Altogether, this study demonstrates that the m 6 A pathway controls development of the nervous system by modulating Fmr1 target selection.