ABSTRACT Cell type specification during early nervous system development in Drosophila melanogaster requires precise regulation of gene expression in time and space. Resolving the programs driving neurogenesis has been a major challenge owing to the complexity and rapidity with which distinct cell populations arise. To resolve the cell type-specific gene expression dynamics in early nervous system development, we have sequenced the transcriptomes of purified neurogenic cell types across consecutive time points covering crucial events in neurogenesis. The resulting gene expression atlas comprises a detailed resource of global transcriptome dynamics that permits systematic analysis of how cells in the nervous system acquire distinct fates. We resolve known gene expression dynamics and uncover novel expression signatures for hundreds of genes among diverse neurogenic cell types, most of which remain unstudied. We also identified a set of conserved long noncoding RNAs (lncRNAs) that are regulated in a tissue-specific manner and exhibit spatiotemporal expression during neurogenesis with exquisite specificity. lncRNA expression is highly dynamic and demarcates specific subpopulations within neurogenic cell types. Our spatiotemporal transcriptome atlas provides a comprehensive resource for investigating the function of coding genes and noncoding RNAs during crucial stages of early neurogenesis.
Summary statement We present a spatiotemporal transcriptome during early Drosophila embryonic nervous system development, revealing a complex cell type-specific network of mRNAs and lncRNAs. Abstract Cell type specification during early nervous system development in Drosophila melanogaster requires precise regulation of gene expression in time and space. Resolving the programs driving neurogenesis has been a major challenge owing to the complexity and rapidity with which distinct cell populations arise. To resolve the cell type-specific gene expression dynamics in early nervous system development, we have sequenced the transcriptomes of purified neurogenic cell types across consecutive time points covering critical events in neurogenesis. The resulting gene expression atlas comprises a detailed resource of global transcriptome dynamics that permits systematic analysis of how cells in the nervous system acquire distinct fates. We resolve known gene expression dynamics and uncover novel expression signatures for hundreds of genes among diverse neurogenic cell types, most of which remain unstudied. We also identified a set of conserved and tissue-specifically regulated long-noncoding RNAs (lncRNAs) that exhibit spatiotemporal expression during neurogenesis with exquisite specificity. LncRNA expression is highly dynamic and demarcates specific subpopulations within neurogenic cell types. Our spatiotemporal transcriptome atlas provides a comprehensive resource to investigate the function of coding genes and noncoding RNAs during critical stages of early neurogenesis.
Abstract More than half the human and mouse genomes are comprised of repetitive sequences, such as transposable elements (TEs), which have been implicated in many biological processes. In contrast, much less is known about other repeats, such as local repeats that occur in multiple instances within a given locus in the genome but not elsewhere. Here, we systematically characterize local repeats in the genomic locus of the Firre long noncoding RNA (lncRNA). We find a conserved function for the RRD repeat as a ribonucleic nuclear retention signal that is sufficient to retain an otherwise cytoplasmic mRNA in the nucleus. We also identified a repeat, termed R0, that can function as a DNA enhancer element within the intronic sequences of Firre. Collectively, our data suggest that local repeats can have diverse functionalities and molecular modalities in the Firre locus and perhaps more globally in other lncRNAs.
Abstract Background In high-throughput studies, hundreds to millions of hypotheses are typically tested. Statistical methods that control the false discovery rate (FDR) have emerged as popular and powerful tools for error rate control. While classic FDR methods use only p -values as input, more modern FDR methods have been shown to increase power by incorporating complementary information as “informative covariates” to prioritize, weight, and group hypotheses. However, there is currently no consensus on how the modern methods compare to one another. We investigated the accuracy, applicability, and ease of use of two classic and six modern FDR-controlling methods by performing a systematic benchmark comparison using simulation studies as well as six case studies in computational biology Results Methods that incorporate informative covariates were modestly more powerful than classic approaches, and did not underperform classic approaches, even when the covariate was completely uninformative. The majority of methods were successful at controlling the FDR, with the exception of two modern methods under certain settings. Furthermore, we found the improvement of the modern FDR methods over the classic methods increased with the informativeness of the covariate, total number of hypothesis tests, and proportion of truly non-null hypotheses. Conclusions Modern FDR methods that use an informative covariate provide advantages over classic FDR-controlling procedures, with the relative gain dependent on the application and informativeness of available covariates. We present our findings as a practical guide and provide recommendations to aid researchers in their choice of methods to correct for false discoveries.
In high-throughput studies, hundreds to millions of hypotheses are typically tested. Statistical methods that control the false discovery rate (FDR) have emerged as popular and powerful tools for error rate control. While classic FDR methods use only p values as input, more modern FDR methods have been shown to increase power by incorporating complementary information as informative covariates to prioritize, weight, and group hypotheses. However, there is currently no consensus on how the modern methods compare to one another. We investigate the accuracy, applicability, and ease of use of two classic and six modern FDR-controlling methods by performing a systematic benchmark comparison using simulation studies as well as six case studies in computational biology. Methods that incorporate informative covariates are modestly more powerful than classic approaches, and do not underperform classic approaches, even when the covariate is completely uninformative. The majority of methods are successful at controlling the FDR, with the exception of two modern methods under certain settings. Furthermore, we find that the improvement of the modern FDR methods over the classic methods increases with the informativeness of the covariate, total number of hypothesis tests, and proportion of truly non-null hypotheses. Modern FDR methods that use an informative covariate provide advantages over classic FDR-controlling procedures, with the relative gain dependent on the application and informativeness of available covariates. We present our findings as a practical guide and provide recommendations to aid researchers in their choice of methods to correct for false discoveries.
Background and Aims Manual histological assessment is currently the accepted standard for diagnosing and monitoring disease progression in NASH, but is limited by variability in interpretation and insensitivity to change. Thus, there is a critical need for improved tools to assess liver pathology in order to risk stratify NASH patients and monitor treatment response. Approach and Results Here, we describe a machine learning (ML)‐based approach to liver histology assessment, which accurately characterizes disease severity and heterogeneity, and sensitively quantifies treatment response in NASH. We use samples from three randomized controlled trials to build and then validate deep convolutional neural networks to measure key histological features in NASH, including steatosis, inflammation, hepatocellular ballooning, and fibrosis. The ML‐based predictions showed strong correlations with expert pathologists and were prognostic of progression to cirrhosis and liver‐related clinical events. We developed a heterogeneity‐sensitive metric of fibrosis response, the Deep Learning Treatment Assessment Liver Fibrosis score, which measured antifibrotic treatment effects that went undetected by manual pathological staging and was concordant with histological disease progression. Conclusions Our ML method has shown reproducibility and sensitivity and was prognostic for disease progression, demonstrating the power of ML to advance our understanding of disease heterogeneity in NASH, risk stratify affected patients, and facilitate the development of therapies.
Resource15 January 2018Open Access Transparent process High-throughput identification of RNA nuclear enrichment sequences Chinmay J Shukla Chinmay J Shukla Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA Broad Institute of MIT and Harvard, Cambridge, MA, USA Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA Program in Biological and Biomedical Sciences, Harvard Medical School, Boston, MA, USA Search for more papers by this author Alexandra L McCorkindale Alexandra L McCorkindale Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine, Berlin, Germany Search for more papers by this author Chiara Gerhardinger Chiara Gerhardinger Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA Broad Institute of MIT and Harvard, Cambridge, MA, USA Search for more papers by this author Keegan D Korthauer Keegan D Korthauer orcid.org/0000-0002-4565-1654 Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA Search for more papers by this author Moran N Cabili Moran N Cabili Broad Institute of MIT and Harvard, Cambridge, MA, USA Search for more papers by this author David M Shechner David M Shechner Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA Broad Institute of MIT and Harvard, Cambridge, MA, USA Search for more papers by this author Rafael A Irizarry Rafael A Irizarry Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA Search for more papers by this author Philipp G Maass Philipp G Maass orcid.org/0000-0002-2742-8301 Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA Search for more papers by this author John L Rinn Corresponding Author John L Rinn [email protected] orcid.org/0000-0002-7231-7539 Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA Broad Institute of MIT and Harvard, Cambridge, MA, USA Department of Pathology, Beth Israel Deaconess Medical Center, Boston, MA, USA Search for more papers by this author Chinmay J Shukla Chinmay J Shukla Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA Broad Institute of MIT and Harvard, Cambridge, MA, USA Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA Program in Biological and Biomedical Sciences, Harvard Medical School, Boston, MA, USA Search for more papers by this author Alexandra L McCorkindale Alexandra L McCorkindale Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine, Berlin, Germany Search for more papers by this author Chiara Gerhardinger Chiara Gerhardinger Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA Broad Institute of MIT and Harvard, Cambridge, MA, USA Search for more papers by this author Keegan D Korthauer Keegan D Korthauer orcid.org/0000-0002-4565-1654 Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA Search for more papers by this author Moran N Cabili Moran N Cabili Broad Institute of MIT and Harvard, Cambridge, MA, USA Search for more papers by this author David M Shechner David M Shechner Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA Broad Institute of MIT and Harvard, Cambridge, MA, USA Search for more papers by this author Rafael A Irizarry Rafael A Irizarry Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA Search for more papers by this author Philipp G Maass Philipp G Maass orcid.org/0000-0002-2742-8301 Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA Search for more papers by this author John L Rinn Corresponding Author John L Rinn [email protected] orcid.org/0000-0002-7231-7539 Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA Broad Institute of MIT and Harvard, Cambridge, MA, USA Department of Pathology, Beth Israel Deaconess Medical Center, Boston, MA, USA Search for more papers by this author Author Information Chinmay J Shukla1,2,3,4, Alexandra L McCorkindale1,5, Chiara Gerhardinger1,2, Keegan D Korthauer3,6, Moran N Cabili2, David M Shechner1,2, Rafael A Irizarry3,6, Philipp G Maass1,‡ and John L Rinn *,1,2,7,8,‡ 1Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA 2Broad Institute of MIT and Harvard, Cambridge, MA, USA 3Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA 4Program in Biological and Biomedical Sciences, Harvard Medical School, Boston, MA, USA 5Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine, Berlin, Germany 6Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA 7Department of Pathology, Beth Israel Deaconess Medical Center, Boston, MA, USA 8Present address: Department of Biochemistry, University of Colorado BioFrontiers, Boulder, CO, USA ‡These authors contributed equally to this work *Corresponding author. Tel: +1 303 735 7218; E-mail: [email protected] The EMBO Journal (2018)37:e98452https://doi.org/10.15252/embj.201798452 See also: F Agostini et al (March 2018) PDFDownload PDF of article text and main figures. Peer ReviewDownload a summary of the editorial decision process including editorial decision letters, reviewer comments and author responses to feedback. ToolsAdd to favoritesDownload CitationsTrack CitationsPermissions ShareFacebookTwitterLinked InMendeleyWechatReddit Figures & Info Abstract In the post-genomic era, thousands of putative noncoding regulatory regions have been identified, such as enhancers, promoters, long noncoding RNAs (lncRNAs), and a cadre of small peptides. These ever-growing catalogs require high-throughput assays to test their functionality at scale. Massively parallel reporter assays have greatly enhanced the understanding of noncoding DNA elements en masse. Here, we present a massively parallel RNA assay (MPRNA) that can assay 10,000 or more RNA segments for RNA-based functionality. We applied MPRNA to identify RNA-based nuclear localization domains harbored in lncRNAs. We examined a pool of 11,969 oligos densely tiling 38 human lncRNAs that were fused to a cytosolic transcript. After cell fractionation and barcode sequencing, we identified 109 unique RNA regions that significantly enriched this cytosolic transcript in the nucleus including a cytosine-rich motif. These nuclear enrichment sequences are highly conserved and over-represented in global nuclear fractionation sequencing. Importantly, many of these regions were independently validated by single-molecule RNA fluorescence in situ hybridization. Overall, we demonstrate the utility of MPRNA for future investigation of RNA-based functionalities. Synopsis A new tiling-based method maps functional domains in RNAs in a high-throughput manner, allowing the identification of sequence motifs that contribute to the nuclear enrichment of long non-coding RNAs. Massively Parallel RNA Assay (MPRNA) is a universally applicable method to survey RNA-based functionalities in a high-throughput manner. MPRNA identifies 109 nuclear enrichment sequences across 29 of 38 lncRNAs tested. A C-rich motif is generally enriched in nuclear versus cytoplasmic transcripts. RNA-FISH Fish reveals that large sequence domains are sufficient to localize otherwise cytoplasmic RNA to the nucleus. MPRNA shows that nuclear lncRNAs have unique and large (˜500 bp) nuclear localization domains Introduction One of the biggest surprises since the sequencing of the human genome has been the discovery of thousands of functional noncoding regions (Rinn & Chang, 2012; Hon et al, 2017). This includes new DNA enhancer elements, promoters, small RNAs, long noncoding RNAs (lncRNAs), and small peptides (20–70 amino acids) that are encoded in regions previously annotated as lncRNAs. Underscoring the importance of these elements are their disease associations and functional roles in the regulation of transcription (Jin et al, 2011; Morris & Mattick, 2014; Anderson et al, 2015; Gong et al, 2015; Hon et al, 2017). The ever-growing collection of noncoding annotations has motivated technological advances to characterize these elements and assay for their functional roles in a high-throughput manner. For example, the capacity to synthesize pools comprised of more than 100,000 individual DNA oligos has led to massively parallel reporter assays (MPRA) that have been applied to identify noncoding DNA elements, such as enhancers and promoters, on a genome-wide scale (Patwardhan et al, 2009; Melnikov et al, 2012; Oikonomou et al, 2014; Rosenberg et al, 2015; Ernst et al, 2016). These studies have turbo-boosted our understanding of functional DNA elements and their upstream regulatory factors. Addressing RNA functionalities in a similar manner has many challenges, remains limited, and is poorly scalable. Yet, such an assay would hold great promise to understand fundamental aspects of lncRNA biology through the identification of functional sequences and structures. Central to RNA-based functionality is subcellular localization, which influences the biogenesis and function of mRNAs and lncRNAs alike. RNA localization provides a fundamental mechanism through which cells modulate and utilize the functions encoded in their transcriptomes (Davis & Ish-Horowicz, 1991; Bullock & Ish-Horowicz, 2001; Johnstone & Lasko, 2001; Lin & Holt, 2007; Paquin & Chartrand, 2008; Martin & Ephrussi, 2009; Zhang et al, 2014; Hacisuleyman et al, 2016). This spatial layer of post-transcriptional gene regulation is known to be critical in a variety of contexts, including asymmetric cell divisions (Paquin & Chartrand, 2008), embryonic development (Davis & Ish-Horowicz, 1991; Bullock & Ish-Horowicz, 2001; Johnstone & Lasko, 2001), and signal transduction (Lin & Holt, 2007). Previous work has identified a small number of cis-acting mRNA localization elements, using genetic approaches or hybrid reporter constructs to decipher sequences required for localization to specific parts of the cell (Bullock & Ish-Horowicz, 2001; Martin & Ephrussi, 2009). These elements are often located in 3′ untranslated regions and range from five to several hundred nucleotides in length (Bullock & Ish-Horowicz, 2001; Miyagawa et al, 2012; Zhang et al, 2014; Hacisuleyman et al, 2016). Yet, the sequences and structures responsible for RNA localization remain inchoate. In contrast to mRNAs—which are exclusively cytosolic—most lncRNAs are predominantly enriched in the nucleus (Derrien et al, 2012). Consistent with their localization patterns, several examples of lncRNAs (XIST (Brown et al, 1992; Lee & Bartolomei, 2013), FIRRE (Hacisuleyman et al, 2016), MALAT1 (Gutschner et al, 2013), NEAT1 (Clemson et al, 2009), PVT1 (Tseng et al, 2014), GAS5 (Kino et al, 2010), PINT (Marín-Béjar et al, 2013), and many others) perform key nuclear roles during development and are believed to be crucial in nuclear organization (Rinn & Guttman, 2014). This is surprising since mRNAs and lncRNAs share similar biogenesis and post-transcriptional features (Cabili et al, 2011; Ni et al, 2013; Guttman & Rinn, 2012; e.g., m7-G cap and polyA tail), which usually trigger RNA export to the cytosol. This raises a more general question: Is there a universal nuclear localization motif harbored within lncRNAs (Zhang et al, 2014), or is nuclear localization imparted by larger RNA domains specific to individual transcripts (Hacisuleyman et al, 2016)? Addressing this question requires a high-throughput assay that can screen for RNA-based functionalities. Toward this goal, we have developed and optimized such a massively parallel RNA assay (MPRNA). Briefly, we developed a construct that expressed and appends thousands of 110mer RNA sequences—each uniquely barcoded—to a cytosolic-localized reporter transcript: a noncoding, frame-shifted variant of Sox2, which we hereafter refer to as fsSox2 (see Materials and Methods). By sequencing barcodes in nuclear fractions versus the total barcode population, we can simultaneously assess each 110mer that was sufficient to retain fsSox2 in the nucleus. To control for the possibility that sequences larger than 110 nucleotides (nt) might be required for nuclear retention, we designed a densely overlapping pool of oligos so that, on average, every unique 10 nt are independently assayed. This was optimized to develop a robust statistical method that leverages the interdependencies and variances of each 110mer to identify larger RNA domains enriched in the nucleus. As a first application, we performed MPRNA across 38 lncRNAs with varying degrees of subcellular localization patterns as previously determined by single-molecule RNA fluorescence in situ hybridization (smFISH; Cabili et al, 2015). We identified 109 unique nuclear enrichment sequences derived from 29 of the 38 lncRNAs tested, including the known RNA localization regions for MALAT1 (Miyagawa et al, 2012). Interestingly, a global motif analysis of these regions uncovered a cytosine-rich (C-rich) motif that is over-represented in many of the nuclear enrichment regions. Consistent with a possible global role of the C-rich motif for localization, these regions tend to be more conserved and are generally nuclear-enriched in global nuclear versus cytoplasmic RNA sequencing (RNA-seq) experiments from the ENCODE consortium. Notably, a very similar motif was also identified in an independent study (Lubelsky & Ulitsky, 2018). Finally, we independently validated the capability of these domains to impart nuclear localization by smFISH of fsSox2 appended with the putative nuclear enrichment sequences identified by our MPRNA. Collectively, we demonstrate that the MPRNA methodology could be universally applicable to identify active RNA elements sufficient for any cellular process that can be physically and functionally separated. Results Design and optimization of a massively parallel RNA assay (MPRNA) In order to identify RNA sequences that drive lncRNA nuclear enrichment, we developed a high-throughput approach for identifying nuclear enrichment elements. First, we designed a pool of 11,969 153-nt oligos representing 38 lncRNAs with diverse subcellular localization patterns: from single nuclear foci (e.g., XIST, ANRIL, ANCR, PVT1, KCNQ1OT1, FIRRE) to broadly diffuse cytosolic patterns (e.g., NR_024412, NR 033770; Cabili et al, 2015). We designed the oligo-pool to tile each of the 38 lncRNAs with a 10-nt shift between sequential oligonucleotides. This densely overlapping tiling approach offers us a unique advantage of allowing the computational "stitching" of sequential oligos (Jaffe et al, 2012; preprint: Korthauer et al, 2017), thus enabling identification of longer regions required for nuclear enrichment. Second, we cloned the pool of oligonucleotides to the 3′ end of a cytosolic-localized Sox2 construct (fsSox2). As previously shown (Haciuselyman et al, 2016), we used fsSox2 instead of regular Sox2 to avoid any unwanted translation artifacts. The oligo-pool was expressed in HeLa cells, followed by subcellular fractionation, and targeted RNA-seq of unique barcodes to determine the enrichment of each fsSox2 variant in the nucleus relative to the total barcode representation in total RNA (Fig 1A, Table EV1, Materials and Methods). The assay was performed as six biological replicates to ensure sufficient statistical power for our analytical model, and accurately estimate in-group variance (see below, Materials and Methods). Figure 1. A Massively Parallel RNA Assay (MPRNA) to identify RNA nuclear enrichment signals Experimental overview. Far left: oligonucleotide pool design. Double-stranded DNA (dsDNA) oligonucleotides were designed by computationally scanning 38 parental lncRNA transcripts (Table EV1) in 110-nt windows, with 10-nt spacing between sequential oligos. These lncRNA-derived sequences (gray) were appended with unique barcodes and universal primer binding sites, resulting in a pool of 11,969 oligos of 153 bp (Table EV1). The vertical lines in the lncRNA denote splice junctions. Second from left: schematic summarizing the design of each oligonucleotide. Second from right: reporter design. The oligonucleotide pool was cloned into a reporter plasmid as fusion transcript 3′ of fsSox2 (minCMV, minimal CMV promoter; pA, polyadenylation sequence). Far right: MPRA workflow. The fsSox2˜oligo reporter pool was transiently transfected into HeLa cells. Following 48 h of expression, cells were harvested and fractionated to isolate nuclei, and the nuclear enrichment of each oligo was quantified by targeted RNA sequencing. Matched whole-cell lysates from unfractionated cells served as controls. Read mapping and normalization. A perfect match between the first 10 nt of the read and the barcode sequence was used to "map" the read. To guarantee robustness of the mapping procedure, we allowed for no more than two mismatches within the 90 basepairs upstream of the barcode (see "mapReads" function in our analysis package—please refer to the Code availability section). Counts were normalized for library size using the "normCounts" table (see analysis package and GEO data—please refer to the Code and Data availability sections). Counts for each nucleotide were modeled based on the normalized counts for each oligo. When nucleotide "A" overlapped with oligos i1, i2, i3, and i4, counts for this nucleotide were modeled by the median of counts for each of the individual oligos (i1–i4; see "modelNucCounts" function in our analysis pipeline). The nucleotide counts were then used to infer differential regions by (1): finding candidate regions and assigning a summary statistic to each one of them and next (2): generating null candidates by permuting sample labels and using them to assign an empirical P-value to our candidate regions from step 1 to identify significant regions. Differential region-calling correctly identifies nuclear retention elements in MALAT1. Solid lines: per-nucleotide abundances in the nuclear (red) and whole-cell (gray) fractions, modeled for each nucleotide position along the MALAT1 transcript, based on the aggregate behavior of all oligos containing that nucleotide (shaded regions: ±SD, medians of six biological replicates). Download figure Download PowerPoint We ranked candidate localization regions using a newly defined summary statistic that generates a null distribution by permuting sample labels, which is used to assign P-values (Fig 1B–D; Materials and Methods). Our approach overcomes the inter-replicate variability inherent in high-throughput reporter assays and allows us to sensitively and accurately discover nuclear-enriched RNA segments spanning up to hundreds of base pairs, which we term "differential regions" (DRs). At each stage of the MPRNA, we used quality controls (Fig EV1), and to prove the principle of our assay and analytical method, we first focused on a well-established nuclear lncRNA, MALAT1. Previous work demonstrated that two elements within MALAT1 ("Region E" and "Region M") act as potent nuclear localization signals (Miyagawa et al, 2012). We examined the nuclear enrichment of all fsSox2 pool variants bearing oligos derived from MALAT1 (Materials and Methods). Consistent with the previous study, nucleotides derived from Region E and Region M were highly enriched in the nucleus, compared to those from elsewhere in MALAT1. This finding demonstrates that our assay can recapitulate known RNA localization signals and that our analytical approach can identify localization domains longer than 110 nt (Fig 1E). Click here to expand this figure. Figure EV1. Quality assessment of every MPRNA step The distribution of oligos in the cloned plasmid pool. (i) the single peak showing uniform counts for several different oligos indicates very little jackpotting, and (ii) entire oligo representation (small bump at zero counts). Nuclear enrichment of NEAT1, GAPDH, and SNHG5 as determined by qRT–PCR in each biological replicate (left) (error bars: ±SD). Nuclear enrichment of median ± SD across the six replicates. Transfection efficiency of HeLa cells co-transfected with a GFP plasmid using the protocol outlined in Materials and Methods. A recovery of > 70% of our initial oligo-pool was obtained in each sample. On average, only 0.2% of the oligos (i.e., ˜25 oligos) was not detected in the nuclear fraction samples, and ˜0.4% (i.e., ˜50 oligos) in the Total (whole-cell lysate) samples. Bar plots showing the mapping percentage for all reads of different samples from nuclear (N) and total fractions (T), separated for technical replicates (TR) and biological replicates (BR). Boxplots showing inter-replicate differences between counts of the same oligo. Low technical variance was detected as indicated by low differences in counts among technical replicates. The solid horizontal line is the median while the lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles). The upper whisker extends from the hinge to the largest value no further than 1.5 × IQR from the hinge. The lower whisker extends from the hinge to the smallest value at most 1.5 × IQR of the hinge. Data beyond the end of the whiskers are outliers and are plotted individually. CDF plot of the nucleotides overlapping human RRD, mouse RRD, and other nucleotides in the human and mouse FIRRE loci. Similar to the MALAT1 Region M and Region E, the MPRNA recapitulated the function of the known RNA nuclear retention element RRD of the FIRRE locus. Since the experiments were performed in human HeLa cells, human FIRRE RRD was nuclear enriched, while the mouse FIRRE RRD did not influence nuclear enrichment of fsSox2 (P-value: Mann–Whitney test comparison between RRD nucleotides of the human FIRRE transcript versus other non RRD nucleotides in the human and mouse FIRRE transcripts). Differential region-calling correctly identified nuclear retention elements in FIRRE. Solid lines: per-nucleotide abundances in the nuclear (red) and whole-cell (gray) fractions, modeled for each position along the FIRRE transcript, based on the aggregate behavior of all oligos containing that nucleotide (shaded regions ±SD, medians for six biological replicates). Download figure Download PowerPoint MPRNA-based identification of RNA nuclear enrichment regions Next, we sought to agnostically and systematically investigate nuclear enrichment regions harbored within the selected 38 lncRNAs. Our analysis identified 109 DRs (FDR < 0.1) originating from 29 distinct lncRNAs that were significantly enriched in nuclear fractions relative to whole-cell lysates (Table EV2). Two of these DRs overlap and subsume the MALAT1 Region M while another overlap with Region E (Fig 1E). To confirm that our approach was robust, we compared the significant DRs to all other regions represented in the pool and found them significantly more nuclear-enriched (P < 1/106, Mann–Whitney test; Materials and Methods). The localization patterns of the selected 38 lncRNAs have been previously parsed into five smFISH classes (Cabili et al, 2015). These included lncRNAs ranging from strictly nuclear (Class I) to cytoplasmic (Class V), with three intermediate classes (classes II–IV). The MPRNA discovered DRs derived from lncRNAs in all five FISH classes (Fig 2A–E). To compare DRs found across different FISH classes, we normalized for the length of transcripts tiled across each FISH class. After normalization, we found that the number of DRs per kb was broadly similar within each FISH class (Fig 2F). Interestingly, many Class I lncRNAs harbor multiple DRs, possibly indicating the presence of a redundant nuclear localization motif. For example, we discovered 18 DRs in XIST and 10 DRs in MALAT1 and some of the DRs we discovered in XIST overlap with the previously described XIST repeat elements RepC and RepD (Appendix Fig S1A; Brown et al, 1992). By contrast, we only discovered 1 DR in predominantly cytosolic lncRNAs such as NR_023915 and NR_040001. Interestingly, while 60% of the lncRNAs in the pool were nuclear, 66% of the lncRNAs lacking DRs were predominantly cytosolic. Figure 2. Novel lncRNA nuclear enrichment signals A–E. Identification of differential regions (DRs) within lncRNAs with different subcellular localization patterns. Data are depicted as in Fig 1E. Established subcellular localization patterns range from (A), occupying a single, prominent nuclear focus (ANRIL, FISH Class 1), to (E), exhibiting a diffuse, mostly cytosolic pattern (NR_024412, FISH Class 5; Cabili et al, 2015). F. The number of DRs discovered per 10 kb of lncRNA sequence tiled is similar for each FISH Class. G. DRs are more conserved than most lncRNA sequences. Boxplot of phastCons scores comparing nucleotides within DRs (red), to all other nucleotides within the oligo-pool (gray). P-value: Mann–Whitney Test. The solid horizontal line is the median while the lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles). The upper whisker extends from the hinge to the largest value no further than 1.5 × IQR from the hinge (where IQR is the inter-quartile range). The lower whisker extends from the hinge to the smallest value at most 1.5 × IQR of the hinge. Data beyond the end of the whiskers are outliers and are plotted individually. Download figure Download PowerPoint We further analyzed the evolutionary conservation, length distribution, and sequence content of DRs for putative nuclear localization sequences. We used phastCons (Siepel et al, 2005, 2006) scores to assess evolutionary conservation, and we observed significantly higher scores among our DRs than in other lncRNA regions tiled by our MPRNA (Fig 2G; P < 1/106, Mann–Whitney test; Materials and Methods). The lengths of the DRs ranged from 80 to 740 nt, with an average of 300 nt (Appendix Fig S1B). While we detected a weak correlation between the length of a given lncRNA and number of DRs within (Appendix Fig S1C), this analysis is confounded by the different length of lncRNAs across the five FISH classes. Finally, we did not observe a difference in GC content between the DRs and other sequences within the tiled lncRNAs (Appendix Fig S1D). We hypothesized that the identified DRs might harbor common sequence motifs or preferences. To test this, we searched for motifs that were more prevalent among the DRs than in other regions of the lncRNAs, using the MEME software package (Machanick & Bailey, 2011). We identified a 57-nt motif (E-value = 3.7e-10) occurring 18 times exclusively in XIST but not elsewhere in the human genome (Fig 3A–C). Another 15-nt C-rich motif (E-value = 9.0e-10) was found in 52 DRs of 21 different lncRNAs (Fig 3D–F), and we discovered four additional motifs closely related to the ones described here (Appendix Fig S2A–D). Similarly, k-mer analysis (Le Cessie & Van Houwelingen, 1992) revealed several C-rich 4-mers that were mildly predictive of a DR (Appendix Fig S2E). In total, we discovered six motifs and confirmed that the nucleotides overlapping these motifs were significantly enriched in the nucleus (P < 1/106, Mann–Whitney test, Materials and Methods), compared to all other regions tiled in our MPRNA (Fig 3G). Since the C-rich motif occurred in many distinct DRs of diverse lncRNAs, we postulated that this motif could function as a global RNA nuclear localization element. To test this, we examined the nuclear versus cytoplasmic localization of both human lncRNA and mRNA transcripts containing this motif, from the ENCODE consortium fractionation RNA-seq data (ENCODE Project Consortium, 2012). For both lncRNAs and mRNAs, we observed a modest-yet-significant increase (P < 1/106, Mann–Whitney test) in nuclear localization of transcripts containing the C-rich motif across all 11 ENCODE TIER 2 cell lines (Fig 3H and I, Appendix Fig S3). Interestingly, we note that while the effect of the motifs was significant for both lncRNAs and mRNAs, the effect size was larger in motif-harboring lncRNAs. Collectively, these results demonstrate the power of our MPRNA to discover potential functional elements that may be missed by classic RNA localization studies. Figure 3. Motifs enriched in lncRNA nuclear enrichment signals A. Position weight matrix (PWM) for a novel 57-nt motif enriched within the DRs of XIST (E-value < 0.05). B. Occurrences of this motif throughout the XIST locus. C. Multiple sequence alignments of the incidences of the XIST motif (colored nucleotides) within the XIST DRs. Adjoining sequences are colored in gray. D. PWM for a novel C-rich 15-nt motif enriched within the DRs of 21 different lncRNAs (E-value < 0.05). E. The occurrences of this motif throughout the MALAT1 locus. F. Multiple sequence alignments of different instances of this motif (colored nucleotides), as they appear in the DRs of the indicated lncRNAs. G. Oligos bearing the novel motifs described in Fig 2A–F and Appendix Fig S2 are significantly e
Summary One of the biggest surprises since the sequencing of the human genome has been the discovery of thousands of long noncoding RNAs (lncRNAs) 1–6 . Although lncRNAs and mRNAs are similar in many ways, they differ with lncRNAs being more nuclear-enriched and in several cases exclusively nuclear 7,8 . Yet, the RNA-based sequences that determine nuclear localization remain poorly understood 9–11 . Towards the goal of systematically dissecting the lncRNA sequences that impart nuclear localization, we developed a massively parallel reporter assay (MPRA). Unlike previous MPRAs 12–15 that determine motifs important for transcriptional regulation, we have modified this approach to identify sequences sufficient for RNA nuclear enrichment for 38 human lncRNAs. Using this approach, we identified 109 unique, conserved nuclear enrichment regions, originating from 29 distinct lncRNAs. We also discovered two shorter motifs within our nuclear enrichment regions. We further validated the sufficiency of several regions to impart nuclear localization by single molecule RNA fluorescence in situ hybridization (smRNA-FISH). Taken together, these results provide a first systematic insight into the sequence elements responsible for the nuclear enrichment of lncRNA molecules.