Computational methods for the identification and quantification of transcript isoforms from next generation sequencing data

Foivos Gypas

Computational methods for the identification and quantification of transcript isoforms from next generation sequencing data

2018

Foivos Gypas

Most mammalian genes have multiple isoforms which are generated through the use of alternative transcription initiation sites, termination sites and internal exons. High-throughput sequencing technologies enabled the discovery and quantification of many novel RNA species including protein-coding RNAs, microRNAs, long non-coding RNAs and others. Currently, what is sequenced is mostly short reads, not full-length transcripts. Thus, computational methods are needed to reconstruct transcripts and infer their expression levels from the RNA-seq data, which is challenging, due to the many biases that are introduced during sample preparation. The main aim of my thesis was to improve approaches to isoform reconstruction and quantification. I have started by evaluating the performance of isoform quantification methods using two complementary test data sets. The first was generated by simulating short read sampling from in silico transcriptomes with known transcript abundances and second by preparing and sequencing in parallel both RNA-seq and 3’ end sequencing reads from the same population of cells. Many of the benchmarked methods performed comparably well, while a few were outstanding. However, all methods produced more accurate results of gene-level estimates than commonly used count-based methods. I have set up a complementary web service that developers of isoform quantification methods can use to compare the accuracy of their approach with those that we have already surveyed. Transcript quantification methods generally start from annotated transcripts, whose abundance is then estimated. However many isoforms are still to be identified. Currently available RNA-seq-based transcript reconstruction methods are insufficiently accurate, especially in the identification of transcript 5’ or 3’ ends. A catalog of poly(A) sites in the human and mouse genomes that our group constructed contains thousands of poly(A) sites located in regions that are currently annotated as intergenic and intronic. They indicate that many transcripts are yet to be annotated. Towards this goal, we developed the Terminal Exon Characterization (TEC) tool, which uses annotated intronic poly(A) sites together with RNA-seq data to reconstruct terminal exons and associated transcript isoforms. Applying TECtool to various datasets, we identified many novel tissue-specific transcripts, particularly from testis and bone marrow. Single cell data indicate that the relatively low expression of these transcripts is not due to their being expressed at low levels in individual cells, but rather to their being expressed in smaller subpopulations of cells. Ribosome profiling data suggest that novel transcript isoforms lead to the production of new proteins. TECtool can enrich the existing transcript annotation and support an improved transcript isoform abundance estimation. These in turn are relevant for the identification of binding sites for various regulators (miRNAs, RBPs), and for the annotation of protein domains. Besides developing novel tools, I have put much effort into their automation, in line with current efforts towards reproducibility of data analysis.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations