Micropeptides, the next best thing after micro-RNA?: combining in silico prediction and ribosome profiling in a genome-wide search for novel micropeptides

2013 
Introduction : It was long assumed that proteins are at least 100 amino acids (AAs) long. Moreover, the detection of short translation products (e.g. coded from small Open Reading Frames, sORFs) is very difficult as the short length makes it hard to distinguish true coding ORFs from ORFs occurring by chance. Nevertheless, over the past few years many such non-canonical genes (with ORFs < 100 AAs) have been discovered in different organisms like Arabidopsis thaliana, Saccharomyces cerevisiae, and Drosophila melanogaster. Thanks to advances in sequencing, bioinformatics and computing power, it is now possible to scan the genome in unprecedented scrutiny, for example in a search of this type of small ORFs. Methods : Using bioinformatics methods, we performed a systematic search for putatively functional sORFs in the Mus musculus genome. A genome-wide scan detected all sORFs which were subsequently analyzed for their coding potential, based on evolutionary conservation at the AA level using UCSC multiple species alignments, and ranked using a Support Vector Machine (SVM) learning model. The ranked sORFs are finally overlapped with ribosome profiling data proving sORF translation. All candidates are visually inspected using an in-house developed genome browser. Preliminary Data : The genome-wide search for sORFs with sORFfinder resulted in the prediction of 2,414,589 single-exon sORFs with high coding potential, out of a total pool of 40,704,347 sORFs. To assess their peptide-coding potential, all sORFs were analyzed using a UCSC multi-species alignment of 8 vertebrate species. For each sORF a number of basic peptide conservation characteristics were deduced and gathered. We used an SVM approach to classify the sORFs into a coding and non-coding group based on all aforementioned characteristics. After training the SVM on 4/5th of the data and testing the SVM on the remainder, we reached a correct classification for up to 93% of the test subjects, with a false positive rate not exceeding 4%. Even with very stringent parameters this genome-wide in silico prediction approach gives rise to hundreds, even thousands of possibly interesting sequences. Therefore we reanalyzed ribosome profiling data obtained from a mouse Embryonic Stem Cells (mESC) sample, uniquely mapping the reads to sORFs located in intergenic or ncRNA regions. Retaining only those sORFs that overlap with ribosome profiles at their start position in the harringtonine treated sample data and that have a sequence coverage of at least 75% relative to the untreated sample data, led to a set of 221 intergenic sORFs and 489 sORFs located in ncRNA regions. Looking only at lincRNA sORFs, as data points to their expression in these regions, further decreases the sample size to 33 sORFs. All sORFs are made accessible through an in-house developed H2G2 genome browser. Next to the sORF information, static visualization tracks are added depicting genomic annotation from Ensembl, phastCons conservation scores and other relevant information. Experimental ribosomal profiling data are incorporated using individual tracks for every analysis on the different samples (with or without harringtonine treatment).
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []