An Information Integration Approach for Classifying Coding and Non-Coding Genomic Data

2014 
Reliable methods to classify coding and non-coding transcripts from large scale genomic data will help researchers annotate novel RNA transcripts. In this manuscript we explored some of the distinguishing properties of these two classes of transcripts, such as the features of their secondary structures, differential expression scores obtained from typical RNA-seq experiments, and G+C content scores. We trained two classification methods—Conditional Random Forest (CRF) and the Support Vector Machines (SVMs) with the extracted features from the genomic data and applied the trained model to predict a test set comprised of the two classes of transcripts from three well known annotation sources and found important characteristics of the extracted features regarding the classification problem. A comparative analysis shows that our method outperforms the existing two state-of-the-art methods—the CPC (Coding Potential Calculator) and the PORTRAIT in classifying transcripts from the test dataset.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    16
    References
    1
    Citations
    NaN
    KQI
    []