TERL: Classification of Transposable Elements by Convolutional Neural Networks

2020 
Transposable elements (TEs) are the most represented sequences occurring in eukaryotic genomes. They are capable of transpose and generate multiple copies of themselves throughout genomes. These sequences can produce a variety of effects on organisms, such as regulation of gene expression. There are several types of these elements, which are classified in a hierarchical way into classes, subclasses, orders and superfamilies. Few methods provide the classification of these sequences into deeper levels, such as superfamily level, which could provide useful and detailed information about these sequences. Most methods that classify TE sequences use handcrafted features such as k-mers and homology based search, which could be inefficient for classifying non-homologous sequences. Here we propose a pipeline, transposable elements representation learner (TERL), that use four preprocessing steps, a transformation of one-dimensional nucleic acid sequences into two-dimensional space data (i.e., image-like data of the sequences) and apply it to deep convolutional neural networks (CNNs). CNN is used to classify TE sequences because it is a very flexible classification method, given it can be easily retrained to classify different categories and any other DNA sequences. This classification method tries to learn the best representation of the input data to correctly classify it. CNNs can also be accelerated via GPUs to provide fast results. We have conducted six experiments to test the performance of TERL against other methods. Our approach obtained macro mean accuracies and F1-score of 96.4% and 85.8% for the superfamily sequences from RepBase and 95.7% and 91.5% for the order sequences from RepBase respectively. We have also obtained macro mean accuracies and F1-score of 95.0% and 70.6% for sequences from seven databases into superfamily level and 89.3% and 73.9% for the order level respectively. We surpassed accuracy, recall and specificity obtained by other methods on the experiment with the classification of order level sequences from seven databases and surpassed by far the time elapsed of any other method for all experiments. We also show a way to preprocess sequences and prepare train and test sets. Therefore, TERL can learn how to predict any hierarchical level of the TEs classification system, is on average 162 times and four orders of magnitude faster than TEclass and PASTEC respectively and on a real-world scenario obtained better accuracy, recall, and specificity than the other methods.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    29
    References
    2
    Citations
    NaN
    KQI
    []