Reordering Genomic Sequences for Enhanced Classification via Compression Analytics

2019 
The full implications of sharing genomic information are still largely unknown. Understanding what attributes can be inferred from available information is therefore a critical part of genomic privacy and security. We show that compression analytics are successful at classifying, or inferring, unknown attributes of genomic sequences without the need for a predefined feature set and with very little training data. Compression analytics perform best when predictable elements within a sequence are local; however, long range dependencies are ubiquitous in the human genome. We therefore consider a variety of schemes to reorder genomic sequences so as to localize predictable elements and improve the performance of compression analytics. Compression analytics on both native and reordered sequences are shown to outperform more traditional, feature-based machine learning approaches.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    16
    References
    1
    Citations
    NaN
    KQI
    []