Biological Sequence Embedding Based Classification for MERS and SARS

2021 
Biological sequence comparison is one of the key tasks in finding similarities between different species. The primary task involved in computing such biological sequences is to produce embeddings in vector space which can capture the most meaningful information for the original sequences. Several methods such as one-hot encoding, Word2Vec models, etc.. have been explored for sequence embeddings. But these methods either fail to capture similarity information between k-mers or face the challenge of handling Out-of-Vocabulary (OOV) k-mers. In this paper, we aim at conducting an in-depth analysis of sequence embeddings using Global Vectors (GloVe) model and FastText n-gram representation. We thereby evaluate its performance using classical Machine Learning algorithms and Deep Learning methods. We compare our results with an existing Word2Vec approach. Results show that FastText n-gram based sequence embeddings provide the most meaningful sequences based on classification accuracy and visualization plots.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    20
    References
    0
    Citations
    NaN
    KQI
    []