Protein Sequence Classification Based on N-Gram and K-Nearest Neighbor Algorithm

2016 
The paper proposes classification of protein sequences using K-Nearest Neighbor (KNN) algorithm. Motif extraction method N-gram is used to encode biological sequences into feature vectors. The N-gram generated is represented using Boolean data representation technique. The experiments are conducted on dataset consisting of 717 sequences unequally distributed into seven classes with a sequence identity of 25 %. The number of neighbors in the KNN classifier is varied from 3, 5, 7, 9, 11, 13 and 15. Euclidean distance and Cosine coefficient similarity measures are used for determining nearest neighbors. The experimental results revealed that the procedure with Cosine measure and the number of neighbors as 15 gave the highest accuracy of 84 %. The effectiveness of the proposed method is also shown by comparing the experimental results with those of other related methods on the same dataset.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    32
    References
    7
    Citations
    NaN
    KQI
    []