Ensemble K-means for semi-supervised learning in enzymatic activity classification of GH-70 enzymes

2020 
The enzymatic activity classification of the GH-70 enzymes is a challenge in Bioinformatics due to the high diversity of these sequences. From the 501 sequences reported when we accessed Cazy.org, just 58 were labeled into 6 EC number classes. In this paper we propose a semi-supervised classification algorithm based on the k-mers frequency descriptors with k equals to 2, 3, 4, 5 and 6 as alignment-free measures extracted from the sequences. The high dimensionality of the k-mers ( vectors and the increasing number of sequences lead to the application of big data Spark classifiers such as the ones in Apache MLlib. Specifically, the K-means clustering applied in an iterative way yields multiple results that can be ensemble in a semi-supervised second-round clustering step capable of detecting groups of similar sequences including the labeled and the unlabeled ones. Finally, external measures validate the ensemble clustering for the labeled sequences. Further improvements in the clustering and ensemble steps could raise the quality of classification.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    6
    References
    0
    Citations
    NaN
    KQI
    []