An Automatic Classification of Genetic Mutations by Exploring Different Classifiers

2021 
The proposed work solely focuses on transforming the manual task of pathologists in classifying a test mutation to a task automatically done by a machine. We collected the dataset from a Kaggle competition which distributes the three features, Gene, Variation and Text into nine different classes. These classes are provided by genomic researchers which state whether a particular mutation is a driver (cancer causing mutation) or a passenger (neutral mutation). Our model was able to perform the labour intensive work of classification, thus saving time, diminishing possibility of human error and risks of wrong analysis. As the score was not too high, we have given a probabilistic output and hence our model is interpretable. Researchers need to analyse only around two classes with highest probability for classification ignoring all other classes. We have used different encoding and embedding techniques to convert text entry to numerical form, after which they are fed as input to different classifiers. Logistic Regression classifier with Term Frequency—Inverse Document Frequency encoding technique fetched the maximum accuracy, i.e. 67.27%. Attention had been given to enhance this accuracy with Word2Vec and Doc2Vec embedding, but it only decreased due to some issues with our dataset as discussed later. However, with a better dataset, our model can ensure better accuracy.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    19
    References
    0
    Citations
    NaN
    KQI
    []