CLADES: A classification‐based machine learning method for species delimitation from population genetic data

2018 
Species are considered to be the basic unit of ecological and evolutionary studies. Since multi-locus genomic data are becoming increasingly available, there has been considerable interests in the use of DNA sequence data to delimit species. In this paper, we show that machine learning can be used for species delimitation. There exists no species delimitation methods that are based on machine learning. Our method treats the species delimitation problem as a classification problem. It is a problem of identifying the category of a new observation on the basis of training data. Extensive simulation is first conducted over a broad range of evolutionary parameters for training purpose. Each pair of known populations are combined to form training samples with a label of "same species" or "different species". We use Support Vector Machine (SVM) to train a classifier using a set of summary statistics computed from training samples as features. The trained classifier can classify a test sample to two outcomes: "same species" or "different species". Given multi-locus genomic data of multiple related organisms or populations, our method (called CLADES) performs species delimitation by first classifying pairs of populations. CLADES then delimits species by maximizing the likelihood of species assignment for multiple populations. CLADES is evaluated through extensive simulation and also tested on real genetic data. We show that CLADES is both accurate and efficient for species delimitation when compared with existing methods. CLADES can be useful especially when existing methods have difficulty in delimitation, e.g. with short species divergence time and gene flow.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    28
    References
    1
    Citations
    NaN
    KQI
    []