A cautionary note on the use of machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices

2020 
Genetic surveillance of malaria parasites supports malaria control programmes, treatment guidelines and elimination strategies. Surveillance studies often pose questions about malaria parasite ancestry (e.g. about the spread of antimalarial resistance) and employ methods that characterise parasite population structure. Many of the methods used to characterise structure are algorithms developed in machine learning (ML) and depend on a genetic distance matrix, e.g. principal coordinates analysis (PCoA) and hierarchical agglomerative clustering (HAC). However, PCoA and HAC are sensitive to both the definition of genetic distance and algorithmic specification. Importantly, neither algorithm generates an inferred malaria parasite ancestry. As such, PCoA and HAC can support (e.g. via exploratory visualisation and hypothesis generation), but not answer comprehensively, key questions about malaria parasite ancestry. We illustrate the sensitivity of PCoA and HAC using 393 P. falciparum whole genome sequences collected from Cambodia and neighbouring regions (where antimalarial resistance has emerged and spread recently) and we provide tentative guidance for the use of PCoA and HAC in malaria parasite genetic epidemiology. This guidance includes a call for fully transparent and reproducible analysis pipelines that feature (i) a clearly outlined scientific question; (ii) clear justification of methods used along with discussion of any inferential limitations; (iii) publicly available genetic distance matrices; and (iv) sensitivity analyses. To bridge the inferential disconnect between the output of the ML algorithms and the scientific questions of interest, tailor-made statistical models are needed to infer malaria parasite ancestry. In the absence of such models speculative reasoning should feature only as discussion but not as results.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    59
    References
    1
    Citations
    NaN
    KQI
    []