EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data

Zhongyang Zhang,Haoxiang Cheng,Xiumei Hong,Antonio Fabio Di Narzo,Oscar Franzén,Shouneng Peng,Arno Ruusalepp,Jason C. Kovacic,Johan Björkegren,Xiaobin Wang,Ke Hao

EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data

2019

The associations between diseases/traits and copy number variants (CNVs) have not been systematically investigated in genome-wide association studies (GWASs), primarily due to a lack of robust and accurate tools for CNV genotyping. Herein, we propose a novel ensemble learning framework, ensembleCNV, to detect and genotype CNVs using single nucleotide polymorphism (SNP) array data. EnsembleCNV (a) identifies and eliminates batch effects at raw data level; (b) assembles individual CNV calls into CNV regions (CNVRs) from multiple existing callers with complementary strengths by a heuristic algorithm; (c) re-genotypes each CNVR with local likelihood model adjusted by global information across multiple CNVRs; (d) refines CNVR boundaries by local correlation structure in copy number intensities; (e) provides direct CNV genotyping accompanied with confidence score, directly accessible for downstream quality control and association analysis. Benchmarked on two large datasets, ensembleCNV outperformed competing methods and achieved a high call rate (93.3%) and reproducibility (98.6%), while concurrently achieving high sensitivity by capturing 85% of common CNVs documented in the 1000 Genomes Project. Given this CNV call rate and accuracy, which are comparable to SNP genotyping, we suggest ensembleCNV holds significant promise for performing genome-wide CNV association studies and investigating how CNVs predispose to human diseases.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations