EDAR: an efficient error detection and removal algorithm for next generation sequencing data.

Xiaohong Zhao,Lance E. Palmer,Randall Bolanos,Cristian Mircean,Daniel Fasulo,Gayle M. Wittenberg

EDAR: an efficient error detection and removal algorithm for next generation sequencing data.

2010

Abstract Genomic sequencing techniques introduce experimental errors into reads which can mislead sequence assembly efforts and complicate the diagnostic process. Here we present a method for detecting and removing sequencing errors from reads generated in genomic shotgun sequencing projects prior to sequence assembly. For each input read, the set of all length k substrings (k-mers) it contains are calculated. The read is evaluated based on the frequency with which each k-mer occurs in the complete data set (k-count). For each read, k-mers are clustered using the variable-bandwidth mean-shift algorithm. Based on the k-count of the cluster center, clusters are classified as error regions or non-error regions. For the 23 real and simulated data sets tested (454 and Solexa), our algorithm detected error regions that cover 99% of all errors. A heuristic algorithm is then applied to detect the location of errors in each putative error region. A read is corrected by removing the errors, thereby creating two or ...

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations