Self-Tuning Spectral Clustering for Full-length Viral Quasispecies Reconstruction with PacBio Long Reads

2017 
Many of the infectious diseases which have jeopardized and still are a threat to public health are caused by RNA viruses, including HIV, HCV, Influenza virus, Ebola virus and Zika virus. Because of a high rate of mutations and recombination events, rapidly evolving RNA viruses prevail within a host as a collection of closely related variants, referred to as viral quasi-species. Uncovering the genetic diversity (i.e., inferring viral haplotypes and their proportions in the population) of an RNA virus can significantly benefit the study of disease progression, antiviral drug design, vaccine design and viral pathogenesis. Recent advances in PacBio single-molecule sequencing offers sufficient throughput and contiguous long reads (>10kb) covering the full length of most genes and RNA viral genomes, providing a potential to reliably profile the viral populations. However, the relatively high error rate (2~15%) in the long-read data requires novel analysis methods to deconvolute sequences derived from complex viral mixtures. We examined samples containing complex mixtures of near-full-length HIV-1 genomes, single molecules sequenced as near-full-length (9kb) amplicons directly from PCR products, and developed a novel signature-based self-tuning spectral clustering method called SigClust to accurately determine the identity (above 99.5%) and relative abundances of viral genomes in the mixtures. Results on real HIV-1 and influenza benchmark data sets demonstrate efficacy and superior performance of SigClust.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []