Overcoming uncollapsed haplotypes in long-read assemblies of non-model organisms

2020 
Third-generation sequencing, also called long-read sequencing, has revolutionized genome assembly: as PacBio and Nanopore technologies have become more accessible in technicity and in cost (with decreasing error rates and increasing read lengths), long-read assemblers have flourished and are starting to deliver chromosome-level assemblies. However, an independent, comparative assessment of the performance of these programs on a common, real-life dataset is still lacking. To fill this gap, we tested the efficiency of long-read assemblers on the genome of the rotifer Adineta vaga, a non-model organism for which both PacBio and Nanopore reads were available. Although all the assemblers included in our benchmark aimed to produce a haploid genome assembly with collapsed haplotypes, we observed strikingly different behaviors of these assemblers on highly heterozygous regions: allelic regions that were most divergent were sometimes not merged, resulting in variable amounts of duplicated regions. We identified three strategies to alleviate this problem: setting a read-length threshold to filter out shorter reads; choosing an assembler less prone to retaining uncollapsed haplotypes; and post-processing the assembled set of contigs using a downstream tool to remove uncollapsed haplotypes. These three strategies are not mutually exclusive and, when combined, generate haploid assemblies with genome sizes, coverage distributions, and k-mer completeness matching expectations.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    43
    References
    4
    Citations
    NaN
    KQI
    []