Benchmarking long-read genome sequence alignment tools for human genomics applications

2021 
ABSTRACT Background The utility of long-read genome sequencing platforms has been shown in many fields including whole genome assembly, metagenomics, and amplicon sequencing. Less clear is the applicability of long reads to reference-guided human genomics, the foundation of genomic medicine. Here, we benchmark available platform-agnostic alignment tools on datasets from nanopore and single-molecule real-time platforms to understand their suitability in producing a genome representation. Results For this study, we leveraged publicly-available data from sample NA12878 generated on Oxford Nanopore and Pacific Biosciences platforms. Each tool that was benchmarked, including GraphMap, minimap2, and NGMLR, produced the same alignment file each time. However, the different tools widely disagreed on which reads to leave unaligned, affecting the end genome coverage and the number and locations of discoverable breakpoints. Only minimap2 was computationally lightweight enough for use at scale. No alignment from one tool independently resolved all large structural variants (10,000-100,000 basepairs) present in the Database of Genome Variants (DGV) for sample NA12878. For variants larger than 1,000,000 basepairs, nanopore sequence aligned with minimap2 and NGMLR, and single-molecule real-time sequence aligned with NGMLR contained more breakpoints than are present in DGV. Conclusions When computational resources are not a limiting factor, it should be best practice to use an analysis pipeline that generates alignments with both minimap2 and NGMLR, as neither results in a comprehensive genome representation. When computational resources are limited, use of minimap2 for human genome alignment produces files sufficient to answer hypotheses and generate new questions.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    45
    References
    0
    Citations
    NaN
    KQI
    []