Rapid hybrid de novo assembly of a microbial genome using only short reads: Corynebacterium pseudotuberculosis I19 as a case study
Louise CerdeiraAdriana Ribeiro CarneiroRommel Thiago Jucá RamosS. AlmeidaVivian D ́AfonsecaMaria Paula Cruz SchneiderJan BaumbachAndreas TauchJohn Anthony McCullochVasco AzevedoArtur Silva
55
Citation
20
Reference
10
Related Paper
Citation Trend
Abstract:
Due to the advent of the so-called Next-Generation Sequencing (NGS) technologies the amount of monetary and temporal resources for whole-genome sequencing has been reduced by several orders of magnitude. Sequence reads can be assembled either by anchoring them directly onto an available reference genome (classical reference assembly), or can be concatenated by overlap (de novo assembly). The latter strategy is preferable because it tends to maintain the architecture of the genome sequence the however, depending on the NGS platform used, the shortness of read lengths cause tremendous problems the in the subsequent genome assembly phase, impeding closing of the entire genome sequence. To address the problem, we developed a multi-pronged hybrid de novo strategy combining De Bruijn graph and Overlap-Layout-Consensus methods, which was used to assemble from short reads the entire genome of Corynebacterium pseudotuberculosis strain I19, a bacterium with immense importance in veterinary medicine that causes Caseous Lymphadenitis in ruminants, principally ovines and caprines. Briefly, contigs were assembled de novo from the short reads and were only oriented using a reference genome by anchoring. Remaining gaps were closed using iterative anchoring of short reads by craning to gap flanks. Finally, we compare the genome sequence assembled using our hybrid strategy to a classical reference assembly using the same data as input and show that with the availability of a reference genome, it pays off to use the hybrid de novo strategy, rather than a classical reference assembly, because more genome sequences are preserved using the former.Keywords:
Sequence assembly
Hybrid genome assembly
Corynebacterium pseudotuberculosis
De Bruijn graph
Comparative Genomics
In silico studying a genome requires two steps: sequencing it with cloning and cutting the genome in several reads, and then, assembling the reads. It is well known that the number of sequencing errors is proportional to the reads' size. However, the use of long reads can be an advantage against genome repeated regions issues. De novo is an assembly method which does not use a reference. The purpose of the described here tool, named LOREAS, is to be a de novo assembler in two tasks: first, ordering the long reads, and then, obtaining a consensus sequence of the ordered reads. Currently, only the first task was realised. While other de novo long reads assemblers use heuristics and De Bruijn graphs, LOREAS is based on overlaps similarity between all the long reads. It uses integer linear programing, to find the heaviest path in a graph $G= (V,E,λ)$, where V is the vertices set corresponding to the long reads set, E the set of edges associated with the overlaps between long reads – weighted by λ: the overlap length. When this graph is too huge, the set of reads V is partitioned in several parts. Then, all the parts are solved sequentially. Here we present the solution concerning the first task related to ten bacteria genomes. Seven of them have been succefully solved for less than 12 minutes on a laptop.
De Bruijn graph
Sequence assembly
Hybrid genome assembly
k-mer
Heuristics
Cite
Citations (0)
Technologies for next-generation sequencing (NGS) have stimulated an exponential rise in high-throughput sequencing projects and resulted in the development of new read-assembly algorithms. A drastic reduction in the costs of generating short reads on the genomes of new organisms is attributable to recent advances in NGS technologies such as Ion Torrent, Illumina, and PacBio. Genome research has led to the creation of high-quality reference genomes for several organisms, and de novo assembly is a key initiative that has facilitated gene discovery and other studies. More powerful analytical algorithms are needed to work on the increasing amount of sequence data. We make a thorough comparison of the de novo assembly algorithms to allow new users to clearly understand the assembly algorithms: overlap-layout-consensus and de-Bruijn-graph, string-graph based assembly, and hybrid approach. We also address the computational efficacy of each algorithm’s performance, challenges faced by the assem- bly tools used, and the impact of repeats. Our results compare the relative performance of the different assemblers and other related assembly differences with and without the reference genome. We hope that this analysis will contribute to further the application of de novo sequences and help the future growth of assembly algorithms.
De Bruijn graph
Sequence assembly
Hybrid genome assembly
Ion semiconductor sequencing
k-mer
Cite
Citations (24)
We describe a new assembly algorithm, where a genome assembly with low sequence coverage, either throughout the genome or locally, due to cloning bias, is considerably improved through an assisting process via a related genome. We show that the information provided by aligning the whole-genome shotgun reads of the target against a reference genome can be used to substantially improve the quality of the resulting assembly.
Sequence assembly
Hybrid genome assembly
Shotgun
Cite
Citations (71)
Sequence assembly
Hybrid genome assembly
Nanopore
Illumina dye sequencing
k-mer
Cite
Citations (0)
Due to the advent of the so-called Next-Generation Sequencing (NGS) technologies the amount of monetary and temporal resources for whole-genome sequencing has been reduced by several orders of magnitude. Sequence reads can be assembled either by anchoring them directly onto an available reference genome (classical reference assembly), or can be concatenated by overlap (de novo assembly). The latter strategy is preferable because it tends to maintain the architecture of the genome sequence the however, depending on the NGS platform used, the shortness of read lengths cause tremendous problems the in the subsequent genome assembly phase, impeding closing of the entire genome sequence. To address the problem, we developed a multi-pronged hybrid de novo strategy combining De Bruijn graph and Overlap-Layout-Consensus methods, which was used to assemble from short reads the entire genome of Corynebacterium pseudotuberculosis strain I19, a bacterium with immense importance in veterinary medicine that causes Caseous Lymphadenitis in ruminants, principally ovines and caprines. Briefly, contigs were assembled de novo from the short reads and were only oriented using a reference genome by anchoring. Remaining gaps were closed using iterative anchoring of short reads by craning to gap flanks. Finally, we compare the genome sequence assembled using our hybrid strategy to a classical reference assembly using the same data as input and show that with the availability of a reference genome, it pays off to use the hybrid de novo strategy, rather than a classical reference assembly, because more genome sequences are preserved using the former.
Sequence assembly
Hybrid genome assembly
Corynebacterium pseudotuberculosis
De Bruijn graph
Comparative Genomics
Cite
Citations (55)
Sequence assembly
Hybrid genome assembly
De Bruijn graph
Sequence (biology)
Cite
Citations (0)
We describe a new assembly algorithm, where a genome assembly with low sequence coverage, either throughout the genome or locally, due to cloning bias, is considerably improved through an assisting process via a related genome. We show that the information provided by aligning the whole-genome shotgun reads of the target against a reference genome can be used to substantially improve the quality of the resulting assembly.
Sequence assembly
Hybrid genome assembly
Shotgun
Cite
Citations (0)
The development of next-generation sequencing has made it possible to sequence whole genomes at a relatively low cost. However, de novo genome assemblies remain challenging due to short read length, missing data, repetitive regions, polymorphisms and sequencing errors. As more and more genomes are sequenced, reference-guided assembly approaches can be used to assist the assembly process. However, previous methods mostly focused on the assembly of other genotypes within the same species. We adapted and extended a reference-guided de novo assembly approach, which enables the usage of a related reference sequence to guide the genome assembly. In order to compare and evaluate de novo and our reference-guided de novo assembly approaches, we used a simulated data set of a repetitive and heterozygotic plant genome. The extended reference-guided de novo assembly approach almost always outperforms the corresponding de novo assembly program even when a reference of a different species is used. Similar improvements can be observed in high and low coverage situations. In addition, we show that a single evaluation metric, like the widely used N50 length, is not enough to properly rate assemblies as it not always points to the best assembly evaluated with other criteria. Therefore, we used the summed z-scores of 36 different statistics to evaluate the assemblies. The combination of reference mapping and de novo assembly provides a powerful tool to improve genome reconstruction by integrating information of a related genome. Our extension of the reference-guided de novo assembly approach enables the application of this strategy not only within but also between related species. Finally, the evaluation of genome assemblies is often not straight forward, as the truth is not known. Thus one should always use a combination of evaluation metrics, which not only try to assess the continuity but also the accuracy of an assembly.
Sequence assembly
Hybrid genome assembly
k-mer
Cite
Citations (128)
The enormous amount of short reads produced by next generation sequencing (NGS) techniques such as Roche/454, Illumina/Solexa and SOLiD sequencing opened the possibility of de novo genome assembly. Some of the de novo genome assemblers (e.g., Edena, SGA) use an overlap graph approach to assemble a genome, while others (e.g., ABySS and SOAPdenovo) use a de Bruijn graph approach. Currently, the approaches based on the de Bruijn graph are the most successful, yet their performance is far from being able to assemble entire genomic sequences. We developed a new overlap graph based genome assembler called Paired-End Genome ASsembly Using Short-sequences (PEGASUS) for paired-end short reads produced by NGS techniques. PEGASUS uses a minimum cost network flow approach to predict the copy count of the input reads more precisely than other algorithms. With the help of accurate copy count and mate pair support, PEGASUS can accurately unscramble the paths in the overlap graph that correspond to DNA sequences. PEGASUS exhibits comparable and in many cases better performance than the leading genome assemblers.
De Bruijn graph
Sequence assembly
Hybrid genome assembly
k-mer
Cite
Citations (1)
Sequence assembly
Hybrid genome assembly
De Bruijn graph
Identification
Sequence (biology)
k-mer
Cite
Citations (8)