A High-Quality Melon Genome Assembly Provides Insights into Genetic Basis of Fruit Trait Improvement
Hong ZhangXuming LiHaiyan YuYongbing ZhangMeihua LiHaojie WangDengming WangHuaisong WangQiushi FuMin LiuChangmian JiLiming MaJuan TangLi SongJianshun MiaoHongkun ZhengHongping Yi
42
Citation
36
Reference
10
Related Paper
Citation Trend
Abstract:
Accurate reference genomes have become indispensable tools for characterization of genetic and functional variations. Here we generated a high-quality assembly of the melon Payzawat using a combination of short-read sequencing, single-molecule real-time sequencing, Hi-C, and a high-density genetic map. The final 12 chromosome-level scaffolds cover ∼94.13% of the estimated genome (398.57 Mb). Compared with the published DHL92 genome, our assembly exhibits a 157-fold increase in contig length and remarkable improvements in the assembly of centromeres and telomeres. Six genes within STHQF12.4 on pseudochromosome 12, identified from whole-genome comparison between Payzawat and DHL92, may explain a considerable proportion of the skin thickness. In addition, our population study showed that melon domesticated at multiple times from whole-genome perspective and melons in China are introduced from different routes. Selective sweeps underlying the genes related to desirable traits, haplotypes of alleles associated with agronomic traits, and the variants from resequencing data enable efficient breeding.Keywords:
Sequence assembly
Melon
ABSTRACT Ignatzschineria larvae is studied for its role in decomposition and disease ecology; however, the type strain reference genome remains fragmented. The current reference genome consists of 61 contigs calculated at 82.18% complete with 10.98% contamination. Here, we announce the hybrid genome assembly as an improved single contig.
Sequence assembly
Cite
Citations (0)
We present LT1, the first high-quality human reference genome from the Baltic States. LT1 is a female de novo human reference genome assembly, constructed using 57× nanopore long reads and polished using 47× short paired-end reads. We utilized 72 GB of Hi-C chromosomal mapping data for scaffolding, to maximize assembly contiguity and accuracy. The contig assembly of LT1 was 2.73 Gbp in length, comprising 4490 contigs with an NG50 value of 12.0 Mbp. After scaffolding with Hi-C data and manual curation, the final assembly has an NG50 value of 137 Mbp and 4699 scaffolds. Assessment of gene prediction quality using Benchmarking Universal Single-Copy Orthologs (BUSCO) identified 89.3% of the single-copy orthologous genes included in the benchmark. Detailed characterization of LT1 suggests it has 73,744 predicted transcripts, 4.2 million autosomal SNPs, 974,616 short indels, and 12,079 large structural variants. These data may be used as a benchmark for further in-depth genomic analyses of Baltic populations.
Sequence assembly
Indel
Benchmark (surveying)
Cite
Citations (1)
Metagenomic sequencing is a promising way to reconstruct genomes of a large diversity of bacterial species in their environment. To reconstruct the genome of a single interest species, current approaches require the metagenomic assembly of the whole community. This method appears to be computationally unnecessarily intensive, and error prone, in particular when closely related species are present, as highlighted by the results of the CAMI challenge (Sczyrba2017). A solution to enable targeting genome assembly from metagenomic samples is to use the information of a reference genome as a backbone for the assembly. However, among the existing reference-guided assembly softwares, none takes into account the specificities of metagenomic data, including high volume and heterogeneous genotypes.
In this work, we propose a two-step reference-guided assembly method tailored for metagenomic data. First, a subset of the reads belonging to the species of interest are recruited by mapping and assembled into backbone contigs. The gapfiller MindTheGap is then used to perform an all-versus-all contig gapfilling and assemble the missing regions between the backbone contigs, which are regions different from the reference genome. MindTheGap algorithm makes no assumption on the synteny of backbone contigs, the potential structural variations within the sample, or the length of the missing regions. The result of the method is a genome assembly graph in gfa format, accounting for the structure of assembled genome, including the potential structural variations identified within the sample. This hybrid approach does not require a closely related reference and yet enables the targeted assembly of a species of interest from potentially large metagenomic read sets.
This approach was applied in the context of the pea aphid microbiome and outperformed several alternative strategies. Starting from a remote reference genome, we were able to assemble the full circular sequence of Buchnera aphidicola, symbiont of the pea aphid. MindTheGap was also able to assemble full circular sequences of APSE bacteriophages, including coexisting strains within the same read sample differing by large structural variants, such as novel virulence cassettes of several kilobases.
Sequence assembly
Synteny
Cite
Citations (1)
Background: Continued advances in next-generation sequencing (NGS) technologies is accompanied with the development of many whole genome assembly approaches to convert the small sequences (reads) into larger regions (contigs/scaffolds). However, none of these is perfect. Up to now, genome assembly data is compared under standard statistics (N50, coverage, contig sizes, number bases etc.) and there is no commonly accepted and standardized method for comparison and assessing the assembly data Methods & Materials: The raw data for S. aureus SA957 (paired end sequencing - SRR497751) produced by Illumina platform have been download from European Nucleotide Archive. Software such as Tadpole, Velvet, CLC genomic workbench, SeqMan NGen for de novo assembly and Bowtie2, BWA and CLC genomic workbench for mapping to reference have been used under default options to produce contigs/consensus. The assessing of the quality of genome assembly have been performed with using wgMLST implemented in SeqSphere software (Ridom). Concordace of genome assembly data was estimated with Rand index. Results: Seven genomes have been assembled using de novo and reference mapping methods. The basic statistics of de novo assembled data showed that CLC genomic workbench tool gave the best assembly sequence. Statistic parameters of Tadpole assembly were less well in comparison with other (Table 1). wgMLST analysis based on 2787 genes was performed on seven assembled genomes. As a result, CLC, Velvet and SeqMan Ngen assembly allowed to determine more than 2500 genes while Tadpole assembly with average contig length 1378 bp could identify 1560 out 2787 genes (table 2). Reference mapping assembly reviealed hign concordance (98-99%) between results. Minimum spanning tree clustered reference mapping results (picture1).View Large Image Figure ViewerDownload Hi-res image Download (PPT) Conclusion: A standardized gene-by-gene wgMLST approach allows assessing not only the quality but also quantitative estimation of genome assembly data. Based on this approach the genomes assembled with different software can be compared and clustered to find approaches that give similar results. wgMLST allows to find the discrepancies on gene level as well.
Sequence assembly
Workbench
Cite
Citations (1)
Sequence assembly
Synteny
Cite
Citations (21)
De novo DNA sequence assembly is very important in genome sequence analysis. In this paper, we investigated two of the major approaches for de novo DNA sequence assembly of very short reads: overlap-layout-consensus (OLC) and Eulerian path. From that investigation, we developed a new assembly technique by combining the OLC and the Eulerian path methods in a hierarchical process. The contigs yielded by these two approaches were treated as reads and were assembled again to yield longer contigs. We tested our approach using three real very-short-read datasets generated by an Illumina Genome Analyzer and four simulated very-short-read datasets that contained sequencing errors. The sequencing errors were modeled based on Illumina's sequencing technology. As a result, our combined approach yielded longer contigs than those of Edena (OLC) and Velvet (Eulerian path) in various coverage depths and was comparable to SOAPdenovo, in terms of N50 size and maximum contig lengths. The assembly results were also validated by comparing contigs that were produced by assemblers with their reference sequence from an NCBI database. The results show that our approach produces more accurate results than Velvet, Edena, and SOAPdenovo alone. This comparison indicates that our approach is a viable way to assemble very short reads from next generation sequencers.
Sequence assembly
Hybrid genome assembly
Sequence (biology)
Velvet
Path length
k-mer
Cite
Citations (4)
Although genetic sequencing technology has made great progress, read errors and large repetitive regions still occur during the genome assembly process. Many current assembly methods typically yield only sets of contigs whose relative positions and orientations along the sequenced genome are unknown. In order to further obtain its correct and complete sequence, this paper proposes a new sequence assembly method based on a single reference genome of similar species: BRS. For a specific species, assuming that its own reference genome is unknown, due to the high genetic similarity between similar species. the reference genome of similar species is used as an aid and the alignment tool is used to compare the contig collection and the reference genome of similar species. We analyze the alignment information and determine the direction and position of the contigs according to the final alignment result and complete the sequence assembly of the gene. The BRS method is compared with two other common methods: RaGOO and Ragout2 on two bacterial datasets. The experimental results show that this method can indeed achieve good results.
Sequence assembly
Hybrid genome assembly
Sequence (biology)
Cite
Citations (0)
Abstract Background The domestic sheep (Ovis aries) is an important agricultural species raised for meat, wool, and milk across the world. A high-quality reference genome for this species enhances the ability to discover genetic mechanisms influencing biological traits. Furthermore, a high-quality reference genome allows for precise functional annotation of gene regulatory elements. The rapid advances in genome assembly algorithms and emergence of sequencing technologies with increasingly long reads provide the opportunity for an improved de novo assembly of the sheep reference genome. Findings Short-read Illumina (55× coverage), long-read Pacific Biosciences (75× coverage), and Hi-C data from this ewe retrieved from public databases were combined with an additional 50× coverage of Oxford Nanopore data and assembled with canu v1.9. The assembled contigs were scaffolded using Hi-C data with Salsa v2.2, gaps filled with PBsuitev15.8.24, and polished with Nanopolish v0.12.5. After duplicate contig removal with PurgeDups v1.0.1, chromosomes were oriented and polished with 2 rounds of a pipeline that consisted of freebayes v1.3.1 to call variants, Merfin to validate them, and BCFtools to generate the consensus fasta. The ARS-UI_Ramb_v2.0 assembly is 2.63 Gb in length and has improved continuity (contig NG50 of 43.18 Mb), with a 19- and 38-fold decrease in the number of scaffolds compared with Oar_rambouillet_v1.0 and Oar_v4.0. ARS-UI_Ramb_v2.0 has greater per-base accuracy and fewer insertions and deletions identified from mapped RNA sequence than previous assemblies. Conclusions The ARS-UI_Ramb_v2.0 assembly is a substantial improvement in contiguity that will optimize the functional annotation of the sheep genome and facilitate improved mapping accuracy of genetic variant and expression data for traits in sheep.
Sequence assembly
Cite
Citations (52)
The buffalo is an integral part of agriculture, particularly within the continent of Asia, providing a source of milk, meat, skin, hides, fertilizer, fuel, and draft power. The efficiency of this animal, compared to that of cattle, is higher in this region, though little is known about genome sequence of buffalo. The first version of assembly of a single female Murrah buffalo was constructed with Illumina paired end and mate pair short read sequencing using the cattle genome (Btau 4.0 assembly) as a reference. The assembly has read depth of 17-19X. The buffalo assembly represents ~ 91%-95% coverage in comparison to the cattle assembly Btau 4.0. The assembly has 185,150 contigs with the median contig length of 2.3 Kb and the largest contig length of 663 Kb. The mitochondrial genome is fully covered by a single contig. Whole genome comparison between this assembly and of cattle revealed 52 million mismatches/indels. The present analysis also unveils about 300 structural variants in the buffalo genome. The buffalo assembly has been integrated into a publically available genome browser with tracks for read pair insert distances, read depth, nucleotide variations, coverage, and the availability of custom tracks for scientific community. This assembly of the Water Buffalo is the first deep sequencing project that provides the resources to better understand the genomic basis of adaptable traits and genetic variation that distinguishes buffalo from cattle.
Sequence assembly
Bubalus
Bovine genome
Cite
Citations (36)
Until recently, read lengths on the Solexa/Illumina system were too short to reliably assemble transcriptomes without a reference sequence, especially for non-model organisms. However, with read lengths up to 100 nucleotides available in the current version, an assembly without reference genome should be possible. For this study we created an EST data set for the common pond snail Radix balthica by Illumina sequencing of a normalized transcriptome. Performance of three different short read assemblers was compared with respect to: the number of contigs, their length, depth of coverage, their quality in various BLAST searches and the alignment to mitochondrial genes. A single sequencing run of a normalized RNA pool resulted in 16,923,850 paired end reads with median read length of 61 bases. The assemblies generated by VELVET, OASES, and SeqMan NGEN differed in the total number of contigs, contig length, the number and quality of gene hits obtained by BLAST searches against various databases, and contig performance in the mt genome comparison. While VELVET produced the highest overall number of contigs, a large fraction of these were of small size (< 200bp), and gave redundant hits in BLAST searches and the mt genome alignment. The best overall contig performance resulted from the NGEN assembly. It produced the second largest number of contigs, which on average were comparable to the OASES contigs but gave the highest number of gene hits in two out of four BLAST searches against different reference databases. A subsequent meta-assembly of the four contig sets resulted in larger contigs, less redundancy and a higher number of BLAST hits. Our results document the first de novo transcriptome assembly of a non-model species using Illumina sequencing data. We show that de novo transcriptome assembly using this approach yields results useful for downstream applications, in particular if a meta-assembly of contig sets is used to increase contig quality. These results highlight the ongoing need for improvements in assembly methodology.
Sequence assembly
Illumina dye sequencing
Cite
Citations (140)