The sunflower genome provides insights into oil metabolism, flowering and Asterid evolution
Hélène BadouinJérôme GouzyChristopher J. GrassaFlorent MuratS. Evan StatonLudovic CottretChristine Lelandais‐BrièreGregory L. OwensSébastien CarrèreBaptiste MayjonadeLudovic LegrandNavdeep GillNolan C. KaneJohn E. BowersSariel HübnerArnaud BellecAurélie BérardHélène BergèsNicolas BlanchetMarie‐Claude BonifaceDominique BrunelOlivier CatriceNadia ChaidirClotilde ClaudelCécile DonnadieuThomas FarautGhislain FievetNicolas HelmstetterMatthew KingSteven J. KnappZhao LaiMarie‐Christine Le PaslierYannick LippiLolita LorenzonJennifer R. MandelGwenola MarageGwenaëlle MarchandElodie MarquandEmmanuelle Bret-MestriesEvan MorienSavithri U. NambeesanThuy NguyenPrune Pegot-EspagnetNicolas PouillyFrances RaftisErika SalletThomas SchiexJustine ThomasCéline VandecasteeleDidier VarèsFélicity VearSonia VautrinMartín CrespiBrigitte ManginJohn M. BurkeJérôme SalseStéphane MuñosPatrick VincourtLoren H. RiesebergNicolas Langlade
672
Citation
56
Reference
10
Related Paper
Citation Trend
Abstract:
A high-quality reference for the sunflower genome (Helianthus annuus L.) and analysis of gene networks involved in flowering time and oil metabolism provide a basis for nutritional exploitation and analyses of adaptation to climate change. Nicolas Langlade and colleagues report the genome sequence of the domesticated sunflower, Helianthus annuus L., a global oil crop that can maintain stable yields across a wide range of environmental conditions. Their comparative analyses provide insights into the evolutionary history of Asterids. They also analysed transcriptomic data from vegetative and floral organs, re-sequenced 80 domesticated lines and performed genome-wide association studies identifying 35 loci associated with flowering time. These resources will be useful in breeding programs as well as ecological and evolutionary studies. The domesticated sunflower, Helianthus annuus L., is a global oil crop that has promise for climate change adaptation, because it can maintain stable yields across a wide variety of environmental conditions, including drought1. Even greater resilience is achievable through the mining of resistance alleles from compatible wild sunflower relatives2,3, including numerous extremophile species4. Here we report a high-quality reference for the sunflower genome (3.6 gigabases), together with extensive transcriptomic data from vegetative and floral organs. The genome mostly consists of highly similar, related sequences5 and required single-molecule real-time sequencing technologies for successful assembly. Genome analyses enabled the reconstruction of the evolutionary history of the Asterids, further establishing the existence of a whole-genome triplication at the base of the Asterids II clade6 and a sunflower-specific whole-genome duplication around 29 million years ago7. An integrative approach combining quantitative genetics, expression and diversity data permitted development of comprehensive gene networks for two major breeding traits, flowering time and oil metabolism, and revealed new candidate genes in these networks. We found that the genomic architecture of flowering time has been shaped by the most recent whole-genome duplication, which suggests that ancient paralogues can remain in the same regulatory networks for dozens of millions of years. This genome represents a cornerstone for future research programs aiming to exploit genetic diversity to improve biotic and abiotic stress resistance and oil production, while also considering agricultural constraints and human nutritional needs8,9.Keywords:
Helianthus annuus
plant evolution
Sequence assembly
De novo heterozygous assembly is an ongoing challenge requiring improved assembly approaches. In this study, three strategies were used to develop de novo Vitis vinifera 'Sultanina' genome assemblies for comparison with the inbred V. vinifera (PN40024 12X.v2) reference genome and a published Sultanina ALLPATHS-LG assembly (AP). The strategies were: 1) a default PLATANUS assembly (PLAT_d) for direct comparison with AP assembly, 2) an iterative merging strategy using METASSEMBLER to combine PLAT_d and AP assemblies (MERGE) and 3) PLATANUS parameter modifications plus GapCloser (PLAT*_GC). The three new assemblies were greater in size than the AP assembly. PLAT*_GC had the greatest number of scaffolds aligning with a minimum of 95% identity and ≥1000 bp alignment length to V. vinifera (PN40024 12X.v2) reference genome. SNP analysis also identified additional high quality SNPs. A greater number of sequence reads mapped back with zero-mismatch to the PLAT_d, MERGE, and PLAT*_GC (>94%) than was found in the AP assembly (87%) indicating a greater fidelity to the original sequence data in the new assemblies than in AP assembly. A de novo gene prediction conducted using seedless RNA-seq data predicted > 30,000 coding sequences for the three new de novo assemblies, with the greatest number (30,544) in PLAT*_GC and only 26,515 for the AP assembly. Transcription factor analysis indicated good family coverage, but some genes found in the VCOST.v3 annotation were not identified in any of the de novo assemblies, particularly some from the MYB and ERF families. The PLAT_d and PLAT*_GC had a greater number of synteny blocks with the V. vinifera (PN40024 12X.v2) reference genome than AP or MERGE. PLAT*_GC provided the most contiguous assembly with only 1.2% scaffold N, in contrast to AP (10.7% N), PLAT_d (6.6% N) and Merge (6.4% N). A PLAT*_GC pseudo-chromosome assembly with chromosome alignment to the reference genome V. vinifera, (PN40024 12X.v2) provides new information for use in seedless grape genetic mapping studies. An annotated de novo gene prediction for the PLAT*_GC assembly, aligned with VitisNet pathways provides new seedless grapevine specific transcriptomic resource that has excellent fidelity with the seedless short read sequence data.
Sequence assembly
MYB
Merge (version control)
Cite
Citations (16)
Due to the advent of the so-called Next-Generation Sequencing (NGS) technologies the amount of monetary and temporal resources for whole-genome sequencing has been reduced by several orders of magnitude. Sequence reads can be assembled either by anchoring them directly onto an available reference genome (classical reference assembly), or can be concatenated by overlap (de novo assembly). The latter strategy is preferable because it tends to maintain the architecture of the genome sequence the however, depending on the NGS platform used, the shortness of read lengths cause tremendous problems the in the subsequent genome assembly phase, impeding closing of the entire genome sequence. To address the problem, we developed a multi-pronged hybrid de novo strategy combining De Bruijn graph and Overlap-Layout-Consensus methods, which was used to assemble from short reads the entire genome of Corynebacterium pseudotuberculosis strain I19, a bacterium with immense importance in veterinary medicine that causes Caseous Lymphadenitis in ruminants, principally ovines and caprines. Briefly, contigs were assembled de novo from the short reads and were only oriented using a reference genome by anchoring. Remaining gaps were closed using iterative anchoring of short reads by craning to gap flanks. Finally, we compare the genome sequence assembled using our hybrid strategy to a classical reference assembly using the same data as input and show that with the availability of a reference genome, it pays off to use the hybrid de novo strategy, rather than a classical reference assembly, because more genome sequences are preserved using the former.
Sequence assembly
Hybrid genome assembly
Corynebacterium pseudotuberculosis
De Bruijn graph
Comparative Genomics
Cite
Citations (55)
RNA-Seq is a technology to sequence transcriptomes using next-generation sequencing technologies. It has been widely used for analyses such as gene expression profiling and identification of differentially expressed genes (DEG). This chapter focuses on the design of RNA-Seq experiments and on the bioinformatics issues related to the assembly of RNA-Seq short reads into reference transcriptomes. It presents procedures and command lines for both de novo assembly approaches and reference-sequence-guided assembly approaches. In spite of the rapid progress in genome sequencing with aquaculture species, the reference genome sequences or reference transcriptomes are not yet available formost aquaculture species. If a reference genome sequence is available, reference-guided assembly methods can be used. In contrast, de novo RNA-Seq assembly methods must be used in the absence of a reference genome sequence. TopHat-Cufflinks is the most popular reference-guided assembly method, while Trinity is the most popular de novo assembly method.
Sequence assembly
RNA-Seq
Cite
Citations (0)
The recent technological advances in genome sequencing techniques have resulted in an exponential increase in the number of sequenced human and non-human genomes. The ever increasing number of assemblies generated by novel de novo pipelines and strategies demands the development of new software to evaluate assembly quality and completeness. One way to determine the completeness of an assembly is by detecting its Presence-Absence variations (PAV) with respect to a reference, where PAVs between two assemblies are defined as the sequences present in one assembly but entirely missing in the other one. Beyond assembly error or technology bias, PAVs can also reveal real genome polymorphism, consequence of species or individual evolution, or horizontal transfer from viruses and bacteria.We present scanPAV, a pipeline for pairwise assembly comparison to identify and extract sequences present in one assembly but not the other. In this note, we use the GRCh38 reference assembly to assess the completeness of six human genome assemblies from various assembly strategies and sequencing technologies including Illumina short reads, 10× genomics linked-reads, PacBio and Oxford Nanopore long reads, and Bionano optical maps. We also discuss the PAV polymorphism of seven Tasmanian devil whole genome assemblies of normal animal tissues and devil facial tumour 1 (DFT1) and 2 (DFT2) samples, and the identification of bacterial sequences as contamination in some of the tumorous assemblies.The pipeline is available under the MIT License at https://github.com/wtsi-hpag/scanPAV.Supplementary data are available at Bioinformatics online.
Sequence assembly
De Bruijn graph
Cite
Citations (17)
The rapid improvement of the next-generation sequencing (NGS) technologies has enabled unprecedented production of huge DNA sequence data at low cost. However, the NGS technologies are still limited to generate short DNA sequences, which has led to the development of many assembly algorithms to recover whole genome sequences from those short sequences. Unfortunately, the assembly algorithms alone can only construct scaffold sequences, which are generally much shorter than chromosome sequences. To generate chromosome sequences, additional expensive experimental data is required. To overcome this problem, there have been many studies to develop new computational algorithms to further merge the scaffold sequences, and produce chromosome-level sequences by utilizing an existing genome assembly of a related species called a reference. However, even though the quality of the chosen reference assembly is critical for generating a good final assembly, its effect is not well uncovered yet. In this study, we measured the effect of the reference genome assembly on the quality of the final assembly generated by reference-guided assembly algorithms. By using the genome assemblies of total eleven reference species (eight primates and three rodents), the human genome sequences were assembled from scaffold sequences by one of the reference-guided assembly algorithms, called RACA, and they were compared with known genome sequences to measure their quality in terms of the number of misassemblies. The effect of the quality of the reference assemblies was investigated in terms of divergence time against human, alignment coverage between the reference and human, and the amount of inclusion of core eukaryotic genes. We found that the divergence time is a good indicator of the quality of the final assembly when reference assemblies with high quality are used. We believe this study will contribute to broaden our understanding of the effect and importance of a reference assembly on the reference-guided assembly task.
Sequence assembly
Merge (version control)
Hybrid genome assembly
Cite
Citations (0)
Domestication is an evolutionary process of species divergence in which morphological and physiological changes result from the cultivation/tending of plant or animal species by a mutualistic partner, most prominently humans. Darwin used domestication as an analogy to evolution by natural selection although there is strong debate on whether this process of species evolution by human association is an appropriate model for evolutionary study. There is a presumption that selection under domestication is strong and most models assume rapid evolution of cultivated species. Using archaeological data for 11 species from 60 archaeological sites, we measure rates of evolution in two plant domestication traits—nonshattering and grain/seed size increase. Contrary to previous assumptions, we find the rates of phenotypic evolution during domestication are slow, and significantly lower or comparable to those observed among wild species subjected to natural selection. Our study indicates that the magnitudes of the rates of evolution during the domestication process, including the strength of selection, may be similar to those measured for wild species. This suggests that domestication may be driven by unconscious selection pressures similar to that observed for natural selection, and the study of the domestication process may indeed prove to be a valid model for the study of evolutionary change.
plant evolution
Rate of evolution
Cite
Citations (182)
Abstract Background The domestic sheep (Ovis aries) is an important agricultural species raised for meat, wool, and milk across the world. A high-quality reference genome for this species enhances the ability to discover genetic mechanisms influencing biological traits. Furthermore, a high-quality reference genome allows for precise functional annotation of gene regulatory elements. The rapid advances in genome assembly algorithms and emergence of sequencing technologies with increasingly long reads provide the opportunity for an improved de novo assembly of the sheep reference genome. Findings Short-read Illumina (55× coverage), long-read Pacific Biosciences (75× coverage), and Hi-C data from this ewe retrieved from public databases were combined with an additional 50× coverage of Oxford Nanopore data and assembled with canu v1.9. The assembled contigs were scaffolded using Hi-C data with Salsa v2.2, gaps filled with PBsuitev15.8.24, and polished with Nanopolish v0.12.5. After duplicate contig removal with PurgeDups v1.0.1, chromosomes were oriented and polished with 2 rounds of a pipeline that consisted of freebayes v1.3.1 to call variants, Merfin to validate them, and BCFtools to generate the consensus fasta. The ARS-UI_Ramb_v2.0 assembly is 2.63 Gb in length and has improved continuity (contig NG50 of 43.18 Mb), with a 19- and 38-fold decrease in the number of scaffolds compared with Oar_rambouillet_v1.0 and Oar_v4.0. ARS-UI_Ramb_v2.0 has greater per-base accuracy and fewer insertions and deletions identified from mapped RNA sequence than previous assemblies. Conclusions The ARS-UI_Ramb_v2.0 assembly is a substantial improvement in contiguity that will optimize the functional annotation of the sheep genome and facilitate improved mapping accuracy of genetic variant and expression data for traits in sheep.
Sequence assembly
Cite
Citations (52)
The development of next-generation sequencing has made it possible to sequence whole genomes at a relatively low cost. However, de novo genome assemblies remain challenging due to short read length, missing data, repetitive regions, polymorphisms and sequencing errors. As more and more genomes are sequenced, reference-guided assembly approaches can be used to assist the assembly process. However, previous methods mostly focused on the assembly of other genotypes within the same species. We adapted and extended a reference-guided de novo assembly approach, which enables the usage of a related reference sequence to guide the genome assembly. In order to compare and evaluate de novo and our reference-guided de novo assembly approaches, we used a simulated data set of a repetitive and heterozygotic plant genome. The extended reference-guided de novo assembly approach almost always outperforms the corresponding de novo assembly program even when a reference of a different species is used. Similar improvements can be observed in high and low coverage situations. In addition, we show that a single evaluation metric, like the widely used N50 length, is not enough to properly rate assemblies as it not always points to the best assembly evaluated with other criteria. Therefore, we used the summed z-scores of 36 different statistics to evaluate the assemblies. The combination of reference mapping and de novo assembly provides a powerful tool to improve genome reconstruction by integrating information of a related genome. Our extension of the reference-guided de novo assembly approach enables the application of this strategy not only within but also between related species. Finally, the evaluation of genome assemblies is often not straight forward, as the truth is not known. Thus one should always use a combination of evaluation metrics, which not only try to assess the continuity but also the accuracy of an assembly.
Sequence assembly
Hybrid genome assembly
k-mer
Cite
Citations (128)
Major advances in selection progress for cattle have been made following the introduction of genomic tools over the past 10-12 years. These tools depend upon the Bos taurus reference genome (UMD3.1.1), which was created using now-outdated technologies and is hindered by a variety of deficiencies and inaccuracies.We present the new reference genome for cattle, ARS-UCD1.2, based on the same animal as the original to facilitate transfer and interpretation of results obtained from the earlier version, but applying a combination of modern technologies in a de novo assembly to increase continuity, accuracy, and completeness. The assembly includes 2.7 Gb and is >250× more continuous than the original assembly, with contig N50 >25 Mb and L50 of 32. We also greatly expanded supporting RNA-based data for annotation that identifies 30,396 total genes (21,039 protein coding). The new reference assembly is accessible in annotated form for public use.We demonstrate that improved continuity of assembled sequence warrants the adoption of ARS-UCD1.2 as the new cattle reference genome and that increased assembly accuracy will benefit future research on this species.
Sequence assembly
Cite
Citations (517)
De novo genome assembly tool comparison for highly heterozygous species Vitis vinifera cv. Sultanina
Vitis vinifera cultivars are widely used for wine, table and raisin production throughout the world. A reference genome for an inbred line is available; however, standard cultivars are highly heterozygous. The heterozygosity makes it difficult to select an optimal assembler for de novo genome assembly. Here we have compared de novo genome assembly of the V. vinifera cv. Sultanina, a pivotal table grape genotype by ALLPATHS-LG and PLATANUS tools. Sequence reads were downloaded from NCBI using study accession SRP026420 and assembled using PLATANUS and ALLPATHS-LG assemblers. PLATANUS can manage high-throughput data from highly heterozygous samples while ALLPATHS-LG is used for different types of genomes like homozygous prokaryotes and eukaryotes. Comparison of assembly quality results by QUAST tool plots (cumulative, GC content and NGx) indicated the results of PLATANUS tool were more closely related with reference genome of V. vinifera than the results of ALLPATHS-LG. The PLATANUS assembly had a greater number of large contigs and scaffolds. The PLATANUS NG50 was two times that of the ALLPATHS-LG NG50. PLATANUS is a suitable tool for de novo genome assembly for V. vinifera cv. Sultanina and other highly heterozygous species.
Sequence assembly
Cite
Citations (2)