La detection de variations genetiques est un enjeu majeur dans le diagnostic des maladies genetiques chez l’homme. Certains types de variations sont detectes dans la routine d'analyse. D'autres, comme les variations de structure de type insertion sont bien plus complexes a identifier. Le developpement de nouvelles technologies de sequencage dites longs reads permet de faciliter la detection de ces insertions. Elles ont notamment permis la generation d’ensembles de variants de reference d’une qualite sans precedent. Neanmoins, cette technologie possede encore des faiblesses qui ne permettent pas son utilisation pour la detection de variants dans un usage clinique. Il est donc essentiel d’ameliorer les outils de detection bases sur les technologies de sequencage de courtes lectures utilisees dans un contexte medical. Cette these presente la caracterisation des differentes insertions et des facteurs limitant leur detection, basee sur ces jeux de donnees de reference de haute qualite. L’utilisation de simulations d'insertions a permis de quantifier l’impact de ces facteurs et mis en lumiere la faiblesse des outils actuels a detecter et assembler la sequence des insertions. Ces resultats ont permis de proposer des pistes d'ameliorations des outils de detection d’insertions. Plusieurs ameliorations ont ainsi ete implementees dans l'outil existant MindTheGap et ont permis de surpasser certaines de ses limites.
Abstract Most metazoans are associated with symbionts. Characterizing the effect of a particular symbiont often requires getting access to its genome, which is usually done by sequencing the whole community. We present MinYS, a targeted assembly approach to assemble a particular genome of interest from such metagenomic data. First, taking advantage of a reference genome, a subset of the reads is assembled into a set of backbone contigs. Then, this draft assembly is completed using the whole metagenomic readset in a de novo manner. The resulting assembly is output as a genome graph, enabling different strains with potential structural variants coexisting in the sample to be distinguished. MinYS was applied to 50 pea aphid resequencing samples, with variable diversity in symbiont communities, in order to recover the genome sequence of its obligatory bacterial symbiont, Buchnera aphidicola. It was able to return high-quality assemblies (one contig assembly in 90% of the samples), even when using increasingly distant reference genomes, and to retrieve large structural variations in the samples. Because of its targeted essence, it outperformed standard metagenomic assemblers in terms of both time and assembly quality.
Metagenomic sequencing is a promising way to reconstruct genomes of a large diversity of bacterial species in their environment. To reconstruct the genome of a single interest species, current approaches require the metagenomic assembly of the whole community. This method appears to be computationally unnecessarily intensive, and error prone, in particular when closely related species are present, as highlighted by the results of the CAMI challenge (Sczyrba2017). A solution to enable targeting genome assembly from metagenomic samples is to use the information of a reference genome as a backbone for the assembly. However, among the existing reference-guided assembly softwares, none takes into account the specificities of metagenomic data, including high volume and heterogeneous genotypes.
In this work, we propose a two-step reference-guided assembly method tailored for metagenomic data. First, a subset of the reads belonging to the species of interest are recruited by mapping and assembled into backbone contigs. The gapfiller MindTheGap is then used to perform an all-versus-all contig gapfilling and assemble the missing regions between the backbone contigs, which are regions different from the reference genome. MindTheGap algorithm makes no assumption on the synteny of backbone contigs, the potential structural variations within the sample, or the length of the missing regions. The result of the method is a genome assembly graph in gfa format, accounting for the structure of assembled genome, including the potential structural variations identified within the sample. This hybrid approach does not require a closely related reference and yet enables the targeted assembly of a species of interest from potentially large metagenomic read sets.
This approach was applied in the context of the pea aphid microbiome and outperformed several alternative strategies. Starting from a remote reference genome, we were able to assemble the full circular sequence of Buchnera aphidicola, symbiont of the pea aphid. MindTheGap was also able to assemble full circular sequences of APSE bacteriophages, including coexisting strains within the same read sample differing by large structural variants, such as novel virulence cassettes of several kilobases.
Abstract Since 2009, numerous tools have been developed to detect structural variants (SVs) using short read technologies. Insertions >50 bp are one of the hardest type to discover and are drastically underrepresented in gold standard variant callsets. The advent of long read technologies has completely changed the situation. In 2019, two independent cross technologies studies have published the most complete variant callsets with sequence resolved insertions in human individuals. Among the reported insertions, only 17 to 37% could be discovered with short-read based tools. In this work, we performed an in-depth analysis of these unprecedented insertion callsets in order to investigate the causes of such failures. We have first established a precise classification of insertion variants according to four layers of characterization: the nature and size of the inserted sequence, the genomic context of the insertion site and the breakpoint junction complexity. Because these levels are intertwined, we then used simulations to characterize the impact of each complexity factor on the recall of several SV callers. Simulations showed that the most impacting factor was the insertion type rather than the genomic context, with various difficulties being handled differently among the tested SV callers, and they highlighted the lack of sequence resolution for most insertion calls. Our results explain the low recall by pointing out several difficulty factors among the observed insertion features and provide avenues for improving SV caller algorithms and their combinations. Contact wesley.delage@irisa.fr
Most metazoans are associated with symbionts. Characterizing the effect of a particular symbiont often requires to get access to its genome, which is usually done by sequencing the whole community. We present MinYS, a targeted assembly approach to assemble one particular genome of interest from such metagenomic data. First, taking advantage of a reference genome, a subset of the reads is assembled into a set of backbone contigs. Then, this draft assembly is completed using the whole metagenomic readset in a de novo manner. The resulting assembly is output as a genome graph, allowing to distinguish different strains with potential structural variants coexisting in the sample. MinYS was applied to 50 pea aphid re-sequencing samples, with low and high diversity, in order to recover the genome sequence of its obligatory bacterial symbiont, Buchnera aphidicola . It was able to return high quality assemblies (one contig assembly in 90% of the samples), even when using increasingly distant reference genomes, and to retrieve large structural variations in the samples. Due to its targeted essence, it outperformed standard metagenomic assemblers in terms of both time and assembly quality.
Cases of emergence of novel plant-pathogenic strains are regularly reported that reduce the yields of crops and trees. However, the molecular mechanisms underlying such emergence are still poorly understood. The acquisition by environmental non-pathogenic strains of novel virulence genes by horizontal gene transfer has been suggested as a driver for the emergence of novel pathogenic strains. In this study, we tested such an hypothesis by transferring a plasmid encoding the type 3 secretion system (T3SS) and four associated type 3 secreted proteins (T3SPs) to the non-pathogenic strains of Xanthomonas CFBP 7698 and CFBP 7700, which lack genes encoding T3SS and any previously known T3SPs. The resulting strains were phenotyped on Nicotiana benthamiana using chlorophyll fluorescence imaging and image analysis. Wild-type, non-pathogenic strains induced a hypersensitive response (HR)-like necrosis, whereas strains complemented with T3SS and T3SPs suppressed this response. Such suppression depends on a functional T3SS. Amongst the T3SPs encoded on the plasmid, Hpa2, Hpa1 and, to a lesser extent, XopF1 collectively participate in suppression. Monitoring of the population sizes in planta showed that the sole acquisition of a functional T3SS by non-pathogenic strains impairs growth inside leaf tissues. These results provide functional evidence that the acquisition via horizontal gene transfer of a T3SS and four T3SPs by environmental non-pathogenic strains is not sufficient to make strains pathogenic. In the absence of a canonical effector, the sole acquisition of a T3SS seems to be counter-selective, and further acquisition of type 3 effectors is probably needed to allow the emergence of novel pathogenic strains.
Abstract Background Since 2009, numerous tools have been developed to detect structural variants using short read technologies. Insertions >50 bp are one of the hardest type to discover and are drastically underrepresented in gold standard variant callsets. The advent of long read technologies has completely changed the situation. In 2019, two independent cross technologies studies have published the most complete variant callsets with sequence resolved insertions in human individuals. Among the reported insertions, only 17 to 28% could be discovered with short-read based tools. Results In this work, we performed an in-depth analysis of these unprecedented insertion callsets in order to investigate the causes of such failures. We have first established a precise classification of insertion variants according to four layers of characterization: the nature and size of the inserted sequence, the genomic context of the insertion site and the breakpoint junction complexity. Because these levels are intertwined, we then used simulations to characterize the impact of each complexity factor on the recall of several structural variant callers. We showed that most reported insertions exhibited characteristics that may interfere with their discovery: 63% were tandem repeat expansions, 38% contained homology larger than 10 bp within their breakpoint junctions and 70% were located in simple repeats. Consequently, the recall of short-read based variant callers was significantly lower for such insertions (6% for tandem repeats vs 56% for mobile element insertions). Simulations showed that the most impacting factor was the insertion type rather than the genomic context, with various difficulties being handled differently among the tested structural variant callers, and they highlighted the lack of sequence resolution for most insertion calls. Conclusions Our results explain the low recall by pointing out several difficulty factors among the observed insertion features and provide avenues for improving SV caller algorithms and their combinations.