Beyond Nanopore Sequencing in Space: Identifying the Unknown
Sarah E. StahlAaron S. BurtonKristen K. JohnSissel JuulDaniel J. TurnerEoghan HarringtonMiten JainBenedict PatenMark AkesonSarah L. Castro-Wallace
0
Citation
0
Reference
13
Related Paper
Abstract:
Astronaut Kate Rubins sequenced DNA on the International Space Station (ISS) for the first time in August 2016 (Figure 1A). A 2D sequencing library containing an equal mixture of lambda bacteriophage, Escherichia coli, and Mus musculus was prepared on the ground with a SQK_MAP006 kit and sent to the ISS frozen and loaded into R7.3 flow cells. After a total of 9 on-orbit sequencing runs over 6 months, it was determined that there was no decrease in sequencing performance on-orbit compared to ground controls (1). A total of ~280,000 and ~130,000 reads generated on-orbit and on the ground, respectively, identified 90% of reads that were attributed to 30% lambda bacteriophage, 30% Escherichia coli, and 30% M. musculus (Figure 1B). Extensive bioinformatics analysis determined comparable 2D and 1D read accuracies between flight and ground runs (Figure 1C), and data collected from the ISS were able to construct directed assemblies of E.coli and lambda genomes at 100% and M. musculus mitochondrial genome at 96.7%. These findings validate sequencing as a viable option for potential on-orbit applications such as environmental microbial monitoring and disease diagnosis. Current microbial monitoring of the ISS applies culture-based techniques that provide colony forming unit (CFU) data for air, water, and surface samples. The identity of the cultured microorganisms in unknown until sample return and ground-based analysis, a process that can take up to 60 days. For sequencing to benefit ISS applications, spaceflight-compatible sample preparation techniques are required. Subsequent to the testing of the MinION on-orbit, a sample-to-sequence method was developed using miniPCR™ and basic pipetting, which was only recently proven to be effective in microgravity. The work presented here details the in- flight sample preparation process and the first application of DNA sequencing on the ISS to identify unknown ISS-derived microorganisms.Keywords:
International Space Station
Cite
Environmental metagenomic analysis is typically accomplished by assigning taxonomy and/or function from whole genome sequencing (WGS) or 16S amplicon sequences. Both of these approaches are limited by read length and other technical and biological factors. A nanopore-based sequencing platform, MinION™, produces reads that are ≥10000 bp in length, potentially providing for more precise assignment, thereby alleviating some of the limitations inherent in determining metagenome composition from short reads. We tested the ability of sequence data produced by MinION (R7.3 flow cells) to correctly assign taxonomy in single bacterial species runs and in three types of low complexity synthetic communities: a mixture of DNA using equal mass from four species, a community with one relatively rare (1%) and three abundant (33% each) components, and a mixture of genomic DNA from 20 bacterial strains of staggered representation. Taxonomic composition of the low-complexity communities was assessed by analyzing the MinION sequence data with three different bioinformatic approaches: Kraken, MG-RAST, and One Codex. Long read sequences generated from libraries prepared from single strains using the SQK–MAP005 kit and chemistry, run on the original MinION device, yielded as few as 224 to as many as 3,497 bidirectional high-quality (2D) reads with an average overall study length of 6,000 bp. For the single-strain analyses, assignment of reads to the correct genus by different methods ranged from 53.1% to 99.5%, assignment to the correct species ranged from 23.9% to 99.5%, and the majority of mis-assigned reads were to closely related organisms. A synthetic metagenome sequenced with the same setup yielded 714 high quality 2D reads of approximately 5,500 bp that were up to 98% correctly assigned to the species level. Synthetic metagenomes from MinION libraries generated using the SQK–MAP006 kit and chemistry yielded 899-3,497 2D reads with lengths averaging 5,700 bp with up to 98% assignment accuracy at the species-level. The observed community proportions for “equal” and “rare” synthetic libraries were close to the known proportions, deviating from 0.1 – 10% across all tests. For a 20-species mock community with staggered contributions, a sequencing run detected all but 3 species (each included at 99% of reads were assigned to the correct family.
Minion
Amplicon
Nanopore
Amplicon sequencing
Cite
Citations (0)
Enhanced sequencing capacity may be of particular value in studies of pathogens with extreme levels of intra- and inter-sample polymorphism such as HIV and HCV. Current bulk sequencing methods rely on iterative primer selection and primer-walking to combat polymorphism efficiently, however malalignment due to complex polymorphisms with insertions and/or deletions in quasi-species often cannot be resolved easily without more laborious cloning of multiple sequences. We therefore wished to conduct a pilot study to test and develop methods for sequencing HIV using the GS FLX system that are high throughput, cost-effective and resolve complexity associated with diversity. A proof of principle run was undertaken in collaboration with Roche at the 454 centre Branford, Connecticut.
Nebulized PCR Amplicon spanning the entire HIV genome from 4, 8 or 12 plasma samples were barcoded, pooled into a single sequencing library and sequenced on a region of the 8 lane 454 picotitre plate. In the same run, a replicate experiment was carried out using the same samples to assess the reproducibility.
The number of reads per lane on the picotitre plate ranged from 41,000-54,000 with an average read length of 230bases. High quality reads from the samples were de-multiplexed and mapped, using 454 mapper software, to the reference sequence HXB2. On average 75-95% of the reference sequence was covered by the mapped reads. The depth of coverage across the genome ranged between 50-400 fold. With the most polymorphic region (env) showing lowest depth coverage. More detailed analysis of the data and the comparison of the sequences generated using the current sequencing protocol will be presented.
Amplicon
Sequence assembly
Replicate
Amplicon sequencing
Cite
Citations (0)
Gene-corrected cells in Gene Therapy (GT) treated patients can be tracked in vivo by means of vector integration site (IS) analysis, since each engineered clone becomes univocally and stably marked by an individual IS. As the proper IS identification and quantification is crucial to accurately perform clonal tracking studies, we designed a customizable and tailored pipeline to analyze LAM-PCR amplicons sequenced by Illumina MiSeq/HiSeq technology. The sequencing data are initially processed through a series of quality filters and cleaned from vector and Linker Cassette (LC) sequences with customizable settings. Demultiplexing is then performed according to the recognition of specific barcodes combination used upon library preparation and the sequences are aligned to the reference genome. Importantly, the human genome assembly Hg19 is composed of 93 contigs, among which the mitochondrial genome, unlocalized and unplaced contigs and some alternative haplotypes of chr6. While previous approaches aligned IS sequences only to the standard 24 human chromosomes, using the whole assembled genome allowed improving alignment accuracy and concomitantly increased the amount of detectable ISs. To date, we have processed 28 independent human sample sets retrieving 260,994 ISs from 189,270,566 sequencing reads. Although, sequencing read counts at each IS have been widely used to estimate the relative IS abundance, this method carries inherent accuracy constraints due to the rounds of exponential amplification required by LAM-PCR that might generate unbalances on the original clonal representation. More recently, a method based on genomic sonication has been proposed exploiting shear site counts to tag the number of original fragments belonging to each IS before PCR amplification. However, the number of cells composing a given clone could far exceed the number of fragments of different lengths that can be generated upon fragmentation in proximity of that given IS. This would rapidly saturate the available diversity of shear sites and progressively generate more and more same-site shearing on independent genomes. In order to overcome the described biases and reliably quantify ISs, we designed and tested a new LC encoding random barcodes. The new LC is composed of a known sequence of 29nt used as binding site for the primers upon amplification steps, a 6nt-random barcode, a fixed-anchor sequence of 6nt, a second 6nt-random barcode and a final known sequence of 22nt containing sticky ends for the three main restriction enzymes in use (MluI, HpyCH4IV and AciI). This peculiar design allowed increasing the accuracy of clonal diversity estimation since the fixed-anchor sequence acts as a control for sequencing reliability in the barcode area. The theoretical number of different available barcodes per clone (412=16,777,216) far exceeds the requirements for not saturating the original diversity of the analyzed sample (on average composed by around 50.000 cells). We validated this novel approach by performing assays on serial dilutions of individual clones carrying known ISs. The precision rate obtained was averagely around 99.3%, while the worst error rate reaches at most the 1.86%, confirming the reliability of IS quantification. We successfully applied the barcoded-LC system to the analysis of clinical samples from a Wiskott Aldrich Syndrome GT patient, collecting to date 50,215 barcoded ISs from 94,052,785 sequencing reads.
Amplicon
clone (Java method)
Cite
Citations (7)
The MinION sequencer has made in situ sequencing feasible in remote locations. Following our initial demonstration of its high performance off planet with Earth-prepared samples, we developed and tested an end-to-end, sample-to-sequencer process that could be conducted entirely aboard the International Space Station (ISS). Initial experiments demonstrated the process with a microbial mock community standard. The DNA was successfully amplified, primers were degraded, and libraries prepared and sequenced. The median percent identities for both datasets were 84%, as assessed from alignment of the mock community. The ability to correctly identify the organisms in the mock community standard was comparable for the sequencing data obtained in flight and on the ground. To validate the process on microbes collected from and cultured aboard the ISS, bacterial cells were selected from a NASA Environmental Health Systems Surface Sample Kit contact slide. The locations of bacterial colonies chosen for identification were labeled, and a small number of cells were directly added as input into the sequencing workflow. Prepared DNA was sequenced, and the data were downlinked to Earth. Return of the contact slide to the ground allowed for standard laboratory processing for bacterial identification. The identifications obtained aboard the ISS, Staphylococcus hominis and Staphylococcus capitis, matched those determined on the ground down to the species level. This marks the first ever identification of microbes entirely off Earth, and this validated process could be used for in-flight microbial identification, diagnosis of infectious disease in a crewmember, and as a research platform for investigators around the world.
Minion
Identification
International Space Station
Cite
Citations (48)
Abstract United States public health agencies are focusing on next-generation sequencing (NGS) to quickly identify and characterize foodborne pathogens. Here, the MinION nanopore, long-read sequencer was used to simultaneously sequence the entire chromosome and plasmids of Salmonella enterica subsp. enterica serovar Bareilly and Escherichia coli O157:H7. A rapid, random sequencing approach, coupled with de novo genome assembly within a customized data analysis workflow, that can resolve highly-repetitive genomic regions, was developed. In sequencing runs, as short as four hours, using nanopore data alone, full-length genomes were obtained with an average identity of 99.87% for Salmonella Bareilly and 99.89% for E. coli in comparison to the respective MiSeq references. These long-read assemblies provided information on serotype, virulence factors, and antimicrobial resistance genes. Using a custom-developed, SNP-selection workflow, the potential of the nanopore-only assemblies (after only 30 minutes of sequencing) for rapid phylogenetic inference, with identical topology compared to the published dataset, was demonstrated. To achieve maximum quality assemblies, the developed bioinformatics workflow employed additional polishing steps to correct the systematic errors produced by the nanopore-only assemblies. Nanopore sequencing provided a shorter (10 hours library preparation and sequencing) turnaround time compared to other NGS technologies.
Minion
Salmonella enterica
Nanopore
Cite
Citations (4)
Abstract The miniaturized and portable DNA sequencer MinION™ has demonstrated great potential in different analyses such as genome-wide sequencing, pathogen outbreak detection and surveillance, human genome variability, and microbial diversity. In this study, we tested the ability of the MinION™ platform to perform long amplicon sequencing in order to design new approaches to study microbial diversity using a multi-locus approach. After compiling a robust database by parsing and extracting the rrn bacterial region from more than 67000 complete or draft bacterial genomes, we demonstrated that the data obtained during sequencing of the long amplicon in the MinION™ device using R9 and R9.4 chemistries were sufficient to study 2 mock microbial communities in a multiplex manner and to almost completely reconstruct the microbial diversity contained in the HM782D and D6305 mock communities. Although nanopore-based sequencing produces reads with lower per-base accuracy compared with other platforms, we presented a novel approach consisting of multi-locus and long amplicon sequencing using the MinION™ MkIb DNA sequencer and R9 and R9.4 chemistries that help to overcome the main disadvantage of this portable sequencing platform. Furthermore, the nanopore sequencing library, constructed with the last releases of pore chemistry (R9.4) and sequencing kit (SQK-LSK108), permitted the retrieval of the higher level of 1D read accuracy sufficient to characterize the microbial species present in each mock community analysed. Improvements in nanopore chemistry, such as minimizing base-calling errors and new library protocols able to produce rapid 1D libraries, will provide more reliable information in the near future. Such data will be useful for more comprehensive and faster specific detection of microbial species and strains in complex ecosystems.
Minion
Amplicon
Nanopore
Amplicon sequencing
Bacterial genome size
Multiplex
Cite
Citations (94)
Abstract We evaluated the performance of the MinION DNA sequencer in-flight on the International Space Station (ISS), and benchmarked its performance off-Earth against the MinION, Illumina MiSeq, and PacBio RS II sequencing platforms in terrestrial laboratories. Samples contained equimolar mixtures of genomic DNA from lambda bacteriophage, Escherichia coli (strain K12, MG1655) and Mus musculus (female BALB/c mouse). Nine sequencing runs were performed aboard the ISS over a 6-month period, yielding a total of 276,882 reads with no apparent decrease in performance over time. From sequence data collected aboard the ISS, we constructed directed assemblies of the ~4.6 Mb E. coli genome, ~48.5 kb lambda genome, and a representative M. musculus sequence (the ~16.3 kb mitochondrial genome), at 100%, 100%, and 96.7% consensus pairwise identity, respectively; de novo assembly of the E. coli genome from raw reads yielded a single contig comprising 99.9% of the genome at 98.6% consensus pairwise identity. Simulated real-time analyses of in-flight sequence data using an automated bioinformatic pipeline and laptop-based genomic assembly demonstrated the feasibility of sequencing analysis and microbial identification aboard the ISS. These findings illustrate the potential for sequencing applications including disease diagnosis, environmental monitoring, and elucidating the molecular basis for how organisms respond to spaceflight.
Minion
Sequence assembly
Hybrid genome assembly
DNA nanoball sequencing
Cite
Citations (298)
Ultra-accurate Microbial Amplicon Sequencing Directly from Complex Samples with Synthetic Long Reads
Abstract Out of the many pathogenic bacterial species that are known, only a fraction are readily identifiable directly from a complex microbial community using standard next generation DNA sequencing technology. Long-read sequencing offers the potential to identify a wider range of species and to differentiate between strains within a species, but attaining sufficient accuracy in complex metagenomes remains a challenge. Here, we describe and analytically validate LoopSeq, a commercially-available synthetic long-read (SLR) sequencing technology that generates highly-accurate long reads from standard short reads. LoopSeq reads are sufficiently long and accurate to identify microbial genes and species directly from complex samples. LoopSeq applied to full-length 16S rRNA genes from known strains in a microbial community perfectly recovered the full diversity of full-length exact sequence variants in a known microbial community. Full-length LoopSeq reads had a per-base error rate of 0.005%, which exceeds the accuracy reported for other long-read sequencing technologies. 18S-ITS and genomic sequencing of fungal and bacterial isolates confirmed that LoopSeq sequencing maintains that accuracy for reads up to 6 kilobases in length. Analysis of rinsate from retail meat samples demonstrated that LoopSeq full-length 16S rRNA synthetic long-reads could accurately classify organisms down to the species level, and could differentiate between different strains within species identified by the CDC as potential foodborne pathogens. The order-of-magnitude improvement in both length and accuracy over standard Illumina amplicon sequencing achieved with LoopSeq enables accurate species-level and strain identification from complex and low-biomass microbiome samples. The ability to generate accurate and long microbiome sequencing reads using standard short read sequencers will accelerate the building of quality microbial sequence databases and removes a significant hurdle on the path to precision microbial genomics.
Amplicon
Amplicon sequencing
Illumina dye sequencing
Identification
Cite
Citations (17)
The relatively short read lengths produced by the major Next-Gen sequencing platforms (up to 300 nt for Illumina, up to 200 nt for Ion Proton) are poorly suited for analyzing large combinatorial libraries of longer nucleotide sequences, such as those involving AAV capsid genes. Despite its much lower throughput and accuracy, the PacBio single-molecule real-time technology is currently the only option for long templates, with average read lengths of 10 to 15 kb. Using the Circular Consensus Sequencing (CCS) mode, in which template DNA fragments are circularized, allows a significant increase in accuracy due to the fact that each template is being sequenced multiple times. To interpret PacBio CCS data, we have previously reported a first version of the CapLib code, which was developed to identify variable regions in AAV combinatorial capsid libraries. DNA fragments, derived from purified DNA-containing AAV particles, 869 bp in length and including 27 variable nucleotide positions, were sequenced in CCS mode using the P6-C4 chemistry. A total of 26,897 reads were obtained, with a mean read length of 814 nt, a mean read quality of 0.9956 and a mean number of passes of 21.34. Only 5,456 reads had the correct size of 869 nt, and of these, only 1,638 had a sequence that matched the reference sequence, indicating that only 6% of reads were potentially error-free and that the vast majority had multiple insertions and deletions. In order to extract more useful information from the sequencing data, a new version of the CapLib software was developed. It is designed to correct sequencing reads in silico by assuming that constant nucleotide positions are wild-type and focusing on the detection of the variable positions. The premise was validated by Sanger sequencing of multiple clones, confirming that mutations were present only in the intended positions. Depending on the parameter values used, up to 14,000 reads could be recovered by CapLib 2. In addition to recovering PacBio CCS reads, CapLib 2 can also assemble Sanger sequencing data, translate recovered reads into protein sequences and perform detailed analyses of the dataset. It can also analyze clones resulting from directed evolution experiments and compare them with the original library.
Template
Sequence (biology)
Cite
Citations (0)
Next-generation sequencing technologies enable the rapid cost-effective production of sequence data. To evaluate the performance of these sequencing technologies, investigation of the quality of sequence reads obtained from these methods is important. In this study, we analyzed the quality of sequence reads and SNP detection performance using three commercially available next-generation sequencers, i.e., Roche Genome Sequencer FLX System (FLX), Illumina Genome Analyzer (GA), and Applied Biosystems SOLiD system (SOLiD). A common genomic DNA sample obtained from Escherichia coli strain DH1 was applied to these sequencers. The obtained sequence reads were aligned to the complete genome sequence of E. coli DH1, to evaluate the accuracy and sequence bias of these sequence methods. We found that the fraction of "junk" data, which could not be aligned to the reference genome, was largest in the data set of SOLiD, in which about half of reads could not be aligned. Among data sets after alignment to the reference, sequence accuracy was poorest in GA data sets, suggesting relatively low fidelity of the elongation reaction in the GA method. Furthermore, by aligning the sequence reads to the E. coli strain W3110, we screened sequence differences between two E. coli strains using data sets of three different next-generation platforms. The results revealed that the detected sequence differences were similar among these three methods, while the sequence coverage required for the detection was significantly small in the FLX data set. These results provided valuable information on the quality of short sequence reads and the performance of SNP detection in three next-generation sequencing platforms.
Sequence (biology)
Cite
Citations (91)