Sequence clustering is a common early step in amplicon-based microbial community analysis, when raw sequencing reads are clustered into operational taxonomic units (OTUs) to reduce the run time of subsequent analysis steps. Here, we evaluated the performance of recently released state-of-the-art open-source clustering software products, namely, OTUCLUST, Swarm, SUMACLUST, and SortMeRNA, against current principal options (UCLUST and USEARCH) in QIIME, hierarchical clustering methods in mothur, and USEARCH's most recent clustering algorithm, UPARSE. All the latest open-source tools showed promising results, reporting up to 60% fewer spurious OTUs than UCLUST, indicating that the underlying clustering algorithm can vastly reduce the number of these derived OTUs. Furthermore, we observed that stringent quality filtering, such as is done in UPARSE, can cause a significant underestimation of species abundance and diversity, leading to incorrect biological results. Swarm, SUMACLUST, and SortMeRNA have been included in the QIIME 1.9.0 release. IMPORTANCE Massive collections of next-generation sequencing data call for fast, accurate, and easily accessible bioinformatics algorithms to perform sequence clustering. A comprehensive benchmark is presented, including open-source tools and the popular USEARCH suite. Simulated, mock, and environmental communities were used to analyze sensitivity, selectivity, species diversity (alpha and beta), and taxonomic composition. The results demonstrate that recent clustering algorithms can significantly improve accuracy and preserve estimated diversity without the application of aggressive filtering. Moreover, these tools are all open source, apply multiple levels of multithreading, and scale to the demands of modern next-generation sequencing data, which is essential for the analysis of massive multidisciplinary studies such as the Earth Microbiome Project (EMP) (J. A. Gilbert, J. K. Jansson, and R. Knight, BMC Biol 12:69, 2014, http://dx.doi.org/10.1186/s12915-014-0069-1).
Popular de novo amplicon clustering methods suffer from two fundamental flaws: arbitrary global clustering thresholds, and input-order dependency induced by centroid selection. Swarm was developed to address these issues by first clustering nearly identical amplicons iteratively using a local threshold, and then by using clusters' internal structure and amplicon abundances to refine its results. This fast, scalable, and input-order independent approach reduces the influence of clustering parameters and produces robust operational taxonomic units, improving the amount of meaningful biological information that can be extracted from amplicon-based studies.
Reads were grouped into OTUs using the following swarm-based pipeline: paired-end reads were merged with vsearch’s --fastq_mergepairs command (version 2.15.1, allowing for staggered reads; Rognes et al., 2016), and trimmed with cutadapt (version 3.0; Martin, 2011), keeping only reads containing both forward and reverse primers. After trimming, the expected error per read was estimated with vsearch’s command --fastq_filter and the option --eeout. Each sample was then de-replicated, i.e. strictly identical reads were merged, using vsearch’s command --derep_fulllength, and converted into fasta format. Clustering was performed at the sample level with swarm 3.0 using default parameters (Mahé et al., 2015). Prior to global clustering, individual fasta files (one per sample) were pooled and further dereplicated with vsearch. Files containing per-read expected error values were also dereplicated to retain only the lowest expected error for each unique sequence. Global clustering was performed with swarm (using the fastidious option). Cluster representative sequences were then searched for chimeras with vsearch’s command --uchime_denovo using default parameters (Edgar et al., 2011). Clustering results, expected error values, taxonomic assignments, and chimera detection results were used to build a “raw” occurrence table. Reads without primers, reads shorter than 32 nucleotides and reads with uncalled bases (“N”) were discarded. For a “filtered” occurrence table, non-chimeric sequences, sequences with an expected error per nucleotide below 0.0002, and clusters containing at least 2 reads were retained. Since primer trimming is not perfect, some sequences can still contain primer fragments or be excessively trimmed. These sub- or super-sequences were identified using vsearch and merged with their closest, most abundant perfectly trimmed sequence. Finally, occurrence patterns throughout our sample collection were used to further refine the occurrence table. Clusters that contain sub-clusters with only a single-nucleotide difference but with different ecological patterns (defined here as uncorrelated abundance values in at least 5% of the samples) were turned into distinct clusters (https://github.com/frederic-mahe/fred-metabarcoding-pipeline). On the other hand, clusters with similar sequences that had correlated abundance values in at least 95% of the samples, were merged using a re-implementation of lulu's method (Frøslev et al. 2017; https://github.com/frederic-mahe/mumu).
Metabarcoding of microbial eukaryotes (collectively known as protists) has developed tremendously in the last decade, almost solely relying on the 18S rRNA gene. As microbial eukaryotes are extremely diverse, many primers and primer pairs have been developed. To cover a relevant and representative fraction of the protist community in a given study system, an informed primer choice is necessary, as no primer pair can target all protists equally well. As such, a smart primer choice is very difficult even for experts and there are very few online resources available to list existing primers. We built a database listing 285 primers and 83 unique primer pairs that have been used for eukaryotic 18S rRNA gene metabarcoding. In silico performance of primer pairs was tested against two sequence databases: PR2 version 4.12.0 for eukaryotes and a subset of silva version 132 for bacteria and archaea. We developed an R-based web application enabling browsing of the database, visualization of the taxonomic distribution of the amplified sequences with the number of mismatches, and testing any user-defined primer or primer set (https://app.pr2-primers.org). Taxonomic specificity of primer pairs, amplicon size and location of mismatches can also be determined. We identified universal primer sets that matched the largest number of sequences and analysed the specificity of some primer sets designed to target certain groups. This tool enables guided primer choices that will help a wide range of researchers to include protists as part of their investigations.
Environmental DNA and culture-based analyses have suggested that fungi are present in low diversity and in low abundance in many marine environments, especially in the upper water column. Here, we use a dual approach involving high-throughput diversity tag sequencing from both DNA and RNA templates and fluorescent cell counts to evaluate the diversity and relative abundance of fungi across marine samples taken from six European near-shore sites. We removed very rare fungal operational taxonomic units (OTUs) selecting only OTUs recovered from multiple samples for a detailed analysis. This approach identified a set of 71 fungal ‘OTU clusters' that account for 66% of all the sequences assigned to the Fungi. Phylogenetic analyses demonstrated that this diversity includes a significant number of chytrid-like lineages that had not been previously described, indicating that the marine environment encompasses a number of zoosporic fungi that are new to taxonomic inventories. Using the sequence datasets, we identified cases where fungal OTUs were sampled across multiple geographical sites and between different sampling depths. This was especially clear in one relatively abundant and diverse phylogroup tentatively named Novel Chytrid-Like-Clade 1 (NCLC1). For comparison, a subset of the water column samples was also investigated using fluorescent microscopy to examine the abundance of eukaryotes with chitin cell walls. Comparisons of relative abundance of RNA-derived fungal tag sequences and chitin cell-wall counts demonstrate that fungi constitute a low fraction of the eukaryotic community in these water column samples. Taken together, these results demonstrate the phylogenetic position and environmental distribution of 71 lineages, improving our understanding of the diversity and abundance of fungi in marine environments.
Summary Dinoflagellates (Alveolata) are one of the ecologically most important groups of modern phytoplankton. Their biological complexity makes assessment of their global diversity and community structure difficult. We used massive V 9 18 S rDNA sequencing from 106 size‐fractionated plankton communities collected across the world's surface oceans during the T ara O ceans expedition (2009–2012) to assess patterns of pelagic dinoflagellate diversity and community structuring over global taxonomic and ecological scales. Our data and analyses suggest that dinoflagellate diversity has been largely underestimated, representing overall ∼1/2 of protistan rDNA metabarcode richness assigned at ≥ 90% to a reference sequence in the world's surface oceans. Dinoflagellate metabarcode diversity and abundance display regular patterns across the global scale, with different order‐level taxonomic compositions across organismal size fractions. While the pico to nano‐planktonic communities are composed of an extreme diversity of metabarcodes assigned to Gymnodiniales or are simply undetermined, most micro‐dinoflagellate metabarcodes relate to the well‐referenced Gonyaulacales and Peridiniales orders, and a lower abundance and diversity of essentially symbiotic Peridiniales is unveiled in the meso‐plankton. Our analyses could help future development of biogeochemical models of pelagic systems integrating the separation of dinoflagellates into functional groups according to plankton size classes.
Abstract Taxonomic assignment of operational taxonomic units (OTUs) is an important bioinformatics step in analyzing environmental sequencing data. Pairwise alignment and phylogenetic‐placement methods represent two alternative approaches to taxonomic assignments, but their results can differ. Here we used available colpodean ciliate OTUs from forest soils to compare the taxonomic assignments of VSEARCH (which performs pairwise alignments) and EPA‐ng (which performs phylogenetic placements). We showed that when there are differences in taxonomic assignments between pairwise alignments and phylogenetic placements at the subtaxon level, there is a low pairwise similarity of the OTUs to the reference database. We then showcase how the output of EPA‐ng can be further evaluated using GAPPA to assess the taxonomic assignments when there exist multiple equally likely placements of an OTU, by taking into account the sum over the likelihood weights of the OTU placements within a subtaxon, and the branch distances between equally likely placement locations. We also inferred the evolutionary and ecological characteristics of the colpodean OTUs using their placements within subtaxa. This study demonstrates how to fully analyze the output of EPA‐ng, by using GAPPA in conjunction with knowledge of the taxonomic diversity of the clade of interest.