Construction of habitat-specific training sets to achieve species-level assignment in 16S rRNA gene datasets

2019 
Background: The low cost of 16S rRNA gene sequencing facilitates population-scale molecular epidemiological studies. Existing computational algorithms can parse 16S rRNA gene sequences to high-resolution Amplicon Sequence Variants (ASVs), which represent ecologically coherent entities. Assigning species-level taxonomy to these ASVs is the critical remaining barrier to drawing ecologically/clinically relevant inferences from and comparing data across 16S rRNA gene-based microbiota studies. Results: To overcome this barrier, we developed a broadly applicable method for constructing a phylogeny-based, high-resolution, habitat-specific training set. When used with the naive Bayesian Ribosomal Database Project (RDP) Classifier, this training set achieved species/supraspecies-level taxonomic assignment to 16S rRNA gene-derived ASVs. The key steps for generating such a training set are 1) constructing an accurate and comprehensive phylogenetic-based, habitat-specific database; 2) compiling multiple 16S rRNA gene sequences to represent the natural sequence variability of each taxon in the database; 3) trimming the training set to match the sequenced regions if necessary; and 4) placing species sharing closely related sequences into a supraspecies taxonomic level to maintain subgenus resolution. As proof of principle, we developed a V1-V3 region training set for the bacterial microbiota of the human aerodigestive tract using our expanded Human Oral Microbiome Database (eHOMD). In addition, we overcame technical limitations to successfully use Illumina sequences for the 16S rRNA gene V1-V3 region, the most informative segment for classifying bacteria native to the human aerodigestive tract. We also generated a full-length eHOMD 16S rRNA gene training set, which we used in conjunction with an independent PacBio Single Molecule, Real-Time (SMRT)-sequenced sinonasal dataset to validate the representation of species in our training set. The latter also established the effectiveness of a full-length training set for assigning taxonomy of long-read 16S rRNA gene datasets. Conclusion: Here, we present a systematic approach for constructing a phylogeny-based, high-resolution, habitat-specific training set that permits species/supraspecies-level taxonomic assignment to short- and long-read 16S rRNA gene-derived ASVs. This advancement enhances the ecological and/or clinical relevance of 16S rRNA gene-based microbiota studies.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    78
    References
    0
    Citations
    NaN
    KQI
    []