Additional file 7: Table S5. Differentially abundant phyla and families between Indian and Danish samples. Table S5A: Differentially abundant phyla between Indian (IN) and Danish (DK) gut microbiomes identified using a negative binomial Wald test. A positive log2 fold change value indicates higher relative abundance of the OTU in DK subjects and vice-versa. P-values were adjusted for multiple testing using Benjamini-Hochberg correction (padj
Background: Next-generation sequencing (NGS) technologies have enabled probing of microbial diversity in different environmental niches with unprecedented sequencing depth. However, due to read-length limitations of popular NGS technologies, 16S amplicon sequencing-based microbiome studies rely on targeting short stretches of the 16S rRNA gene encompassing a selection of variable (V) regions. In most cases, such a short stretch constitutes a single V-region or a couple of V-regions placed adjacent to each other on the 16S rRNA gene. Given that different V-regions have different resolving ability with respect to various taxonomic groups, selecting the optimal V-region (or a combination thereof) remains a challenge. Methods: The accuracy of taxonomic profiles generated from sequences encompassing 1) individual V-regions, 2) adjacent V-regions, and 3) pairs of non-contiguous V-regions were assessed and compared. Subsequently, the discriminating capability of different V-regions with respect to different taxonomic lineages was assessed. The possibility of using paired-end sequencing protocols to target combinations of non-adjacent V-regions was finally evaluated with respect to the utility of such an experimental design in providing improved taxonomic resolution. Results: Extensive validation with simulated microbiome datasets mimicking different environmental and host-associated microbiome samples suggest that targeting certain combinations of non-contiguously placed V-regions might yield better taxonomic classification accuracy compared to conventional 16S amplicon sequencing targets. This work also puts forward a novel in silico combinatorial strategy that enables creation of consensus taxonomic profiles from experiments targeting multiple pair-wise combinations of V-regions to improve accuracy in taxonomic classification. Conclusion: The study suggests that targeting non-contiguous V-regions with paired-end sequencing can improve 16S rRNA-based taxonomic resolution of microbiomes. Furthermore, employing the novel in silico combinatorial strategy can improve taxonomic classification without any significant additional experimental costs and/or efforts. The empirical observations obtained can potentially serve as a guideline for future 16S microbiome studies, and facilitate researchers in choosing the optimal combination of V-regions for a specific experiment/sampled environment.
Abstract Genomes have an inherent context dictated by the order in which the nucleotides and higher order genomic elements are arranged in the DNA/RNA. Learning this context is a daunting task, governed by the combinatorial complexity of interactions possible between ordered elements of genomes. Can natural language processing be employed on these orderly, complex and also evolving datatypes (genomic sequences) to reveal the latent patterns or context of genomic elements (e.g Mutations)? Here we present an approach to understand the mutational landscape of Covid-19 by treating the temporally changing (continuously mutating) SARS-CoV-2 genomes as documents. We demonstrate how the analogous interpretation of evolving genomes to temporal literature corpora provides an opportunity to use dynamic topic modeling (DTM) and temporal Word2Vec models to delineate mutation signatures corresponding to different Variants-of-Concerns and tracking the semantic drift of Mutations-of-Concern (MoC). We identified and studied characteristic mutations affiliated to Covid-infection severity and tracked their relationship with MoCs. Our ground work on utility of such temporal NLP models in genomics could supplement ongoing efforts in not only understanding the Covid pandemic but also provide alternative strategies in studying dynamic phenomenon in biological sciences through data science (especially NLP, AI/ML).
Additional file 12: Table S10. Differentially abundant phyla, family and genera between NG and PD subjects. Table S10A: Differentially abundant phyla between NG and PD subjects, belonging to the Indian and Danish cohorts (pooled together), identified using a negative binomial Wald test (corrected for geography specific cohort-effect). A positive log2 fold change value indicates higher relative abundance of the OTU in PD subjects and vice-versa. P-values were adjusted for multiple testing using Benjamini-Hochberg correction. Table S10B: Differentially abundant family between NG and PD subjects, belonging to the Indian and Danish cohorts (pooled together), identified using a negative binomial Wald test (corrected for geography specific cohort-effect). A positive log2 fold change value indicates higher relative abundance of the OTU in PD subjects and vice-versa. P-values were adjusted for multiple testing using Benjamini-Hochberg correction. Significantly (padj
Additional file 9: Table S7. Differentially abundant OTUs between NG and PD subjects. Differentially abundant OTUs between NG and PD subjects, belonging to the Indian and Danish cohorts (pooled together) identified using a negative binomial Wald test (corrected for geography specific cohort effect). A positive log2 fold change value indicates higher relative abundance of the OTU in PD subjects and vice-versa. P-values were adjusted for multiple testing using Benjamini-Hochberg method. Significantly (padj
The human microbiota, which comprises an ensemble of taxonomically and functionally diverse but often mutually cooperating microorganisms, benefits its host by shaping the host immunity, energy harvesting, and digestion of complex carbohydrates as well as production of essential nutrients. Dysbiosis in the human microbiota, especially the gut microbiota, has been reported to be linked to several diseases and metabolic disorders. Recent studies have further indicated that tracking these dysbiotic variations could potentially be exploited as biomarkers of disease states. However, the human microbiota is not geography agnostic, and hence a taxonomy-based (microbiome) biomarker for disease diagnostics has certain limitations. In comparison, (microbiome) function-based biomarkers are expected to have a wider applicability. Given that (i) the host physiology undergoes certain changes in the course of a disease and (ii) host-associated microbial communities need to adapt to this changing microenvironment of their host, we hypothesized that signatures emanating from the abundance of bacterial proteins associated with the signal transduction system (herein referred to as sensory proteins [SPs]) might be able to distinguish between healthy and diseased states. To test this hypothesis, publicly available metagenomic data sets corresponding to three diverse health conditions, namely, colorectal cancer, type 2 diabetes mellitus, and schizophrenia, were analyzed. Results demonstrated that SP signatures (derived from host-associated metagenomic samples) indeed differentiated among healthy individual and patients suffering from diseases of various severities. Our finding was suggestive of the prospect of using SP signatures as early biomarkers for diagnosing the onset and progression of multiple diseases and metabolic disorders. IMPORTANCE The composition of the human microbiota, a collection of host-associated microbes, has been shown to differ among healthy and diseased individuals. Recent studies have investigated whether tracking these variations could be exploited for disease diagnostics. It has been noted that compared to microbial taxonomies, the ensemble of functional proteins encoded by microbial genes are less likely to be affected by changes in ethnicity and dietary preferences. These functions are expected to help the microbe adapt to changing environmental conditions. Thus, healthy individuals might harbor a different set of genes than diseased individuals. To test this hypothesis, we analyzed metagenomes from healthy and diseased individuals for signatures of a particular group of proteins called sensory proteins (SP), which enable the bacteria to sense and react to changes in their microenvironment. Results demonstrated that SP signatures indeed differentiate among healthy individuals and those suffering from diseases of various severities.
Natural language processing (NLP) algorithms process linguistic data in order to discover the associated word semantics and develop models that can describe or even predict the latent meanings of the data. The applications of NLP become multi-fold while dealing with dynamic or temporally evolving datasets (e.g., historical literature). Biological datasets of genome-sequences are interesting since they are sequential as well as dynamic. Here we describe how SARS-CoV-2 genomes and mutations thereof can be processed using fundamental algorithms in NLP to reveal the characteristics and evolution of the virus. We demonstrate applicability of NLP in not only probing the temporal mutational signatures through dynamic topic modelling, but also in tracing the mutation-associations through tracing of semantic drift in genomic mutation records. Our approach also yields promising results in unfolding the mutational relevance to patient health status, thereby identifying putative signatures linked to known/highly speculated mutations of concern.
Additional file 13: Table S11. Differentially abundant KEGG pathways (level 3) between NG and PD subjects. Differentially abundant KEGG pathways (level 3) between NG and PD subjects, belonging to the Indian and Danish cohorts (pooled together), identified using a negative binomial Wald test (corrected for geography specific cohort-effect). A positive log2 fold change value indicates higher relative abundance of the Pathway in PD subjects and vice-versa. P-values were adjusted for multiple testing using Benjamini-Hochberg correction (list of pathways sorted according to padjvalues). Pathways that are differentially abundant at a significance level of padj