Metagenomics, or sequencing of the genetic material from a complete microbial community, is a promising tool to discover novel microbes and viruses. Viral metagenomes typically contain many unknown sequences. Here we describe the discovery of a previously unidentified bacteriophage present in the majority of published human faecal metagenomes, which we refer to as crAssphage. Its ~97 kbp genome is six times more abundant in publicly available metagenomes than all other known phages together; it comprises up to 90% and 22% of all reads in virus-like particle (VLP)-derived metagenomes and total community metagenomes, respectively; and it totals 1.68% of all human faecal metagenomic sequencing reads in the public databases. The majority of crAssphage-encoded proteins match no known sequences in the database, which is why it was not detected before. Using a new co-occurrence profiling approach, we predict a Bacteroides host for this phage, consistent with Bacteroides-related protein homologues and a unique carbohydrate-binding domain encoded in the phage genome. Metagenomic studies of microbial communities often report DNA sequences from unidentified viruses. Here, Dutilh et al. analyse metagenomic data to reveal the complete genome of an abundant, ubiquitous virus from human faeces, and predict that the virus infects bacteria of the Bacteroides group.
Genomics-based metabolic models of microorganisms currently have no easy way of corroborating predicted biomass with the actual metabolites being produced. This study uses untargeted mass spectrometry-based metabolomics data to generate a list of accurate metabolite masses produced from the human commensal bacteria Citrobacter sedlakii grown in the presence of a simple glucose carbon source. A genomics-based flux balance metabolic model of this bacterium was previously generated using the bioinformatics tool PyFBA and phenotypic growth curve data. The high-resolution mass spectrometry data obtained through timed metabolic extractions were integrated with the predicted metabolic model through a program called MS_FBA. This program correlated untargeted metabolomics features from C. sedlakii with 218 of the 699 metabolites in the model using an exact mass match, with 51 metabolites further confirmed using predicted isotope ratios. Over 1400 metabolites were matched with additional metabolites in the ModelSEED database, indicating the need to incorporate more specific gene annotations into the predictive model through metabolomics-guided gap filling.
Metagenomic studies characterize both the composition and diversity of uncultured viral and microbial communities. BLAST-based comparisons have typically been used for such analyses; however, sampling biases, high percentages of unknown sequences, and the use of arbitrary thresholds to find significant similarities can decrease the accuracy and validity of estimates. Here, we present Genome relative Abundance and Average Size (GAAS), a complete software package that provides improved estimates of community composition and average genome length for metagenomes in both textual and graphical formats. GAAS implements a novel methodology to control for sampling bias via length normalization, to adjust for multiple BLAST similarities by similarity weighting, and to select significant similarities using relative alignment lengths. In benchmark tests, the GAAS method was robust to both high percentages of unknown sequences and to variations in metagenomic sequence read lengths. Re-analysis of the Sargasso Sea virome using GAAS indicated that standard methodologies for metagenomic analysis may dramatically underestimate the abundance and importance of organisms with small genomes in environmental systems. Using GAAS, we conducted a meta-analysis of microbial and viral average genome lengths in over 150 metagenomes from four biomes to determine whether genome lengths vary consistently between and within biomes, and between microbial and viral communities from the same environment. Significant differences between biomes and within aquatic sub-biomes (oceans, hypersaline systems, freshwater, and microbialites) suggested that average genome length is a fundamental property of environments driven by factors at the sub-biome level. The behavior of paired viral and microbial metagenomes from the same environment indicated that microbial and viral average genome sizes are independent of each other, but indicative of community responses to stressors and environmental conditions.
Abstract Paired end DNA sequencing provides additional information about the sequence data that is used in sequence assembly, mapping, and other downstream bioinformatics analysis. Paired end reads are usually provided as two fastq-format files, with each file representing one end of the read. Many commonly used downstream tools require that the sequence reads appear in each file in the same order, and reads that do not have a pair in the corresponding file are placed in a separate file of singletons. Although most sequencing instruments capable of generating paired end reads produce files where each read has a corresponding mate, many downstream bioinformatics manipulations break the one-to-one correspondence between reads, and paired-end sequence files loose synchronicity, and contain either unordered sequences or sequences in one or other file without a mate. Trivial solutions to this problem require reading one or both of the DNA sequence files into memory but quickly become limited by computational resources for moderate to large sized sequence files that are common nowadays. Here, we introduce a fast and memory efficient solution, written in C for portability, that synchronizes paired-end fastq files for subsequent analysis and places unmatched reads into singleton files. Fastq-pair is freely available from https://github.com/linsalrob/fastq-pair and is released under the MIT license.
<p>PDF file - 83K, Table 1-list of miRs tested for Figure 1G. Table 2 shows scores for distribution of the permuted test statistic of miR-23a and miR-27a in dataset GSE14333.</p>
Environmental DNA (eDNA), elemental and mineralogical analyses of soil have been shown to be specific to their source material, prompting consideration of using the airborne fraction of soil (dust) for forensic intelligence work. Dust is ubiquitous in the environment and is easily transferred to items belonging to a person of interest, making dust analysis an ideal tool in forensic casework. The advent of Massive Parallel Sequencing technologies means metabarcoding of eDNA can uncover bacterial, fungal, and even plant genetic fingerprints in dust particles. Combining this with elemental and mineralogical compositions offers multiple, complementary lines of evidence for tracing the origin of an unknown dust sample. This is particularly pertinent when recovering dust from a person of interest to ascertain where they may have travelled. Prior to proposing dust as a forensic trace material, however, the optimum sampling protocols and detection limits need to be established to place parameters around its utility in this context. We tested several approaches to collecting dust from different materials and determined the lowest quantity of dust that could be analysed for eDNA, elemental composition and mineralogy, whilst still yielding results capable of distinguishing between sites. We found that fungal eDNA profiles could be obtained from multiple sample types and that tape lifts were the optimum collection method for discriminating between sites. We successfully recovered both fungal and bacterial eDNA profiles down to 3 mg of dust (the lowest tested quantity) and recovered elemental and mineralogical compositions for all tested sample quantities. We show that dust can be reliably recovered from different sample types, using different sampling techniques, and that fungi and bacteria, as well as elemental and mineralogical profiles, can be generated from small sample quantities, highlighting the utility of dust for forensic intelligence.
The bacterial diseases black leg and soft rot in potatoes cause heavy losses of potatoes worldwide. Bacteria within the genus Pectobacteriaceae are the causative agents of black leg and soft rot. The use of antibiotics in agriculture is heavily regulated and no other effective treatment currently exists, but bacteriophages (phages) have shown promise as potential biocontrol agents. In this study we isolated soft rot bacteria from potato tubers and plant tissue displaying soft rot or black leg symptoms collected in Danish fields. We then used the isolated bacterial strains as hosts for phage isolation. Using organic waste, we isolated phages targeting different species within Pectobacterium. Here we focus on seven of these phages representing a new genus primarily targeting P. brasiliense; phage Ymer, Amona, Sabo, Abuela, Koroua, Taid and Pappous. TEM image of phage Ymer showed Siphovirus morphology, and the Ymer genus belongs to the class Caudoviricetes, with double-stranded DNA genomes varying from 39kb to 43kb. In silico host range prediction using a CRISPR-Cas spacer database suggested both P. brasiliense, P. polaris and P. versatile as natural hosts for phages within the Ymer genus. A following host range experiment, using 47 bacterial isolates from Danish tubers and plants symptomatic with soft rot or black leg disease verified the in silico host range prediction, as the Ymer genus as a group were able to infect all three Pectobacterium species. Phages did, however, primarily target P. brasiliense isolates and displayed differences in host range even within the species level. Two of the phages were able to infect two or more Pectobacterium species. Despite no nucleotide similarity with any phages in the NCBI database, the Ymer genus did share some similarity at the protein level, as well as gene synteny, with currently known phages. None of the phages encoded integrases or other genes typically associated with lysogeny. Similarly, no virulence factors nor antimicrobial resistance genes were found, and combined with their ability to infect several soft rot-causing Pectobacterium species from Danish fields, demonstrates their potential as biocontrol agents against soft rot and black leg diseases in potatoes.
The remarkable advance in sequencing technology and the rising interest in medical and environmental microbiology, biotechnology, and synthetic biology resulted in a deluge of published microbial genomes. Yet, genome annotation, comparison, and modeling remain a major bottleneck to the translation of sequence information into biological knowledge, hence computational analysis tools are continuously being developed for rapid genome annotation and interpretation. Among the earliest, most comprehensive resources for prokaryotic genome analysis, the SEED project, initiated in 2003 as an integration of genomic data and analysis tools, now contains >5,000 complete genomes, a constantly updated set of curated annotations embodied in a large and growing collection of encoded subsystems, a derived set of protein families, and hundreds of genome-scale metabolic models. Until recently, however, maintaining current copies of the SEED code and data at remote locations has been a pressing issue. To allow high-performance remote access to the SEED database, we developed the SEED Servers (http://www.theseed.org/servers): four network-based servers intended to expose the data in the underlying relational database, support basic annotation services, offer programmatic access to the capabilities of the RAST annotation server, and provide access to a growing collection of metabolic models that support flux balance analysis. The SEED servers offer open access to regularly updated data, the ability to annotate prokaryotic genomes, the ability to create metabolic reconstructions and detailed models of metabolism, and access to hundreds of existing metabolic models. This work offers and supports a framework upon which other groups can build independent research efforts. Large integrations of genomic data represent one of the major intellectual resources driving research in biology, and programmatic access to the SEED data will provide significant utility to a broad collection of potential users.
Supplementary Table 1 from <i>NOTCH</i> Signaling Is Required for Formation and Self-Renewal of Tumor-Initiating Cells and for Repression of Secretory Cell Differentiation in Colon Cancer
Abstract For any given bacteriophage genome or phage sequences in metagenomic data sets, we are unable to assign a function to 50-90% of genes. Structural protein-encoding genes constitute a large fraction of the average phage genome and are among the most divergent and difficult-to-identify genes using homology-based methods. To understand the functions encoded by phages, their contributions to their environments, and to help gauge their utility as potential phage therapy agents, we have developed a new approach to classify phage ORFs into ten major classes of structural proteins or into an “other” category. The resulting tool is named PhANNs (Phage Artificial Neural Networks). We built a database of 538,213 manually curated phage protein sequences that we split into eleven subsets (10 for cross-validation, one for testing) using a novel clustering method that ensures there are no homologous proteins between sets yet maintains the maximum sequence diversity for training. An Artificial Neural Network ensemble trained on features extracted from those sets reached a test F 1 -score of 0.875 and test accuracy of 86.2%. PhANNs can rapidly classify proteins into one of the ten classes, and non-phage proteins are classified as “other”, providing a new approach for functional annotation of phage proteins. PhANNs is open source and can be run from our web server or installed locally. Author Summary Bacteriophages (phages, viruses that infect bacteria) are the most abundant biological entity on Earth. They outnumber bacteria by a factor of ten. As phages are very different within them and from bacteria, and we have comparatively few phage genes in our database, we are unable to assign function to 50%-90% of phage genes. In this work, we developed PhANNs, a machine learning tool that can classify a phage gene as one of ten structural roles, or “other”. This approach does not require a similar gene to be known.