Metabolic models have the potential to impact on genome annotation and on the interpretation of gene expression and other high throughput genome data. The genome of Streptomyces coelicolor genome has been sequenced and some 30% of the open reading frames (ORFs) lack any functional annotation. A recently constructed metabolic network model for S. coelicolor highlights biochemical functions which should exist to make the metabolic model complete and consistent. These include 205 reactions for which no ORF is associated. Here we combine protein functional predictions for the unannotated open reading frames in the genome with \'missing but expected\' functions inferred from the metabolic model. The approach allows function predictions to be evaluated in the context of the biochemical pathway reconstruction, and feed back iteratively into the metabolic model. We describe the approach and discuss a few illustrative examples.
Fold recognition methods aim to use the information in the known protein structures (the targets) to identify that the sequence of a protein of unknown structure (the probe) will adopt a known fold. This paper highlights that the structural similarities sought by these methods can be divided into two types: remote homologues and analogues. Homologues are the result of divergent evolution and often share a common function. We define remote homologues as those that are not easily detectable by sequence comparison methods alone. Analogues do not have a common ancestor and generally do not have a common function. Several sets of empirical matrices for residue substitution, secondary structure conservation and residue accessibility conservation have previously been derived from aligned pairs of remote homologues and analogues (Russell et al., J. Mol. Biol., 1997, 269, 423-439). Here a method for fold recognition, FOLDFIT, is introduced that uses these matrices to match the sequences, secondary structures and residue accessibilities of the probe and target. The approach is evaluated on distinct datasets of analogous and remotely homologous folds. The accuracy of FOLDFIT with the different matrices on the two datasets is contrasted to results from another fold recognition method (THREADER) and to searches using mutation matrices in the absence of any structural information. FOLDFIT identifies at top rank 12 out of 18 remotely homologous folds and five out of nine analogous folds. The average alignment accuracies for residue and secondary structure equivalencing are much higher for homologous folds (residue approximately 42%, secondary structure approximately 78%) than for analogues folds (approximately 12%, approximately 47%). Sequence searches alone can be successful for several homologues in the testing sets but nearly always fail for the analogues. These results suggest that the recognition of analogous and remotely homologous folds should be assessed separately. This study has implications for the development and comparative evaluation of fold recognition algorithms.
Combining multiple evidence-types from different information sources has the potential to reveal new relationships in biological systems. The integrated information can be represented as a relationship network, and clustering the network can suggest possible functional modules. The value of such modules for gaining insight into the underlying biological processes depends on their functional coherence. The challenges that we wish to address are to define and quantify the functional coherence of modules in relationship networks, so that they can be used to infer function of as yet unannotated proteins, to discover previously unknown roles of proteins in diseases as well as for better understanding of the regulation and interrelationship between different elements of complex biological systems. We have defined the functional coherence of modules with respect to the Gene Ontology (GO) by considering two complementary aspects: (i) the fragmentation of the GO functional categories into the different modules and (ii) the most representative functions of the modules. We have proposed a set of metrics to evaluate these two aspects and demonstrated their utility in Arabidopsis thaliana. We selected 2355 proteins for which experimentally established protein-protein interaction (PPI) data were available. From these we have constructed five relationship networks, four based on single types of data: PPI, co-expression, co-occurrence of protein names in scientific literature abstracts and sequence similarity and a fifth one combining these four evidence types. The ability of these networks to suggest biologically meaningful grouping of proteins was explored by applying Markov clustering and then by measuring the functional coherence of the clusters. Relationship networks integrating multiple evidence-types are biologically informative and allow more proteins to be assigned to a putative functional module. Using additional evidence types concentrates the functional annotations in a smaller number of modules without unduly compromising their consistency. These results indicate that integration of more data sources improves the ability to uncover functional association between proteins, both by allowing more proteins to be linked and producing a network where modular structure more closely reflects the hierarchy in the gene ontology.
Pseudomonas aeruginosa is a genetically complex bacterium which can adopt and switch between a free-living or biofilm lifestyle, a versatility that enables it to thrive in many different environments and contributes to its success as a human pathogen. Transcriptomes derived from growth states relevant to the lifestyle of P. aeruginosa were clustered using three different methods (K-means, K-means spectral and hierarchical clustering). The culture conditions used for this study were; biofilms incubated for 8, 14, 24 and 48 hrs, and planktonic culture (logarithmic and stationary phase). This cluster analysis revealed the existence and provided a clear illustration of distinct expression profiles present in the dataset. Moreover, it gave an insight into which genes are up-regulated in planktonic, developing biofilm and confluent biofilm states. In addition, this analysis confirmed the contribution of quorum sensing (QS) and RpoS regulated genes to the biofilm mode of growth, and enabled the identification of a 60.69 Kbp region of the genome associated with stationary phase growth (stationary phase planktonic culture and confluent biofilms). This is the first study to use clustering to separate a large P. aeruginosa microarray dataset consisting of transcriptomes obtained from diverse conditions relevant to its growth, into different expression profiles. These distinct expression profiles not only reveal novel aspects of P. aeruginosa gene expression but also provide a growth specific transcriptomic reference dataset for the research community.
The human network of Protein-Protein Interactions (PPIs) (interactome) provides information on biological systems that can be used to aid prediction of protein function and disease association. As some classes of protein may be the focus of much study, data sets may contain bias, which may affect the results of network analyses. Implicated cancer proteins and proteins including significant known mediators of cardiovascular disease (cvd) display a tendency to play a central role in a previously constructed interactome. However, removing possible bias in the interactome by only considering interactions obtained from non-targeted approaches affects the significance of the findings.
Abstract Biomedical informatics has traditionally adopted a linear view of the informatics process (collect, store and analyse) in translational medicine (TM) studies; focusing primarily on the challenges in data integration and analysis. However, a data management challenge presents itself with the new lifecycle view of data emphasized by the recent calls for data re-use, long term data preservation, and data sharing. There is currently a lack of dedicated infrastructure focused on the ‘manageability’ of the data lifecycle in TM research between data collection and analysis. Current community efforts towards establishing a culture for open science prompt the creation of a data custodianship environment for management of TM data assets to support data reuse and reproducibility of research results. Here we present the development of a lifecycle-based methodology to create a metadata management framework based on community driven standards for standardisation, consolidation and integration of TM research data. Based on this framework, we also present the development of a new platform (PlatformTM) focused on managing the lifecycle for translational research data assets.
The cervicovaginal environment in pregnancy is proposed to influence risk of spontaneous preterm birth. The environment is shaped both by the resident microbiota and local inflammation driven by the host response (epithelia, immune cells and mucous). The contributions of the microbiota, metabolome and host defence peptides have been investigated, but less is known about the immune cell populations and how they may respond to the vaginal environment. Here we investigated the maternal immune cell populations at the cervicovaginal interface in early to mid-pregnancy (10-24 weeks of gestation, samples from N = 46 women), we confirmed neutrophils as the predominant cell type and characterised associations between the cervical neutrophil transcriptome and the cervicovaginal metagenome (N = 9 women). In this exploratory study, the neutrophil cell proportion was affected by gestation at sampling but not by birth outcome or ethnicity. Following RNA sequencing (RNA-seq) of a subset of neutrophil enriched cells, principal component analysis of the transcriptome profiles indicated that cells from seven women clustered closely together these women had a less diverse cervicovaginal microbiota than the remaining three women. Expression of genes involved in neutrophil mediated immunity, activation, degranulation, and other immune functions correlated negatively with Gardnerella vaginalis abundance and positively with Lactobacillus iners abundance; microbes previously associated with birth outcome. The finding that neutrophils are the dominant immune cell type in the cervix during pregnancy and that the cervical neutrophil transcriptome of pregnant women may be modified in response to the microbial cervicovaginal environment, or vice versa, establishes the rationale for investigating associations between the innate immune response, cervical shortening and spontaneous preterm birth and the underlying mechanisms.
Ratko Djukanović has consulted and presented at symposia organised by TEVA, Novartis, GlaxoSmithKline and AstraZeneca and has shares in and consults for Synairgen; Dr Asa Wheelock report remuneration from AstraZenica and Harvard Medical School for speaking engagements on SNF-clustering in COPD.; Charles Auffray reports grants from Innovative Medicine Initiative; Kian Fan Chung has received honoraria for participating in Advisory Board meetings of the pharmaceutical industry regarding treatments for asthma and chronic obstructive pulmonary disease and has also been remunerated for speaking engagements; Ian Adcock has received grants from Advisory Board meetings with pharmaceutical companies GSK, A-Z, Novartis, Boehringer Ingelheim and Vectura, and grants on asthma and COPD from Pfizer, GSK, MRC, EU, BI and IMI; Peter Sterk reports grants from IMI Innovative Medicines Initiative, during the conduct of the study; Matthew Loza and Frederic Baribaud are Employees and Shareholders of Janssen Research and Development, a Johnson and Johnson company; John Riley and Ana R Sousa are employees of GSK; the rest of the authors have nothing to disclose. Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.