Exploring the human genome with functional maps

Curtis Huttenhower,Erin M. Haley,Matthew A. Hibbs,Vanessa Dumeaux,Daniel R. Barrett,Hilary A. Coller,Olga G. Troyanskaya

Exploring the human genome with functional maps

2009

The completion of the Human Genome Project and the subsequent flood of genomic data and analyses have provided a wealth of information regarding the entire catalog of human genes. Comprehensive assays of gene expression, protein binding, genetic interactions, and regulatory relationships all provide snapshots of molecular activity in specific cell types and environments, but turning these biomolecular parts lists into an understanding of pathways, processes, and systems biology has proven to be a challenging task. This abundance of data can sometimes obscure biological truths: The size of the human genome, the complexity of human tissue types and regulatory mechanisms, and the sheer amount of available data all contribute to the analytical complexity of understanding human functional genomics. In order to take advantage of large collections of genomic data, they must be integrated, summarized, and presented in a biologically informative manner. We provide a means of mining tens of thousands of whole-genome experiments by way of functional maps. Each map represents a body of data, probabilistically weighted and integrated, focused on a particular biological question. These questions can include, for example, the function of a gene, the relationship between two pathways, or the processes disrupted in a genetic disorder. Functional integrations investigating individual genes' relationships have been successful with smaller data collections in less complex organisms (Lee et al. 2004; Date and Stoeckert Jr. 2006; Myers and Troyanskaya 2007), although (as discussed below) it is particularly challenging to scale these techniques up to the size and complexity of the human genome. Each functional map, based on an underlying predicted interaction network, summarizes an entire collection of genomic experimental results in a biologically meaningful way. While functional maps can readily predict functions for uncharacterized genes (Murali et al. 2006), it is important to take advantage of the scale of available data to understand entire pathways and processes. Cross-talk and coregulation among pathways, processes, and genetic disorders can be mapped by analyzing the structure of underlying functional relationship networks. This includes the association of disease genes with (potentially causative) pathways; for example, many known breast cancer genes are involved in aspects of the cell cycle and DNA repair, and novel associations of this type can be mined from high-throughput data. Similarly, associations between distinct but interacting biological processes (e.g., mitosis and DNA replication) can be quantified by examining functional relationships between groups of genes, allowing the identification of proteins key to interprocess regulation. The functional maps we provide for the human genome include information on protein function, associations between diseases, genes, and pathways, and cross-talk between biological processes. These are all based on probabilistic data integration using regularized naive Bayesian classifiers. Naive Bayesian systems have been used successfully to analyze protein–protein interaction (PPI) data (Rhodes et al. 2005; von Mering et al. 2007), whereas our focus is on functional relationships and the biological roles of gene products. Prior work performing functional integration in simpler organisms with smaller data collections (Date and Stoeckert Jr. 2006; Myers and Troyanskaya 2007) has been similarly successful; see Supplemental Text 1 for a complete discussion. Such integrations have not previously been scaled biologically (i.e., to complex metazoans) or computationally (over very large genomic data collections) to provide a functional view of the human genome driven purely by experimental results. In addition to challenges of computational efficiency in the presence of hundreds of genome-scale data sets, naive classifiers assume that all input data sets are independent; this becomes increasingly untrue and problematic as more data sets are analyzed, resulting in a paradox of decreasing performance with increasing training data. To address this, we use Bayesian regularization (Steck and Jaakkola 2002), a process by which an observed distribution of data can be combined with a prior belief in a principled manner. Intuitively, this results in groups of data sets containing similar information making a more modest contribution to the integration process, up-weights unique data sets, and prevents overconfident predictions. Our regularization of the naive classifier parameters using a score based on mutual information up- and down-weighted appropriate subsets of data, maintaining both efficiency and accuracy. We applied our functional maps to a specific biological question in the area of autophagy, the process by which a cell can recycle its own biomass under conditions of starvation or stress (Klionsky 2007). Among many proteins predicted to participate in this biological process by our maps, we chose to investigate AP3B1, ATP6AP1, BLOC1S1, LAMP2, and RAB11A in the laboratory. We demonstrated through multiple lines of experimental evidence that these proteins are indeed involved in macroautophagy in amino acid-starved human fibroblasts, a specific type of autophagy in which bulk cytoplasm is lysosomally degraded. The results of our integration are available through a web-based interface, HEFalMp (Human Experimental/Functional Mapper), at http://function.princeton.edu/hefalmp. This tool allows a user to interactively explore functional maps integrating evidence from thousands of genomic experiments, focusing as desired on specific genes, processes, or diseases of interest.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

182

Citations