Mining the Structural Genomics Pipeline: Identification of Protein Properties that Affect High-throughput Experimental Analysis

2004 
Abstract Structural genomics projects represent major undertakings that will change our understanding of proteins. They generate unique datasets that, for the first time, present a standardized view of proteins in terms of their physical and chemical properties. By analyzing these datasets here, we are able to discover correlations between a protein's characteristics and its progress through each stage of the structural genomics pipeline, from cloning, expression, purification, and ultimately to structural determination. First, we use tree-based analyses (decision trees and random forest algorithms) to discover the most significant protein features that influence a protein's amenability to high-throughput experimentation. Based on this, we identify potential bottlenecks in various stages of the structural genomics process through specialized “pipeline schematics”. We find that the properties of a protein that are most significant are: (i) whether it is conserved across many organisms; (ii) the percentage composition of charged residues; (iii) the occurrence of hydrophobic patches; (iv) the number of binding partners it has; and (v) its length. Conversely, a number of other properties that might have been thought to be important, such as nuclear localization signals, are not significant. Thus, using our tree-based analyses, we are able to identify combinations of features that best differentiate the small group of proteins for which a structure has been determined from all the currently selected targets. This information may prove useful in optimizing high-throughput experimentation. Further information is available from http://mining.nesg.org/ .
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    45
    References
    125
    Citations
    NaN
    KQI
    []