Studies on Distributed Approaches for Large Scale Multi-Criteria Protein Structure Comparison and Analysis
2011
Protein Structure Comparison (PSC) is at the core of many important structural biology problems. PSC is used to infer the evolutionary history of distantly related proteins; it can also help in the identification of the biological function of a new protein by comparing it with other proteins whose function has already been annotated; PSC is also a key step in protein structure prediction, because one needs to reliably and efficiently compare tens or hundreds of thousands of decoys (predicted structures) in evaluation of 'native-like' candidates (e.g. Critical Assessment of Techniques for Protein Structure Prediction (CASP) experiment). Each of these applications, as well as many others where molecular comparison plays an important role, requires a different notion of similarity, which naturally lead to the Multi-Criteria Protein Structure Comparison (MC-PSC) problem. ProCKSI (www.procksi.org), was the first publicly available server to provide algorithmic solutions for the MC-PSC problem by means of an enhanced structural comparison that relies on the principled application of information fusion to similarity assessments derived from multiple comparison methods (e.g. USM, FAST, MaxCMO, DaliLite, CE and TMAlign). Current MC-PSC works well for moderately sized data sets and it is time consuming as it provides public service to multiple users. Many of the structural bioinformatics applications mentioned above would benefit from the ability to perform, for a dedicated user, thousands or tens of thousands of comparisons through multiple methods in real-time, a capacity beyond our current technology.
This research is aimed at the investigation of Grid-styled distributed computing strategies for the solution of the enormous computational challenge inherent in MC-PSC. To this aim a novel distributed algorithm has been designed, implemented and evaluated with different load balancing strategies and selection and configuration of a variety of software tools, services and technologies on different levels of infrastructures ranging from local testbeds to production level eScience infrastructures such as the National Grid Service (NGS). Empirical results of different experiments reporting on the scalability, speedup and efficiency of the overall system are presented and discussed along with the software engineering aspects behind the implementation of a distributed solution to the MC-PSC problem based on a local computer cluster as well as with a GRID implementation. The results lead us to conclude that the combination of better and faster parallel and distributed algorithms with more similarity comparison methods provides an unprecedented advance on protein structure comparison and analysis technology. These advances might facilitate both directed and fortuitous discovery of protein similarities, families, super-families, domains, etc, and also help pave the way to faster and better protein function inference, annotation and protein structure prediction and assessment thus empowering the structural biologist to do a science that he/she would not have done otherwise.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
258
References
1
Citations
NaN
KQI