De novo protein structure prediction

In computational biology, de novo protein structure prediction refers to an algorithmic process by which protein tertiary structure is predicted from its amino acid primary sequence. The problem itself has occupied leading scientists for decades while still remaining unsolved. According to Science, the problem remains one of the top 125 outstanding issues in modern science. At present, some of the most successful methods have a reasonable probability of predicting the folds of small, single-domain proteins within 1.5 angstroms over the entire structure.Primary structure of human artemin (Isoform 1 )Tertiary structure of human artemin (PDB: 2GYR) rendered using PyMOL (Delano Scientific Freeware) In computational biology, de novo protein structure prediction refers to an algorithmic process by which protein tertiary structure is predicted from its amino acid primary sequence. The problem itself has occupied leading scientists for decades while still remaining unsolved. According to Science, the problem remains one of the top 125 outstanding issues in modern science. At present, some of the most successful methods have a reasonable probability of predicting the folds of small, single-domain proteins within 1.5 angstroms over the entire structure. De novo methods tend to require vast computational resources, and have thus only been carried out for relatively small proteins. De novo protein structure modeling is distinguished from Template-based modeling (TBM) by the fact that no solved homologue to the protein of interest is used, making efforts to predict protein structure from amino acid sequence exceedingly difficult. Prediction of protein structure de novo for larger proteins will require better algorithms and larger computational resources such as those afforded by either powerful supercomputers (such as Blue Gene or MDGRAPE-3) or distributed computing projects (such as Folding@home, Rosetta@home, the Human Proteome Folding Project, or Nutritious Rice for the World). Although computational barriers are vast, the potential benefits of structural genomics (by predicted or experimental methods) to fields such as medicine and drug design make de novo structure prediction an active research field. Currently, the gap between known protein sequences and confirmed protein structures is immense. At the beginning of 2008, only about 1% of the sequences listed in the UniProtKB database corresponded to structures in the Protein Data Bank (PDB), leaving a gap between sequence and structure of approximately five million. Experimental techniques for determining tertiary structure have faced serious bottlenecks in their ability to determine structures for particular proteins. For example, whereas X-ray crystallography has been successful in crystallizing approximately 80,000 cytosolic proteins, it has been far less successful in crystallizing membrane proteins – approximately 280. In light of experimental limitations, devising efficient computer programs to close the gap between known sequence and structure is believed to be the only feasible option. De novo protein structure prediction methods attempt to predict tertiary structures from sequences based on general principles that govern protein folding energetics and/or statistical tendencies of conformational features that native structures acquire, without the use of explicit templates. Research into de novo structure prediction has been primarily focused into three areas: alternate lower-resolution representations of proteins, accurate energy functions, and efficient sampling methods. A general paradigm for de novo prediction involves sampling conformation space, guided by scoring functions and other sequence-dependent biases such that a large set of candidate (“decoy') structures are generated. Native-like conformations are then selected from these decoys using scoring functions as well as conformer clustering. High-resolution refinement is sometimes used as a final step to fine-tune native-like structures. There are two major classes of scoring functions. Physics-based functions are based on mathematical models describing aspects of the known physics of molecular interaction. Knowledge-based functions are formed with statistical models capturing aspects of the properties of native protein conformations. Several lines of evidence have been presented in favor of the notion that primary protein sequence contains all the information required for overall three-dimensional protein structure, making the idea of a de novo protein prediction possible. First, proteins with different functions usually have different amino acid sequences. Second, several different human diseases, such as Duchenne muscular dystrophy, can be linked to loss of protein function resulting from a change in just a single amino acid in the primary sequence. Third, proteins with similar functions across many different species often have similar amino acid sequences. Ubiquitin, for example, is a protein involved in regulating the degradation of other proteins; its amino acid sequence is nearly identical in species as far separated as Drosophila melanogaster and Homo sapiens. Fourth, by thought experiment, one can deduce that protein folding must not be a completely random process and that information necessary for folding must be encoded within the primary structure. For example, if we assume that each of 100 amino acid residues within a small polypeptide could take up 10 different conformations on average, giving 10^100 different conformations for the polypeptide. If one possible confirmation was tested every 10^-13 second, then it would take about 10^77 years to sample all possible conformations. However, proteins are properly folded within the body on short timescales all the time, meaning that the process cannot be random and, thus, can potentially be modeled. One of the strongest lines of evidence for the supposition that all the relevant information needed to encode protein tertiary structure is found in the primary sequence was demonstrated in the 1950s by Christian Anfinsen. In a classic experiment, he showed that ribonuclease A could be entirely denatured by being submerged in a solution of urea (to disrupt stabilizing hydrophobic bonds) in the presence of a reducing agent (to cleave stabilizing disulfide bonds). Upon removal of the protein from this environment, the denatured and functionless ribonuclease protein spontaneously recoiled and regained function, demonstrating that protein tertiary structure is encoded in the primary amino acid sequence. Had the protein reformed randomly, over one-hundred different combinations of four disulfide bonds could have formed. However, in the majority of cases proteins will require the presence of molecular chaperons within the cell for proper folding. The overall shape of a protein may be encoded in its amino acid structure, but its folding may depend on chaperons to assist in folding.

Parent Topic

Child Topic

No Parent Topic