Structural alignment

Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure. Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure. Structural alignments can compare two sequences or multiple sequences. Because these alignments rely on information about all the query sequences' three-dimensional conformations, the method can only be used on sequences where these structures are known. These are usually found by X-ray crystallography or NMR spectroscopy. It is possible to perform a structural alignment on structures produced by structure prediction methods. Indeed, evaluating such predictions often requires a structural alignment between the model and the true known structure to assess the model's quality. Structural alignments are especially useful in analyzing data from structural genomics and proteomics efforts, and they can be used as comparison points to evaluate alignments produced by purely sequence-based bioinformatics methods. The outputs of a structural alignment are a superposition of the atomic coordinate sets and a minimal root mean square deviation (RMSD) between the structures. The RMSD of two aligned structures indicates their divergence from one another. Structural alignment can be complicated by the existence of multiple protein domains within one or more of the input structures, because changes in relative orientation of the domains between two structures to be aligned can artificially inflate the RMSD. The minimum information produced from a successful structural alignment is a set of residues that are considered equivalent between the structures. This set of equivalences is then typically used to superpose the three-dimensional coordinates for each input structure. (Note that one input element may be fixed as a reference and therefore its superposed coordinates do not change.) The fitted structures can be used to calculate mutual RMSD values, as well as other more sophisticated measures of structural similarity such as the global distance test (GDT, the metric used in CASP). The structural alignment also implies a corresponding one-dimensional sequence alignment from which a sequence identity, or the percentage of residues that are identical between the input structures, can be calculated as a measure of how closely the two sequences are related. Because protein structures are composed of amino acids whose side chains are linked by a common protein backbone, a number of different possible subsets of the atoms that make up a protein macromolecule can be used in producing a structural alignment and calculating the corresponding RMSD values. When aligning structures with very different sequences, the side chain atoms generally are not taken into account because their identities differ between many aligned residues. For this reason it is common for structural alignment methods to use by default only the backbone atoms included in the peptide bond. For simplicity and efficiency, often only the alpha carbon positions are considered, since the peptide bond has a minimally variant planar conformation. Only when the structures to be aligned are highly similar or even identical is it meaningful to align side-chain atom positions, in which case the RMSD reflects not only the conformation of the protein backbone but also the rotameric states of the side chains. Other comparison criteria that reduce noise and bolster positive matches include secondary structure assignment, native contact maps or residue interaction patterns, measures of side chain packing, and measures of hydrogen bond retention. The most basic possible comparison between protein structures makes no attempt to align the input structures and requires a precalculated alignment as input to determine which of the residues in the sequence are intended to be considered in the RMSD calculation. Structural superposition is commonly used to compare multiple conformations of the same protein (in which case no alignment is necessary, since the sequences are the same) and to evaluate the quality of alignments produced using only sequence information between two or more sequences whose structures are known. This method traditionally uses a simple least-squares fitting algorithm, in which the optimal rotations and translations are found by minimizing the sum of the squared distances among all structures in the superposition. More recently, maximum likelihood and Bayesian methods have greatly increased the accuracy of the estimated rotations, translations, and covariance matrices for the superposition. Algorithms based on multidimensional rotations and modified quaternions have been developed to identify topological relationships between protein structures without the need for a predetermined alignment. Such algorithms have successfully identified canonical folds such as the four-helix bundle. The SuperPose method is sufficiently extensible to correct for relative domain rotations and other structural pitfalls. The optimal 'threading' of a protein sequence onto a known structure and the production of an optimal multiple sequence alignment have been shown to be NP-complete. However, this does not imply that the structural alignment problem is NP-complete. Strictly speaking, an optimal solution to the protein structure alignment problem is only known for certain protein structure similarity measures, such as the measures used in protein structure prediction experiments, GDT_TS and MaxSub. These measures can be rigorously optimized using an algorithm capable of maximizing the number of atoms in two proteins that can be superimposed under a predefined distance cutoff. Unfortunately, the algorithm for optimal solution is not practical, since its running time depends not only on the lengths but also on the intrinsic geometry of input proteins. Approximate polynomial-time algorithms for structural alignment that produce a family of 'optimal' solutions within an approximation parameter for a given scoring function have been developed. Although these algorithms theoretically classify the approximate protein structure alignment problem as 'tractable', they are still computationally too expensive for large-scale protein structure analysis. As a consequence, practical algorithms that converge to the global solutions of the alignment, given a scoring function, do not exist. Most algorithms are, therefore, heuristic, but algorithms that guarantee the convergence to at least local maximizers of the scoring functions, and are practical, have been developed.

Parent Topic

Child Topic

No Parent Topic