Pseudo amino acid composition

Pseudo amino acid composition, or PseAAC, was originally introduced by Kuo-Chen Chou (周国城) in 2001 to represent protein samples for improving protein subcellular localization prediction and membrane protein type prediction. Like the vanilla amino acid composition (AAC) method, it characterizes the protein mainly using a matrix of amino-acid frequencies, which helps with dealing with proteins without significant sequential homology to other proteins. Compared to AAC, additional information are also included in the matrix to represent some local features, such as correlation between residues of a certain distance. Pseudo amino acid composition, or PseAAC, was originally introduced by Kuo-Chen Chou (周国城) in 2001 to represent protein samples for improving protein subcellular localization prediction and membrane protein type prediction. Like the vanilla amino acid composition (AAC) method, it characterizes the protein mainly using a matrix of amino-acid frequencies, which helps with dealing with proteins without significant sequential homology to other proteins. Compared to AAC, additional information are also included in the matrix to represent some local features, such as correlation between residues of a certain distance. To predict the subcellular localization of proteins and other attributes based on their sequence, two kinds of models are generally used to represent protein samples: (1) the sequential model, and (2) the non-sequential model or discrete model. The most typical sequential representation for a protein sample is its entire amino acid (AA) sequence, which can contain its most complete information. This is an obvious advantage of the sequential model. To get the desired results, the sequence-similarity-search-based tools are usually utilized to conduct the prediction. Given a protein sequence P with L {displaystyle L} amino acid residues, i.e., where R1 represents the 1st residue of the protein P, R2 the 2nd residue, and so forth. This is the representation of the protein under the sequential model. However, this kind of approach fails when a query protein does not have significant homology to the known protein(s). Thus, various discrete models were proposed that do not rely on sequence-order. The simplest discrete model is using the amino acid composition (AAC) to represent protein samples. Under the AAC model, the protein P of Eq.1 can also be expressed by where f u ( u = 1 , 2 , ⋯ , 20 ) {displaystyle ,f_{u},(u=1,2,cdots ,20)} are the normalized occurrence frequencies of the 20 native amino acids in P, and T the transposing operator. The AAC of a protein is trivially derived with the protein primary structure known like given in Eq.1; it is also possible by hydrolysis without knowing the exact sequence, and such a step in fact is often a prerequisite for protein sequencing.

Parent Topic

Child Topic

No Parent Topic