The size distribution of protein families within different types of folds.

2011 
Abstract It is well known that the structure is currently available only for a small fraction of known protein sequences. It is urgent to discover the important features of known protein sequences based on present protein structures. Here, we report a study on the size distribution of protein families within different types of folds. The fold of a protein means the global arrangement of its main secondary structures, both in terms of their relative orientations and their topological connections, which specify a certain biochemical and biophysical aspect. We first search protein families in the structural database SCOP against the sequence-based database Pfam, and acquire a pool of corresponding Pfam families whose structures can be deemed as known. This pool of Pfam families is called the sample space for short. Then the size distributions of protein families involving the sample space, the Pfam database and the SCOP database are obtained. The results indicate that the size distributions of protein families under different kinds of folds abide by similar power-law. Specially, the largest families scatter evenly in different kinds of folds. This may help better understand the relationship of protein sequence, structure and function. We also show that the total of proteins with known structures can be considered a random sample from the whole space of protein sequences, which is an essential but unsettled assumption for related predictions, such as, estimating the number of protein folds in nature. Finally we conclude that about 2957 folds are needed to cover the total Pfam families by a simple method.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    23
    References
    4
    Citations
    NaN
    KQI
    []