Segmentation of Heteropolymer Sequences Specifying Subsequences with Different Composition and Statistical Properties

2003 
We have studied the segmentation of two-letter AB heterosequences composed of subsequences with different composition and distribution of A and B monomer units along the chain. Our approach is based on the segmentation function S(k) introduced in the present work and on the Jensen-Shannon divergence measure determined with respect to the probabilities of the lengths of uniform blocks of A and B monomer units. It is shown that the function S(k) is extremely sensitive to the sequence statistics. Even visual analysis of S(k) allows judgment on some features of sequence statistics. In particular, function S(k) is constant for random copolymers, it is an oscillating function for random block copolymers and shows monotonic growth up to some constant value for proteinlike copolymers. However, due to significant fluctuations observed for short sequences, the function S(k) can be effectively used only for segmentation of a heterosequence composed of very long subsequences. On the other hand, we find that the Jensen-Shannon divergence measure does not allow one to judge the type of statistics, but is extremely efficient for segmentation of a heterosequence. Therefore, the two introduced functions, being mutually complementary, provide an effective approach for recognizing and segmentation of heterosequences. As an example, the methods developed are applied for concatenating sequences of different proteins.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    6
    Citations
    NaN
    KQI
    []