Protein Sequences as Literature Text

2006 
We have performed analysis of protein sequences treating them as texts written in a "protein" language. We have shown that repeating patterns (words) of various lengths can be identified in these sequences. It was found that the maximum word lengths are different for proteins belonging to different classes; therefore, the corresponding values can be used to characterize the protein type. The suggested technique was first applied to analyze (decompose into words) normal (literature) texts written as a gapless symbolic sequence without spaces and punctuation marks. The tests using fiction, scientific, and popular scientific English texts proved the relative efficiency of the technique.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    11
    References
    2
    Citations
    NaN
    KQI
    []