Discovering discriminative and class-specific sequence and structural motifs in proteins

2013 
Finding recurring motifs is an important problem in bioinformatics. Such motifs can be used for any number of problems including sequence classi cation, label prediction, knowledge discovery and biological engineering of proteins t for a speci c purpose. Our motivation is to create a better foundation for the research and development of novel motif mining and machine learning methods that can extract class-speci c and discriminative motifs using both sequence and structural features. We propose the building blocks of a general machine learning framework to act on a biological input. This thesis present a combination of elements that are aimed to be applicable to a variety of biological problems. Ideally, the learner should only require a number of biological data instances as input that are classi- ed into a number of di erent classes as de ned by the researchers. The output should be the factors and motifs that discriminate between those classes (for reasonable, non-random class de nitions). This ideal work ow requires two main steps. First step is the representation of the biological input with features that contain the signi cant information the researcher is looking for. Due to the complexity of the macromolecules, abstract representations are required to convert the real world representation into quanti able descriptors that are suitable for motif mining and machine learning. The second step of the proposed work ow is the motif mining and knowledge discovery step. Using these informative representations, an algorithm should be able to nd discriminative, class-speci c motifs that are over-represented in one class and under-represented in the other. This thesis presents novel procedures for representation of the proteins to be used in a variety of machine learning algorithms, and two separate motif mining algorithms, one based on temporal motif mining, and the other on deep learning, that can work with the given biological data. The descriptors and the learners are applied to a wide range of computational problems encountered in life sciences.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    137
    References
    0
    Citations
    NaN
    KQI
    []