Mining distinctive DNA patterns from the upstream of human coding&non-coding genes via class frequency distribution

2016 
The upstream of genes are expected to contain many still unknown regulatory regions that can increase or decrease the expression of specific genes. The processes of mining distinctive patterns (region) are to extract maximal repeats (patterns) from the upstream DNA sequences of human genes, and then filter out the patterns whose class frequency distribution can fit in with that is specified by domain experts; the class frequency distribution of one pattern is the frequencies of that pattern appearing in each of classes. The computation of extracting maximal repeats and meanwhile computing their class frequency distribution can be done by a scalable approach based on a previous work via MapReduce programming model. Experimental resources include the DNA sequences extracted from the upstream 5, 000 bp DNA sequences of 49, 267 human coding&non-coding genes. The classes of human genes are divided into four classes as “non-cancer related protein-coding gene”, “oncogene”, “tumor suppressor gene” and “non-coding genes”(RNA). Experimental results show that 17 distinctive patterns selected as core patters whose length is longer than 36 bp and, appear in more than 3, 000 genes and in all of four classes. To have more specific observation, there are 22 distinctive patterns selected that appear in at least 10 genes and whose lengths are greater than 15 bp and, most of all, just happen in two classes, “oncogene” and “tumor suppressor gene”. It is very attractive and expected to extend this approach to mine for another types of distinctive patterns, e.g. biomarkers, via this approach based on class frequency distribution of selected patterns if the targeted resources of genomic sequences, containing “genotypes”, are available and each of these sequences is labeled precisely according to the features, e.g. “phenotypes”, specified by domain experts in the future.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    12
    References
    3
    Citations
    NaN
    KQI
    []