Algorithms for aligning and clustering genomic sequences that contain duplications

2007 
Genomes of advanced organisms contain numerous repeated sequences, including gene clusters, tandem repeats, interspersed repeats, and segmental duplications. Among these, gene clusters are the class most frequently of functional importance. Algorithmic processing of regions containing these clusters remains challenging in practice, and its lack of clean solutions has been a big obstacle in sequence analysis in bioinformatics. This thesis includes new methodologies for solving two sets of problems in processing the sequences of gene-cluster regions, particularly methods to properly align gene-cluster regions of multiple species. Similar sequences sharing the same evolutionary origin are homologous . Homologous sequences that differ by speciation are orthologous . One set of problems deals with aligning all and only orthologous sequences in a gene-cluster region, between two or more species. A two-way orthologous-sequence identification tool is developed to produce orthologous pairwise alignments. The results are evaluated based on the phylogenetic inference of gene sequences. High specificity is achieved without much loss of sensitivity. Two approaches are designed to create orthologous multi-species alignments. One uses a chosen species to guide the alignment process, and it has been successfully applied genome-wide. The other solves a more difficult formulation of the problem, where all species are treated equally. Its computational difficulty is discussed, and some initial experiments are reported. Another set of methods deals with the construction of all homologous groups within a single genome. Each homologous group is expected to contain precisely the genomic intervals that are homologous to each other. A mixture of algorithmic and heuristic procedures is designed to maintain a balance between the completeness and purity of each group. We verify the accuracy and efficiency of these methodologies.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    2
    Citations
    NaN
    KQI
    []