Multiple sequence alignment (MSA) is a useful tool in bioinformatics. Although many MSA algorithms have been developed, there is still room for improvement in accuracy and speed. We have developed an MSA program PRIME, whose crucial feature is the use of a group-to-group sequence alignment algorithm with a piecewise linear gap cost. We have shown that PRIME is one of the most accurate MSA programs currently available. However, PRIME is slower than other leading MSA programs. To improve computational performance, we newly incorporate anchoring and grouping heuristics into PRIME. An anchoring method is to locate well-conserved regions in a given MSA as anchor points to reduce the region of DP matrix to be examined, while a grouping method detects conserved subfamily alignments specified by phylogenetic tree in a given MSA to reduce the number of iterative refinement steps. The results of BAliBASE 3.0 and PREFAB 4 benchmark tests indicated that these heuristics contributed to reduction in the computational time of PRIME by more than 60% while the average alignment accuracy measures decreased by at most 2%. Additionally, we evaluated the effectiveness of iterative refinement algorithm based on maximal expected accuracy (MEA). Our experiments revealed that when many sequences are aligned, the MEA-based algorithm significantly improves alignment accuracy compared with the standard version of PRIME at the expense of a considerable increase in computation time.
Abstract There exist few databases that enable cross-reference among various research fields related to bioenergy. Cross-reference is highly desired among bioinformatics databases related to environment, energy, and agriculture for better mutual cooperation. By uniting Semantic Graph, we can economically construct a distributed database, regardless of the size of research laboratories and research endeavors.Our purpose is to design and develop a workflow based on RDF (Resource Description Framework) that generates Semantic Graph for a set of technical terms extracted from documents of various formats, such as PDF, HTML, and plain text. Our attempt is to generate Semantics Graph as a result of text mining including morphological analysis and syntax analysis.We have developed a prototype of workflow program named "RDF Curator". By using this system, various types of documents can be automatically converted into RDF. "RDF Curator" is composed of general tools and libraries so that no special environment is needed. Hence, “RDF Curator” can be used on many platforms, such as MacOSX, Linux, and Windows (Cygwin). We expect that our system can assist human curators in constructing Semantic Graph. Although fast and high throughput, the accuracy of the present version of "RDF Curator" is lower than that of human curators. As a future task, we have to improve the accuracy of the workflow. In addition, we also plan to apply our system to analysis of network similarity.
We developed an algorithm that classi es all observed units of alternative splicing and transcriptionalinitiation and termination (UASTs) into an extendable set of distinct elementary patterns, when acollection of alignments between genomic DNA sequences and a set of cDNA/EST sequences are pro-vided. Thealgorithm rstconverts aligned exon-intron structuresinto bitarrays, extracts UASTs, andthen encodes each UAST into a pair (or vector) of decimal numbers, which specify the correspondingpattern. This system can uniquely and compactly encode not only typical patterns but also any rareor novel patterns which have usually been collectively assigned as \others. This system deals withtranscriptional variation and alternative splicing in the same framework of classi cation.
Following the completion of genomic sequencing of S. cerevisiae and C. elegans, complete sequencing of several eukaryotic genomes, including that of human, is being accomplished within a few years. An essential but yet unresolved problem is to locate genes on a genomic sequence and to precisely predict their internal (exon-intron) structures. Statistical gene-finding methods have attained significant success, but the performance of even the best available methods is still unsatisfactory for many practical purposes [1, 2]. Homology-based gene-identification methods can considerably improve the accuracy of prediction, provided that one or more known protein or mRNA sequence closely related to the target gene is found in databases [5]. However, it is often observed that the closest relative to a gene is another gene on the same genome. In fact, genomes of higher eukaryotes, such as C. elegans and A. thaliana, possess a number of large gene families, members of which are mutually well related but far from any genes in other organisms. Here, I propose a method for simultaneously predicting the gene structures of all members in such a species-specific family.
Existing methods for getting the locally best matched alignments between a pair of biological sequences require O(N2) computational steps and O(N2) storage, where N is the average sequence length. An improved method is presented with which the storage requirement is greatly reduced, while the computational steps remain O(N2). Only a small number of additional steps are required to display any common sub-sequences with similarity scores greater than a given threshold. The aligments found by the algorithm are optimal in the sense that their scores are locally maximal, where each score is a sum of weights given to individual matches/replacements, insertions and deletions involved in the alignment. The algorithm was implemented in C programming language on a personal computer. Data area of 64 kbytes on random access memory and a few hundred kbytes on a disk is sufficient for comparing two protein or nucleic acid sequences of 2500 residues. The programs are particularly valuable when used in combination with fast sequence search programs.