Computer-aided analyses were made of the complete amino-acid sequences of two P-450 species, the phenobarbital-inducible major P-450 of rat liver microsomes(P-450PB) and camphor-hydroxylating P-450 of Pseudomonas putida (P450cam). Statistically significant homology was recognized between the two P-450 sequences, but these sequences were not related to those of other groups of hemoproteins, such as hemoglobins, peroxidases, and cytochrome c's and b's. Two highly homologous regions, HR1 and HR2, and two other weakly homologous regions were found on optimally matched alignment of the P-450 sequences. The secondary structures of the two P-450's predicted by current prediction methods bear strong resemblance at these homologous regions. Both HR1 and HR2 contain a cysteine residue near the center of the homologous regions, and they are the only regions that show significant homology among all 48 combinations of local seqences around the cysteine residues (six on P-450PB and eight on P-450cam HR1 is located in the N-proximal half of the molecule, is rich in hydrophilic residues, and is predicted to be helical. On the other hand, HR2 is close to the C-terminus, has intermediate hydrophobicity, and may take a complex secondary structure of a turn-sheet-helix. The amino-acid sequences around the HR1 and HR2 regions are also well conserved in another P-450 species, rabbit P-450LM2.
There exist few databases that enable cross-reference among various research fields related to bioenergy. Cross-reference is highly desired among bioinformatics databases related to environment, energy, and agriculture for better mutual cooperation. By uniting Semantic Graph, we can economically construct a distributed database, regardless of the size of research laboratories and research endeavors.Our purpose is to design and develop a workflow based on RDF (Resource Description Framework) that generates Semantic Graph for a set of technical terms extracted from documents of various formats, such as PDF, HTML, and plain text. Our attempt is to generate Semantics Graph as a result of text mining including morphological analysis and syntax analysis.We have developed a prototype of workflow program named "RDF Curator". By using this system, various types of documents can be automatically converted into RDF. "RDF Curator" is composed of general tools and libraries so that no special environment is needed. Hence, “RDF Curator” can be used on many platforms, such as MacOSX, Linux, and Windows (Cygwin). We expect that our system can assist human curators in constructing Semantic Graph. Although fast and high throughput, the accuracy of the present version of "RDF Curator" is lower than that of human curators. As a future task, we have to improve the accuracy of the workflow. In addition, we also plan to apply our system to analysis of network similarity.
The caspases, a family of cysteine proteases, play multiple roles in apoptosis, inflammation, and cellular differentiation. Caspase-8 (Casp8), which was first identified in humans, functions as an initiator caspase in the apoptotic signaling mediated by cell-surface death receptors. To understand the evolution of function in the Casp8 protein family, casp8 orthologs were identified from a comprehensive range of vertebrates and invertebrates, including sponges and cnidarians, and characterized at both the gene and protein levels. Some introns have been conserved from cnidarians to mammals, but both losses and gains have also occurred; a new intron arose during teleost evolution, whereas in the ascidian Ciona intestinalis, the casp8 gene is intronless and is organized in an operon with a neighboring gene. Casp8 activities are near ubiquitous throughout the animal kingdom. Exogenous expression of a representative range of nonmammalian Casp8 proteins in cultured mammalian cells induced cell death, implying that these proteins possess proapoptotic activity. The cnidarian Casp8 proteins differ considerably from their bilaterian counterparts in terms of amino acid residues in the catalytic pocket, but display the same substrate specificity as human CASP8, highlighting the complexity of spatial structural interactions involved in enzymatic activity. Finally, it was confirmed that the interaction with an adaptor molecule, Fas-associated death domain protein, is also evolutionarily ancient. Thus, despite structural diversity and cooption to a variety of new functions, the ancient origins and near ubiquitous distribution of this activity across the animal kingdom emphasize the importance and utility of Casp8 as a central component of the metazoan molecular toolkit.
The basic process of RNA splicing is conserved among eukaryotic species. Three signals (5' and 3' splice sites and branch site) are commonly used to directly conduct splicing, while other features are also related to the recognition of an intron. Although there is experimental evidence pointing to the significant species specificities in the features of intron recognition, a quantitative evaluation of the divergence of these features among a wide variety of eukaryotes has yet to be conducted. To better understand the splicing process from the viewpoints of evolution and information theory, we collected introns from 61 diverse species of eukaryotes and analyzed the properties of the nucleotide sequences relevant to splicing. We found that trees individually constructed from the five features (the three signals, intron length, and nucleotide composition within an intron) roughly reflect the phylogenetic relationships among the species but sometimes extensively deviate from the species classification. The degree of topological deviation of each feature tree from the reference trees indicates the lowest discordance for the 5' splicing signal, followed by that for the 3' splicing signal, and a considerably greater discordance for the other three features. We also estimated the relative contributions of the five features to short intron recognition in each species. Again, moderate correlation was observed between the similarities in pattern of short intron recognition and the genealogical relationships among the species. When mammalian introns were categorized into three subtypes according to their terminal dinucleotide sequences, each subtype segregated into a nearly monophyletic group, regardless of the host species, with respect to the 5' and 3' splicing signals. It was also found that GC-AG introns are extraordinarily abundant in some species with high genomic G + C contents, and that the U12-type spliceosome might make a greater contribution than currently estimated in most species. Overall, the present study indicates that both splicing signals themselves and their relative contributions to short intron recognition are rather susceptible to evolutionary changes, while some poorly characterized properties seem to be preserved within the mammalian intron subtypes. Our findings may afford additional clues to understanding of evolution of splicing mechanisms.
Accurate computational identification of eukaryotic gene organization is a long-standing problem. Despite the fundamental importance of precise annotation of genes encoded in newly sequenced genomes, the accuracy of predicted gene structures has not been critically evaluated, mostly due to the scarcity of proper assessment methods.We present a gene-structure-aware multiple sequence alignment method for gene prediction using amino acid sequences translated from homologous genes from many genomes. The approach provides rich information concerning the reliability of each predicted gene structure. We have also devised an iterative method that attempts to improve the structures of suspiciously predicted genes based on a spliced alignment algorithm using consensus sequences or reliable homologs as templates. Application of our methods to cytochrome P450 and ribosomal proteins from 47 plant genomes indicated that 50 ~ 60 % of the annotated gene structures are likely to contain some defects. Whereas more than half of the defect-containing genes may be intrinsically broken, i.e. they are pseudogenes or gene fragments, located in unfinished sequencing areas, or corresponding to non-productive isoforms, the defects found in a majority of the remaining gene candidates can be remedied by our iterative refinement method.Refinement of eukaryotic gene structures mediated by gene-structure-aware multiple protein sequence alignment is a useful strategy to dramatically improve the overall prediction quality of a set of homologous genes. Our method will be applicable to various families of protein-coding genes if their domain structures are evolutionarily stable. It is also feasible to apply our method to gene families from all kingdoms of life, not just plants.
Intron length distribution (ILD) is a specific feature of a genome that exhibits extensive species-specific variation. Whereas ILD contributes to up to 30% of the total information content for intron recognition in some species, rendering it an important component of computational gene prediction, very few studies have been conducted to quantitatively characterize ILDs of various species.We developed a set of computer programs (fitild, compild, etc.) to build statistical models of ILDs and compare them with one another. Each ILD of more than 1000 genomes was fitted with fitild to a statistical model consisting of one, two, or three components of Frechet distributions. Several measures of distances between ILDs were calculated by compild. A theoretical model was presented to better understand the origin of the observed shape of an ILD.The C++ source codes are available at https://github.com/ogotoh/fitild.git/.Supplementary data are available at Bioinformatics online.
In the last 20 years, many multiple-sequence alignment programs based on various principles have been developed. Continuous e orts have been devoted to solve two major problems: (1) how to evaluate the 'goodness' of an alignment, and (2) how to get the alignment with the optimal score. These problems are tightly interrelated, and other criteria are needed to objectively assess reliability of a certain alignment method. Recently, the number of protein three-dimensional (3D) structures determined by X-ray crystallography and high-resolution NMR methods is rapidly increasing. Comparison of the 3D structures makes it possible to align distantly related protein sequences based on their structural equivalence. A few collections of such structure-based alignments are now available [4]. Hence we can assess the quality of sequence alignments obtained by a given method by referring to the structural counterparts. McClure et al. [3] recently reported that the! most popular 'progressive' metho
Perturbations of gene regulatory networks are essentially responsible for oncogenesis. Therefore, inferring the gene regulatory networks is a key step to overcoming cancer. In this work, we propose a method for inferring directed gene regulatory networks based on soft computing rules, which can identify important cause-effect regulatory relations of gene expression. First, we identify important genes associated with a specific cancer (colon cancer) using a supervised learning approach. Next, we reconstruct the gene regulatory networks by inferring the regulatory relations among the identified genes, and their regulated relations by other genes within the genome. We obtain two meaningful findings. One is that upregulated genes are regulated by more genes than downregulated ones, while downregulated genes regulate more genes than upregulated ones. The other one is that tumor suppressors suppress tumor activators and activate other tumor suppressors strongly, while tumor activators activate other tumor activators and suppress tumor suppressors weakly, indicating the robustness of biological systems. These findings provide valuable insights into the pathogenesis of cancer.