logo
    Coherent Structural Prediction of a Set of Paralogous Genes on a Eukaryotic Genome
    0
    Citation
    6
    Reference
    20
    Related Paper
    Abstract:
    Following the completion of genomic sequencing of S. cerevisiae and C. elegans, complete sequencing of several eukaryotic genomes, including that of human, is being accomplished within a few years. An essential but yet unresolved problem is to locate genes on a genomic sequence and to precisely predict their internal (exon-intron) structures. Statistical gene-finding methods have attained significant success, but the performance of even the best available methods is still unsatisfactory for many practical purposes [1, 2]. Homology-based gene-identification methods can considerably improve the accuracy of prediction, provided that one or more known protein or mRNA sequence closely related to the target gene is found in databases [5]. However, it is often observed that the closest relative to a gene is another gene on the same genome. In fact, genomes of higher eukaryotes, such as C. elegans and A. thaliana, possess a number of large gene families, members of which are mutually well related but far from any genes in other organisms. Here, I propose a method for simultaneously predicting the gene structures of all members in such a species-specific family.
    Keywords:
    Gene prediction
    Homology
    With genomic data (generated by classical, functional, structural, proteo- and other `omic' approaches) accumulating at a stupendous rate, there is an ever increasing need for the development of new, more efficient and more sensitive computational methods. To highlight aspects of our computational needs, we will present results that emerged from the comparative genome analysis of mitochondria. Having originated from an alpha-proteobacterial endosymbiont, these eukaryotic organelles contain small and extremely variable genomes, and are thus perfect model systems for the much more complex eubacterial and archaeal genomes. We are currently in vestigating mitochondrial DNAs (mtDNAs) in a lineage of unicellular, primitive protistan eukaryotes, the jakobids, with the aim to understand the evolution of mitochondrial genomes, genes and their regulation. Because these organisms are difficult to grow, biochemical approaches aimed at understanding gene regulation are laborious, thus it is possible to capitalize considerably from predictions on genome and gene organization, and regulatory elements. Contrary to approaches in which molecular data (gene order, sequence similarities) are used to infer the phylogenetic relationships among a group of organism, we know their phylogeny and employ this information to identify and model more or less conserved genetic elements and structural RNA genes that are difficult to spot by conventional methods, in a phylogenetic-comparative approach.
    Lineage (genetic)
    Organelle
    Citations (0)
    The importance of gene gain through duplication has long been appreciated. In contrast, the importance of gene loss has only recently attracted attention. Indeed, studies in organisms ranging from plants to worms and humans suggest that duplication of some genes might be better tolerated than that of others. Here we have undertaken a large-scale study to investigate the existence of duplication-resistant genes in the sequenced genomes of 20 flowering plants. We demonstrate that there is a large set of genes that is convergently restored to single-copy status following multiple genome-wide and smaller scale duplication events. We rule out the possibility that such a pattern could be explained by random gene loss only and therefore propose that there is selection pressure to preserve such genes as singletons. This is further substantiated by the observation that angiosperm single-copy genes do not comprise a random fraction of the genome, but instead are often involved in essential housekeeping functions that are highly conserved across all eukaryotes. Furthermore, single-copy genes are generally expressed more highly and in more tissues than non–single-copy genes, and they exhibit higher sequence conservation. Finally, we propose different hypotheses to explain their resistance against duplication.
    Housekeeping gene
    Segmental duplication
    Citations (366)
    The examination of a sequenced genome produces many fascinating insights into how genes function, as well as tantalizing hints about the importance of gene orders and orientation. Indeed a static view of any single genome leads to a number of hypotheses regarding the history of its organization. Since the genome projects began, it has been clear that many of the questions arising from the examination of any single genome might well be resolved by having other genomes with which to compare. We now have a small collection of fairly complete eukaryotic genome sequences to examine (two distantly related yeasts, a nematode, an insect, and a higher plant). Although other sequences are near completion, they are not yet of sufficiently high quality that they can be confidently used in this type of comparison. Existing genome sequences are evolutionarily widely separated and the organisms are morphologically very different. Thus they are not yet very helpful when one wants to consider the forces and mechanisms that have lead to the present state. Recognition of this by the genome community has resulted in efforts to sequence genomes that will fill in the phylogenetic gaps and that are evolutionarily close to existing sequenced genomes. A prominent example is the mouse genome as a complement to the human genome. Efforts are also underway to produce genome sequences of close relatives of some of the more tractable model organism genomes (worm, fly, and yeast). In the case of the worm Caenorhabditis elegans, a sister species with very similar morphology has been selected, Caenorhabditis briggsae. The sequencing effort thus far has produced ∼15 million bases of genome sequence (∼15%– 18% of the total). This sequence is available at the Genome Sequencing Center, Washington University School of Medicine (http:// genome.wush.edu./gsc/). There is a concerted effort between the Washington University Genome Center and the Sanger Centre to complete the C. briggsae genome. The availability of these two high-quality data sets has proved irresistible to bioinformatics researchers. Kent and Zahler (2000) and Webb et al. (2002) have used this data to show the usefulness of newly developed tools for teasing information out of genomic sequence data from these closely related species. In this issue of Genome Research, data from these two species has again been used in an extensive analysis of genome rearrangement. Rates of rearrangement are calculated and compared to the earlier data from Drosophila species. Coghlan and Wolfe at Trinity College have done an extensive and elegant analysis of the genomes of C. elegans and C. briggsae genomes and made some surprising discoveries and predictions for the overall rate of rearrangement in Caenorhabditis. They point out that this data set is “the largest available for any pair of congenic eukaryotes.” The extent and quality of the sequence data make this analysis possible. By first using BLASTX, Coghlan and Wolfe (2002) were able to predict the locations of 1784 orthologous genes in nearly 13 million megabases of C. briggsae genomic DNA. These were localized to 756 segments that ranged in size from 1 to 19 genes. When rearrangements were considered these segments could be reduced to 252, some containing as many as 109 genes. Using this set of ordered orthologs they analyzed the data to deduce the number of chromosomal rearrangements that would be required to give rise to the observed order. They determined that 517 chromosomal rearrangements would be needed. Transpositions are the most common event, but inversions and translocations each contributed about half as many breaks. This leads to the conclusion that the genomes have had some 4030 rearrangements occur since the separation. This is a remarkable rate of rearrangement, even when considering the 50–120 million years that the investigators estimate for the divergence of the two species. They point out that this is higher than that reported for Drosophila. However, we will have to wait for comparable sequence data to arrive for a Drosophila sister species for this to be confirmed. Indeed they calculate that the breakage rate in C. elegans is 1400–17,000 times higher than has been calculated for mammals; again we must await the comparisons based on similar highquality sequence in pairs of mammals. It is worth noting that the length of the conserved regions is increasing, Kent and Zahler (2000) claimed they averaged 8.1 kb, whereas as this paper claims they are 53 kb. This difference is largely attributed to differences in the analytic method and assumptions made in the two papers. It is clear that much is being learned about how genomes may be compared and how information from this comparison may be used. A whole C. briggsae genome assembly has been completed and is currently being analyzed (R. Waterston and R. Durbin, pers. comm.), this will allow the predictions made in the Coghlan and Wolfe paper (2002) to be confirmed.
    Citations (4)
    Our comparisons of complete genome sequences revealed that the genome structures have been extensively shuffled among eubacteria, particularly when the orders of orthologous genes were examined. Moreover, archaebacterial and eukaryotic genome structures were found to be unstable, too, as were the cases of eubacteria. We then turned our attention to operon structures, which were expected to be well conserved during evolution because of their regulatory importance. Surprisingly enough, however, we found that even within operons, gene orders have not been conserved, with exception to only a few cases such as ribosomal operons. When we reconstructed the ancestral genome structure of eubacteria and archaebacteria, and examined the relative instability of the genome structures among eubacteria, we found that there were differences in the degree of the genome instability among the examined species. The genome instability appears to be correlated with the number of insertion sequences. Interestingly enough, the intensity of the intrastrand bias of nucleotide composition (G-C skew) was found to be affected by the genome instability, implying that accumulation of strand-specific mutations depends heavily upon the stability of a genome. These findings imply that the gene orders have not been essential for survival of microbes in long-term evolution, and that the evolutionary instability of the genome structures is an intrinsic nature common to eubacteria, archaebacteria and eukaryotes. For eukaryotic genomes, we found that a lot of gene fusion events might have happened in the early evolution of eukaryotes so as to compensate for the loss of bacterial operon structures. The evolutionary instability of the genome structures can be one of the most important factors in understanding the evolutionary processes of the genome evolution.
    Bacterial genome size
    Comparative Genomics
    Genome size
    Citations (0)
    The mechanism by which protein-coding portions of eukaryotic genes came to be separated by long non-coding stretches of DNA, and the purpose for this perplexing arrangement, have remained unresolved fundamental biological problems for three decades. We report here a plausible solution to this problem based on analysis of open reading frame (ORF) length constraints in the genomes of nine diverse species. If primordial nucleic acid sequences were random in sequence, functional proteins that are innately long would not be encoded due to the frequent occurrence of stop codons. The best possible way that a long protein-coding sequence could have been derived was by evolving a split-structure from the random DNA (or RNA) sequence. Results of the systematic analyses of nine complete genome sequences presented here suggests that perhaps the major underlying structural features of split-genes have evolved due to the indigenous occurrence of split protein-coding genes in primordial random nucleotide sequence. The results also suggest that intron-rich genes containing short exons may have been the original form of genes intrinsically occurring in random DNA, and that intron-poor genes containing long exons were perhaps derived from the original intron-rich genes.
    Coding region
    Stop codon
    The almost complete nucleotide sequence of several eukaryotic genomes is now available, including yeast, worm, fly and man, as well as a plant Arabidopsis thaliana, and adding to the knowledge gained from around sixty bacterial genomes, Archaea and Eubacteria. From this vast amount of raw information, general conclusions can be drawn: firstly, the complexity of a higher organism is not directly related to the number of genes found in its genome; for instance, the human genes are probably less than twice as numerous as the genes of the nematode made of less than one thousand cells. Secondly, alternative splicing is common in higher organisms, affecting a good half of the genes and giving rise to several gene products per gene that are tissue specific. Thirdly, roughly half of the genes are of unknown function, even within the limited number of eukaryotic orthologous genes, common for instance to man, fly and worm, and likely to belong to the cell basic machinery. Describing the function of all gene products is now the challenge of the post-sequencing era; presently, the function of a protein is not semantically well defined: the function may be molecular (enzymes are often defined this way), or derived from a phenotype (oncogenes cause cancer) or from a subcellular localization (nucleolin resides in the nucleole), hence the huge eforts carried out these days to define gene ontologies. More and more, biologists tend to talk about contextual function, that is to say all the interactions of the protein of interest with all other cell components, whether the interactions are physical (such as in a particle like the spliceosome) or only functional (metabolic pathways). This is of course a huge task, for which in silico methods can be used, such as the Rosetta stone, the phylogenetic profiles, or the chromosomal colocalizations, adding to experimental techniques, such as the mRNA coexpression observed for instance from microarrays experiments. When two of these tools are in agreement, confidence can be good that the two proteins have a functional interaction, resulting in the progressive building of the complex network of interactions within the living cell. The main avenues of research are now well visible, and they use techniques that are miniaturized, robotized and parallelized. The first one is the rapid expansion of the proteomics field which takes advantage of remarkable progress of mass spectrometry. The second one is the massive structural resolution of recombinant proteins, purified and crystallized on a micro scale, analyzed by synchrotron diffraction, and their structure solved with more and more powerful computation. Finally, frequent diseases benefit from a better knowledge of the genome polymorphisms (single nucleotide polymorphisms as well as microsatellites) in large studies similar to the one performed in Iceland by linking databases concerning the pedigrees of the whole population, the medical records, and the relative genotypes. These are the new avenues to drug discovery. (Texte integral)
    Pseudogene
    Citations (0)
    This thesis focuses on the theme of tools designed to increase our knowledge of bacterial genetics, and how this knowledge can help in the process of genetic engineering. It is split into two main areas; the first concerns the development of a methodology that allows for random genome reductions in Mycoplasma pneumoniae, the second with an exploration of the essential genes across the bacterial Domain. In the first part, we document the development and iteration of a novel protocol to allow for the random deletion of genetic material in M. pneumoniae. Traditionally, genome reduction methodologies rely on an a priori justification of what to delete. However, these assumptions may be biased by our incomplete knowledge of both all gene functions, and their epistatic interactions with the rest of the genome. As such, our determinations of what areas of a genome we can remove successfully may not be accurate or optimised. To address this, we developed a methodology to remove sections of the genome in a random manner, thus bypassing any implicit biases on what to delete. We demonstrate how our methodology is effective, and the iterations we undertook to improve its efficacy, that it is self-selective for strains harbouring a genetic reduction, can produce a high level of variation in both size and location of deletions, and outline a modified sequencing protocol capable of detecting and localising deletions in a heterogeneous pool in a high-throughput manner. The second part of the thesis concerns the identification of trends regarding which genes are considered essential across the bacterial domain. Over the last 2 decades, we have been able to fully sequence the genomes of thousands of bacteria, and have found that despite their great diversity, there is still commonalities within them on the level of shared genes. However, there is no data on how essential to life these near universal genes are. The number of bacterial species that have had their essential genes identified is far lower, but we compiled as many as we could find hat shared a common gene disruption and sequencing methodology. A database of genes extracted from a sample of 47 species spanning 8 different phyla was constructed, clustering the genes into groups of homologs and assigning essentiality data from individual studies and functional data from the Cluster of Orthologous Genes (COG) database. This database was then interrogated to see if there were trends relating to which genes were conserved, and which genes were essential. Our list of highly conserved genes matches those found by previous groups well. However, when essentiality is considered, we find very few genes that can be considered to be universally essential. Of these, the vast majority pertain to translational machinery. We also found that there are a subset of genes that are very highly conserved, but rarely essential to cell survival. With regard to genome size vs essentiality, we found that while there is little correlation between the number of genes and genome contains, and the number of essential genes, the composition of a bacteria’s essential genome does change with complexity. The essential genes of a minimal genome are dominated by transcription, translation and DNA replication/repair genes, but as complexity increases the number of essential genes relating to cellular signalling and housekeeping rises, along with a modest increase in metabolism genes. These two parts can work synergistically to improve our knowledge of genome engineering. Random genome deletions can both help minimise bacterial genomes, and also provide information on more complex networks of essentiality by deleting multiple genes simultaneously. This knowledge of essentiality can then be queried against a larger database, and begin to uncover which networks or individual genes can be deleted or are at least non essential in large number of species. This in turn can help us build a greater understanding of which systems are more viable deletion targets in the future, and which appear to have functionalities that we should strive to preserve.
    Bacterial genome size
    Epistasis
    Identification
    Citations (0)