Cross-Referencing Eukaryotic Genomes: TIGR Orthologous Gene Alignments (TOGA)

2002 
The underlying goal of the Human Genome Project is the identification and functional characterization of the entire catalog of human genes. With the available complete sequences or comprehensive drafts of several eukaryotic genomes including Saccharomyces cerevisiae (Goffeau et al. 1996), Caenorhabditis elegans (The C. elegans sequencing consortium 1998), Drosophila melanogaster (Adams et al. 2000), Arabidopsis thaliana (The Arabidopsis genome initiative 2000), and human (International human genome sequencing consortium 2001; Venter et al. 2001), our ability to identify genes and analyze their functions and interactions through cross-species comparisons is improving rapidly. Nevertheless, identification and classification of gene sequences remains a significant challenge because of the lack of experimental evidence and the apparent shortcomings of the available gene prediction programs (Guigo et al. 2000). Of the estimated 35,000–60,000 human genes (Crollius 2000; Ewing and Green 2000; Liang et al. 2000a), fewer than 10,000 are represented by functionally characterized mRNA sequences in GenBank. Although many newly discovered genes might reveal their functions through disease-related studies, classifying the entire collection will require the analysis of related genes in experimentally tractable organisms. For most other eukaryotic species, the number of available gene sequences is more limited, and for many, the generation of complete genomic sequence data is not likely in the near future. However, there exist more than 7,000,000 publicly available expressed sequence tag (EST) sequences in dbEST, representing a wide diversity of eukaryotic species. Using a compact representation of those sequences within the TIGR Gene Index (TGI) databases (Liang et al. 2000b; Quackenbush et al. 2001), we created TOGA, the TIGR orthologous gene alignments, as a tool to explore genes and their relationships across species. Cross-referencing the available genomic data has several important applications, including the identification of homologous genes in eukaryotes. Gene homologs can be separated into two classes, orthologs and paralogs (Fitch 1970; Gogarten and Olendzenski 1999; Eisen 1998). Orthologs are genes that are related by direct evolutionary descent whereas paralogs are homologous genes that are the result of a duplication event within the same lineage. The identification of orthologs is particularly important because these genes should play similar developmental or physiological roles, and consequently, their study in rodent or other models can provide insight into their functions in humans. Although such an analysis has been performed for the completed microbial genomes and yeast (Tatusov et al. 1997, 2000), the lack of a comprehensive set of coding genes in many representative organisms has hampered the development of a similar resource for eukaryotes. For the completed C. elegans and Drosophila genomes, comparisons with the available gene sequence data revealed 2758 human–fly orthologs and 2031 human–worm orthologs, respectively, of which 1523 orthologs were common to both groups (Venter et al. 2001). The most extensive survey of orthologs in mammals is a study by Makalowski and Boguski in which they analyzed 1880 human–rodent ortholog pairs (Makalowski and Boguski 1998); 1212 rat–human pairs, 1138 mouse–human pairs, and 470 genes shared by all three species. As might be expected, both amino acid sequences and their corresponding DNA coding sequences were found to be highly conserved. More surprising is the high degree of conservation of the untranslated regions (UTRs) flanking the coding sequence: 71.0 ± 12.2% identity for mouse–human orthologs, 70.1 ± 11.4% for rat–human orthologs, and 86.3 ± 8.9% for mouse–rat orthologs. It is this high degree of sequence conservation in the UTRs, in combination with the wealth of partial gene sequence data available through EST projects, that lead us to believe that orthologs could be identified through DNA-based sequence comparisons. Whereas more than 8,000,000 EST sequences made the necessary pair-wise comparisons a computationally and logistically daunting task, the TGI (Liang et al. 2000a; Quackenbush et al. 2001) databases, which assemble gene and EST sequences into tentative consensus (TC) sequences, make assembling a database of orthologs spanning many species feasible. There are presently 28 species represented in the TGI (Table ​(Table1),1), including five mammals, 10 plants, seven eukaryotic parasites, and six other model organisms. These databases are updated every 3–6 months depending on availability of newly generated EST and gene sequence data and can be accessed at http://www.tigr.org/tdb/tgi.shtml. In total, there are 328,337 TCs, 1,211,636 singleton ESTs, and 46,511 singleton ETs (expressed transcript, or gene sequences) represented in the various TGI. It is our long-term goal to represent the full set of gene transcripts for an increasing number of organisms; these databases serve as our starting point for ortholog identification. Table 1 Summary Statistics for Inclusion of TC and sET Sequences in TOGA for Each of the 28 Species-Specific TGI Databases Represented
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    49
    References
    139
    Citations
    NaN
    KQI
    []