Overview of gene structure in C. elegans

2014 
In the early stage of the C. elegans sequencing project, the ab initio gene prediction program Genefinder was used to find protein-coding genes. Subsequently, protein-coding genes structures have been actively curated by WormBase using evidence from all available data sources. Most coding loci were identified by the Genefinder program, but the process of gene curation results in a continual refinement of the details of gene structure, involving the correction and confirmation of intron splice sites, the addition of alternate splicing forms, the merging and splitting of incorrect predictions, and the creation and extension of 5’ and 3’ ends. The development of new technologies results in the availability of further data sources, and these are incorporated into the evidence used to support the curated structures. Non-coding genes are more difficult to curate using this methodology, and so the structures for most of these have been imported from the literature or from specialist databases of ncRNA data. This article describes the structure and curation of transcribed regions of genes. 1. What is a gene? Sydney Brenner, the founder of modern worm biology, once said, “Old geneticists knew what they were talking about when they used the term ‘gene’, but it seems to have become corrupted by modern genomics to mean any piece of expressed sequence...” (Brenner, 2000). Dr. Brenner's lament serves to illustrate two points: the first is that the concept of a gene can mean different things to different people in different contexts, the second is that the concept of a gene has been evolving, not only in the modern genomic era, but ever since it first appeared in the early 1900s as a term to conceptualize the particulate basis of heritable physical traits (Snyder and Gerstein, 2003). Therefore, in a review of gene structure in C. elegans it seems prudent to define what we mean by a gene. Our definition of a gene is essentially: “a union of genomic sequences encoding a coherent set of potentially overlapping functional products” (Gerstein et al., 2007). This encompasses promoters and control regions necessary for the transcription, processing and if applicable, translation of a gene. Hence, we include not only protein-coding genes (genes that encode polypeptides), but also non-coding RNA genes (ribosomal RNA, transfer RNA, micro RNA, anti-sense RNA, piwi-interacting RNA, and small nuclear RNA genes). One additional type of gene we will briefly discuss is the pseudogene, though these are not usually considered to be functional. The full extent of most C. elegans genes is not known because promoters remain, for the most part, incompletely defined. Even the full extent of the primary transcript is frequently not known because a majority (70%) of protein-coding genes are rapidly modified by trans-splicing, which involves the addition of a short 22 nt exogenous RNA species to the 5’ end of a transcript (Zorio et al., 1994). Definition of the true 5’ ends of genes is an active area of research (Chen et al. 2013; Kruesi et al. 2013; Saito et al. 2013; Gu et al. 2012). Some non-coding RNA genes are also trans-spliced; a precursor of the microRNA let-7 (C05G5.6) was identified with a trans-splice leader sequence (Bracht et al., 2004). This article is concerned primarily with the properties of the transcribed regions of C. elegans genes. 2. Protein-coding genes 2.1. Prediction and curation In the initial stage of the C. elegans sequencing project, prior to the publication of the genome in 1998 (The C. elegans Sequencing Consortium, 1998), Genefinder (Green and Hillier, unpublished software) was the gene prediction program of choice. Genefinder is an ab initio predictor and requires only a genomic DNA sequence and parameters based on a training set of confirmed coding sequences. Note that Genefinder, like most other gene prediction tools, is actually a coding sequence (CDS) predictor and does not attempt to define untranslated regions (UTRs). In the WormBase database, the structure of a coding gene is held as three different types of data. The first is the “Gene”, which holds information on the span from the start to end of the transcribed region of that locus. The second is the “Transcript”, which holds the exon structure, including the 5’ and 3’ untranslated regions (UTRs) and attempts to faithfully model a mature mRNA sequence. The third is the “CDS”, which is purely a protein-coding set of exons from a START codon to a STOP codon, with no UTR. It is only the “CDS” structure which is manually curated, with often two or more “CDS” isoforms being made for the same gene, based on evidence for trans-splicing Overview of gene structure in C. elegans
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    62
    References
    22
    Citations
    NaN
    KQI
    []