Gene prediction platform improved with second generation sequencers

2008 
Genomic research is advancing very rapidly in respect of sequencing technologies and bioinformatics and functional analyses. As part of the Italian Grape Genome Project we developed a bioinformatic platform for gene prediction and annotation of Grape. The platform integrates heterogeneous data as ab-initio predictions and EST/PROTEIN alignments. In particular, we improved the gene prediction taking advantage from the new sequencing technologies like 454 and Solexa, that offer an higher level of information. Preliminary results show their contribution for the fine tuning of gene structure. Introduction Gene prediction and annotation is the process of searching for genes or regulatory elements in the primary sequence of DNA and assigning a function to predicted genes. The increasing number of sequencing projects and the availability of completely sequenced genomes led to the rapid development of software for the identification of the intron-exon boundaries and gene structure using different sources of information. The gene prediction involves several types of evidences and approaches, ranging from ab-initio prediction to comparative genomics. In particular, ab-initio predictors are based on HMM approach and they predict gene structure starting only from nucleotide sequence. An higher and more accurate level of information is represented by the alignment of real biological sequences like ESTs, proteins and whole genomes. EST alignment is the most secure method in gene prediction, because ESTs represent the transcribed portion of a genome. Whole genome alignment allows the identification of conserved portions between DNA sequences that can represent conserved genes. Moreover, protein alignments offer an indirect evidence of a coding sequence. Other information coming from splicing sites, microRNA, pseudogenes and any other available resource can improve the prediction. As a part of the main Italian Grape Genome Project our research group had the task of developing and managing a bioinformatic platform for gene prediction of the Grape genome. The platform is an automated and modular system that process the genome sequence given as input in order to predict using as much evidences as possible. The data coming from different information sources are then collected, integrated and combined producing a sort of consensus that represents the final predictions. At present, the platform includes four ab-initio gene predictors (SNAP, GeneID, Glimmer, TigrScan), two ESTs aligners (GMAP and est2genome) that align three distinct sets of plant EST, MUMmer pairwise aligner for aligning three plant genomes (Populus trichocarpa, Arabidopsis thaliana, Oryza sativa) and a protein alignment software (GeneWise) that uses UniProt database. Finally, JIGSAW software is used to combine all the data and produce the final predictions New generation sequencing technology. 454 and Solexa sequencers produce a considerable amounts of data in small time and with small investments. For example, one run of these instruments is sufficient to obtaining an high coverage of an EST library that can be used to improve the gene prediction. The high coverage and quality assured by these systems introduce new useful features. Potentially these data contain several information like alternative splicings, UTRs, and different gene expression profiles. However, sequences produced by these new technologies are short (30bp for Solexa) and it is difficult analyzing them with common alignment software. For this reason, we developed at CRIBI a dedicated software that process and align with high accuracy Solexa sequences.(see Manasky et al poster). 454 sequences are longer (250 bp) than Solexa reads and their platform integration is simpler. To determine the contribution of the new generation sequencing technologies to the overall gene prediction quality we set up several tests considering different combination of the 454 and solexa sequences (both spliced sequences and non spliced sequences), for a total of 8 different analysis. In particular, we performed the prediction on the Vitis genome using 500 full length genes as set for the platform training, and 150 as test set to assess the prediction quality. Results Preliminary results show that the best performances are obtained using the 454 sequences. The F-measure (2*Sp*Sn/(Sp+Sn)) is 0.97 at nucleotide level for the run where no 454 and solexa are considered (control run), while it is 0.986 using only solexa reads, 0.99 considering 454 and 0.985 with both 454 and solexa. At the exon level, the F-measure is 0.93 for the control run, 0.952 for solexa run, 0.96 for 454 test and 0.955 for Solexa and 454. Conclusion The results show that the new generation sequences improve the annotation quality. The 454 contribution seems to be stronger than the Solexa one’s. Probably this is due to the different kind of information given by the Solexa reads. One of the major problems is that the high amount of information provided by these new generation sequences causes a sort of overfitting masking the signal due to the other evidences. As a consequence many genes can be lost or not correctly identified in regions not covered by 454 or solexa sequences but covered by other strong evidences. To avoid these problems, we clustered similar public vitis vinifera EST and 454 sequences in order to redistribute the weight that each evidence give to the prediction process. Moreover the solexa sequences were split into two groups in order to consider different kind of information: reads with splicing sites and reads without splicing sites. In the first ones, only splicing site information has been considered; in the second ones, prediction takes into account only coding region information avoiding overfitting effects.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []