A protein is a sequence of amino acidal residue. Usually a sequence of amino acids in one protein is divided into several subsequences, which is thought to be an independent component or region. They are called motifs or domains. Zinc fingers are motifs that has a unique structure capturing a zinc ion in the core with several (usually four) amino acid residues, which are cysteines or histidines in most cases. Zinc fingers are kinds of transcription factors because they connect to the specific DNA sequence, so they are called DNA-binding proteins [2]. In this research protein data of three species are used for motif search: Oryza sativa and Arabidopsis thaliana for plant, and Drosophila melanogaster for insect. Protein data of A. thaliana and D. melanogaster is obtained from GenBank FTP site. Protein data of O. sativa is extracted as the open reading frame (ORF) from cDNA sequence, which is sequenced by Rice Full-Length cDNA Sequencing Project in National Institute of Agrobiological Sciences (NIAS), FAIS, and RIKEN. We selected 13,919 cDNA sequences and extracted 13,554 proteins as ORF’s from them [1]. These data are not yet public.
We collected and completely sequenced 28,469 full-length complementary DNA clones from Oryza sativa L. ssp. japonica cv. Nipponbare. Through homology searches of publicly available sequence data, we assigned tentative protein functions to 21,596 clones (75.86%). Mapping of the cDNA clones to genomic DNA revealed that there are 19,000 to 20,500 transcription units in the rice genome. Protein informatics analysis against the InterPro database revealed the existence of proteins presented in rice but not in Arabidopsis. Sixty-four percent of our cDNAs are homologous to Arabidopsis proteins.
Here we introduce our application of the wavelet analysis method to DNA sequences. In the signal processing field, Fourier transform is popular for analyzing wave data. However, although this method can process frequency information, it fails to handle locational data. In contrast, the wavelet method accommodates both locational and frequency information for wave analysis. The wavelet method is now increasing in its importance for signal processing. Fast Fourier transform is already applied to biological sequence analysis using correlations. We introduce a new method, called wavelet profile, for biological sequence analysis. Our method is based on multiresolution analysis of wavelet transform, offering data decomposition in several scaling at the same time. We applied our wavelet profile method to identifying gene loci among O. sativa genomic sequences.
This study describes comprehensive polling of transcription start and termination sites and analysis of previously unidentified full-length complementary DNAs derived from the mouse genome. We identify the 5' and 3' boundaries of 181,047 transcripts with extensive variation in transcripts arising from alternative promoter usage, splicing, and polyadenylation. There are 16,247 new mouse protein-coding transcripts, including 5154 encoding previously unidentified proteins. Genomic mapping of the transcriptome reveals transcriptional forests, with overlapping transcription on both strands, separated by deserts in which few transcripts are observed. The data provide a comprehensive platform for the comparative analysis of mammalian transcriptional regulation in differentiation and development.