FindGDPs: identification of primers for labeling microbial transcriptomes for DNA microarray analysis.

2003 
DNA microarrays have been used to investigate the differential expression of entire bacterial transcriptomes in response to environmental stimuli (Wilson et al., 1999). Central to the application of microarray technology in this context is the need to label bacterial mRNA with a detectable molecule such as a fluorophore. Frequently, reverse transcription of total RNA in the presence of a fluorescent nucleotide analog is employed to meet this need, but the lack of a conserved nucleotide motif at the 3′ end of bacterial mRNAs prevents the use of a single primer (e.g., oligo-dT) in bacterial labeling reactions (Lakey et al., 2002). Talaat et al. recently described an algorithm for identifying a set of oligonucleotide primers that anneal to all of the ORFs in a microbial genome (Talaat et al., 2000). These authors demonstrated that the use of these genome-directed primers (GDPs) resulted in an improved signal-to-noise ratio over that observed when random hexamers were used. However, this program is only available for Macintosh and Windows NT/2000; it is not available for other versions of Windows or other operating systems. This paper describes the development of FindGDPs, a program that quickly identifies a set of GDPs that fulfills two criteria. First, the members of this set anneal to all of the ORFs in a genome, and second, they do not exhibit full-length complementarity to members of another set of user-supplied nucleotide sequences. FindGDPs also offers advantages in speed and portability. It requires only seconds to identify a set of GDPs for common microbial genomes (Table 1), and since it is written in C++, FindGDPs will run on any platform for which a C++ compiler is available. Table 1 Comparison of running FindGDPs and GDPFinder (Talaat et al., 2000) on four different, annotated microbial genomes. FindGDPs is run from the command line and prompts the user for the required runtime parameters. Two input files are required prior to running the program. The first input file contains the nucleotide sequences of all the ORFs for which primers are to be designed in FastA format. The second input file, also in FastA format, contains any nucleotide sequences to which the GDPs should not exhibit full-length complementarity. The user must also specify the length of the desired GDPs (6, 7, or 8 nucleotides; referred to as n), as well as the percentage of the 3′ end of each ORF (contained in the first file) to scan for potential GDPs. The program begins by processing the file containing sequences to which GDPs should not exhibit full-length complementarity. Each sequence in the file is read and converted to its reverse-complement to obtain the non-coding strand. The non-coding sequence is scanned using a moving window of length n, and each n-mer contained therein is noted in a table. After reading all sequences in this file, the table contains all of the n-mers that cannot be used as GDPs. The program then performs a similar operation on each of the sequences in the file containing the ORFs for which GDPs are to be designed. A specified percentage of the 3′ end (given as a runtime parameter) of each ORF is converted to its reverse complement and scanned to identify potential GDPs in the 3′ end of the ORF. As each potential GDP is identified, the table of invalid GDPs is checked to see if the potential GDP shows full-length complementarity to any of the prohibited sequences (i.e., those in the second file). If the potential GDP is valid, it is noted in a table of potential GDPs for the current ORF. After scanning the specified region at the 3′ end of the current ORF, the table is written to a temporary file. This process is repeated for each ORF in the input file. After all ORFs have been processed, a greedy algorithm is employed to identify a set of primers (selected from the set of valid primers) that anneal to the 3′ end of all of the ORFs. The algorithm operates by choosing an ORF for which a GDP has not yet been found, and then selecting the GDP that binds to this ORF as well as to the largest number of other ORFs that still need a GDP. The algorithm repeats until a set of GDPs has been identified such that every ORF in the first input file can be primed by at least one GDP. Furthermore, the members of this set are not complementary across their entire length to any sequence in the second input file (although partial complementarity may be exhibited). This algorithm runs in O(pq2) time, where p is the number of potential n-mers and q is the number of ORFs. Like all greedy algorithms, the program exhibits very short run times, as illustrated in Table 1. The short run times required by FindGDPs, combined with its ability to run on multiple platforms, should facilitate its use in prokaryotic DNA microarray systems.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    4
    References
    8
    Citations
    NaN
    KQI
    []