basically what GeneMark does. On a second pass, they retain only those that
are exactly flanked by good splice-site sequences.
Vertebrate exons are small (150-bp long on average), and the sequences of
their splice sites are variable. For this reason, finding exons is a more challeng-
ing problem than ORFing microbe DNA — so don’t expect an exon-detection
program to work nearly as well as something like ORF Finder or GeneMark.
(After all, those programs have much larger targets to shoot at.) Still, we’d
like to show you what an exon detection program can do. Check out the fol-
lowing steps, where we use Michael Zhang’s program MZEF:
1. Point your browser to rulai.cshl.edu/.
This is the Zhang Laboratory home page at Cold Spring Harbor
Laboratory, on beautiful Long Island.
2. On the next page, click the Gene-Finding link in the Software Tools
section.
3. In the next Gene Finder page, select Human.
This selects the program version calibrated for human coding-region
statistics. A simple input form is then displayed.
4. Copy your sequence from a .txt file or a Word document.
If you don’t have a sequence handy, you can fetch the sequence AF018429
from GenBank at NCBI (
www.ncbi.nlm.nih.gov). This entry contains
the exons 1 and 2 of the dUTPase gene.
Make sure you get the sequence in FASTA format.
5. Paste the sequence you copied into the input box.
6. Click the Submit button (below the sequence-input box) to start your
analysis.
The program quickly returns a minimally formatted output, as shown
in Figure 5-10. MZEF correctly identified one exon between positions
1056–1172. The rest of the line consists of various quality values associ-
ated with the overall prediction, the various reading frames (FR1, FR2,
FR3), the splice site, and the coding potential. The prediction is half
correct because it missed one exon. Furthermore, the predicted exon
actually starts at position 1018. Such a success rate — about half the
attempts, which isn’t entirely perfect — isn’t so unusual for exon-finding
programs.
An exon predictor that combines MZEF with other approaches (which
enhances its performances) is available at this Michigan Tech site:
genome.cs.mtu.edu/aat/aat.html
150
Part II: A Survival Guide to Bioinformatics