Going from protein sequences
to DNA sequences
In databases, the correspondence between protein and DNA sequences is not
one-to-one. Many different — even non-overlapping — DNA sequences can be
linked to the same protein or gene name, as the following list makes clear
(flip ahead to Figures 3-1 and 3-2):
52
Part I: Getting Started in Bioinformatics
Beware of parasite characters
in downloaded sequence files
In the dUTPAse example, we didn’t use the
File
➪Save As command on the Internet browser
main menu to download the content of the
window because files saved using the File
➪
Save As command have some problems:
They may contain some hidden
para-
site characters
— Control/Alt/Shift +
something — often displayed as
<PRE> at
the beginning of the file — which corrupts
the FASTA format.
They’re trickier to reopen by your PC word-
processing software, because they don’t
have the right extension (for instance
.doc),
or ask you to choose an obscure encoding
scheme (about which we know nothing, so
we can’t advise you) to load the file.
It is our experience that using a browser’s
File
➪Save As option produces unpredictable
results, depending on the browser type, version,
or implementation. In general, it’s a good and
wise practice to inspect the sequence data files
that you download from the Internet for un-
expected leading or trailing signs. For most
sequence-analysis programs, FASTA-formatted
sequence files must begin with a definition line,
such as
>P0343456
My_Sequence_definition
and nothing else! Any leading character (even
blank ones) that differs from
> may produce an
error. (For instance, the definition line might be
considered part of the protein sequence.)
Sequence-data files also have to end with a
final
<New Line> character (showing as a
blank line).
The good news is that usually no constraints
restrict the length of the definition and sequence
lines (but use reasonable numbers, no more
than about 60 to 100 characters long, to be safe).
Except for the constraint of having
> as the first
character of any definition line, you can freely
use blank characters, such as
<Space>,
<Tab>
, and line delimiters without interfering
with the parsing of the sequences.
Finally, do not use characters that are not
among the standard amino-acid codes such as
<-> or <*>. They aren’t treated in a consistent
way by different analysis programs. They’re
skipped (deleting a position), replaced by
X, or
may simply cause an error. (For more on stan-
dard amino-acid codes, see Chapter 1.)