Watching out for lost data
No matter what they tell you, there’s always the danger that you can lose
information when changing the format of your sequences. Each type of
format supports different types of information. The problem is that, in
most cases, you only realize this when it’s too late. Table 10-3 lists the
kinds of features that formatting can destroy.
Along the same lines, do not take for granted that similar online servers do
the same thing (even if they have the same name and the same interface).
Two servers running READSEQ may run different versions of this program,
or the same version with different default parameters. As a consequence, a
problem that doesn’t occur with one server may occur with the next server.
It really pays to keep your eyes peeled and to keep backup copies of your
original files.
Table 10-3 Information You Can Lose When Reformatting
Information Type Nature of the Loss
Sequence name Long names can be truncated when switching formats.
Special characters may also be modified. This happens
when converting from FASTA to ALN. The effect of the trun-
cation is unpredictable. Sometimes a portion of the name is
added to the sequence!
Upper/lowercase Case sometimes contains useful information. Strictly speak-
ing, FASTA only supports uppercase. Some programs hate to
receive an input with a mixture of cases.
Gap type Some formats, such as MSF, use different symbols for differ-
ent types of gaps (
., -, ~). These are often turned into (-)
symbols after reformatting.
Annotation MSF can support weight values for the sequences. This
information is lost in most conversions. The extra line of
annotation in PIR often disappears when changing the
format of these alignments. Any annotation that comes after
the sequence name in FASTA is bound to disappear after a
conversion.
Special amino Some reformatting programs support the code for
acid or ambiguities such as X (for undetermined amino acids) or N
nucleotides (for nucleotides). If your sequences travel across many pro-
grams, these symbols may disappear. This can be a problem
if you rely on the offset of some residues within the
sequences. If you can, stick to the standard 4-nucleotide
alphabet and the 20-amino-acid alphabet.
312
Part III: Becoming a Pro in Sequence Analysis