Getting Your Multiple Alignment
in the Right Format
When using online servers — or even local programs — you do not always
have full control of what goes in and what comes out of the program you use.
For instance, many multiple-sequence-alignment programs commonly output
only one format: MSF (Multiple Sequence Format). What can you do if you
want to analyze this multiple sequence alignment with a program that only
reads FASTA-formatted alignments? The answer is simple: Use a reformatting
program just like the ones we introduce in this chapter.
When it comes to multiple sequence alignments, there is no such thing as a
perfect format. Anyone used to dealing with the many services available
online knows that each service tends to have its favorite format. (See the “A
format for everyone: Democracy or anarchy?” sidebar if you want proof of
this fact.) Understanding everything there is to know about formats is an abil-
ity you only acquire with some practice — but this chapter gives you some
ideas, and we prepare you to be on the lookout for potential problems.
305
Chapter 10: Editing and Publishing Alignments
A format for everyone: Democracy or anarchy?
The variety of formats is a major curse in bioin-
formatics. For a long time, the tradition was that
anyone developing a new program would design
his or her own house format. Each new format
was slightly different from the others — and
especially appealing to only a particular cate-
gory of biologists. For instance, people doing
phylogeny became very attached to the Phylip
format; users of GCG (a popular bioinformatics
package) preferred MSF, and so on. Whether we
like it or not, the weight of history has made each
of these formats totally acceptable — and today
they all live in perfect harmony, side by side on
our computer keyboards — or do they?
Another source of confusion in the field was the
communication wars between biologists and
computer scientists. For many years, biologists
have complained about the computer scientists’
formats, basically saying, “We cannot make
sense of this gibberish.” In retaliation (and
because they needed to), biologists also cre-
ated formats they could use — to the contempt
of computer scientists, who dubbed these for-
mats “amateurish and ambiguous.” Both the
biologists and the computer scientists were
probably right in their respective evaluations.
But that didn’t help.
Things may be about to change with the XML
language coming to the fore. XML (e
X
tensible
M
arkup
L
anguage) is a close relative of HTML —
the language of the Web. XML makes it possible
to simply describe your data with keywords
everyone can agree on. Today, biologists and
computer scientists consent — albeit weakly —
that XML could be the solution and the main
bioinformatics programs are now able to pro-
duce output in XML.