283
Chapter 9: Building a Multiple Sequence Alignment
This clustering is named a
dendrogram.
If we
were to align four sequences A, B, C, and D, the
dendrogram might look like this:
|------A
|------|
| |------B
|
Root ----------|
|
| |------C
|------|
|------D
The topology of this dendrogram tells us a
simple story: It says that A and B are more sim-
ilar to each other than they are to C and D. Thus
if we align A with B, we are less likely to make
a mistake than if we align A with C or D.
To make the progressive alignment, ClustalW fol-
lows the dendrogram topology: It starts aligning A
with B. After this it aligns C with D. When this is
done, Clustal has two small multiple alignments
(AB and CD). This is where Clustal pulls out its
main trick: It aligns the two alignments as if each
of them was a single sequence! It is not as com-
plicated as it seems and there are many ways to
do this. For instance, you could replace each
alignment with a single consensus sequence.
Clustal uses a slightly more sophisticated method,
but the idea is essentially the same: It treats mul-
tiple alignments like single sequences and aligns
them two by two.
Now you may ask, where do we cheat then?
The answer is simple: we make a multiple align-
ment with ALL the sequences in our set, but we
do not use ALL the information they contain. For
instance, when Clustal aligns A and B, it does
not use C and D. This could be a problem. If A
and B are very different, we will produce an
incorrect pairwise alignment — which would
be a waste if the two other sequences con-
tained some useful information we did not use.
Imagine, for instance, that A, B, C, and D all con-
tain a certain very short (but important) motif.
This motif does not look so important when you
compare only A and B, and it only shows up
when you look at all the sequences simultane-
ously. Unfortunately, with a progressive strat-
egy, C and D will come too late to rescue the
incorrect alignment of A and B.
The reason not to use all the information is that
it’s too expensive in terms of computation. So we
cheat a little and use the progressive alignment.
This shouldn’t worry you too much, though. Even
if it is a little greedy and approximate, Clustal
often delivers pretty good alignments.
When you ask yourself which type of sequences
are best suited for ClustalW, imagine that your
sequences are like big stones spread across a
shallow river. Making a multiple alignment is like
crossing this river by jumping from stone to
stone: It doesn’t matter how wide the river is, as
long as you always find a stone to jump to.
Similarly, it doesn’t matter how many sequences
you have, and how far apart they are from each
other, as long as a chain of correct alignments
exists that can take you across the entire set.
(Of course, if a sequence or a small group of
sequences is very different from the rest, you
fall in the rapids!)
Sometimes your sequence set may contain
many identical or similar sequences — usually
a problem because sequences that belong to
minority subgroups become harder to align
properly. If you can, you want to avoid this situ-
ation by removing the sequences yourself, but
if you have no choice, it can be reassuring to
know that ClustalW is equipped to deal with
redundancy.
In case you were wondering, the
W
in
ClustalW
does not stand for Dubya, the U.S. President; it
stands for
W
eights. ClustalW uses a sophisti-
cated scheme so that very similar sequences do
not end up dominating the multiple sequence
alignment. In fact, every sequence is supposed
to receive a weight proportional to the amount
of new information it contributes.