Using DNA or protein sequences
If the sequences you’re interested in are non-coding sequences, you obviously
have no choice — you must use DNA. However, beware that non-coding DNA
sequences can be tricky to align. If you cannot generate a proper alignment
from sequences that you know are related, you could use a local multiple
alignment method, such as the Gibbs sampler, or a pattern extraction motif,
such as Pratt. (See “Comparing Sequences That You Can’t Align,” later in this
chapter, for a description of these two methods.)
Multiple-sequence-alignment methods are at their best when aligning protein
sequences. The reason is that protein sequences are three times shorter than
the corresponding DNA, and they use a more informative alphabet of 20
amino acids.
If you want to persist in carrying out a phylogenetic analysis on a set of
coding DNA sequences, things work better if you do the following:
1. Translate your DNA sequences into proteins.
2. Perform the multiple alignments on the proteins.
3. Thread the DNA back onto the protein multiple sequence alignment
framework using pal2nal (
coot.embl.de/pal2nal) or Protogene if
you do not have the original DNA sequence (
www.tcoffee.org).
If your proteins are difficult to align because they have few similarities, DNA
information does not help. This has to do with the degeneracy of the genetic
code and the fact that there are 20 amino acids and only 4 nucleotides. If
there is little signal at the protein level, you can be sure that there is NO
useful signal at the DNA level.
Choosing the right number of sequences
Of course, there’s no absolute answer to this question, such as 42 or 7. A few
years ago, the answer was easy: Get everything you can and go to the lab if
there aren’t enough sequences in the databases! But that isn’t true anymore.
These days — given the sizes of the databases and new complete genome
sequences flowing in twice a month — you may easily find hundreds or thou-
sands of sequences that would be suitable for inclusion in your multiple
sequence alignment. But that doesn’t mean you have to use them all!
In our opinion, you should start with a relatively small number of sequences —
between 10 and 15 sequences would be suitable for most cases. After you get
something interesting happening with this small set, you can always increase
its size. In any case, it’s hard to see any reason for generating a multiple align-
ment with more than 50 sequences, unless you’re interested in building some
extensive phylogenetic tree.
If you start with hundreds of sequences, you immediately hit troubles. There
are good reasons why:
272
Part III: Becoming a Pro in Sequence Analysis