Making Sure You Have the Right
Sequences and the Right Methods
The most delicate decision you make, when doing a pairwise analysis, is the
choice of two sequences to compare. Choosing two sequences is a bit like
arranging a boxing match between two opponents: The idea is to get the
most exciting fight.
Don’t use pairwise comparison methods to discover a sequence that would be
homologous to a sequence you already have. It takes too much time. If you
want to compare your sequence with every other sequence in a database, use
a database-search program such as BLAST. (See Chapter 7 for more on BLAST.)
If you’ve made your way through Chapter 7, you’re now in a position to argue
that database-search programs merely do pairwise comparisons between a
query sequence and all the sequences within a database — so there’s no real
need to make extra pairwise comparisons, is there? But wait a minute. Although
it’s true that programs like BLAST search databases through pairwise compar-
isons, these programs are optimized for speed, not for alignment accuracy.
The programs we describe here are just the opposite: They are optimized for
giving the most accurate possible result, which is why you want to apply them
to carefully selected sequences.
Choosing the right sequences
A good reason to make a pairwise comparison between two sequences is a
strong suspicion that these sequences are
homologous — that is, they share a
common ancestor. (See Chapter 7 for a complete description.) Such sequences
often have similar 3-D structures and related functions.
The best way to find a sequence that’s homologous to a sequence you
already have is to search a database with the BLAST program. Select your
sequences according to the following (conservative) criteria:
DNA sequence: At least 70 percent identity over more than 100 bases
between the hit and the query, or an E-value lower than 10-
4
. (For more
on E-values, see Chapter 7.)
Protein sequence: More than 25 percent identity over more than 100
amino acids between the hit and the query, or an E-value lower than 10-
4
.
Unfortunately, these criteria are only an indication of a match, not a guaran-
tee. If your hit has a score very close to the threshold, it may not be homolo-
gous to the query — or it may be so distantly related that aligning it correctly
is difficult. This is where pairwise comparison methods come into the pic-
ture: They help you decide how meaningful a database hit really is.
236
Part III: Becoming a Pro in Sequence Analysis