214
Part III: Becoming a Pro in Sequence Analysis
E-values, similarity, and homology
A high level of similarity between two
sequences often indicates that the two have
evolved from a common ancestor and have
the same overall 3-D structure. Biologists call
these sequences
homologues.
In practice,
homologue sequences often have similar bio-
chemical functions.
If you are studying a protein, the most desirable
object in the universe is a very well-character-
ized protein sequence that’s clearly homolo-
gous to your protein. People search databases
in hopes of finding this special sequence.
That’s all very fine, but how do you show that
your two sequences are homologous? Think of
homologous sequences as relatives in a family.
We all know that relatives tend to look alike, but
we also know that two persons with the same
eye color aren’t necessarily siblings. On the
other hand, if they have the same type of hair,
the same facial features, and so on, we can be
tempted to conclude that they are true relatives.
It works the same way with sequences.
How similar must sequences be in order to be
considered homologous? The answer is clear:
More than
25 percent
of the amino acids pres-
ent for proteins — and more than
70 percent
of
the nucleotides present for DNA — must be
similar. Above this limit, you can be almost sure
that two proteins have the same structure and
the same common ancestor. Below that limit
lies the twilight zone — that spooky identity
range where nobody can really be sure whether
the observed similarity means anything.
Warning:
Be careful! The
25-percent
and
70-
percent
limits only work for sequences that
contain more than 100 amino acids or nucle-
otides. You might frequently get near-perfect
identity between short segments (10 residues,
for instance) of totally unrelated proteins or
DNA sequences.
Everybody loves percent identities because
they are so easy to spot visually. Unfortunately,
they are not always a good indicator. For
instance, they tell you nothing when many
similar — but not identical — amino acids are
aligned together. Moreover, how do you tell the
difference between 60 matched residues
spread over a 100-residue segment, and 120
matches spread over a 200-residue segment?
The longest is probably more meaningful, but
the percent identity says nothing about this.
Gurus invented E-values so we’d have a crite-
rion more objective than percentage-of-similar-
ity. E-values (short for
expectation values
) are a
powerful tool for comparing pairwise align-
ments with different similarities and different
lengths. They also help you decide how much
you can trust your conclusion on homology.
With E-values, you know when you can get
excited or when you should wait and see.
The E-value has a very concrete meaning: It is
the number of times your database match may
have occurred just by chance. We consider a
match that’s very unlikely to occur just by
chance to be a very good match; that’s why
results associated with the
lowest
E-values are
the best. We say they’re the most
significant
because we know we can trust them enough to
infer homology.
In theory, alignments associated with E-values
lower than one should all be trusted. In practice,
this is not true because BLAST uses an approx-
imate formula for computing the E-values and
strongly underestimates them. In the sequence
world, a similarity with an E-value above 10-
4
(0.0001) is not necessarily interesting. If you
want to be certain of the homology, your E-value
must be lower than 10-
4
.
BLAST isn’t the only program that uses E-
values. You may come across them almost any
time you compare two sequences — even if you
use programs that compare sequences and
domains. The principle is always the same.