Understanding the Importance
of Similarity
Similar sequences often derive from the same ancestral sequence. This
means that if your sequences are similar, they probably have the same ances-
tor, share the same structure, and have a similar biological function. This
principle even works when the sequences come from very different organ-
isms. For you, this means you can extrapolate something you know about a
particular DNA or protein sequence to all similar DNA and protein sequences.
For example, imagine that your favorite sequence looks very much like
another one that somebody has studied in another lab. Because these two
sequences are similar, you can say, “If something is true for that sequence, it
is probably true for mine as well!” Just imagine how much time you can save:
Studying a gene in the lab takes years; searching a database for similarity
takes seconds. And you’re not even cheating!
When two proteins or gene sequences are very similar, biologists call them
homologues, which is a fancy word for two proteins or gene sequences that
have the same ancestor, similar functions, and similar structures. The snag
comes in deciding how similar is “very” similar. If your sequences are more
than 100 amino acids long (or 100 nucleotides long), the rule says you can
label proteins as “homologous” if 25 percent of the amino acids are identical,
for DNA you will require at least 70 percent identity to draw the same conclu-
sion. If your values are below those stated values, then your guess is as good
as any. You’ve entered that range of protein identity below 25 percent —
known (fans of Rod Serling take note) as the
twilight zone — where
Nothing is sure about the meaning of observed similarities.
For instance, some proteins whose amino acids are less than 15 percent
identical have exactly the same 3-D structure — while some proteins
with residues that are 20 percent identical have different structures.
Homology or non-homology is never granted.
The 25-percent figure that defines the twilight zone is mostly a common-sense
indicator. In reality, things are slightly more complicated. In most cases, to
make sure that two sequences are true homologues, you also need to use
some other information reported by the search. These bits of data include
The Expectation value (E-value), which tells you how likely it is that the
similarity between your sequence and a database sequence is due to
chance
The length of the segments similar between the two sequences
200
Part III: Becoming a Pro in Sequence Analysis