
another. This comparison can be accomplished by simply sliding one sequence past the other, one amino acid at a time,
and counting the number of matched residues (Figure 7.5).
For hemoglobin α and myoglobin, the best alignment reveals 23 sequence identities, spread throughout the central parts
of the sequences. However, a nearby alignment showing 22 identities is nearly as good. In this alignment, the identities
are concentrated toward the amino-terminal end of the sequences. The sequences can be aligned to capture most of the
identities in both alignments by introducing a gap into one of the sequences (Figure 7.6). Such gaps must often be
inserted to compensate for the insertions or deletions of nucleotides that may have taken place in the gene for one
molecule but not the other in the course of evolution.
The use of gaps substantially increases the complexity of sequence alignment because, in principle, the insertion of gaps
of arbitrary sizes must be considered throughout each sequence. However, methods have been developed for the
insertion of gaps in the automatic alignment of sequences. These methods use scoring systems to compare different
alignments, and they include penalties for gaps to prevent the insertion of an unreasonable number of them. Here is an
example of such a scoring system: each identity between aligned sequences results in +10 points, whereas each gap
introduced, regardless of size, results in -25 points. For the alignment shown in Figure 7.6, there are 38 identities and 1
gap, producing a score of (38 × 10 - 1 × 25 = 355). Overall, there are 38 matched amino acids in an average length of
147 residues; so the sequences are 25.9% identical. The next step is to ask, Is this precentage of identity significant?
7.2.1. The Statistical Significance of Alignments Can Be Estimated by Shuffling
The similarities in sequence in Figure 7.5 appear striking, yet there remains the possibility that a grouping of sequence
identities has occurred by chance alone. How can we estimate the probability that a specific series of identities is a
chance occurrence? To make such an estimate, the amino acid sequence in one of the proteins is "shuffled" that is,
randomly rearranged and the alignment procedure is repeated (Figure 7.7). This process is repeated to build up a
distribution showing, for each possible score, the number of shuffled sequences that received that score.
When this procedure is applied to the sequences of myoglobin and hemoglobin α , the authentic alignment clearly stands
out (Figure 7.8). Its score is far above the mean for the alignment scores based on shuffled sequences. The odds of such a
deviation occurring owing due to chance alone are approximately 1 in 10
20
. Thus, we can comfortably conclude that the
two sequences are genuinely similar; the simplest explanation for this similarity is that these sequences are
homologous that is, that the two molecules have descended by divergence from a common ancestor.
7.2.2. Distant Evolutionary Relationships Can Be Detected Through the Use of
Substitution Matrices
The scoring scheme in Section 7.2.1 assigns points only to positions occupied by identical amino acids in the two
sequences being compared. No credit is given for any pairing that is not an identity. However, not all substitutions are
equivalent. Some are structurally conservative substitutions, replacing one amino acid with another that is similar in size
and chemical properties. Such conservative amino acid substitutions may have relatively minor effects on protein
structure and can thus be tolerated without compromising function. In other substitutions, an amino acid replaces one
that is dissimilar. Furthermore, some amino acid substitutions result from the replacement of only a single nucleotide in
the gene sequence; whereas others require two or three replacements. Conservative and single-nucleotide substitutions
are likely to be more common than are substitutions with more radical effects. How can we account for the type of
substitution when comparing sequences? We can approach this problem by first examining the substitutions that have
actually taken place in evolutionarily related proteins.
From the examination of appropriately aligned sequences, substitution matrices can be deduced. In these matrices, a
large positive score corresponds to a substitution that occurs relatively frequently, whereas a large negative score
corresponds to a substitution that occurs only rarely. The Blosum-62 substitution matrix illustrated in Figure 7.9 is an
example. The highest scores in this substitution matrix indicate that amino acids such as cysteine (C) and tryptophan (W)
tend to be conserved more than those such as serine (S) and alanine (A). Furthermore, structurally conservative