Berg J.M., Tymoczko J.L., Stryer L. Biochemistry

Подождите немного. Документ загружается.

I. The Molecular Design of Life 7. Exploring Evolution 7.1. Homologs Are Descended from a Common Ancestor

Figure 7.3. Two Classes of Homologs. Homologs that perform identical or very similar functions in different organisms

are called orthologs, whereas homologs that perform different functions within one organism are called paralogs.

I. The Molecular Design of Life 7. Exploring Evolution

7.2. Statistical Analysis of Sequence Alignments Can Detect Homology

Conceptual Insights, Sequence Analysis, provides opportunities to

interactively explore issues involved in sequence alignment.

Conceptual Insights, appearing throughout the book, are interactive

animations that help you build your understanding of key biochemical

principles and concepts. To access, go to the Web site: www.whfreeman.com/

biochem5, and select the chapter, Conceptual Insights, and the title.

A significant sequence similarity between two molecules implies that they are likely to have the same evolutionary

origin and, therefore, the same three-dimensional structure, function, and mechanism. Although both nucleic acid and

protein sequences can be compared to detect homology, a comparison of protein sequences is much more effective for

several reasons, most notably that proteins are built from 20 different building blocks, whereas RNA and DNA are

synthesized from only 4 building blocks.

To illustrate sequence-comparison methods, let us consider a class of proteins called the globins. Myoglobin is a protein

that binds oxygen in muscle, whereas hemoglobin is the oxygen-carrying protein in blood (Section 10.2). Both proteins

cradle a heme group, an iron-containing organic molecule that binds the oxygen. Each human hemoglobin molecule is

composed of four heme-containing polypeptide chains, two identical α chains and two identical β chains. Here, we

consider only the α chain. We wish to examine the similarity between the amino acid sequence of the human α chain

and that of human myoglobin (Figure 7.4). To detect such similarity, methods have been developed for sequence

alignment.

How can we tell where to align the two sequences? The simplest approach is to compare all possible juxtapositions of

one protein sequence with another, in each case recording the number of identical residues that are aligned with one

another. This comparison can be accomplished by simply sliding one sequence past the other, one amino acid at a time,

and counting the number of matched residues (Figure 7.5).

For hemoglobin α and myoglobin, the best alignment reveals 23 sequence identities, spread throughout the central parts

of the sequences. However, a nearby alignment showing 22 identities is nearly as good. In this alignment, the identities

are concentrated toward the amino-terminal end of the sequences. The sequences can be aligned to capture most of the

identities in both alignments by introducing a gap into one of the sequences (Figure 7.6). Such gaps must often be

inserted to compensate for the insertions or deletions of nucleotides that may have taken place in the gene for one

molecule but not the other in the course of evolution.

The use of gaps substantially increases the complexity of sequence alignment because, in principle, the insertion of gaps

of arbitrary sizes must be considered throughout each sequence. However, methods have been developed for the

insertion of gaps in the automatic alignment of sequences. These methods use scoring systems to compare different

alignments, and they include penalties for gaps to prevent the insertion of an unreasonable number of them. Here is an

example of such a scoring system: each identity between aligned sequences results in +10 points, whereas each gap

introduced, regardless of size, results in -25 points. For the alignment shown in Figure 7.6, there are 38 identities and 1

gap, producing a score of (38 × 10 - 1 × 25 = 355). Overall, there are 38 matched amino acids in an average length of

147 residues; so the sequences are 25.9% identical. The next step is to ask, Is this precentage of identity significant?

7.2.1. The Statistical Significance of Alignments Can Be Estimated by Shuffling

The similarities in sequence in Figure 7.5 appear striking, yet there remains the possibility that a grouping of sequence

identities has occurred by chance alone. How can we estimate the probability that a specific series of identities is a

chance occurrence? To make such an estimate, the amino acid sequence in one of the proteins is "shuffled" that is,

randomly rearranged and the alignment procedure is repeated (Figure 7.7). This process is repeated to build up a

distribution showing, for each possible score, the number of shuffled sequences that received that score.

When this procedure is applied to the sequences of myoglobin and hemoglobin α , the authentic alignment clearly stands

out (Figure 7.8). Its score is far above the mean for the alignment scores based on shuffled sequences. The odds of such a

deviation occurring owing due to chance alone are approximately 1 in 10

. Thus, we can comfortably conclude that the

two sequences are genuinely similar; the simplest explanation for this similarity is that these sequences are

homologous that is, that the two molecules have descended by divergence from a common ancestor.

7.2.2. Distant Evolutionary Relationships Can Be Detected Through the Use of

Substitution Matrices

The scoring scheme in Section 7.2.1 assigns points only to positions occupied by identical amino acids in the two

sequences being compared. No credit is given for any pairing that is not an identity. However, not all substitutions are

equivalent. Some are structurally conservative substitutions, replacing one amino acid with another that is similar in size

and chemical properties. Such conservative amino acid substitutions may have relatively minor effects on protein

structure and can thus be tolerated without compromising function. In other substitutions, an amino acid replaces one

that is dissimilar. Furthermore, some amino acid substitutions result from the replacement of only a single nucleotide in

the gene sequence; whereas others require two or three replacements. Conservative and single-nucleotide substitutions

are likely to be more common than are substitutions with more radical effects. How can we account for the type of

substitution when comparing sequences? We can approach this problem by first examining the substitutions that have

actually taken place in evolutionarily related proteins.

From the examination of appropriately aligned sequences, substitution matrices can be deduced. In these matrices, a

large positive score corresponds to a substitution that occurs relatively frequently, whereas a large negative score

corresponds to a substitution that occurs only rarely. The Blosum-62 substitution matrix illustrated in Figure 7.9 is an

example. The highest scores in this substitution matrix indicate that amino acids such as cysteine (C) and tryptophan (W)

tend to be conserved more than those such as serine (S) and alanine (A). Furthermore, structurally conservative

substitutions such as lysine (K) for arginine (R) and isoleucine (I) for valine (V) have relatively high scores. When two

sequences are compared, each substitution is assigned a score based on the matrix. In addition, a gap penalty is often

assigned according to the size of the gap. For example, the introduction of a gap lowers the alignment score by 12 points

and the extension of an existing gap costs 2 points per residue. Using this scoring system, the alignment shown in Figure

7.6 receives a score of 115. In many regions, most substitutions are conservative (defined as those substitutions with

scores greater than 0) and relatively few are strongly disfavored types (Figure 7.10).

This scoring system detects homology between less obviously related sequences with greater sensitivity than would a

comparison of identities only. Consider, for example, the protein leghemoglobin, an oxygen-binding protein found in the

roots of some plants. The amino acid sequence of leghemoglobin from the herb lupine can be aligned with that of human

myoglobin and scored by using either the simple scoring scheme based on identities only or the Blosum-62 scoring

matrix (see Figure 7.9). Repeated shuffling and scoring provides a distribution of alignment scores (Figure 7.11).

Scoring based on identities only indicates that the odds of the alignment between myoglobin and leghemoglobin

occurring by chance alone are 1 in 20. Thus, although the level of similarity suggests a relationship, there is a 5% chance

that the similarity is accidental on the basis of this analysis. In contrast, users of the substitution matrix are able to

incorporate the effects of conservative substitutions. From such an analysis, the odds of the alignment occurring by

chance are calculated to be approximately 1 in 300. Thus, an analysis performed by using the substitution matrix reaches

a much firmer conclusion about the evolutionary relationship between these proteins (Figure 7.12).

Experience with sequence analysis has led to the development of simpler rules of thumb. For sequences longer than 100

amino acids, sequence identities greater than 25% are almost certainly not the result of chance alone; such sequences are

probably homologous. In contrast, if two sequences are less than 15% identical, pairwise comparison alone is unlikely to

indicate statistically significant similarity. For sequences that are between 15% and 25% identical, further analysis is

necessary to determine the statistical significance of the alignment. It must be emphasized that the lack of a statistically

significant degree of sequence similarity does not rule out homology. The sequences of many proteins that have

descended from common ancestors have diverged to such an extent that the relationship between the proteins can no

longer be detected from their sequences alone. As we will see, such homologous proteins can often be detected by

examining three-dimensional structures.

7.2.3. Databases Can Be Searched to Identify Homologous Sequences

When the sequence of a protein is first determined, comparing it with all previously characterized sequences can be a

source of tremendous insight into its evolutionary relatives and, hence, its structure and function. Indeed, an extensive

sequence comparison is almost always the first analysis performed on a newly elucidated sequence. The sequence

alignment methods heretofore described are used to compare an individual sequence with all members of a database of

known sequences.

In 1995, investigators reported the first complete sequence of the genome of a free-living organism, the bacterium

Haemophilus influenzae. Of 1743 identified open reading frames (Section 6.3.2), 1007 (58%) could be linked by

sequence-comparison methods to some protein of known function that had been previously characterized in another

organism. An additional 347 open reading frames could be linked to sequences in the database for which no function had

yet been assigned ("hypothetical proteins"). The remaining 389 sequences did not match any sequence present in the

database at the time at which the Haemophilus influenzae sequence was completed. Thus, investigators were able to

identify likely functions for more than half the proteins within this organism solely through the use of sequence-

comparison methods.