Bergeron B. Bioinformatics Computing

Подождите немного. Документ загружается.

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4

Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4

X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1

As noted in the earlier discussion of gap penalties, arbitrarily selecting opening and extension costs

so that the output "looks nice" from a mathematical perspective likely has no relevance to the actual

biology of the protein under study. It's commonly assumed that a better approach is to assign gap

extension and opening costs relative to the substitution matrix used for a given protein. If the gap

penalty figures are too high relative to the matrix scores, the gap penalty figures will override the

matrix scores, and gaps will never appear in the sequence alignment. Conversely, if gap penalty

figures are too low relative to the matrix scores, gaps will be used wherever possible in order to align

the sequences. That is, simply because a substitution matrix is used doesn't guarantee biologically

relevant results. The matrices and related calculations must be used appropriately, and in

consideration of the underlying biology.

Dynamic Programming

One way to be certain that the solution to a sequence alignment is the best alignment possible is to

try every possible alignment, introducing one or more gaps at every position, and computing an

alignment score based on aligned character pairs and inexact matches. However, the computational

overhead of evaluating all possible alignments of one sequence against another grows exponentially

with the length of the two sequences. For reasonable length sequences of several hundred characters

each, an exhaustive evaluation of potential alignments could take days of computer time without

using specific algorithms developed for sequence alignment, such as dynamic programming.

Dynamic programming is a form of recursion in which intermediate results are saved in a matrix

where they can be referred to later by the program. The comparison can be likened to solving a

series of complex mathematical equations, with the results of one equation feeding the input of

another, with and without the benefit of pen and paper or other temporary storage and retrieval

mechanism. With pen and paper (as with dynamic programming), the intermediate results can be

recorded and the next equation can be solved without regard to the previous or following equation.

Without the pen and paper, it may be impossible for some people to solve the series of equations.

Dynamic programming is processor- and RAM-intensive, but the technique of storing intermediate

values in a matrix can transform an otherwise intractable problem requiring immense computational

capabilities into one that is computationally feasible.

To illustrate the value of dynamic programming in sequence alignment, consider the function:

MaxValue = f (A

, B

)

In this equation, MaxValue is some function of variables A

and B

, where i and j are indices to the

variables defined in the tree structure illustrated in Figure 8-7. That is, the possible values of A

are

represented by A

through A

, and the possible values of B are represented by B

through B

. The

best solution to MaxValue depends on the equation that defines MaxValue. For example, consider the

following possible definition of MaxValue:

MaxValue = (A

x B

)

Figure 8-7. Dynamic Programming Problem. Values for A and B are defined

in the tree structure. Maximizing MaxValue requires evaluating the

equation for every combination of i and j.

In this example, the solution is simply the largest value of A and the largest value of B. However,

consider the following definition of MaxValue:

In this example, the solution to MaxValue is less obvious and much more computationally intensive.

The brute-force method of solving for MaxValue is to recursively walk down each of the trees and try

the various combinations of A and B in the MaxValue equation. However, as illustrated in the upper-

right of Figure 8-7, evaluating every value of B in the MaxValue equation entails evaluating every

value of A. For example, assume that the values for A

and B

are defined as:

Solving for the first value of A

= 2) and ignoring the specific equation for MaxValue for clarity:

MaxValue

1,1

= f (A

, B

) = f (2, 9) = 5

MaxValue

1,2

= f (A

, B

) = f (2, 11) = 3

MaxValue

1,3

= f (A

, B

) = f (2, 1) = 0

MaxValue

1,4

= f (A

, B

) = f (2, 0) = 2

MaxValue

1,5

= f (A

, B

) = f (2, 3) = 8

MaxValue

1,6

= f (A

, B

) = f (2, 8) = 0

MaxValue

1,7

= f (A

, B

) = f (2, 1) = –2

MaxValue

1,8

= f (A

, B

) = f (2, 7) = 1

MaxValue

1,9

= f (A

, B

) = f (2, 5) = 2

MaxValue

1,10

= f (A

, B

) = f (2, 3) = 8

MaxValue

1,11

= f (A

, B

) = f (2, 2) = 4

If the branches of A and B have hundreds of sub-branches, representing hundreds of values, then the

problem is likely computationally infeasible. This is especially true if the MaxValue function, which

must be evaluated for each combination of variables, is also computationally intensive.

Dynamic programming can address this computational and time dilemma by creating a matrix to

store the values for A

, B

, and MaxValue for each combination of i and j. Instead of solving one

complex CPU- and RAM-intensive problem, the task is decomposed into hundreds or even thousands

of easily and quickly solved problems. For example, consider the solution matrix for MaxValue in

Figure 8-8. The solution set to MaxValue computed earlier for A

appears in the first row of the

matrix. Examining only this first row, it can be seen that there are two solutions to MaxValue, B

and

, each of which results in a value of 8.

Figure 8-8. Solution Matrix for MaxValue for A

and B

. The solution to

MaxValue is A

and B

with MaxValue = 12.

With the completed solution matrix available for examination, it's a trivial matter to locate the best

values for i and j, second-best, and so on. The same approach can be extended to any number of

dimensions. For example, consider adding a third variable, as in Figure 8-9. The equation for

MaxValue now takes the form:

MaxValue = f (A

, B

, C

)

Figure 8-9. Dynamic Programming Problem with Added Dimensionality.

Values for A

, B

, and C

are contained in the tree structure (left). The

exhaustive solution to MaxValue involves evaluating every combination of i,

j, and k.

In this new equation, MaxValue is some function of variables A

, B

, and C

, where i, j, and k are

indices to the variables defined in the tree structure illustrated in Figure 8-9. The best solution to

MaxValue is in the form [i = 3, j = 3, k = 2], for example. As in the simpler 2D problem, a matrix of

solutions can be constructed. However, the matrix of solutions is now much larger, and is better

represented as a 3D structure, as in Figure 8-10.

Figure 8-10. Solution Matrix for MaxValue = f (A

, B

, C

). Only one value for

k (k = 0) is shown here for clarity.

Even though there are now many more solutions to consider, the process of evaluating MaxValue for

three variables and saving intermediary results in the 3D matrix is the same as in the previous 2D

example. Adding additional dimensions, although computationally intensive, makes it possible to

evaluate all possible ways of aligning the three sequences against each other in a reasonable time,

even though the number of such possible alignments grows exponentially with the length of the two

sequences. Similarly, just as adding a dimension to the problem doesn't fundamentally change the

evaluation process, the alignment of multiple strings can be evaluated using this process as well.

To bring the power of dynamic programming into the realm of pairwise sequence alignment, consider

MaxValue to be the alignment score for pairwise alignment of two sequences. MaxValue takes into

account gap penalties, correct alignments, and imperfect alignments. After the matrix is filled in

using the alignment score to determine MaxValue, the highest scoring path is followed back to the

beginning of the alignment to define the best alignment of elements in the sequence, including gaps.

Graphically, this approach to the local alignment of two sequences is illustrated in Figure 8-11. The

starting point is the best score in the matrix, the C-C alignment with a value of 11. Working

backwards to the row and column to the upper left, step (1), the best score is for the G-G alignment,

with a score of 10. Because the value is on the diagonal immediately adjacent to the value for the C-

C alignment, there is no gap penalty. Now, moving to step (2), the highest score, 8, is also

immediately adjacent and therefore free of a gap penalty. In step (3), there are three high scores,

each of which has a gap penalty. The minimum gap penalty is associated with the closest alignment

with a score of 5, the A-A alignment. Continuing to step (4), there are two competing high scores.

Because there is no penalty for the C-C alignment that is diagonally adjacent to the A-A alignment,

with a value of 6, the process continues to the G-G bond with a value of 8, to completing the local

alignment. That is, the local alignment appears as:

Q) ATCGAGCA-GCATG...

R) -----GCATGCT...

Figure 8-11. Matrix Scores and Optimum Local Alignment for Two

Sequences.

In this example, sequence (Q) appears across the top and sequence (R) is listed across the side of

Figure 8-11. The characters involved in the local alignment appear in bold.

Mathematically, the algorithm for this form of local alignment, known as the Smith-Waterman

algorithm, is defined as:

Where A

and B

are the two sequences to be aligned; H

is the score at position A

, B

; s(A

) is the

score for aligning the characters at positions i and j; w

is the penalty of a gap of length x in

sequence A, and w

is the penalty for a gap of length of y in sequence B.

The three special provisions of this algorithm that favors local alignments are:

● Negative numbers are not allowed in the scoring matrix.

● Inexact matches are penalized.

● The best score is sought anywhere in the matrix, and not simply in the last column or row.

Even though dynamic programming guarantees to find the best local or global alignment because the

technique considers all possible alignments, the technique is computationally intensive. Short

pairwise comparisons using the Smith-Waterman algorithm can require several hours of workstation

processing. High-end parallel processing hardware, such as the UCSC Kestrel server, which provides

the equivalent of 40 times the processing power of a desktop workstation, requires several minutes

for pairwise alignment using the Smith-Waterman algorithm. Given the computational overhead of

dynamic programming, a variety of first-pass, heuristic-based methods have been developed to

support alignment on the desktop workstation. These techniques, often referred to as word methods,

include the ubiquitous FASTA and BLAST algorithms, as described in the next section.

Word Methods

BLAST and FASTA are called word methods of sequence alignment because these algorithms work at

the level of words—multiple polypeptides or nucleic acids—instead of with individual polypeptides or

nucleic acids. Both methods of sequence alignment are fast enough to support searching for

alignments of query sequences against entire nucleotide or protein databases.

The high-level flow of the FASTA algorithm, which predates BLAST, is shown in Figure 8-12. The first

step in the FASTA algorithm is to create a hash table of words from the query sequence. Hashing is a

function that maps words to integers to get a smaller set of values so that the search space is

minimized, for example. A hash table, such as the one in Figure 8-13, maps words to array positions,

based on the hash function. For proteins, word length is typically one or two amino acids long. For

nucleic acid sequences, the word length is usually from four to six characters. In either case, the

longer the word length, the more rapid and the less thorough the search.

Figure 8-12. FASTA Algorithm Flowchart.

Figure 8-13. Hash Table for FASTA. The possible words are keyed to index

numbers (right), which are used to represent words in the hash table.