Catalase Website IndexJamesTechnoOptimisation •Optimisation of 'Diff' •

Optimisation of Sequence Alignment

Sequence alignment algorithms are the most heavily used algorithms in bioinformatics. Their use for database searching was once restricted to parallel computers because of their high computational cost.

The basic algorithm is O(N2). Whilst not reducing this complexity measure, I wrote an intensively optimised version with a more than fifty fold speed up over what was previously thought possible. Using this optimised version complete protein versus database searches take minutes rather than hours on an ordinary single processor PC.

An introduction to the basic unoptimised algorithm is in the related article Dynamic Programming

The 'less than or equal to' sign appears like this: '' in this document. On the machines I've used it looks fine, but if there's a funny symbol there instead tell me so I can re-post this document with '<=' in its place.

Methods for Fast Serial Type III Comparison

TopIndexDown to 1.2 Introduction

1.1 Index

1. Introduction 2. The Optimisations 3. Conclusion
IndexDown to 1.3 Algorithm

1.2 Introduction

A number of techniques combine to accelerate the bioinformatics sequence comparison routine which finds the highest scoring similar region of two proteins allowing for insertions and deletions in either sequence.

The algorithm is described for scoring only, i.e. separated from production of an alignement of the two similar regions. In the database searching application the vast majority of proteins compared are of low similarity to the probe sequence. The alignments for the low scoring similarities are not of interest. It is therefore more economical not to calculate them.

Some presentations of the alignment algorithm use a second matrix to record 'the path taken' for traceback of the alignment. This is not necessary. The score matrix itself provides all the information needed to reconstruct the path. In my own implementation I use an unoptimised routine for path reconstruction, however it calls the optimised routine for score calculation to generate the score matrix.

IndexDown to 2.1 Re-expression

1.3 Algorithm

The Type III algorithm can be described economically as follows:

A symmetrical table that scores similarities of pairs of characters from an alphabet is used to generate scores d[i,j] for the similarity of the ith character of one string to the jth character of the second. Positive d[i,j] indicate similarity, negative d[i,j] are counterindicative of similarity. A penalty P, the 'indel' penalty, represents the cost of inserting a gap in either sequence. Using these elementary scores, a score S[i,j] for the best local similarity ending on the ith character of the first sequence and jth character of the second can be computed. The score S[i,j] is expressed recursively as:

     S[i,j]   = Max( S[i-1,j-1]+d[i,j],
                     0 )
This expression is for 0<im and 0<jn. S[i,0] and S[0,j] are equal to 0 for 0im, 0jn respectively. Matching the ith character against the jth corresponds to the first term and increases the score S at [i,j] by d[i,j] over its value at S[i-1,j-1]. The second and third terms correspond to the cost of skipping a character in either sequence. This is -P. The formula for S[i,j] calculates the best similarity for substrings ending at characters i and j as being from matching, skipping one or other character, or if all three give negative results, starting a run of similarity after this position. The largest value of S[i,j] gives the score for the best local region of similarity between the two strings.

In a recursive implementation to calculate the S[i,j] there would be wasteful recalculation of previously calculated values. Instead the dynamic programming technique is used and all results 0<im, 0<jn are computed in a suitable order. Computation time is thus O(mn).

IndexDown to 2.2 Rearrangement

2.1 Re-expression

The majority of pairs of strings compared in the Molecular Biology application have low similarity and most of the S[i,j] are less than P. To use this observation to increase efficiency requires some rearrangement to the equations. The intermediate stages of rearrangement result in code which is manifestly less efficient. Taking the original formulation we first increase indices. For 0i<m and 0j<n we have:
   S[i+1,j+1] = Max( d[i+1,j+1]+S[i,j],
We can introduce extra tests and break the 'Max' into components.
   S[i+1,j+1] = Max(0,d[i+1,j+1])
   if S[i,j]   > 0 then
         S[i+1,j+1] = Max(0,d[i+1,j+1]+S[i,j])
   if S[i+1,j] > P then    
         S[i+1,j+1] = Max(S[i+1,j+1],S[i+1,j]-P)
   if S[i,j+1] > P then 
         S[i+1,j+1] = Max(S[i+1,j+1],S[i,j+1]-P)
The same positive values can be obtained as follows:
   T[i+1,j+1] = d[i+1,j+1]
   if T[i,j]   > 0 then 
         T[i+1,j+1] = T[i+1,j+1]+T[i,j]
   if T[i+1,j] > P then  
         T[i+1,j+1] = Max(T[i+1,j+1],T[i+1,j]-P)
   if T[i,j+1] > P then
         T[i+1,j+1] = Max(T[i+1,j+1],T[i,j+1]-P)
This change which is removal of a 'max' from the first line modifies the treatment of zero. Unlike S[i,j], T[i,j] may become negative. S would hold a zero in such positions. S and T however agree on their positive values. A negative T[i,j] fails each of the comparison test and behaves just as if it were zero in so far as its influences on other T[i,j] is concerned. That S and T agree on positive values can also be checked by considering the cases leading to T[i+1,j+1]0 and T[i+1,j+1]>0 separately. Because S and T agree on positive values they find the same similar substrings.

IndexDown to 2.3 Use of registers

2.2 Rearrangement

This new form permits a rearrangement. Before the main loop starts all the T[i,j] are initialised to d[i,j]. In the rearranged loop instead of collecting values for one array element, values are distributed from an element that has positive score. The operations performed at each location in T are now as follows:
Start:    if T[i,j]>0 then begin
             if T[i,j]>P then begin
                T[i+1,j] = Max(T[i+1,j],T[i,j]-P)
                T[i,j+1] = Max(T[i,j+1],T[i,j]-P)
             T[i+1,j+1] = T[i+1,j+1]+T[i,j]
Testing for T[i,j]>P is necessary only if the testing for T[i,j]>0 has already succeeded. The testing substantially increases the speed. In a typical comparison of a pair of sequences around three quarters of the T[i,j] are negative. These are skipped. This is where the main saving from the rearrangement arises. Of the remaining values around one half score P or less and are dealt with rapidly. The second comparison requires subtraction of P from T[i,j]. The result of this subtraction is then used twice if the test succeeds. The original formulation in terms of S required two subtractions of P per calculation of S[i,j] so there is an overall saving in calculation even when the second test succeeds.

IndexDown to 2.4 Deferred assignment

2.3 Use of registers

The next version of the pseudocode shows minor changes to take account of the possibility of using registers. To achieve greater speed the algorithm was implemented at assembly level. A word length register AX was used to hold the current T[i,j] and the register CX to hold T[i,j] - P. At this point the influence of the target processor, the 80286 may be clear. The choice of processor is primarily a reflection of the ready availability and affordability of IBM PCs and compatibles.

A standard technique to reduce storage and indexing calculations for the matrix T was also used. Only two of T's columns are needed during processing. T0[j] takes the place of T[i,j] and T1[j] the place of T[i+1,j]. After processing one column, the values in T0 are no longer required. T1 is used in place of T0 and the old T0 is initialised with the appropriate d[i,j] values and takes the place of T1. The new inner loop is shown below.

Start:    AX = T0[j]
          if AX>0 then begin
             CX = AX-P
             if CX>0 then begin
                T1[j]   = Max(T1[j],CX)
                T0[j+1] = Max(T0[j+1],CX)            
             T1[j+1] = T1[j+1]+AX
          j = j+1
          if j<= n then goto start
IndexDown to 2.5 Maximum value and rogue value

2.4 Deferred assignment

A technique compilers use to improve efficiency in code they produce is to defer assignments. A compiler may defer the writing of a value held in a register back out to memory. This can lead to a saving when the same memory location is written to several times in succession. This situation frequently occurs in executing instructions in a loop. A memory location written to in one cycle of the loop may be written to again in the subsequent cycle.

This algorithm illustrates potential for a more complex form of deferred assignment, conditional deferred assignment. In the unmodified code writing to memory locations is conditional on values calculated within the loop. Depending on the condition, the body of the loop may or may not write a new value out to memory. To cope with conditional deferred assignment, different versions of the loop were required. These handle different states of deferred assignment from the previous execution of the loop. In this code, the 'code expansion' was used to allow deferral of assignment to T0[j+1] and T1[j+1]. If CX>0 then both deferrals take place, if only AX>0 then only deferral of assignment to T1[j+1] takes place, and if AX0 then no deferred assignment is required. This required three versions of the loop body. As each version completes a cycle, it enters the appropriate version for the next cycle. Which version it enters will depend on which assignments are being deferred.

In fact the value being held to write into T0 need not be written at all. The value is needed only in the next cycle round the loop. For this use it can be taken from the register holding the value that is being deferred. Use in determining maximum score is the only other potential subsequent use for values written into T0. Since the deferred value is derived by subtracting P from a value already in the matrix T, it cannot itself be the maximum score. The result does not need to be written out to memory. We now have:

Start1:    AX = T0[j]
           if AX>0 then begin
              CX = AX-P
              if CX>0 then begin
                 T1[j] = Max(T1[j],CX)
                 goto DeferTwo
              goto DeferOne
NoDefer:   j = j+1       ;AX<=0
           if j<= n then goto Start1

Start2:    DX = AX+T1[j]
           AX = T0[j]
           if AX>0 then goto Morecalc
           T1[j] = DX
           goto NoDefer
DeferOne:  j = j+1       ;AX>0, CX<=0
           if j<= n then goto Start2

Start3:    DX = AX+T1[j]
           AX = T0[j]
           AX = Max(AX,CX)
;No need to test for AX>0 here as AX>=CX>0
Morecalc:  CX = AX-P
           if CX>0 then begin
Maxtest:      T1[j] = Max(DX,CX)
              goto DeferTwo
           end else begin
              T1[j] = DX
           goto NoDefer
DeferTwo:  j = j+1       ;AX and CX>0
           if j<= n then goto Start3
The number of external memory references has been decreased. The plethora of gotos is not a concern. The unconditional gotos can be removed by rearranging the code and code duplication. Other unconditional gotos originating from expanding 'if then else' and 'Max' can be removed in a similar fashion.

Further code expansion is possible. In the line AX = Max(AX,CX) the conditional assignment of CX into AX could be replaced by a conditional branch to equivalent code with the roles of CX and AX reversed. This however was not done. It would nearly double the length of the program for a marginal performance gain.

IndexDown to 2.6 Loop overheads

2.5 Maximum value and rogue value

Determination of the overall maximum value in the array and its position has so far been left out of this discussion. The maximum value in a column could be determined by a second loop that scans every entry in the column.

We save on loop overheads and on the cost of retrieval from T0 if instead a having a second loop we place the test in the existing loop. Moreover we only need the test in the version of the code which performs both deferrals, since deferrals correspond to values in T0 greater than P. This reduces the number of times we test for the current maximum being exceeded by a factor of six.

In making the test more rarely we have lost the ability to correctly score low scoring pairs. The code will only detect those scores greater than P. For the database searching application this is no loss. Substantially higher scores are needed for evidence of relatedness between compared proteins. An even better position for the maximum test is at the line labelled maxtest, when the value DX has been found to be greater than CX. Any matching of two residues where the score for the substrings before matching the two residues is greater than 2P will reach this point. This gives a slightly higher threshold for guaranteed reporting of scores and an even less frequent execution of the maximum test.

IndexDown to 3.1 Performance

2.6 Loop overheads

The loop overheads, the test for jn, are now a significant fraction of the algorithm's cost. An improvement is the use of a rogue value or 'sentinel' for determining whether the loop has completed. T0[n+1] is initialised with an impossibly high value. This causes entry into the code which checks to see if the current maximum score has been exceeded. If it has, this code additionally checks for the rogue value. Using a rogue value in this way virtually eliminates the time spent in checking for the end of the loop.

The tests for jn can be replaced with unconditional gotos. The testing of AX>0 in the most commonly executed case, no deferral, is a natural candidate for 'loop unrolling'. This unrolling makes a saving since it reverses the test so that the conditional branches most commonly do not branch.

At the start of each cycle it is necessary to initialise the vector T1. This initialisation is dependent on the ith character of the second sequence. This is most rapidly performed if vectors for each of the characters in the alphabet are precomputed. Initialisation is then performed using a rapid memory copy operation.

IndexDown to 3.2 Summary of techniques used

3.1 Performance

Speed of comparison is measured in PMEs-1, Path Matrix Elements per second. Comparison of two sequences of length 300 in one second requires a speed of 90,000 PMEs-1 or better.

The overall result of these optimisations is an algorithm which has a speed of 300,000 PMEs-1 (this was in 1990 using a 16MHz 286). This compares to speeds implied by Pearson (1990) of 4,000 PMEs-1 for comparison; and by Mount (1988) of 700 PMEs-1 for comparison and alignment on a mainframe and a speed of 6000 PMEs-1 for Pascal code on the same 16 MHz PC (own data). Thus the optimisations described here lead to an approximate 50 fold speed improvement.


3.2 Summary

A summary of the techniques used to optimise the routine is given below:
  • Use assembler.
  • Use registers.
  • Alignment production should be a separate phase from scoring.
  • Use two columns of the matrix alternating them, reducing memory from O(N2) to O(N) and saving one indexing operation per cell accessed.
  • Pre-compute a value column for each letter allowing initialisation for each cycle using a fast block copy operation.
  • Rearrange calculation so that score values propogate rather than being requested. This gives a saving because two or three propogations of value can be skipped in the common case where the value is low or zero.
  • Defer writing values out to memory. Use different copies of the same code to keep track of what deferals are currently live.
  • Do not write results back out to memory for 'vertically' propogating values! Only horizontally and diagonally propogating values will be needed again.
  • Test for maximum in the same loop that does scoring, thereby sharing loop overhead and memory access, also only make the test in cases where score has increased and is still propogating, as all other scores will be below a threshold.
  • Terminate the loop with a sentinel or rogue high value. This should be picked up in the 'maximum test' case.
  • Unroll the tightest part of the loop.

© James Crook, June 1998.

Catalase Website IndexJamesTechnoOptimisation •Optimisation of 'Diff' •
"Optimisation to 'diff'" page last updated 5-July-2003