Edit distance

Biological applications often need to compare the DNA of two (or more) different organisms. A strand of DNA consists of a string of molecules called bases, where the possible bases are adenine, guanine, cytosine, and thymine. Representing each of these bases by its initial letter, we can express a strand of DNA as a string over the finite set $\{A, C, G, T\}$. For example, the DNA of one organism may be $S1 = ACCGGTCGAGTGCGCGGAAGCCGGCCGAA$, and the DNA of another organism may be $S2 = GTCGTTCGGAATGCCGTTGCTCTGTAAA$. One reason to compare two strands of DNA is to determine how “similar” the two strands are, as some measure of how closely related the two organisms are. We can, and do, define similarity in many different ways. For example, we can say that two strands are similar if the number of changes needed to turn one into the other is small. [1] We formalize this notion of similarity in our Edit distance problem.

15-5 Edit distance - Problem Statement

In order to transform one source string of text $x[1 \ldots m]$ to a target string $y[1 \ldots n]$, we can perform various transformation operations. Our goal is, given $x$ and $y$, to produce a series of transformations that change $x$ to $y$. We use an array  $z-$ assumed to be large enough to hold all the characters it will need — to hold the intermediate results. Initially, $z$ is empty, and at termination, we should have $z_j = y_j$ for $j = 1, 2 \ldots n$. We maintain current indices $i$ into $x$ and $j$ into $z$ and the operations are allowed to alter $z$ and these indices. Initially, $i = j = 1$. We are required to examine every character in $x$ during the transformation, which means that at the end of the sequence of transformation operations, we must have $i = m + 1$. [1]

We may choose from among six transformation operations:

Copy a character from $x$ to $z$ by setting $z[j]  = x[i]$ and then incrementing both $i$ and $j$ . This operation examines $x[i]$.

Replace a character from $x$ by another character $c$, by setting $z[j]  = c$, and then incrementing both $i$ and $j$ . This operation examines $x[i]$.

Delete a character from $x$ by incrementing $i$ but leaving $j$ alone. This operation examines $x[i]$.

Insert the character $c$ into $z$ by setting $z[j]  = c$ and then incrementing $j$ , but leaving $i$ alone. This operation examines no characters of $x$.

Twiddle (i.e., exchange) the next two characters by copying them from $x$ to $z$ but in the opposite order; we do so by setting $z[j] = x[i+1]$ and $z[j+1]  = x[i]$ and then setting $i = i + 2$ and $j = j + 2$. This operation examines $x[i]$ and $x[i + 1]$.

Kill the remainder of $x$ by setting $i = m + 1$. This operation examines all characters in $x$ that have not yet been examined. This operation, if performed, must be the final operation. [1]

As an example, one way to transform the source string algorithm to the target string altruistic is to use the following sequence of operations, where the underlined characters are $x[i]$ and $z[j]$  after the operation: [1]



Note that there are several other sequences of transformation operations that transform algorithm to altruistic.

Each of the transformation operations has an associated cost. The cost of an operation depends on the specific application, but we assume that each operation’s cost is a constant that is known to us. We also assume that the individual costs of the copy and replace operations are less than the combined costs of the delete and insert operations; otherwise, the copy and replace operations would not be used. The cost of a given sequence of transformation operations is the sum of the costs of the individual operations in the sequence. For the sequence above, the cost of transforming algorithm to altruistic is



a. Given two sequences $x[1 \ldots m]$ and $y[1  \ldots n]$ and a set of transformation-operation costs, the edit distance from $x$ to $y$ is the cost of the least expensive operation sequence that transforms $x$ to $y$. Describe a dynamic-programming algorithm that finds the edit distance from $x[1  \ldots m]$ to $y[1  \ldots n]$ and prints an optimal operation sequence. Analyze the running time and space requirements of your algorithm. [1]

The edit-distance problem generalizes the problem of aligning two DNA sequences (see, for example, Setubal and Meidanis [310, Section 3.2]). There are several methods for measuring the similarity of two DNA sequences by aligning them. One such method to align two sequences $x$ and $y$ consists of inserting spaces at arbitrary locations in the two sequences (including at either end) so that the resulting sequences $x'$ and $y'$ have the same length but do not have a space in the same position (i.e., for no position $j$ are both $x'[j]$  and $y'[j]$  a space). Then we assign a “score” to each position. Position j receives a score as follows:
  • +1 if $x'[j] = y'[j]$ and neither is a space,
  • -1 if $x'[j] \neq y'[j]$ and neither is a space,
  • -2 if either $x'[j]$  or $y'[j]$  is a space.
The score for the alignment is the sum of the scores of the individual positions. For example, given the sequences $x = GATCGGCAT$ and $y = CAATGTGAATC$, one alignment is



A + under a position indicates a score of +1 for that position, a - indicates a score of -1, and a * indicates a score of -2, so that this alignment has a total score of $6 ⋅ 1 −2 ⋅ 1 − 4 ⋅ 2 = −4$.

b. Explain how to cast the problem of finding an optimal alignment as an edit distance problem using a subset of the transformation operations copy, replace, delete, insert, twiddle, and kill.

For simplicity's sake, let’s assume $cost(twiddle) = cost(kill) = \infty$. Since twiddles and kills have infinite costs, we will have neither of them in a minimal cost solution.

Edit Distance - Definition

For two input sequences $x$ and $y$, each with length $m$ and $n$ characters respectively, let $d[i][j]$ be the edit distance between the subsequences $x_1 \ldots x_i$ and $y_1 \ldots y_j$ such that $0 \le i \le m$ and $0 \le j \le n$.
Here’s our final recursive formula for finding the minimum edit distance for two given sequences.



$d[i][j] =
\begin{cases}
\max(i \cdot cost(delete),\; j \cdot cost(insert)) ,  & \text{if $i = 0$ or $j = 0$} \\[2ex]
d[i-1][j-1] + cost(copy), & \text{if $i, \; j > 0 \;$ and $\; X_i = Y_j$} \\[2ex]
\min(d[i - 1][j - 1] + cost(replace), \\ d[i - 1][j] + cost(delete), d[i][j - 1] + cost(insert)),  & \text{if $i, \; j > 0 \;$ and $\; X_i \neq Y_j$}
\end{cases}$


Edit Distance - Approach 

If the sequence $y$ is empty, then we need to perform $m$ delete operations. Otherwise, if the $x$ sequence is empty, then we need to perform $n$ insert operations. This is the trivial case of the algorithm. This is covered for all the sub sequences of $x$ and $y$ in the first two for loops of our algorithm. Otherwise, if the characters $x_i = y_j$, then we can perform a copy operation and that is guaranteed to give us an optimal solution. If the characters $x_i \neq y_j$, then we get the minimum cost of the insert, delete and replace operations as our optimal solution.  Here’s the algorithm for finding edit distance.

**Input:** Sequences $x$, $y$ and a set of transformation operation costs  
**Output:** Table $d$ which contains minimum edit distance  
$EDIT-DISTANCE(x, y, cost)$              
$\phantom{{}++{}}$ $m \gets x.length$  
$\phantom{{}++{}}$ $n \gets y.length$  
$\phantom{{}++{}}$ Let $d[0\ldots m][0 \ldots n]$ be a new table  
$\phantom{{}++{}}$ `for ` $i=0$ ` to ` $m$  
$\phantom{{}++{}} \phantom{{}++{}}$ $d[i][0] \gets i \cdot cost(delete)$  
$\phantom{{}++{}}$ `for ` $j = 1$ ` to ` $n$   
$\phantom{{}++{}} \phantom{{}++{}}$ $d[0][j] \gets j \cdot cost(insert)$  
$\phantom{{}++{}}$ `for ` $i = 1$ ` to ` $m$   
$\phantom{{}++{}}$ $\phantom{{}++{}}$ `for ` $j = 1$ ` to ` $n$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$ `if ` $x_i = y_j$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$ $d[i][j] \gets d[i-1][j-1] + cost(copy)$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$ `else `        
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$  $d[i][j] \gets min(d[i - 1][j - 1] + cost(replace), \\ \; \; \; \; \; \; \; d[i - 1][j] + cost(delete),\; d[i][j - 1] + cost(insert))$  
$\phantom{{}++{}}$ `return ` $d$


Figure 1 shows the table $d$ computed by the procedure $EDIT-DISTANCE$ on the two input sequences $x = $ algorithm and $y = $ altruistic.

$$\begin{array}{|c|c|c|}  \hline
 & \text{}  & \text{A} & \text{L} & \text{T} & \text{R} & \text{U}& \text{I}& \text{S} & \text{T} & \text{I} & \text{C} \\ \hline
\text{} & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 \\ \hline
\text{A} &1 & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 \\ \hline
\text{L} & 2 & 1 & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 \\ \hline
\text{G} & 3 & 2 & 1 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 \\ \hline
\text{O} & 4 & 3 & 2 & 2 & 2 & 3 & 4 & 5 & 6 & 7 & 8 \\ \hline
\text{R} & 5 & 4 & 3 & 3 & 2 & 3 & 4 & 5 & 6 & 7 & 8 \\ \hline
\text{I} & 6 & 5 & 4 & 4 & 3 & 3 & 3 & 4 & 5 & 6 & 7 \\ \hline
\text{T} & 7 & 6 & 5 & 4 & 4 & 4 & 4 & 4 & 4 & 5 & 6 \\ \hline
\text{H} & 8 & 7 & 6 & 5 & 5 & 5 & 5 & 5 & 5 & 5 & 6 \\ \hline
\text{M} & 9 & 8 & 7 & 6 & 6 & 6 & 6 & 6 & 6 & 6 & 6
 \\ \hline
\end{array}$$

Constructing an optimal alignment

Just finding edit distance is not sufficient, we often need to align each character of the two input sequences to each other. The following algorithm is responsible for finding one such optimal alignment.

Let $a_1$, $a_2$ and $s$ be empty strings representing two alignments and sequence of transformation operations that we need to perform respectively, let $x$ and $y$ be two input sequences, let $d$ be the table computed at step one.

**Input:** Sequences $x$, $y$ and $d$ table computed at step one  
**Output:** Sequence of transformation operations and corresponding alignments  
$CONSTRUCT-ALIGNMENT(d, x, y, a_1, a_2, s, i, j)$  
$\phantom{{}++{}}$`if ` $i = 0$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ `for ` $k = 1$ to $j$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$ $a_1 = a_1 + \_$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$ $a_2 = a_2 + y_k$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$ $s = s + INSERT$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ return   
$\phantom{{}++{}}$`if ` $j = 0$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ `for ` $k = 1$ to $i$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$ $a_1 = a_1 + x_k$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$ $a_2 = a_2 + \_$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$ $s = s + DELETE$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ return  
$\phantom{{}++{}}$ `if ` $d[i - 1][j - 1] \le  d[i - 1][j] \;and\; d[i - 1][j - 1] \le d[i][j - 1]$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $CONSTRUCT-ALIGNMENT(d, x, y, a_1, a_2, s, i - 1, j - 1)$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $a_1 = a_1 + x_i$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $a_2 = a_2 + y_j$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ `if ` $x_i = y_j$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$ $s = s + COPY$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ `else `  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $\phantom{{}++{}}$ $s = s + REPLACE$  
$\phantom{{}++{}}$ `else if ` $d[i - 1][j] < d[i][j - 1]$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $CONSTRUCT-ALIGNMENT(d, x, y, a_1, a_2, s, i - 1, j)$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $a_1 = a_1 + x_i$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $a_2 = a_2 + \_$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $s = s + DELETE$  
$\phantom{{}++{}}$ `else`  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $CONSTRUCT-ALIGNMENT(d, x, y, a_1, a_2, s, i, j - 1)$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $a_1 = a_1 + \_$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $a_2 = a_2 + y_j$  
$\phantom{{}++{}}$ $\phantom{{}++{}}$ $s = s + INSERT$
 

We start from the bottom right position of the table $d$. If one sequence is exhausted, then it is the trivial case and we know the answer, all we need to do is just a set of insert or delete operations. This is covered in the first two if statements. At this point our recursion bottoms out. Otherwise, we need to examine the three cells that immediately precedes the current cell, figuring out the one that incurs us the least cost, doing alignments and transformation accordingly. Also note that we can determine the transformation operation based on $i$, $j$ indices and $x_i$, $y_j$ values. 

Our approach for constructing an optimal alignment involves $O(m + n)$ steps in the worst case. Thus it incurs $O(n)$ space and time complexity.

Uses of Edit Distance


1. In the field of Computational biology, to align two sequences of nucleotides. Given a sequence of bases,



one such alignment would be:



This helps comparing genes or regions from different species to find important regions, uncover evolutionary forces and to compare individuals to looking for mutations.

2. For Spell correction. Say for instance, the user typed “graffe”, which is closest? Graf, Graft, Grail, Giraffe

Sample Implementation

Here’s the sample implementation of algorithms described above along with a driver program.

Let’s run it using our two sequences of nucleotide as input.

$X = AGGCTATCACCTGACCTCCAGGCCGATGCCC$
$Y = TAGCTATCACGACCGCGGTCGATTTGCCCGAC$

Upon execution, this program produces the output:

Min Edit Distance: 13

_AGGCTATCACCTGACCTCCAGGCCGAT__GCCC___
TAG_CTATCAC__GACCGC__GGTCGATTTGCCCGAC

INSERT, COPY, COPY, DELETE, COPY, COPY, COPY, COPY, COPY, COPY, COPY, DELETE, DELETE, COPY, COPY, COPY, COPY, REPLACE, COPY, DELETE, DELETE, COPY, COPY, REPLACE, COPY, COPY, COPY, COPY, INSERT, INSERT, COPY, COPY, COPY, COPY, INSERT, INSERT, INSERT


Notice the sequence of transformation operations and the alignments of our two nucleotide sequences which yields the minimum edit distance. There’s one more detail that bears noting. The alignment our program produces is different from the one we have shown you in section - Uses of Edit Distance.  That is because there are multiple optimal alignments and our program deduces only one of them.

Where did the name, dynamic programming, come from?

... The 1950s were not good years for mathematical research. [the] Secretary of Defense ...had a pathological fear and hatred of the word, research…

I decided therefore to use the word, “programming”.

I wanted to get across the idea that this was dynamic, this was multistage... I thought, let’s ... take a word that has an absolutely precise meaning, namely dynamic... it’s impossible to use the word, dynamic, in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. It’s impossible.

Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to.”

    Richard Bellman, “Eye of the Hurricane: an autobiography” 1984.

Analysis

Our approach of finding the minimum edit distance involves $Θ(n^2)$ space and time complexity.

References

[1] https://www.amazon.com/Introduction-Algorithms-3rd-MIT-Press/dp/0262033844

Comments

Popular posts from this blog

Introducing Java Reactive Extentions in to a SpringBoot Micro Service

Optimal binary search trees

Combining the emissions of multiple Observables together using RxJava Zip operator in a Spring Boot Micro service