Longest common subsequence

Biological applications often need to compare the DNA of two (or more) different organisms. A strand of DNA consists of a string of molecules called bases, where the possible bases are adenine, guanine, cytosine, and thymine. Representing each of these bases by its initial letter, we can express a strand of DNA as a string over the finite set $\{A, C, G, T\}$. For example, the DNA of one organism may be $S1 = ACCGGTCGAGTGCGCGGAAGCCGGCCGAA$, and the DNA of another organism may be $S2 = GTCGTTCGGAATGCCGTTGCTCTGTAAA$. One reason to compare two strands of DNA is to determine how similar the two strands are, as some measure of how closely related the two organisms are. However we can, and do, define similarity in many different ways. The way which we use to measure the similarity of strands $S1$ and $S2$ is by finding a third strand $S3$ in which the bases in $S3$ appear in each of $S1$ and $S2$, these bases must appear in the same order, but not necessarily consecutively. The longer the strand $S3$ we can find, the more similar $S1$ and $S2$ are. In our example, the longest strand $S3$ is $GTCGTCGGAAGCCGGCCGAA$. We formalize this last notion of similarity as the longest-common-subsequence problem.[1]

The LCS problem has an optimal-substructure property and the overlapping-subproblems property. Our recursive solution to the LCS problem involves establishing a recurrence for the value of an optimal solution. Let us define $c[i][j]$  to be the length of an LCS of the sequences $X_i$ and $Y_j$. The optimal substructure of the LCS problem gives the recursive formula [1]

$\large c[i][j] =
\begin{cases}
0 ,  & \text{if $i = 0$ or $j = 0$} \\[2ex]
c[i-1][j-1] + 1, & \text{if $i,j > 0\;$ and $X_i = Y_j$} \\[2ex]
max(c[i][j-1], c[i-1][j]), & \text{if $i,j > 0\;$ and $X_i \neq Y_j$}
\end{cases}$


Based on the equation, we could construct a solution using dynamic programming.

Procedure LCS-LENGTH takes two sequences $X$ and $Y$, as inputs. It stores the $c[i][j]$  values in a table $c[0 \cdots m, 0 \cdots n]$, and it computes the entries in row-major order.


In the LCS algorithm, we can eliminate the $b$ table altogether. Each $c[i][j]$ entry depends on only three other $c$ table entries as shown in the equation above. We can determine in $Θ(1)$ time which of these three values was used to compute $c[i][j]$ , without inspecting table $b$. And here’s one such attempt.

For the two DNA strands, it computes the length of the LCS and prints the LCS in the proper order. In our example the length of an LCS is $20$ and the LCS consists of the bases $GTCGTCGGAAGCCGGCCGAA$.

References
[1] https://www.amazon.com/Introduction-Algorithms-3rd-MIT-Press/dp/0262033844

Comments

Popular posts from this blog

Introducing Java Reactive Extentions in to a SpringBoot Micro Service

Optimal binary search trees

Edit distance