Longest common subsequence
Biological applications often need to compare the DNA of two (or more) different organisms. A strand of DNA consists of a string of molecules called bases, where the possible bases are adenine, guanine, cytosine, and thymine. Representing each of these bases by its initial letter, we can express a strand of DNA as a string over the finite set {A,C,G,T}. For example, the DNA of one organism may be S1=ACCGGTCGAGTGCGCGGAAGCCGGCCGAA, and the DNA of another organism may be S2=GTCGTTCGGAATGCCGTTGCTCTGTAAA. One reason to compare two strands of DNA is to determine how similar the two strands are, as some measure of how closely related the two organisms are. However we can, and do, define similarity in many different ways. The way which we use to measure the similarity of strands S1 and S2 is by finding a third strand S3 in which the bases in S3 appear in each of S1 and S2, these bases must appear in the same order, but not necessarily consecutively. The longer the strand S3 we can find, the more similar S1 and S2 are. In our example, the longest strand S3 is GTCGTCGGAAGCCGGCCGAA. We formalize this last notion of similarity as the longest-common-subsequence problem.[1]
The LCS problem has an optimal-substructure property and the overlapping-subproblems property. Our recursive solution to the LCS problem involves establishing a recurrence for the value of an optimal solution. Let us define c[i][j] to be the length of an LCS of the sequences Xi and Yj. The optimal substructure of the LCS problem gives the recursive formula [1]
c[i][j]={0,if i=0 or j=0c[i−1][j−1]+1,if i,j>0 and Xi=Yjmax(c[i][j−1],c[i−1][j]),if i,j>0 and Xi≠Yj
Based on the equation, we could construct a solution using dynamic programming.
Procedure LCS-LENGTH takes two sequences X and Y, as inputs. It stores the c[i][j] values in a table c[0⋯m,0⋯n], and it computes the entries in row-major order.
In the LCS algorithm, we can eliminate the b table altogether. Each c[i][j] entry depends on only three other c table entries as shown in the equation above. We can determine in Θ(1) time which of these three values was used to compute c[i][j] , without inspecting table b. And here’s one such attempt.
For the two DNA strands, it computes the length of the LCS and prints the LCS in the proper order. In our example the length of an LCS is 20 and the LCS consists of the bases GTCGTCGGAAGCCGGCCGAA.
References
[1] https://www.amazon.com/Introduction-Algorithms-3rd-MIT-Press/dp/0262033844
The LCS problem has an optimal-substructure property and the overlapping-subproblems property. Our recursive solution to the LCS problem involves establishing a recurrence for the value of an optimal solution. Let us define c[i][j] to be the length of an LCS of the sequences Xi and Yj. The optimal substructure of the LCS problem gives the recursive formula [1]
c[i][j]={0,if i=0 or j=0c[i−1][j−1]+1,if i,j>0 and Xi=Yjmax(c[i][j−1],c[i−1][j]),if i,j>0 and Xi≠Yj
Based on the equation, we could construct a solution using dynamic programming.
Procedure LCS-LENGTH takes two sequences X and Y, as inputs. It stores the c[i][j] values in a table c[0⋯m,0⋯n], and it computes the entries in row-major order.
In the LCS algorithm, we can eliminate the b table altogether. Each c[i][j] entry depends on only three other c table entries as shown in the equation above. We can determine in Θ(1) time which of these three values was used to compute c[i][j] , without inspecting table b. And here’s one such attempt.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
public final class LCS { | |
public static void main(String[] args) { | |
final String strandOne = "ACCGGTCGAGTGCGCGGAAGCCGGCCGAA"; | |
final String strandTwo = "GTCGTTCGGAATGCCGTTGCTCTGTAAA"; | |
int[][] c = lcsLength(strandOne, strandTwo); | |
System.out.println(String.format("The length of an LCS of two sequences is: %d", | |
c[strandOne.length()][strandTwo.length()])); | |
printLcs(c, strandOne, strandTwo, strandOne.length(), strandTwo.length()); | |
} | |
private static void printLcs(int[][] c, String x, String y, int i, int j) { | |
if (i == 0 || j == 0) | |
System.out.print(""); | |
else if (x.charAt(i - 1) == y.charAt(j - 1)) { | |
printLcs(c, x, y, i - 1, j - 1); | |
System.out.print(x.charAt(i - 1)); | |
} else if (c[i - 1][j] >= c[i][j - 1]) | |
printLcs(c, x, y, i - 1, j); | |
else | |
printLcs(c, x, y, i, j - 1); | |
} | |
private static int[][] lcsLength(String x, String y) { | |
final int m = x.length(); | |
final int n = y.length(); | |
final int[][] c = new int[m + 1][n + 1]; | |
for (int i = 1; i <= m; i++) | |
c[i][0] = 0; | |
for (int j = 0; j <= n; j++) | |
c[0][j] = 0; | |
for (int i = 1; i <= m; i++) { | |
for (int j = 1; j <= n; j++) { | |
if (x.charAt(i - 1) == y.charAt(j - 1)) | |
c[i][j] = c[i - 1][j - 1] + 1; | |
else if (c[i - 1][j] >= c[i][j - 1]) | |
c[i][j] = c[i - 1][j]; | |
else | |
c[i][j] = c[i][j - 1]; | |
} | |
} | |
return c; | |
} | |
} |
For the two DNA strands, it computes the length of the LCS and prints the LCS in the proper order. In our example the length of an LCS is 20 and the LCS consists of the bases GTCGTCGGAAGCCGGCCGAA.
References
[1] https://www.amazon.com/Introduction-Algorithms-3rd-MIT-Press/dp/0262033844
Comments
Post a Comment