Algorithm Implementation/Strings/Longest common subsequence
From Wikibooks, the open-content textbooks collection
Contents |
[edit] C#
[edit] Finding the LCS
The following C# program calculates the longest common subsequence (note the singular) of 2 strings. For example, for the strings "computer" and "houseboat" this algorithm returns a value of 3, specifically the string "out".
public int LongestCommonSubsequence(string s1, string s2) { //if either string is empty, the length must be 0 if (String.IsNullOrEmpty(s1) || String.IsNullOrEmpty(s2)) return 0; int[ , ] num = new int[s1.Length, s2.Length]; //2D array char letter1; char letter2; //Actual algorithm for(int i = 0; i < s1.Length; i++) { letter1 = s1[i]; for(int j = 0; j < s2.Length; j++) { letter2 = s2[j]; if(letter1 == letter2) { if((i == 0) || (j == 0)) num[i, j] = 1; else num[i, j] = 1 + num[i-1, j-1]; } else { if ((i == 0) && (j == 0)) num[i, j] = 0; else if ((i == 0) && !(j == 0)) //First ith element num[i, j] = Math.Max(0, num[i, j - 1]); else if (!(i == 0) && (j == 0)) //First jth element num[i, j] = Math.Max(num[i - 1, j], 0); else // if (!(i == 0) && !(j == 0)) num[i, j] = Math.Max(num[i - 1, j], num[i, j - 1]); } }//end j }//end i return num[s1.Length-1, s2.Length-1]; } //end LongestCommonSubsequence Usage: LongestCommonSubsequence("computer", "boathouse")
[edit] Reading the LCS
This method read the longest (and only the longest) subsequence out of the array that the method above generates.
public StringBuilder ReadLCSFromBacktrack(int[,] backtrack, string string1, string string2, int s1position, int s2posision) { if ((s1position < 0) || (s2posision < 0)) { return new StringBuilder(); } else if (string1[s1position] == string2[s2posision]) { return ReadLCSFromBacktrack(backtrack, string1, string2, s1position - 1, s2posision - 1).Append(string1[s1position]); } else { if (backtrack[s1position, s2posision - 1] >= backtrack[s1position - 1, s2posision]) { return ReadLCSFromBacktrack(backtrack, string1, string2, s1position, s2posision - 1); } else { return ReadLCSFromBacktrack(backtrack, string1, string2, s1position - 1, s2posision); } } } Usage: ReadLCSFromBacktrack(intArray,string1,string2,string1.Length-1,string2.Length-1) Use the ToString() method of the returned StringBuilder to get a string: string someString = ReadLCSFromBacktrack(intArray,string1,string2,string1.Length-1,string2.Length-1).ToString()
[edit] Common Lisp
This version only works for lists but can be generalized for all sequences.
(defun lcs-list (list-1 list-2 &key (test #'eql)) "Find the longest common subsequence of LIST-1 and LIST-2 using TEST." (cond ((null list-1) nil) ((null list-2) nil) ((funcall test (first list-1) (first list-2)) (cons (first list-1) (lcs-list (rest list-1) (rest list-2) :test test))) (t (let ((lcs-1 (lcs-list list-1 (rest list-2) :test test)) (lcs-2 (lcs-list (rest list-1) list-2 :test test))) (if (> (length lcs-1) (length lcs-2)) lcs-1 lcs-2))))) (defun diff (list1 list2 &key (test #'eql)) "Find the differences between LIST1 and LIST2 using TEST." (let ((lcs (lcs-list list1 list2 :test test)) result) (dolist (c lcs) (let* ((sync-list1 (position c list1 :test test)) (sync-list2 (position c list2 :test test)) (removed (subseq list1 0 sync-list1)) (added (subseq list2 0 sync-list2))) (setf list1 (subseq list1 (1+ sync-list1))) (setf list2 (subseq list2 (1+ sync-list2))) (when removed (push (cons :removed removed) result)) (when added (push (cons :added added) result)) (push c result))) (when list1 (push (cons :removed list1) result)) (when list2 (push (cons :added list2) result)) (nreverse result)))
[edit] Python
[edit] Computing the length of the LCS
def LCS(X, Y): m = len(X) n = len(Y) # An (m+1) times (n+1) matrix C = [[0] * (n+1) for i in range(m+1)] for i in range(1, m+1): for j in range(1, n+1): if X[i-1] == Y[j-1]: C[i][j] = C[i-1][j-1] + 1 else: C[i][j] = max(C[i][j-1], C[i-1][j]) return C
[edit] Reading out an LCS
def backTrack(C, X, Y, i, j): if i == 0 or j == 0: return "" elif X[i-1] == Y[j-1]: return backTrack(C, X, Y, i-1, j-1) + X[i-1] else: if C[i][j-1] > C[i-1][j]: return backTrack(C, X, Y, i, j-1) else: return backTrack(C, X, Y, i-1, j)
[edit] Reading out all LCSs
def backTrackAll(C, X, Y, i, j): if i == 0 or j == 0: return set([""]) elif X[i-1] == Y[j-1]: return set([Z + X[i-1] for Z in backTrackAll(C, X, Y, i-1, j-1)]) else: R = set() if C[i][j-1] >= C[i-1][j]: R.update(backTrackAll(C, X, Y, i, j-1)) if C[i-1][j] >= C[i][j-1]: R.update(backTrackAll(C, X, Y, i-1, j)) return R
[edit] Usage example
X = "AATCC" Y = "ACACG" m = len(X) n = len(Y) C = LCS(X, Y) print "Some LCS: '%s'" % backTrack(C, X, Y, m, n) print "All LCSs: %s" % backTrackAll(C, X, Y, m, n)
It prints the following:
Some LCS: 'AAC' All LCSs: set(['ACC', 'AAC'])
[edit] Print the diff
def printDiff(C, X, Y, i, j): if i > 0 and j > 0 and X[i-1] == Y[j-1]: printDiff(C, X, Y, i-1, j-1) print " " + X[i-1] else: if j > 0 and (i == 0 or C[i][j-1] >= C[i-1][j]): printDiff(C, X, Y, i, j-1) print "+ " + Y[j-1] elif i > 0 and (j == 0 or C[i][j-1] < C[i-1][j]): printDiff(C, X, Y, i-1, j) print "- " + X[i-1]
[edit] Usage example
X = [ "This part of the document has stayed", "the same from version to version.", "", "This paragraph contains text that is", "outdated - it will be deprecated '''and'''", "deleted '''in''' the near future.", "", "It is important to spell check this", "dokument. On the other hand, a misspelled", "word isn't the end of the world.", ] Y = [ "This is an important notice! It should", "therefore be located at the beginning of", "this document!", "", "This part of the document has stayed", "the same from version to version.", "", "It is important to spell check this", "document. On the other hand, a misspelled", "word isn't the end of the world. This", "paragraph contains important new", "additions to this document.", ] C = LCS(X, Y) printDiff(C, X, Y, len(X), len(Y))
It prints the following:
+ This is an important notice! It should + therefore be located at the beginning of + this document! + This part of the document has stayed the same from version to version. - - This paragraph contains text that is - outdated - it will be deprecated and - deleted in the near future. It is important to spell check this - dokument. On the other hand, a misspelled - word isn't the end of the world. + document. On the other hand, a misspelled + word isn't the end of the world. This + paragraph contains important new + additions to this document.
[edit] VB.NET
The following VB.NET program calculates the longest common subsequence (note the singular) of 2 strings. For example, for the strings "computer" and "houseboat" this algorithm returns a value of 3, specifically the string "out".
Public Function LongestCommonSubsequence(ByVal s1 As String, ByVal s2 As String) As Integer 'Bulletproofing - 1 or both inputs contains nothing If s1.Length.Equals(0) Or s2.Length.Equals(0) Then Return "0" End If '*** Actual Algorithm From Here *** Dim num(s1.Length - 1, s2.Length - 1) As Long '2D Array Dim letter1 As Char = Nothing Dim letter2 As Char = Nothing For i As Integer = 0 To s1.Length - 1 For j As Integer = 0 To s2.Length - 1 letter1 = s1.Chars(i) letter2 = s2.Chars(j) If letter1.Equals(letter2) Then If i.Equals(0) Or j.Equals(0) Then 'The first elements respectively num(i, j) = 1 Else num(i, j) = 1 + num(i - 1, j - 1) End If Else If i.Equals(0) And j.Equals(0) Then num(i, j) = 0 ElseIf i.Equals(0) And Not j.Equals(0) Then 'First ith element num(i, j) = Math.Max(0, num(i, j - 1)) ElseIf j.Equals(0) And Not i.Equals(0) Then 'First jth element num(i, j) = Math.Max(num(i - 1, j), 0) ElseIf i <> 0 And j <> 0 Then num(i, j) = Math.Max(num(i - 1, j), num(i, j - 1)) End If End If Next j Next i Return num(s1.Length - 1, s2.Length - 1) End Function Usage: LongestCommonSubsequence("computer", "boathouse")
[edit] Java
public static <E> List<E> LongestCommonSubsequence(E[] s1, E[] s2) { int[][] num = new int[s1.length+1][s2.length+1]; //2D array, initialized to 0 //Actual algorithm for (int i = 1; i <= s1.length; i++) for (int j = 1; j <= s2.length; j++) if (s1[i-1].equals(s2[j-1])) num[i][j] = 1 + num[i-1][j-1]; else num[i][j] = Math.max(num[i-1][j], num[i][j-1]); System.out.println("length of LCS = " + num[s1.length][s2.length]); int s1position = s1.length, s2position = s2.length; List<E> result = new LinkedList<E>(); while (s1position != 0 && s2position != 0) { if (s1[s1position - 1].equals(s2[s2position - 1])) { result.add(s1[s1position - 1]); s1position--; s2position--; } else if (num[s1position][s2position - 1] >= num[s1position - 1][s2position]) { s2position--; } else { s1position--; } } Collections.reverse(result); return result; }
The Java implementation uses two classes. The first, an abstract class that implements the algorithm. The second, an concrete class implementing this for a string. Obviously, you could use this so that instead of comparing characters in a string, it could compare lines in a file, blocks of code, nodes in an XML document, or whatever you choose.
import static java.lang.Math.*; import java.util.ArrayList; import java.util.List; /** * A class to compute the longest common subsequence in two strings. * Algorithms from Wikipedia: * http://en.wikipedia.org/wiki/Longest_common_subsequence_problem * * @author jhess * */ public abstract class LongestCommonSubsequence<VALUE> { private int[][] c; private ArrayList<DiffEntry<VALUE>> diff; private ArrayList<VALUE> backtrack; /** * A constructor for classes inheriting this one, allowing them to * do some initialization before setting the values of X and Y. Once * the initialization is complete, the inheriting class must call * initValues(VALUE[] x, VALUE[] y) */ protected LongestCommonSubsequence() { } protected abstract int lengthOfY() ; protected abstract int lengthOfX() ; protected abstract VALUE valueOfX(int index) ; protected abstract VALUE valueOfY(int index) ; protected boolean equals(VALUE x1, VALUE y1) { return (null == x1 && null == y1) || x1.equals(y1); } private boolean isXYEqual(int i, int j) { return equals(valueOfXInternal(i),valueOfYInternal(j)); } private VALUE valueOfXInternal(int i) { return valueOfX(i-1); } private VALUE valueOfYInternal(int j) { return valueOfY(j-1); } public void calculateLcs() { if(c != null) { return; } c = new int[lengthOfX()+1][]; for(int i = 0; i < c.length; i++) { c[i] = new int[lengthOfY()+1]; } for(int i = 1; i < c.length; i++) { for(int j = 1; j < c[i].length; j++) { if(isXYEqual(i, j)) { c[i][j] = c[i-1][j-1] + 1; } else { c[i][j] = max(c[i][j-1], c[i-1][j]); } } } } public int getLcsLength() { calculateLcs(); return c[lengthOfX()][lengthOfY()]; } public int getMinEditDistance() { calculateLcs(); return lengthOfX()+lengthOfY()-2*abs(getLcsLength()); } public List<VALUE> backtrack() { calculateLcs(); if(this.backtrack == null) { this.backtrack = new ArrayList<VALUE>(); backtrack(lengthOfX(),lengthOfY()); } return this.backtrack; } public void backtrack(int i, int j) { calculateLcs(); if (i == 0 || j == 0) { return; } else if (isXYEqual(i, j)) { backtrack(i-1,j-1); backtrack.add(valueOfXInternal(i)); } else { if(c[i][j-1] > c[i-1][j]) { backtrack(i,j-1); } else { backtrack(i-1,j); } } } public List<DiffEntry<VALUE>> diff() { calculateLcs(); if(this.diff == null) { this.diff = new ArrayList<DiffEntry<VALUE>>(); diff(lengthOfX(),lengthOfY()); } return this.diff; } private void diff(int i, int j) { calculateLcs(); while(!(i==0 && j==0)) { if (i > 0 && j > 0 && isXYEqual(i, j)) { this.diff.add(new DiffEntry<VALUE>(DiffType.NONE, valueOfXInternal(i))); i--;j--; } else { if (j > 0 && (i == 0 || c[i][j - 1] >= c[i - 1][j])) { this.diff.add(new DiffEntry<VALUE>(DiffType.ADD, valueOfYInternal(j))); j--; } else if (i > 0 && (j == 0 || c[i][j - 1] < c[i - 1][j])) { this.diff.add(new DiffEntry<VALUE>(DiffType.REMOVE, valueOfXInternal(i))); i--; } } } Collections.reverse(this.diff); } @Override public String toString() { calculateLcs(); StringBuffer buf = new StringBuffer(); buf.append(" "); for(int j = 1; j <= lengthOfY(); j++) { buf.append(valueOfYInternal(j)); } buf.append("\n"); buf.append(" "); for(int j = 0; j < c[0].length; j++) { buf.append(Integer.toString(c[0][j])); } buf.append("\n"); for(int i = 1; i < c.length; i++) { buf.append(valueOfXInternal(i)); for(int j = 0; j < c[i].length; j++) { buf.append(Integer.toString(c[i][j])); } buf.append("\n"); } return buf.toString(); } public static enum DiffType { ADD("+","add"),REMOVE("-","remove"),NONE(" ","none"); private String val; private String name; DiffType(String val, String name) { this.val = val; this.name = name; } @Override public String toString() { return val; } public String getName() { return name; } public String getVal() { return val; } } public static class DiffEntry<VALUE> { private DiffType type; private VALUE value; public DiffEntry(DiffType type, VALUE value) { super(); this.type = type; this.value = value; } public DiffType getType() { return type; } public void setType(DiffType type) { this.type = type; } public VALUE getValue() { return value; } public void setValue(VALUE value) { this.value = value; } @Override public String toString() { return type.toString()+value.toString(); } } }
import java.util.List; public class LcsString extends LongestCommonSubsequence<Character> { private String x; private String y; public LcsString(String from, String to) { this.x = from; this.y = to; } protected int lengthOfY() { return y.length(); } protected int lengthOfX() { return x.length(); } protected Character valueOfX(int index) { return x.charAt(index); } protected Character valueOfY(int index) { return y.charAt(index); } public String getHtmlDiff() { DiffType type = null; List<DiffEntry<Character>> diffs = diff(); StringBuffer buf = new StringBuffer(); for(DiffEntry<Character> entry : diffs) { if(type != entry.getType()) { if(type != null) { buf.append("</span>"); } buf.append("<span class=\""+entry.getType().getName()+"\">"); type = entry.getType(); } buf.append(escapeHtml(entry.getValue())); } buf.append("</span>"); return buf.toString(); } private String escapeHtml(Character ch) { switch(ch) { case '<' : return "<"; case '>' : return ">"; case '"' : return "\""; default : return ch.toString(); } } // EXAMPLE. Here's how you use it. public static void main(String[] args) { LcsString seq = new LcsString("<p>the quick brown fox</p>","<p>the <b>Fast</b> brown dog</p>"); System.out.println("LCS: "+seq.getLcsLength()); System.out.println("Edit Dist: "+seq.getMinEditDistance()); System.out.println("Backtrack: "+seq.backtrack()); System.out.println("HTML Diff: "+seq.getHtmlDiff()); } }
[edit] C++
// Compute the longest common subsequence between X and Y // On return, C will contain the LCS table. // C[m][n] will contain the length of the longest common subsequence. template<typename RanIt> size_t ** LCSLength(RanIt X, RanIt Xend, RanIt Y, RanIt Yend) { size_t m = std::distance(X, Xend); size_t n = std::distance(Y, Yend); size_t **C = Allocate2DArray<size_t>(m+1, n+1); for (size_t i=0; i<=m; ++i) C[i][0] = 0; for (size_t j=0; j<=n; ++j) C[0][j] = 0; for (size_t i=0; i<m; ++i) for (size_t j=0; j<n; ++j) if (X[i] == Y[j]) C[i+1][j+1] = C[i][j] + 1; else C[i+1][j+1] = std::max(C[i+1][j], C[i][j+1]); return C; }
Usage:
char X[] = "XMJYAUZ";
char Y[] = "MZJAWXU";
size_t **C = LCSLength(X, X+strlen(X), Y, Y+strlen(Y));
[edit] Ruby
def subsequence(s1, s2) return 0 if s1.length==0 || s2==0 num=Array.new(s1.size){Array.new(s2.size)} s1.scan(/./).each_with_index{|letter1,i| s2.scan(/./).each_with_index{|letter2,j| if s1[i]==s2[j] if i==0||j==0 num[i][j] = 1 else num[i][j] = 1 + num[i - 1][ j - 1] end else if i==0 && j==0 num[i][j] = 0 elsif i==0 && j!=0 #First ith element num[i][j] = [0, num[i][j - 1]].max elsif j==0 && i!=0 #First jth element num[i][j] = [0, num[i - 1][j]].max elsif i != 0 && j!= 0 num[i][j] = [num[i - 1][j], num[i][j - 1]].max end end } } num[s1.length - 1][s2.length - 1] end puts subsequence("houseboat","computer")