Tutorial for The LCS Problem Revisited |
The LCS Problem Revisited was one of the problems in the November contest.
The Longest Common Subsequence problem is a very well known problem, and has a classic example of a dynamic programming solution. This problem seems to be more complex - while we can easily compute the length of the LCS in the standard way, now we have to count them - and only the distinct ones, which is the crucial part. A subsequence may occur several times in different places, but we are only meant to count it once.
The standard LCS problem
Let's start by looking at how to solve the standard LCS problem - ie to find out the length of the longest common subsequence of two strings.
Given two strings, we split things into three cases:
Case 1: We don't use the first character of the first string.
Case 2: We don't use the first character of the second string.
Case 3: We match up the first character of each string.
The third case is obviously only possible if the first characters match.
Once we have decided which of these options we are going to use, we have reduced the problem to one of a smaller size: we want to find the LCS of the remaining strings, which are suffixes of the original strings.
The terminating condition for the DP occurs when one of the strings is empty - in that case, the length of the LCS is 0.
This results in the following code:
[code] // lcs is an array of dimensions [M+1][N+1] // s is first string of length M // t is second string of length N for (int i=M; i>=0; i--) // using the ith character onwards of first string for (int j=N; j>=0; j--) { // using the jth character onwards of second string if (i==M || j==N) lcs[i][j]=0; // empty string else { lcs[i][j] = max(lcs[i+1][j],lcs[i][j+1]); // first two cases if (s[i]==t[j]) lcs[i][j] = max(lcs[i][j], lcs[i+1][j+1] + 1); // third case } } [/code]
This code runs in O(MN) time, which is very fast with the lengths of the strings restricted to just 1000.
Now lcs[0][0] will hold the length of the longest common subsequence of both strings. This is the first number we are required to print out.
Counting the LCS
Counting the number of distinct LCS looks a little more difficult. If we tried to construct the algorithm in a similar way to the above, by adding together the number of solutions of all subproblems, we run into a few issues. Firstly, we are going to count some things twice (eg removing the first character of string one, then removing the second character of string two, and vice versa, will both be counted separately). Also, this isn't going to detect duplicates - it will count a subsequence more than once if it occurs several times in the string.
In order to get around this, we use the one constraint we haven't looked at yet - there are only 26 possible characters that could occur. Rather than looping through the strings to choose which character we should use first (which may include the same letter twice), we loop through all possible letters. We thus split things into 26 possible cases:
Case 1: How many subsequences of the right length (lcs[0][0]) begin with 'a'?
Case 2: How many subsequences of the right length (lcs[0][0]) begin with 'b'?
..
Case 26: How many subsequences of the right length (lcs[0][0]) begin with 'z'?
Let's suppose we are considering Case 1. Since we know that we are going to start with the letter 'a', we may as well use the first occurrence of 'a' that occurs in both strings. Now we consider the remaining parts of the strings - and want to count the number of subsequences of length one smaller than we started with.
Again, the terminating condition is when one string is empty, where the answer is 0.
This turns into the following code:
[code] // lcscount is an array of dimensions [M+1][N+1] // s is first string of length M // t is second string of length N for (int i=M; i>=0; i--) // using the ith character onwards of first string for (int j=N; j>=0; j--) { // using the jth character onwards of second string if (i==M || j==N) lcscount[i][j]=0; // empty string else { lcscount[i][j]=0; for (char k='a'; k<='z'; k++) { // use k as the first letter int x = /* first position in first string (from i onwards) that is a k */ int y = /* first position in second string (from j onwards) that is a k */ if (/* x and y both exist*/) { if (lcs[x][y] == lcs[i][j] - 1) { // make sure we can get a solution of the right length lcscount[i][j] = (lcscount[i][j] + lcscount[x][y])%23102009; // add on solutions } } } } } [/code]
Precomputing the next positions
The only problem we have remaining are the bits commented out in the above code - finding the first position of a certain character in each suffix of the input string. Simply looping through the string at each point isn't going to work - we might have to loop through up to 1000 characters, and 1000*1000*1000 is going to be too slow.
However, we can get around this by precomputing all of this information, using a third dynamic programming approach (but a very simple one).
We do so for each string by considering each character separately - if the string starts with this character, we have our answer. Otherwise, we skip this character.
Example code for the first string only:
[code] // next is an array of dimensions [M+1][26] // s is first string of length for (int i=0; i<26; i++) { next[M][i] = M; // set the 'next' occurrence of each character in an empty string to something impossible to signify it doesn't exist } for (int i=M-1; i>=0; i--) for (int j=0; j<26; j++) { if (s[i]=='a'+j) { // first character matches, so position = i next[i][j] = i; } else { // first character doesn't match, so we skip this character next[i][j] = next[i+1][j]; } } [/code]
By doing the same thing for the second string, and inserting these results into the previous DP code, we end up with a full solution.
Summary
While dynamic programming often looks scary to beginners, as long as you approach it in a step-by-step manner, the resulting code tends to be very short and sweet. Just make sure you know exactly what you want your subproblems to be, and you'll have a solution before you know it.