Minimum Edit Distance with Dynamic Programming

Posted chunngai

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Minimum Edit Distance with Dynamic Programming相关的知识,希望对你有一定的参考价值。

1. Question / 实践题目

技术图片

2. Analysis / 问题描述

Our task is to modify the two strings with three operations: (1) deleting one character, (2) inserting one character, (3) substituting a character with another, with the minimum edit distance (the least edit times). It seems that it can be solved using dynamic programming. To use this strategy we should first try to find out the optimal substructure and its overlapping subproblems.

3. Algorithm / 算法描述

Assume that the lengths of string A and string B are (m) and (n). Let‘s first try to make the last characters of the two strings identical. We should perform some operations if the last characters of the two strings are not the same. If they are the same, we should only consider their substrings, the (1^{st}) one to the ((m - 1)^{th}) one for string A and the (1^{st}) one to the ((n - 1)^{th}) one for string B.

3.1. Substitution

We can both substitute the last character of string A with the last character of string B and vice versa. Let‘s first consider the first situation. The original strings in the question are:

fxpimu
  xwrs

We substitute "u" in string A with "s" in string B. Thus the strings now are:

fxpims
  xwrs

Since the last characters of the two strings are identical now, we should merely find out the minimum edit distance for their substrings, the (1^{st}) one to the ((m - 1)^{th}) one for string A and the (1^{st}) one to the ((n - 1)^{th}) one for string B. The situation for substiting the last character of string B with the last one of string A is similar.

3.2. Insertion

We can both insert the last character of string A to the end of string B, and vice versa. First let‘s consider the situation that the last character of string B is to be inserted to the end of string A. After the insertion the strings will be:

fxpimus
   xwrs

The lengths of the two strings now are (m + 1) and (n). The last characters of the two strings are identical now, so we can consider the minimum edit distance of their substrings. The (1^{st}) one to the (m^{th}) (the length of string A is increased by one after the insertion, which is (m + 1). Since we are to omit the last character and consider its substring now, the last character of the substring is the (m^{th}) one.) for string A and the (1^{st}) one to the ((n - 1)^{th}) one for string B. The other situation that the last character of string A is inserted to the end of string B is similar.

3.3. Deletion

Both insertion and substitution can make the last character of the two strings identical, but that‘s not the case for the deletion operation. After deleting one character of the string, the last character of the modified string may not be the same as the last character of the other string. Concretely, if we delete the last character "u" in string A, the strings will be like:

fxpim
 xwrs

The last characters are still not identical. Therefore, we should still consider the whole length of the two strings. That is, the (1^{st}) character to the ((m - 1)^{th}) character for string A (because the last character of string A is deleted, the length of it is decreased by one.), and the (1^{st}) character to the (n^{th}) one for string B. Deleting the last character of string B will be similar.

3.4. Special Cases

When one of the strings are empty, the minimum edit distance will be the length of the non-empty string. The empty string can be edited to be the non-empty string by inserting characters of the non-empty string into the empty string, one by one from the end.

And when both of the strings are empty, we need to do nothing.

According to the analysis above we can find out the optimal substructure of the task. The optimal solution of the original task depends on the solution of its subtasks.

The task also has many overlapping subproblems. Let‘s say we want to find out the minimum edit distance of the substrings of two strings, we have to find out the minimum distance of their subsubstrings. And when we are to find out the minimum edit distance of two strings, we find the minimum distance of their substrings, whose minimum distance depends on their subsubstrings. Thus here the minimum edit distance of the subsubstrings are calculated twice.

3.5. Equation

Let‘s say the minimum edit distance of string A and string B with lengths (m) and (n) is edit[m][n]. With the optimal substucture, we can work out the equation for the task:
[ edit[i][j]= left { egin{aligned} & 0 & {i = 0, j = 0} & & n & {i = 0, j > 0} & & m & {i > 0, j = 0} & min{edit[i - 1][j] + 1, edit[i][j - 1] + 1, edit[i - 1][j - 1] + notIdentical} & {i > 0, j > 0} end{aligned} ight. ]

where notIdentical is like:
[ notIdentical= left { egin{aligned} & 0 & {A[i] = B[j]} & & 1 & {A[i] eq B[j]} & end{aligned} ight. ]

Note that (min(edit[i - 1][j] + 1) is for the case that (1) deleting the last character of string A and that (2) inserting the last character of string A to the end of string B. (edit[i][j - 1] + 1) is for the case that (1) deleting the last character of string B and that (2) inserting the last character of string B to the end of string A. (edit[i - 1][j - 1] + notIdentical) is for the case that (1) the last characters of the two strings are identical and (2) the substitution operation is performed.

The equation can be simplefied as:
[ edit[i][j]= left { egin{aligned} & i == 0;?;i;:;j & {i == 0;||;j == 0} & & min{edit[i - 1][j] + 1, & qquad edit[i][j - 1] + 1, & qquad edit[i - 1][j - 1] + int(!(A[i] == B[j])} & {i > 0, j > 0} end{aligned} ight. ]

4. Fill the Table / 填表

What we should do when solving a dynamic programming task is merely filling a table. What we should know are:

  1. the dimension of the table
  2. the range to fill
  3. the filling order

4.1. Dimension

Since we use a 2-dimensional matrix edit[i][j] to store the solutions, the table is 2D.

4.2. Range

The range of the indices of the solution is (0 leq i leq m), (0 leq j leq n). Thus the whole table should be filled.

4.3. Order

To calculate edit[i][j], we should first calculate edit[i - 1][j], edit[i][j - 1] and edit[i - 1][j - 1]. Let‘s find out their related position in the table:
[ egin{matrix} edit[i - 1][j - 1] & edit[i - 1][j] edit[i][j - 1] & *edit[i][j]* end{matrix} ]

thus the order is from left to right, from the top to the bottem.

After considering how the table should be filled, we can start writing code with the equation.

for (int i = 0; i <= stringA.length(); i++) {  // from the top to the bottom
  for (int j = 0; j <= stringB.length(); j++) {  // from left to right
    if (i && j) {  // i > 0 and j > 0
      // (1) delete A[m - 1]
      // (2) insert A[m - 1] to B[n]
      int tmpEditTimes1 = editTimes[i - 1][j] + 1;

      // (1) delete B[n - 1]
      // (2) insert B[n - 1] to A[m]
      int tmpEditTimes2 = editTimes[i][j - 1] + 1;

      // (1) A[m - 1] == B[n - 1]
      // (2) substitution
      int tmpEditTimes3 = editTimes[i - 1][j - 1] + 
        int(!(stringA[i - 1] == stringB[j - 1]));

      // find out the smallest edit distance
      editTimes[i][j] = min(
        tmpEditTimes1, 
        tmpEditTimes2, 
        tmpEditTimes3);
    }
    else {  // i = 0 or j = 0 or both equal 0
      editTimes[i][j] = i == 0 ? j : i;
    }
  }
}

5. Show Me the Code

#include <iostream>
#include <string>
using namespace std;

int editTimes[2001][2001];

int min(int a, int b);
int min(int a, int b, int c);

int main(void) {
    // receive string A
    string stringA;
    getline(cin, stringA);

    // receive string B
    string stringB;
    getline(cin, stringB);

    // fill the table
    for (int i = 0; i <= stringA.length(); i++) {  // from the top to the bottom
        for (int j = 0; j <= stringB.length(); j++) {  // from left to right
            if (i && j) {  // i > 0 and j > 0
                // (1) delete A[m - 1]
                // (2) insert A[m - 1] to B[n]
                int tmpEditTimes1 = editTimes[i - 1][j] + 1;

                // (1) delete B[n - 1]
                // (2) insert B[n - 1] to A[m]
                int tmpEditTimes2 = editTimes[i][j - 1] + 1;

                // (1) A[m - 1] == B[n - 1]
                // (2) substitution
                int tmpEditTimes3 = editTimes[i - 1][j - 1] +
                    int(!(stringA[i - 1] == stringB[j - 1]));

                // find out the smallest edit distance
                editTimes[i][j] = min(
                    tmpEditTimes1,
                    tmpEditTimes2,
                    tmpEditTimes3);
            }
            else {  // i = 0 or j = 0 or both equal 0
                editTimes[i][j] = i == 0 ? j : i;
            }
        }
    }

    // display the minimum edit distance
    cout << editTimes[stringA.length()][stringB.length()];

    return 0;
}

int min(int a, int b) {
    return a < b ? a : b;
}

int min(int a, int b, int c) {
    return min(min(a, b), c);
}

6. T(n) and S(n) / 算法时间及空间复杂度分析(要有分析过程)

In the code which fills the table, there are two loops, and statements in the loops all have a time complexity of (O(1)). Thus the time complexity is
[T(m, n) = O(m) * O(n) = O(mn)]
where (m) and (n) is the lengths of the two strings, respectively.

We used a 2-dimensional array whose size is ((m + 1) * (n + 1)). Thus the space complexity is
[S(m, n) = O(m) * O(n) = O(mn)]

7. Experience / 心得体会(对本次实践收获及疑惑进行总结)

To find out the solution to the problems of this chapter, we should:

  1. find out the optimal substructure
  2. find out the overlapping subproblems
    Once we find out these two we have confidence to say that the problem can be solved with dynamic programming.

The 3 things needed to determine to solve a dynamic programming problem is:

  1. the dimension of the table
  2. the range to be filled
  3. the filling order

Everything gets simple if we finish the steps above.


reference: 动态规划之编辑距离问题

以上是关于Minimum Edit Distance with Dynamic Programming的主要内容,如果未能解决你的问题,请参考以下文章

lightoj-1433 - Minimum Arc Distance(几何)

783. Minimum Distance Between BST Nodes

783. Minimum Distance Between BST Nodes

783. Minimum Distance Between BST Nodes

Edit Distance

783. Minimum Distance Between BST Nodes