Damerau-Levenshtein 距离实施

Posted

技术标签:

【中文标题】Damerau-Levenshtein 距离实施【英文标题】:Damerau-Levenshtein distance Implementation 【发布时间】:2014-04-14 00:27:59 【问题描述】:

我正在尝试在 JS 中创建一个 damerau-levenshtein 距离函数。

我在 WIkipedia 上找到了该算法的描述,但它们没有实现。它说:

设计一个合适的算法来计算无限制 Damerau-Levenshtein 距离注意到总是存在一个最优的 编辑操作序列,其中一次转置的字母永远不会 后修改。因此,我们只需要考虑两种对称方式 多次修改子字符串:(1)转置字母和 在它们之间插入任意数量的字符,或者 (2) 删除一个 字符序列和转置相邻的字母 删除后。这个想法的直接实施给出了 三次复杂度的算法: O\left (M \cdot N \cdot \max(M, N) \right ),其中 M 和 N 是字符串长度。使用的想法 Lowrance 和 Wagner,[7] 这种朴素算法可以改进为 O\left (M \cdot N \right) 在最坏的情况下。有趣的是 可以修改 bitap 算法以处理转置。见 [1] 的信息检索部分提供了这样的示例 适应。

https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

[1] 部分指向http://acl.ldc.upenn.edu/P/P00/P00-1037.pdf,这对我来说更加复杂。

如果我理解正确的话,创建一个实现并不容易。

这是我目前使用的 levenshtein 实现:

levenshtein=function (s1, s2) 
  //       discuss at: http://phpjs.org/functions/levenshtein/
  //      original by: Carlos R. L. Rodrigues (http://www.jsfromhell.com)
  //      bugfixed by: Onno Marsman
  //       revised by: Andrea Giammarchi (http://webreflection.blogspot.com)
  // reimplemented by: Brett Zamir (http://brett-zamir.me)
  // reimplemented by: Alexander M Beedie
  //        example 1: levenshtein('Kevin van Zonneveld', 'Kevin van Sommeveld');
  //        returns 1: 3

  if (s1 == s2) 
    return 0;
  

  var s1_len = s1.length;
  var s2_len = s2.length;
  if (s1_len === 0) 
    return s2_len;
  
  if (s2_len === 0) 
    return s1_len;
  

  // BEGIN STATIC
  var split = false;
  try 
    split = !('0')[0];
   catch (e) 
    // Earlier IE may not support access by string index
    split = true;
  
  // END STATIC
  if (split) 
    s1 = s1.split('');
    s2 = s2.split('');
  

  var v0 = new Array(s1_len + 1);
  var v1 = new Array(s1_len + 1);

  var s1_idx = 0,
    s2_idx = 0,
    cost = 0;
  for (s1_idx = 0; s1_idx < s1_len + 1; s1_idx++) 
    v0[s1_idx] = s1_idx;
  
  var char_s1 = '',
    char_s2 = '';
  for (s2_idx = 1; s2_idx <= s2_len; s2_idx++) 
    v1[0] = s2_idx;
    char_s2 = s2[s2_idx - 1];

    for (s1_idx = 0; s1_idx < s1_len; s1_idx++) 
      char_s1 = s1[s1_idx];
      cost = (char_s1 == char_s2) ? 0 : 1;
      var m_min = v0[s1_idx + 1] + 1;
      var b = v1[s1_idx] + 1;
      var c = v0[s1_idx] + cost;
      if (b < m_min) 
        m_min = b;
      
      if (c < m_min) 
        m_min = c;
      
      v1[s1_idx + 1] = m_min;
    
    var v_tmp = v0;
    v0 = v1;
    v1 = v_tmp;
  
  return v0[s1_len];
 

您对构建这样一个算法有什么想法,如果您认为它太复杂,我该怎么做才能使“l”(L 小写)和“I”(i 大写)之间没有区别。

【问题讨论】:

你到底想做什么?您指的是(所谓的)Damerau-Levenshtein 距离,但您的代码包含经典的 Levenshtein 算法。如果只需要转置支持,只需要更改几行代码。至于给定字符之间的“没有区别”,您必须为特定的编辑操作分配给定的惩罚。这可以通过表查找/插入来处理,但在 javascript 中可能太慢而无法用于任何事情。 是的,我的代码只使用了简单的 Levenshtein 算法。我有一个 screenNames (300 screenNames) 数据库,以及一个扫描 screenNames 列表(300 个 screenNames)的 OCR 扫描仪。但是 OCR 扫描仪给出了不好的结果。所以我想找到相似之处(这就是我目前在 JS 中所做的)。例如,“mikejew_e”被解释为“mikeiew e”。我现在正在使用 levenshtein 算法(最大距离为 3),但它有点过于宽松了。 (距离为 2,我可能会丢失一些匹配的屏幕名称) 好的,我明白了。获得更好结果的基本步骤是为每个编辑操作分配特定的权重。将默认惩罚设为 1.0,并降低 OCR 程序可能误读的字符的惩罚。对于 300 个名称,这将足够快,即使在 javascript 中也是如此。我刚刚发布了我的 C 实现的要点:gist.github.com/doukremt/9473228。这可能会让您了解这是如何完成的。我也记得github上有纯javascript的实现,但是找不到链接了,不好意思! 很好,我会把它转换成 JS。正是我想要的 【参考方案1】:

@doukremt 给出的要点:https://gist.github.com/doukremt/9473228

在 Javascript 中给出以下内容。

您可以在 weighter 对象中更改操作的权重。

var levenshteinWeighted= function(seq1,seq2)

    var len1=seq1.length;
    var len2=seq2.length;
    var i, j;
    var dist;
    var ic, dc, rc;
    var last, old, column;

    var weighter=
        insert:function(c)  return 1.; ,
        delete:function(c)  return 0.5; ,
        replace:function(c, d)  return 0.3; 
    ;

    /* don't swap the sequences, or this is gonna be painful */
    if (len1 == 0 || len2 == 0) 
        dist = 0;
        while (len1)
            dist += weighter.delete(seq1[--len1]);
        while (len2)
            dist += weighter.insert(seq2[--len2]);
        return dist;
    

    column = []; // malloc((len2 + 1) * sizeof(double));
    //if (!column) return -1;

    column[0] = 0;
    for (j = 1; j <= len2; ++j)
        column[j] = column[j - 1] + weighter.insert(seq2[j - 1]);

    for (i = 1; i <= len1; ++i) 
        last = column[0]; /* m[i-1][0] */
        column[0] += weighter.delete(seq1[i - 1]); /* m[i][0] */
        for (j = 1; j <= len2; ++j) 
            old = column[j];
            if (seq1[i - 1] == seq2[j - 1]) 
                column[j] = last; /* m[i-1][j-1] */
             else 
                ic = column[j - 1] + weighter.insert(seq2[j - 1]);      /* m[i][j-1] */
                dc = column[j] + weighter.delete(seq1[i - 1]);          /* m[i-1][j] */
                rc = last + weighter.replace(seq1[i - 1], seq2[j - 1]); /* m[i-1][j-1] */
                column[j] = ic < dc ? ic : (dc < rc ? dc : rc);
            
            last = old;
        
    

    dist = column[len2];
    return dist;

【讨论】:

这看起来是加权的,但不是转置的,因此不是 Damerau;如有错误请指正。【参考方案2】:

从here 窃取,带有格式和一些使用方法的示例:

function DamerauLevenshtein(prices, damerau) 
  //'prices' customisation of the edit costs by passing an object with optional 'insert', 'remove', 'substitute', and
  //'transpose' keys, corresponding to either a constant number, or a function that returns the cost. The default cost
  //for each operation is 1. The price functions take relevant character(s) as arguments, should return numbers, and
  //have the following form:
  //
  //insert: function (inserted)  return NUMBER; 
  //
  //remove: function (removed)  return NUMBER; 
  //
  //substitute: function (from, to)  return NUMBER; 
  //
  //transpose: function (backward, forward)  return NUMBER; 
  //
  //The damerau flag allows us to turn off transposition and only do plain Levenshtein distance.

  if (damerau !== false) 
    damerau = true;
  
  if (!prices) 
    prices = ;
  
  let insert, remove, substitute, transpose;

  switch (typeof prices.insert) 
    case 'function':
      insert = prices.insert;
      break;
    case 'number':
      insert = function (c) 
        return prices.insert;
      ;
      break;
    default:
      insert = function (c) 
        return 1;
      ;
      break;
  

  switch (typeof prices.remove) 
    case 'function':
      remove = prices.remove;
      break;
    case 'number':
      remove = function (c) 
        return prices.remove;
      ;
      break;
    default:
      remove = function (c) 
        return 1;
      ;
      break;
  

  switch (typeof prices.substitute) 
    case 'function':
      substitute = prices.substitute;
      break;
    case 'number':
      substitute = function (from, to) 
        return prices.substitute;
      ;
      break;
    default:
      substitute = function (from, to) 
        return 1;
      ;
      break;
  

  switch (typeof prices.transpose) 
    case 'function':
      transpose = prices.transpose;
      break;
    case 'number':
      transpose = function (backward, forward) 
        return prices.transpose;
      ;
      break;
    default:
      transpose = function (backward, forward) 
        return 1;
      ;
      break;
  

  function distance(down, across) 
    //http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
    let ds = [];
    if (down === across) 
      return 0;
     else 
      down = down.split('');
      down.unshift(null);
      across = across.split('');
      across.unshift(null);
      down.forEach(function (d, i) 
        if (!ds[i]) 
          ds[i] = [];
        
        across.forEach(function (a, j) 
          if (i === 0 && j === 0) 
            ds[i][j] = 0;
           else if (i === 0) 
            //Empty down (i == 0) -> across[1..j] by inserting
            ds[i][j] = ds[i][j - 1] + insert(a);
           else if (j === 0) 
            //Down -> empty across (j == 0) by deleting
            ds[i][j] = ds[i - 1][j] + remove(d);
           else 
            //Find the least costly operation that turns the prefix down[1..i] into the prefix across[1..j] using
            //already calculated costs for getting to shorter matches.
            ds[i][j] = Math.min(
              //Cost of editing down[1..i-1] to across[1..j] plus cost of deleting
              //down[i] to get to down[1..i-1].
              ds[i - 1][j] + remove(d),
              //Cost of editing down[1..i] to across[1..j-1] plus cost of inserting across[j] to get to across[1..j].
              ds[i][j - 1] + insert(a),
              //Cost of editing down[1..i-1] to across[1..j-1] plus cost of substituting down[i] (d) with across[j]
              //(a) to get to across[1..j].
              ds[i - 1][j - 1] + (d === a ? 0 : substitute(d, a))
            );
            //Can we match the last two letters of down with across by transposing them? Cost of getting from
            //down[i-2] to across[j-2] plus cost of moving down[i-1] forward and down[i] backward to match
            //across[j-1..j].
            if (damerau && i > 1 && j > 1 && down[i - 1] === a && d === across[j - 1]) 
              ds[i][j] = Math.min(ds[i][j], ds[i - 2][j - 2] + (d === a ? 0 : transpose(d, down[i - 1])));
            
          
        );
      );
      return ds[down.length - 1][across.length - 1];
    
  
  return distance;


//Returns a distance function to call.
let dl = DamerauLevenshtein();
console.log(dl('12345', '23451'));
console.log(dl('this is a test', 'this is not a test'));
console.log(dl('testing testing 123', 'test'));

【讨论】:

以上是关于Damerau-Levenshtein 距离实施的主要内容,如果未能解决你的问题,请参考以下文章

检测重复/非常相似的文本段落

实施

Damerau–Levenshtein distance (Edit Distance with Transposition) c 实现

L2019.7.29 CCC 安全距离

项目实施流程和规范模板(测试方向)

项目实施流程和规范模板(测试方向)