Java中的相似性字符串比较

Posted 2023-02-16

技术标签:

【中文标题】Java中的相似性字符串比较【英文标题】：Similarity String Comparison in Java 【发布时间】：2010-10-31 14:31:09 【问题描述】：

我想比较几个字符串，并找到最相似的字符串。我想知道是否有任何库、方法或最佳实践可以返回哪些字符串与其他字符串更相似。例如：

“狐狸跳了”->“狐狸跳了” “快狐跳”->“狐狸”

此比较将返回第一个比第二个更相似。

我想我需要一些方法，例如：

double similarityIndex(String s1, String s2)

哪里有这样的东西？

编辑：我为什么要这样做？我正在编写一个脚本，将 MS Project 文件的输出与一些处理任务的遗留系统的输出进行比较。因为遗留系统的字段宽度非常有限，所以在添加值时，描述会被缩写。我想要一些半自动的方法来查找 MS Project 中的哪些条目与系统上的条目相似，这样我就可以获得生成的密钥。它有缺点，因为它仍然必须手动检查，但它会节省很多工作

【问题讨论】：

【参考方案1】：

您还可以使用 z 算法来查找字符串中的相似性。点击这里https://teakrunch.com/2020/05/09/string-similarity-hackerrank-challenge/

【讨论】：

【参考方案2】：

是的，有许多有据可查的算法，例如：

余弦相似度 Jaccard 相似度骰子的系数匹配相似度重叠相似度等等等等

一个很好的总结（“Sam's String Metrics”）can be found here（原始链接失效，所以它链接到 Internet 档案）

还要检查这些项目：

Simmetrics jtmt

【讨论】：

+1 simmetrics 网站似乎不再活跃。但是，我在 sourceforge 上找到了代码：sourceforge.net/projects/simmetrics 谢谢指点。 “您可以检查一下”链接已损坏。这就是 Michael Merchant 在上面发布正确链接的原因。 sourceforge 上 simmetrics 的 jar 有点过时了，github.com/mpkorstanje/simmetrics 是更新的 github 页面，带有 maven 工件要添加到@MichaelMerchant 的评论，该项目也可以在github 上获得。虽然那里也不是很活跃，但比 sourceforge 更新一些。【参考方案3】：

以 0%-100% 的方式计算两个字符串之间的相似度的常用方法，如许多库中所使用的那样，是衡量您必须更改多少（以 % 为单位）将较长的字符串变成较短的字符串：

/**
 * Calculates the similarity (a number within 0 and 1) between two strings.
 */
public static double similarity(String s1, String s2) 
  String longer = s1, shorter = s2;
  if (s1.length() < s2.length())  // longer should always have greater length
    longer = s2; shorter = s1;
  
  int longerLength = longer.length();
  if (longerLength == 0)  return 1.0; /* both strings are zero length */ 
  return (longerLength - editDistance(longer, shorter)) / (double) longerLength;

// you can use StringUtils.getLevenshteinDistance() as the editDistance() function
// full copy-paste working code is below

计算`editDistance()`:

上面的editDistance() 函数预计会计算两个字符串之间的编辑距离。这一步有several implementations，每个可能更适合特定场景。最常见的是 Levenshtein distance algorithm，我们将在下面的示例中使用它（对于非常大的字符串，其他算法可能会表现更好）。

这里有两个计算编辑距离的选项：

可以使用Apache Commons Text的Levenshtein距离实现： apply(CharSequence left, CharSequence rightt) 您自己实现它。下面是一个示例实现。

工作示例：

See online demo here.

public class StringSimilarity 

  /**
   * Calculates the similarity (a number within 0 and 1) between two strings.
   */
  public static double similarity(String s1, String s2) 
    String longer = s1, shorter = s2;
    if (s1.length() < s2.length())  // longer should always have greater length
      longer = s2; shorter = s1;
    
    int longerLength = longer.length();
    if (longerLength == 0)  return 1.0; /* both strings are zero length */ 
    /* // If you have Apache Commons Text, you can use it to calculate the edit distance:
    LevenshteinDistance levenshteinDistance = new LevenshteinDistance();
    return (longerLength - levenshteinDistance.apply(longer, shorter)) / (double) longerLength; */
    return (longerLength - editDistance(longer, shorter)) / (double) longerLength;

  

  // Example implementation of the Levenshtein Edit Distance
  // See http://rosettacode.org/wiki/Levenshtein_distance#Java
  public static int editDistance(String s1, String s2) 
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) 
      int lastValue = i;
      for (int j = 0; j <= s2.length(); j++) 
        if (i == 0)
          costs[j] = j;
        else 
          if (j > 0) 
            int newValue = costs[j - 1];
            if (s1.charAt(i - 1) != s2.charAt(j - 1))
              newValue = Math.min(Math.min(newValue, lastValue),
                  costs[j]) + 1;
            costs[j - 1] = lastValue;
            lastValue = newValue;
          
        
      
      if (i > 0)
        costs[s2.length()] = lastValue;
    
    return costs[s2.length()];
  

  public static void printSimilarity(String s, String t) 
    System.out.println(String.format(
      "%.3f is the similarity between \"%s\" and \"%s\"", similarity(s, t), s, t));
  

  public static void main(String[] args) 
    printSimilarity("", "");
    printSimilarity("1234567890", "1");
    printSimilarity("1234567890", "123");
    printSimilarity("1234567890", "1234567");
    printSimilarity("1234567890", "1234567890");
    printSimilarity("1234567890", "1234567980");
    printSimilarity("47/2010", "472010");
    printSimilarity("47/2010", "472011");
    printSimilarity("47/2010", "AB.CDEF");
    printSimilarity("47/2010", "4B.CDEFG");
    printSimilarity("47/2010", "AB.CDEFG");
    printSimilarity("The quick fox jumped", "The fox jumped");
    printSimilarity("The quick fox jumped", "The fox");
    printSimilarity("kitten", "sitting");

输出：

1.000 is the similarity between "" and ""
0.100 is the similarity between "1234567890" and "1"
0.300 is the similarity between "1234567890" and "123"
0.700 is the similarity between "1234567890" and "1234567"
1.000 is the similarity between "1234567890" and "1234567890"
0.800 is the similarity between "1234567890" and "1234567980"
0.857 is the similarity between "47/2010" and "472010"
0.714 is the similarity between "47/2010" and "472011"
0.000 is the similarity between "47/2010" and "AB.CDEF"
0.125 is the similarity between "47/2010" and "4B.CDEFG"
0.000 is the similarity between "47/2010" and "AB.CDEFG"
0.700 is the similarity between "The quick fox jumped" and "The fox jumped"
0.350 is the similarity between "The quick fox jumped" and "The fox"
0.571 is the similarity between "kitten" and "sitting"

【讨论】：

Levenshtein 距离法在org.apache.commons.lang3.StringUtils 中可用。 @Cleakod 现在它是 commons-text 的一部分：commons.apache.org/proper/commons-text/javadocs/api-release/org/…【参考方案4】：

您可以使用apache commons java library 实现此目的。看看里面的这两个函数： - getLevenshteinDistance - getFuzzyDistance

【讨论】：

自 2017 年 10 月起，已弃用链接方法。改用 commons text library 中的 LevenshteinDistance 和 FuzzyScore 类【参考方案5】：

确实有很多字符串相似性度量：

Levenshtein 编辑距离； Damerau-Levenshtein 距离； Jaro-Winkler 相似性；最长公共子序列编辑距离； Q-Gram（乌科宁）； n-Gram 距离（康德拉克）；杰卡德索引； Sorensen-Dice 系数；余弦相似度； ...

你可以在这里找到这些解释和java实现： https://github.com/tdebatty/java-string-similarity

【讨论】：

【参考方案6】：

感谢第一个回答者，我认为computeEditDistance（s1，s2）有2次计算。由于它的高时间花费，决定提高代码的性能。所以：

public class LevenshteinDistance 

public static int computeEditDistance(String s1, String s2) 
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) 
        int lastValue = i;
        for (int j = 0; j <= s2.length(); j++) 
            if (i == 0) 
                costs[j] = j;
             else 
                if (j > 0) 
                    int newValue = costs[j - 1];
                    if (s1.charAt(i - 1) != s2.charAt(j - 1)) 
                        newValue = Math.min(Math.min(newValue, lastValue),
                                costs[j]) + 1;
                    
                    costs[j - 1] = lastValue;
                    lastValue = newValue;
                
            
        
        if (i > 0) 
            costs[s2.length()] = lastValue;
        
    
    return costs[s2.length()];


public static void printDistance(String s1, String s2) 
    double similarityOfStrings = 0.0;
    int editDistance = 0;
    if (s1.length() < s2.length())  // s1 should always be bigger
        String swap = s1;
        s1 = s2;
        s2 = swap;
    
    int bigLen = s1.length();
    editDistance = computeEditDistance(s1, s2);
    if (bigLen == 0) 
        similarityOfStrings = 1.0; /* both strings are zero length */
     else 
        similarityOfStrings = (bigLen - editDistance) / (double) bigLen;
    
    //////////////////////////
    //System.out.println(s1 + "-->" + s2 + ": " +
      //      editDistance + " (" + similarityOfStrings + ")");
    System.out.println(editDistance + " (" + similarityOfStrings + ")");


public static void main(String[] args) 
    printDistance("", "");
    printDistance("1234567890", "1");
    printDistance("1234567890", "12");
    printDistance("1234567890", "123");
    printDistance("1234567890", "1234");
    printDistance("1234567890", "12345");
    printDistance("1234567890", "123456");
    printDistance("1234567890", "1234567");
    printDistance("1234567890", "12345678");
    printDistance("1234567890", "123456789");
    printDistance("1234567890", "1234567890");
    printDistance("1234567890", "1234567980");

    printDistance("47/2010", "472010");
    printDistance("47/2010", "472011");

    printDistance("47/2010", "AB.CDEF");
    printDistance("47/2010", "4B.CDEFG");
    printDistance("47/2010", "AB.CDEFG");

    printDistance("The quick fox jumped", "The fox jumped");
    printDistance("The quick fox jumped", "The fox");
    printDistance("The quick fox jumped",
            "The quick fox jumped off the balcany");
    printDistance("kitten", "sitting");
    printDistance("rosettacode", "raisethysword");
    printDistance(new StringBuilder("rosettacode").reverse().toString(),
            new StringBuilder("raisethysword").reverse().toString());
    for (int i = 1; i < args.length; i += 2) 
        printDistance(args[i - 1], args[i]);

【讨论】：

【参考方案7】：

我将Levenshtein distance algorithm 翻译成 javascript：

String.prototype.LevenshteinDistance = function (s2) 
    var array = new Array(this.length + 1);
    for (var i = 0; i < this.length + 1; i++)
        array[i] = new Array(s2.length + 1);

    for (var i = 0; i < this.length + 1; i++)
        array[i][0] = i;
    for (var j = 0; j < s2.length + 1; j++)
        array[0][j] = j;

    for (var i = 1; i < this.length + 1; i++) 
        for (var j = 1; j < s2.length + 1; j++) 
            if (this[i - 1] == s2[j - 1]) array[i][j] = array[i - 1][j - 1];
            else 
                array[i][j] = Math.min(array[i][j - 1] + 1, array[i - 1][j] + 1);
                array[i][j] = Math.min(array[i][j], array[i - 1][j - 1] + 1);
            
        
    
    return array[this.length][s2.length];
;

【讨论】：

【参考方案8】：

如果您的字符串变成文档，对我来说听起来像是plagiarism finder。也许用这个词搜索会发现一些好东西。

《Programming Collective Intelligence》有一章是关于判断两个文档是否相似的。代码是用 Python 编写的，但它干净且易于移植。

【讨论】：

【参考方案9】：

这通常使用edit distance 度量来完成。搜索“edit distance java”会出现许多库，例如this one。

【讨论】：

【参考方案10】：

理论上可以比较edit distances。

【讨论】：

【参考方案11】：

您可以使用 Levenshtein 距离来计算两个字符串之间的差异。 http://en.wikipedia.org/wiki/Levenshtein_distance

【讨论】：

Levenshtein 非常适合少数字符串，但无法扩展到大量字符串之间的比较。我在 Java 中使用 Levenshtein 取得了一些成功。我没有对庞大的列表进行比较，因此可能会影响性能。此外，它有点简单，可以使用一些调整来提高较短单词（如 3 或 4 个字符）的阈值，这往往被视为比应该更相似（从猫到狗只有 3 次编辑）请注意，编辑距离下面建议的内容几乎相同 - Levenshtein 是编辑距离的特殊实现。这篇文章展示了如何将 Levenshtein 与高效的 SQL 查询相结合：literatejava.com/sql/fuzzy-string-search-sql

以上是关于Java中的相似性字符串比较的主要内容，如果未能解决你的问题，请参考以下文章

Java中的相似性字符串比较

计算editDistance():

工作示例：

计算`editDistance()`: