Find a needle in the haystack (初探ShingleCloud)

Posted 平原上的维克多

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Find a needle in the haystack (初探ShingleCloud)相关的知识,希望对你有一定的参考价值。

中国人用“大海捞针”形容无从寻觅之难,无独有偶,西班牙人用"A needle in the haystack"描述找寻之苦。自然语言处理任务中,大规模数据下的字符匹配恰如大海捞针。

1993年,来自TU Darmstadt 多媒体通信实验室的 Wise 提出了Greedy String Tiling [1] 算法。采用贪心思想的GST算法属于NP-hard问题,无法找到有效的解决方案,只能通过削减结果的质量以提高效率。即使如此,GST算法在最坏情况下的复杂度仍为O(n³)。

之后,Wise又提出了线性时间复杂度的ShingleCloud算法。该算法基于N-gram [2] 模型实现字符匹配,通过为目标字符串中的每个n-gram建立哈希表实现海量数据的快速匹配,解决了Greedy String Tiling algorithm [1] 中的高时间复杂度问题。

Java环境下ShingleCloud的使用:

String haystack = "This is some text that you want to search through";
String needle = "some text you want to find";

/* preparing the match object */
ShingleCloud sc = new ShingleCloud(haystack);
sc.setNGramSize(2);
sc.setMinimumNumberOfOnesInMatch(1);

sc.setSortMatchesByRating(false);

/* searching for the needle */
sc.match(needle);

/* displaying results */
System.out.println("ShingleCloud found " + sc.getMatches().size() + " match(es).");

ShingleCloudMatch firstMatch = sc.getMatches().get(0);
System.out.println("The first match consists of " + firstMatch.getNumberOfMatchedShingles() + " shingle(s).");
System.out.println("Its indirect rating is " + firstMatch.getIndirectRating());
System.out.println("The matching shingles were: " + firstMatch.getMatchedShingles());

ShingleCloudMatch secondMatch = sc.getMatches().get(1);
System.out.println("The second match consists of " + secondMatch.getNumberOfMatchedShingles() + " shingle(s).");
System.out.println("Its indirect rating is " + secondMatch.getIndirectRating());
System.out.println("The matching shingles were: " + secondMatch.getMatchedShingles());

This would produce the following output:

ShingleCloud found 2 match(es).
The first match consists of 1 shingle(s).
Its indirect rating is 0.2
The matching shingles were: some text
The second match consists of 2 shingle(s).
Its indirect rating is 0.4
The matching shingles were: you want to

命令行下ShingleCloud的使用

To run shinglecloud from the command line use the following command:

java -classpath jopt-simple-3.2.jar:shinglecloud-0.4.jar de.tud.kom.stringmatching.cli.ShingleCloudCli -h

This displays a help message describing all available parameters. If you, for example, want to compare two files with an n-gram size of 2 use the following.

java -classpath jopt-simple-3.2.jar:shinglecloud-0.4.jar de.tud.kom.stringmatching.cli.ShingleCloudCli --needle=needle.txt --haystack=haystack.txt --ngram=2

You can also directly pass in the search strings and constrain the output to only one containment measure:

java -classpath jopt-simple-3.2.jar:shinglecloud-0.4.jar de.tud.kom.stringmatching.cli.ShingleCloudCli --needledirect="This is the needle that i want to find in my haystack" --haystackdirect="My haystack hopefully contains the needle that i am looking for" --ngram=2 --output=containmentneedle

Which would simply print out: 0.333333. If you add the option minimum1=1, meaning that a match is counted if it contains of a single matching n-gram, then the containment would be 0.5

利用ShingleCloud实现文档查重的思路

  1. 使用tika读取不同格式(txt、doc、docx、pdf、html等)不同编码文件中的文本内容,并将其转换成能统一处理的文本;
  2. 使用hanlp对文本进行预处理、分词;
  3. 使用shinglecloud算法计算文本之间的相似度;
  4. 根据相似度排序,输出比较结果。

[1] Wise, Michael J.: String Similarity via Greedy String Tiling and Running Karp-Rabin Matching. (1993)
[2] Daniel Jurafsky, James H. Martin. “N-gram language models.” Speech and Language Processing.
[3]相似数据检测算法(shingle, SimHash, Bloomfilter)的比较. CSDN

以上是关于Find a needle in the haystack (初探ShingleCloud)的主要内容,如果未能解决你的问题,请参考以下文章

php in_array函数的使用

SThw2——find the error in the follow case

python: find the index of a given value in a list

Find the build UUID in a Crash Report

Find the Longest Word in a String

Division and Recursion--find the nearest points in two dimension