[SimHash] find the percentage of similarity between two given data
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了[SimHash] find the percentage of similarity between two given data相关的知识,希望对你有一定的参考价值。
SimHash algorithm, introduced by Charikar and is patented by Google.
Simhash 5 steps: Tokenize, Hash, Weigh Values, Merge, Dimensionality Reduction
-
tokenize
-
tokenize your data, assign weights to each token, weights and tokenize function are depend on your business
-
-
hash (md5, SHA1)
-
calculate token‘s hash value and convert it to binary (101011 )
-
-
weigh values
-
for each hash value, do hash*w, in this way: (101011 ) -> (w,-w,w,-w,w,w)
-
-
merge
-
add up tokens‘ values, to merge to 1 hash, for example, merge (4 -4 -4 4 -4 4) and (5 -5 5 -5 5 5) , results to (4+5 -4+-5 -4+5 4+-5 -4+5 4+5),which is (9 -9 1 -1 1)
-
-
Dimensionality Reduction
-
Finally, signs of elements of
V
corresponds to the bits of the final fingerprint, for example (9 -9 1 -1 1) -> (1 0 1 0 1), we get 10101 as the fingerprint.
-
How to use SimHash fingerprints?
Hamming distance can be used to find the similarity between two given data, calculate the Hamming distance between 2 fingerprints.
Based on my experience, for 64 bit SimHash values, with elaborate weight values, distance of similar data
often differ appreciably in magnitude from those unsimilar data.
how to calculate:XOR, 只有两个位不同时结果是1 ,否则为0,两个二进制value“异或”后得到1的个数 为海明距离 。
simhash 0.1.0 : Python Package Index
以上是关于[SimHash] find the percentage of similarity between two given data的主要内容,如果未能解决你的问题,请参考以下文章