[SimHash] find the percentage of similarity between two given data

Posted 2020-07-16

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了[SimHash] find the percentage of similarity between two given data相关的知识，希望对你有一定的参考价值。

SimHash algorithm, introduced by Charikar and is patented by Google.

tokenize
- tokenize your data, assign weights to each token, weights and tokenize function are depend on your business
hash (md5, SHA1)
- calculate token‘s hash value and convert it to binary (101011 )
weigh values
- for each hash value, do hash*w, in this way: (101011 ) -> (w,-w,w,-w,w,w)
merge
- add up tokens‘ values, to merge to 1 hash, for example, merge (4 -4 -4 4 -4 4) and (5 -5 5 -5 5 5) , results to (4+5 -4+-5 -4+5 4+-5 -4+5 4+5)，which is (9 -9 1 -1 1)
Dimensionality Reduction
- Finally, signs of elements of V corresponds to the bits of the final fingerprint, for example (9 -9 1 -1 1) -> (1 0 1 0 1), we get 10101 as the fingerprint.

Hamming distance can be used to find the similarity between two given data, calculate the Hamming distance between 2 fingerprints.

Based on my experience, for 64 bit SimHash values, with elaborate weight values, distance of similar data

often differ appreciably in magnitude from those unsimilar data.

how to calculate：XOR, 只有两个位不同时结果是1 ，否则为0，两个二进制value“异或”后得到1的个数为海明距离。

以上是关于[SimHash] find the percentage of similarity between two given data的主要内容，如果未能解决你的问题，请参考以下文章