[SimHash] find the percentage of similarity between two given data

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了[SimHash] find the percentage of similarity between two given data相关的知识,希望对你有一定的参考价值。

SimHash algorithm, introduced by Charikar and is patented by Google.

 

Simhash 5 steps: Tokenize, Hash, Weigh Values, Merge, Dimensionality Reduction

  • tokenize

    • tokenize your data, assign weights to each token, weights and tokenize function are depend on your business

  • hash (md5, SHA1)

    • calculate token‘s hash value and convert it to binary (101011 )

  • weigh values

    • for each hash value, do hash*w, in this way: (101011 ) -> (w,-w,w,-w,w,w)

  • merge

    • add up tokens‘ values, to merge to 1 hash, for example, merge (4 -4 -4 4 -4 4) and (5 -5 5 -5 5 5) , results to (4+5 -4+-5 -4+5 4+-5 -4+5 4+5),which is (9 -9 1 -1 1)

  • Dimensionality Reduction

    • Finally, signs of elements of V corresponds to the bits of the final fingerprint, for example (9 -9 1 -1 1) -> (1 0 1 0 1), we get 10101 as the fingerprint.

 

How to use SimHash fingerprints?

Hamming distance can be used to find the similarity between two given data, calculate the Hamming distance between 2 fingerprints.

Based on my experience, for 64 bit SimHash values, with elaborate weight values,  distance of similar data

often differ appreciably in magnitude from those unsimilar data.

 

how to calculate:XOR, 只有两个位不同时结果是1 ,否则为0,两个二进制value“异或”后得到1的个数 为海明距离 。

 

技术分享

 

 

simhash 0.1.0 : Python Package Index

以上是关于[SimHash] find the percentage of similarity between two given data的主要内容,如果未能解决你的问题,请参考以下文章

使用SimHash进行海量文本去重

关于SimHash去重原理的理解(能力工场小马哥)

simhash

simhash算法:海量千万级的数据去重

文本去重算法——simhash简介

海量数据相似度计算之simhash短文本查找