r SpamSumCompare - 原生R中SpamSum Hash比较算法的简化实现。
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了r SpamSumCompare - 原生R中SpamSum Hash比较算法的简化实现。相关的知识,希望对你有一定的参考价值。
# SpamSumCompare - A simplified implementation of SpamSum Hash
# comparison algorithm in native R.
#
# Copyright (C) 2002 Andrew Tridgell <tridge@samba.org>
# Copyright (C) 2006 ManTech International Corporation
# Copyright (C) 2013 Helmut Grohne <helmut@subdivi.de>
# Copyright (C) 2015 Mahmood S. Zargar <mahmood@gmail.com>
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
#
# Earlier versions of this code were named fuzzy.c and can be found at:
# http://www.samba.org/ftp/unpacked/junkcode/spamsum/
# http://ssdeep.sf.net/
# http://ssdeep.sourceforge.net/
#
# Sample Signatures:
# sig1 <- "3:KfIt+ARbgiyLqrJsQhVLgn:KHARfyLQJLQ"
# sig2 <- "3:KfIt0nAwikyLqrJsQhVLiM:KNnAwikyLQJLyM"
#
library(stringdist)
library(stringr)
SPAMSUM_LENGTH <- 64
ROLLING_WINDOW <- 7
MIN_BLOCKSIZE <- 3
# We only accept a match if we have at least one common substring in
# the signature of length ROLLING_WINDOW. This dramatically drops the
# false positive rate for low score thresholds while having
# negligable affect on match rejection.
hasCommonSubstring <- function(h1, h2, l1, l2) {
c1 <- l1 - ROLLING_WINDOW + 1
c2 <- l2 - ROLLING_WINDOW + 1
# Extract all possible 7-grams from the two hashes
s1 <-
lapply(1:c1, function(x)
substr(h1, x, x + ROLLING_WINDOW - 1))
s2 <-
lapply(1:c2, function(x)
substr(h2, x, x + ROLLING_WINDOW - 1))
# Any exactmatches?
return(any(s1 %in% s2))
}
# Compares two bare SpamSum hashes
compareHashes <- function(h1, h2, blk) {
l1 <- str_length(h1)
l2 <- str_length(h1)
# There is very little information content is sequences of
# the same character like 'LLLLL'. Reduces any sequences
# longer than 3 to a sequence of 3. This is especially
# important when combined with the hasCommonSubstring() test.
h1 <- str_replace_all(h1, '(.)\\1{2,}', '\\1\\1\\1')
h2 <- str_replace_all(h2, '(.)\\1{2,}', '\\1\\1\\1')
# Check the basic anomalies (too short or too long hashes, etc.)
if (max(l1, l2) > SPAMSUM_LENGTH |
min(l1, l2) < ROLLING_WINDOW |
blk < MIN_BLOCKSIZE)
return (0)
# The two hashes must have at least a common substring
# of length ROLLING_WINDOW to be candidates */
if ((h1 == h2) | hasCommonSubstring(h1, h2, l1, l2))
scoreStrings(h1, h2, l1, l2, blk)
else
return (0)
}
# This is the low level string scoring algorithm. It takes two strings
# and scores them on a scale of 0-100 where 0 is a terrible match and
# 100 is a great match. The block_size is used to cope with very short
# texts.
scoreStrings <- function(h1, h2, l1, l2, blk) {
# Compute the edit distance between the two strings. The edit distance gives
# us a pretty good idea of how closely related the two strings are.
if (h1 == h2)
score <- 0
else
score <-
stringdist(h1, h2, method = "lv", weight = c(d = 0.5, i = 0.5, s = 1)) * 2
# scale the edit distance by the lengths of the two
# strings. This changes the score to be a measure of the
# proportion of the message that has changed rather than an
# absolute quantity. It also copes with the variability of
# the string lengths.
score <- (score * SPAMSUM_LENGTH) / (l1 + l2)
# At this stage the score occurs roughly on a 0-64 scale,
# with 0 being a good match and 64 being a complete
# mismatch.
score <- (100 * score) / SPAMSUM_LENGTH
# It is possible to get a score above 100 here, but it is a
# really terrible match.
if (score >= 100) return (0)
# Now re-scale on a 0-100 scale with 0 being a poor match and
# 100 being a excellent match (friendlier to humans).
score <- 100 - score
# When the blocksize is small we don't want to exaggerate the match size.
if (blk >= (99 + ROLLING_WINDOW) / ROLLING_WINDOW * MIN_BLOCKSIZE)
return(score)
if (score > (blk / MIN_BLOCKSIZE) * min(l1, l2))
score = (blk / MIN_BLOCKSIZE) * min(l1, l2)
return(score)
}
# Decomposes the signature into its components and saves it
# into a list containing block size (blk), first hash (h1)
# and second hash (h2).
decompSig <- function(sig) {
ss <- as.list(str_split(sig, ":")[[1]])
ss[[1]] <- as.integer(ss[[1]])
names(ss) <- c("blk","h1","h2")
return(ss)
}
# Compares two decomposed SpamSum signatures
compareSS <- function(ss1, ss2) {
# If the blocksizes don't match then we are comparing
# apples to oranges. This isn't an 'error' per se. We could
# have two valid signatures, but they can't be compared.
if (
!(ss1[["blk"]] %in%
c(ss2[["blk"]] / 2, ss2[["blk"]], ss2[["blk"]] * 2))
) return(0)
# Each signature has a string for two block sizes. We now
# choose how to combine the two block sizes. We checked above
# that they have at least one block size in common.
if (ss1$blk == ss2$blk) {
max(compareHashes(ss1$h1, ss2$h1, ss1$blk),
compareHashes(ss1$h2, ss2$h2, ss1$blk * 2))
} else if (ss1$blk == ss2$blk * 2) {
compareHashes(ss1$h1, ss2$h2, ss1$blk)
} else {
compareHashes(ss1$h2, ss2$h1, ss2$blk)
}
}
# Compares two SpamSum signatures (strings)
compareSig <- function(sig1, sig2) {
compareSS(decompSig(sig1), decompSig(sig2))
}
以上是关于r SpamSumCompare - 原生R中SpamSum Hash比较算法的简化实现。的主要内容,如果未能解决你的问题,请参考以下文章
R语言进行数据聚合统计(Aggregating transforms)实战:使用R原生方法data.tabledplyr等方案计算分组均值并添加到可视化结果中
R语言应用calibrate包的textxy函数向R原生绘图结果中添加文本标签:添加多个文本标签改变文本标签的字体改变文本标签的字体颜色
R语言数据热力图绘制实战(基于原生R函数ggplot2包plotly包)
R语言text函数在R原生可视化结果中添加文本标签内容:指定文本显示的位置颜色字体大小等(PositionChange Color Size of Text)