r SpamSumCompare - 原生R中SpamSum Hash比较算法的简化实现。

Posted 2021-05-24

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了r SpamSumCompare - 原生R中SpamSum Hash比较算法的简化实现。相关的知识，希望对你有一定的参考价值。

# SpamSumCompare - A simplified implementation of SpamSum Hash 
# comparison algorithm in native R.
#
# Copyright (C) 2002 Andrew Tridgell <tridge@samba.org>
# Copyright (C) 2006 ManTech International Corporation
# Copyright (C) 2013 Helmut Grohne <helmut@subdivi.de>
# Copyright (C) 2015 Mahmood S. Zargar <mahmood@gmail.com>
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
#
# Earlier versions of this code were named fuzzy.c and can be found at:
#     http://www.samba.org/ftp/unpacked/junkcode/spamsum/
#     http://ssdeep.sf.net/
#     http://ssdeep.sourceforge.net/
#
# Sample Signatures: 
# sig1 <- "3:KfIt+ARbgiyLqrJsQhVLgn:KHARfyLQJLQ"
# sig2 <- "3:KfIt0nAwikyLqrJsQhVLiM:KNnAwikyLQJLyM"
#

library(stringdist)
library(stringr)

SPAMSUM_LENGTH <- 64
ROLLING_WINDOW <- 7
MIN_BLOCKSIZE <- 3

# We only accept a match if we have at least one common substring in
# the signature of length ROLLING_WINDOW. This dramatically drops the
# false positive rate for low score thresholds while having
# negligable affect on match rejection.

hasCommonSubstring <- function(h1, h2, l1, l2) {
  c1 <- l1 - ROLLING_WINDOW + 1
  c2 <- l2 - ROLLING_WINDOW + 1

  # Extract all possible 7-grams from the two hashes

  s1 <-
    lapply(1:c1, function(x)
      substr(h1, x, x + ROLLING_WINDOW - 1))
  s2 <-
    lapply(1:c2, function(x)
      substr(h2, x, x + ROLLING_WINDOW - 1))

  # Any exactmatches?

  return(any(s1 %in% s2))
}

# Compares two bare SpamSum hashes

compareHashes <- function(h1, h2, blk) {
  l1 <- str_length(h1)
  l2 <- str_length(h1)

  # There is very little information content is sequences of
  # the same character like 'LLLLL'. Reduces any sequences
  # longer than 3 to a sequence of 3. This is especially
  # important when combined with the hasCommonSubstring() test.

  h1 <- str_replace_all(h1, '(.)\\1{2,}', '\\1\\1\\1')
  h2 <- str_replace_all(h2, '(.)\\1{2,}', '\\1\\1\\1')

  # Check the basic anomalies (too short or too long hashes, etc.)

  if (max(l1, l2) > SPAMSUM_LENGTH | 
      min(l1, l2) < ROLLING_WINDOW |
      blk < MIN_BLOCKSIZE)
    return (0)

  # The two hashes must have at least a common substring
  # of length ROLLING_WINDOW to be candidates */

  if ((h1 == h2) | hasCommonSubstring(h1, h2, l1, l2))
    scoreStrings(h1, h2, l1, l2, blk)
  else
    return (0)
}

# This is the low level string scoring algorithm. It takes two strings
# and scores them on a scale of 0-100 where 0 is a terrible match and
# 100 is a great match. The block_size is used to cope with very short
# texts.

scoreStrings <- function(h1, h2, l1, l2, blk) {

  # Compute the edit distance between the two strings. The edit distance gives
  # us a pretty good idea of how closely related the two strings are.

  if (h1 == h2)
    score <- 0
  else
    score <-
        stringdist(h1, h2, method = "lv", weight = c(d = 0.5, i = 0.5, s = 1)) * 2

  # scale the edit distance by the lengths of the two
  # strings. This changes the score to be a measure of the
  # proportion of the message that has changed rather than an
  # absolute quantity. It also copes with the variability of
  # the string lengths.

  score <- (score * SPAMSUM_LENGTH) / (l1 + l2)

  # At this stage the score occurs roughly on a 0-64 scale,
  # with 0 being a good match and 64 being a complete
  # mismatch.

  score <- (100 * score) / SPAMSUM_LENGTH

  # It is possible to get a score above 100 here, but it is a
  # really terrible match.

  if (score >= 100) return (0)

  # Now re-scale on a 0-100 scale with 0 being a poor match and
  # 100 being a excellent match (friendlier to humans).

  score <- 100 - score

  # When the blocksize is small we don't want to exaggerate the match size.

  if (blk >= (99 + ROLLING_WINDOW) / ROLLING_WINDOW * MIN_BLOCKSIZE)
    return(score)
  if (score > (blk / MIN_BLOCKSIZE) * min(l1, l2))
    score = (blk / MIN_BLOCKSIZE) * min(l1, l2)
  return(score)
}

# Decomposes the signature into its components and saves it
# into a list containing block size (blk), first hash (h1)
# and second hash (h2).

decompSig <- function(sig) {
  ss <- as.list(str_split(sig, ":")[[1]])
  ss[[1]] <- as.integer(ss[[1]])
  names(ss) <- c("blk","h1","h2")
  return(ss)
}

# Compares two decomposed SpamSum signatures

compareSS <- function(ss1, ss2) {

  # If the blocksizes don't match then we are comparing
  # apples to oranges. This isn't an 'error' per se. We could
  # have two valid signatures, but they can't be compared.

  if (
    !(ss1[["blk"]] %in%
      c(ss2[["blk"]] / 2, ss2[["blk"]], ss2[["blk"]] * 2))
    ) return(0)

  # Each signature has a string for two block sizes. We now
  # choose how to combine the two block sizes. We checked above
  # that they have at least one block size in common.

  if (ss1$blk == ss2$blk) {
    max(compareHashes(ss1$h1, ss2$h1, ss1$blk),
        compareHashes(ss1$h2, ss2$h2, ss1$blk * 2))
  } else if (ss1$blk == ss2$blk * 2) {
    compareHashes(ss1$h1, ss2$h2, ss1$blk)
  } else {
    compareHashes(ss1$h2, ss2$h1, ss2$blk)
  }
}

# Compares two SpamSum signatures (strings)

compareSig <- function(sig1, sig2) {
  compareSS(decompSig(sig1), decompSig(sig2))
}

以上是关于r SpamSumCompare - 原生R中SpamSum Hash比较算法的简化实现。的主要内容，如果未能解决你的问题，请参考以下文章

R语言进行数据聚合统计（Aggregating transforms）实战：使用R原生方法data.tabledplyr等方案计算分组均值并添加到可视化结果中

R语言应用calibrate包的textxy函数向R原生绘图结果中添加文本标签：添加多个文本标签改变文本标签的字体改变文本标签的字体颜色

R语言数据热力图绘制实战（基于原生R函数ggplot2包plotly包）

R语言text函数在R原生可视化结果中添加文本标签内容：指定文本显示的位置颜色字体大小等（PositionChange Color Size of Text）

R语言进行dataframe数据左连接（Left join）：使用R原生方法data.tabledplyr等方案

R语言进行dataframe数据内连接（Inner join）：使用R原生方法data.tabledplyr等方案