R中最长的公共子字符串在两个字符串之间找到不连续的匹配

Posted 2023-02-14

技术标签:

【中文标题】R中最长的公共子字符串在两个字符串之间找到不连续的匹配【英文标题】：longest common substring in R finding non-contiguous matches between the two strings 【发布时间】：2022-01-17 06:27:26 【问题描述】：

我有一个关于在 R 中查找最长公共子字符串的问题。在搜索 *** 上的一些帖子时，我了解了 qualV 包。但是，我看到这个包中的 LCS 函数实际上找到了 string1 中存在于 string2 中的所有字符，即使它们不连续。

解释一下，如果字符串是 string1 : "hello" string2 : "hel12345lo" 我希望输出是 hel，但是我得到的输出是 hello。我一定做错了什么。请在下面查看我的代码。

library(qualV)
a= "hello"
b="hel123l5678o" 
sapply(seq_along(a), function(i)
    paste(LCS(substring(a[i], seq(1, nchar(a[i])), seq(1, nchar(a[i]))),
              substring(b[i], seq(1, nchar(b[i])), seq(1, nchar(b[i]))))$LCS,
          collapse = ""))

我也尝试了 Rlibstree 方法，但我仍然得到不连续的子字符串。此外，子字符串的长度也超出了我的预期。请参见下文。

> a = "hello"
> b = "h1e2l3l4o5"

> ll <- list(a,b)
> lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), function(x) getLongestCommonSubstring(x))
$do.call.rbind..ll.
[1] "h" "e" "l" "o"

> nchar(lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), function(x) getLongestCommonSubstring(x)))
do.call.rbind..ll.
                21

【问题讨论】：

相关问题：***.com/q/16196327/602276 @Andrie，我尝试了链接中的 Rlibstree 方法。但是，我仍然得到不连续的子字符串。匹配子字符串的长度也是关闭的。已添加信息作为编辑我上面的原始帖子。请看一看。澄清一下：qualV 的LCS 函数没有找到最长的公共子串，它找到了最长的公共子序列——因此你得到了结果。这就是子序列的定义。这些问题是相关的，但有完全不同的解决方案，最长的常见子序列问题是计算机科学中更经典的问题，因此是更经常实现的问题。 【参考方案1】：

我不确定你做了什么来得到“你好”的输出。根据下面的试错实验，LCS 函数似乎 (a) 如果字符跟在原本应该是 LCS 的字符后面，则不会将字符串视为 LCS； (b) 找到多个等长的 LCS（不像 sub() 只找到第一个）； (c) 字符串中元素的顺序无关紧要——下面没有说明； (b) LCS 调用中字符串的顺序无关紧要——也没有显示。

因此，您的 a 的“hello”在 b 中没有 LCS，因为 b 的“hel”后面跟着一个字符。嗯，这是我目前的假设。

上面的A点：

a= c("hello", "hel", "abcd")
b= c("hello123l5678o", "abcd") 
print(LCS(a, b)[4]) # "abcd" - perhaps because it has nothing afterwards, unlike hello123...

a= c("hello", "hel", "abcd1") # added 1 to abcd
b= c("hello123l5678o", "abcd") 
print(LCS(a, b)[4]) # no LCS!, as if anything beyond an otherwise LCS invalidates it

a= c("hello", "hel", "abcd") 
b= c("hello1", "abcd") # added 1 to hello
print(LCS(a, b)[4]) # abcd only, since the b hello1 has a character

上面的B点：

a= c("hello", "hel", "abcd") 
b= c("hello", "abcd") 
print(LCS(a, b)[4]) # found both, so not like sub vs gsub of finding first or all

【讨论】：

对不起lawyeR，我没能完全理解。我正在寻找一个函数，它接受两个字符串作为参数并返回两者之间共有的最大长度的子字符串。看了上面的帖子，我有点困惑。我在解释 LCS 能做什么和不能做什么。律师，哦，好吧！但只是为了澄清一下，有没有更好的方法来找到两者之间最长的公共子字符串？【参考方案2】：

这里有三种可能的解决方案。

library(stringi)
library(stringdist)

a <- "hello"
b <- "hel123l5678o"

## get all forward substrings of 'b'
sb <- stri_sub(b, 1, 1:nchar(b))
## extract them from 'a' if they exist
sstr <- na.omit(stri_extract_all_coll(a, sb, simplify=TRUE))
## match the longest one
sstr[which.max(nchar(sstr))]
# [1] "hel"

在基础 R 中还有 adist() 和 agrep()，stringdist 包有一些运行 LCS 方法的函数。看看stringsidt。它返回未配对的字符数。

stringdist(a, b, method="lcs")
# [1] 7

Filter("!", mapply(
    stringdist, 
    stri_sub(b, 1, 1:nchar(b)),
    stri_sub(a, 1, 1:nchar(b)),
    MoreArgs = list(method = "lcs")
))
#  h  he hel 
#  0   0   0

现在我已经对此进行了更多探索，我认为adist() 可能是要走的路。如果我们设置counts=TRUE，我们会得到一系列匹配、插入等。因此，如果您将其提供给stri_locate()，我们可以使用该矩阵来获取从a 到b 的匹配。

ta <- drop(attr(adist(a, b, counts=TRUE), "trafos")))
# [1] "MMMIIIMIIIIM"

所以M 值直接表示匹配项。我们可以去用stri_sub()获取子字符串

stri_sub(b, stri_locate_all_regex(ta, "M+")[[1]])
# [1] "hel" "l"   "o"

抱歉，我没有很好地解释这一点，因为我不精通字符串距离算法。

【讨论】：

虽然这适用于短字符串，但它的效率很低（我什至不知道渐近性能...... O(n^3) 也许？），并且有更有效的解决方案来解决这个问题。好吧，我不确定性能。我收到了 OP 对我在此处寻求帮助的其他答案之一的评论，所以我想我会尽力提供帮助。 @KonradRudolph - 我玩过adist()。看来这可能是去这里的方式供参考，identical(stri_sub(a, 1, 1:nchar(a)), substring(a,1,1:nchar(a))) @Vaibhav en.wikipedia.org/wiki/Longest_common_substring_problem 描述了一个有效的解决方案——不幸的是，我认为 R 的实现不存在。【参考方案3】：

利用@RichardScriven 的见解adist could be used（它计算“近似字符串距离”。我制作了一个更全面的函数。请注意"trafos" 代表用于确定两个字符串之间“距离”的“转换” （底部示例）

EDIT 这个答案可能会产生错误/意外的结果；正如@wdkrnls 所指出的：

我针对“apple”和“big apple bagels”运行了你的函数，它返回了“appl”。我会期待“苹果”。

有关错误结果，请参阅下面的说明。我们从一个在列表中获取longest_string 的函数开始：

longest_string <- function(s)return(s[which.max(nchar(s))])

然后我们可以使用@RichardSriven 的工作和stringi 库：

library(stringi)
lcsbstr <- function(a,b)  
  sbstr_locations<- stri_locate_all_regex(drop(attr(adist(a, b, counts=TRUE), "trafos")), "M+")[[1]]
  cmn_sbstr<-stri_sub(longest_string(c(a,b)), sbstr_locations)
  longest_cmn_sbstr <- longest_string(cmn_sbstr)
   return(longest_cmn_sbstr)

或者我们可以重写我们的代码以避免使用任何外部库（仍然使用 R 的原生 adist 函数）：

lcsbstr_no_lib <- function(a,b)  
    matches <- gregexpr("M+", drop(attr(adist(a, b, counts=TRUE), "trafos")))[[1]];
    lengths<- attr(matches, 'match.length')
    which_longest <- which.max(lengths)
    index_longest <- matches[which_longest]
    length_longest <- lengths[which_longest]
    longest_cmn_sbstr  <- substring(longest_string(c(a,b)), index_longest , index_longest + length_longest - 1)
    return(longest_cmn_sbstr )

上述两个函数仅将 'hello ' 识别为最长的公共子字符串，而不是 'hello r'（无论哪个参数是两者中较长的一个）：

identical('hello',
    lcsbstr_no_lib('hello', 'hello there'), 
    lcsbstr(       'hello', 'hello there'),
    lcsbstr_no_lib('hello there', 'hello'), 
    lcsbstr(       'hello there', 'hello'))

最后编辑 注意一些奇怪的行为这个结果：

lcsbstr('hello world', 'hello')
#[1] 'hell'

我期待'hello'，但由于转换实际上移动（通过删除）world 中的“o”成为地狱中的“o”o -- 根据M，只有 hell 部分被认为是匹配的：

drop(attr(adist('hello world', 'hello', counts=TRUE), "trafos"))
#[1] "MMMMDDDMDDD"
#[1]  vvvv   v
#[1] "hello world"

使用this Levenstein tool 观察到这种行为——它提供了两种可能的解决方案，相当于这两种转换

#[1] "MMMMDDDMDDD"
#[1] "MMMMMDDDDDD"

我不知道我们是否可以将adist 配置为更喜欢一种解决方案而不是另一种解决方案？（变换具有相同的“权重”——相同数量的“M”和“D”——不知道如何选择具有更多连续的变换@987654340 @)

最后，别忘了 adist 允许你传入ignore.case = TRUE（FALSE 是默认值）

adist 的"trafos" 属性的密钥；从一个字符串到另一个字符串的“转换”：

转换序列作为返回值的“trafos”属性返回，作为带有元素M、I、D和S的字符串，表示匹配、插入、删除和替换

【讨论】：

要添加到您的解决方案中，如果您知道要从哪个字符串 - a 或 b 中选择 LCS，您可以在函数中添加 grep 并使用 'longest_cmn_sbstr' 作为参数以返回完整的字符串。我对“apple”和“big apple bagels”运行了你的函数，它返回了“appl”。我会期待“苹果”。是的@wdkrnls，我同意我的解决方案对于“最长”是不正确的——它依赖于 Levenstein，它可能会识别出涉及“删除”的不同解决方案（请参阅我的答案的编辑）这是你得到“appl”的原因；这与我得到这个结果的原因相同：lcsbstr('hello world', 'hello')#[1] 'hell' 也许我可以修改我的正则表达式，这样我不仅会查找连续的“M”，而且还会检查跨越“D”（删除）的“M”（匹配项） )【参考方案4】：

df <- data.frame(A. = c("Australia", "Network"),
                 B. = c("Austria", "Netconnect"), stringsAsFactors = FALSE)

 auxFun <- function(x) 

   a <- strsplit(x[[1]], "")[[1]]
   b  <- strsplit(x[[2]], "")[[1]]
   lastchar <- suppressWarnings(which(!(a == b)))[1] - 1

   if(lastchar > 0)
     out <- paste0(a[1:lastchar], collapse = "")
    else 
     out <- ""
   

   return(out)
 

 df$C. <- apply(df, 1, auxFun)

 df
 A.         B.    C.
 1 Australia    Austria Austr
 2   Network Netconnect   Net

【讨论】：

这适用于子字符串从两个字符串的开头开始的情况，但是如果子字符串出现在某个字符串之间，则会失败。是的，你是对的。但是，如果您认为子字符串出现在某个字符串之间，您可以获得每对的多个输出。并且，可以修改代码以获得与某个字符串匹配的第一个字符串。【参考方案5】：

使用生物字符串：

library(Biostrings)
a= "hello"
b="hel123l5678o"
astr= BString(a)
bstr=BString(b)

pmatchPattern(astr, bstr)

  Views on a 12-letter BString subject
Subject: hel123l5678o
views:
      start end width
  [1]     1   3     3 [hel]
  Views on a 5-letter BString pattern
Pattern: hello
views:
      start end width
  [1]     1   3     3 [hel]

所以我做了一个基准测试，虽然我的回答确实做到了这一点，并且实际上为您提供了更多信息，但它比 @Rich Scriven 慢了约 500 倍，哈哈。

system.time(
a= "hello"
b="123hell5678o"
rounds=100
for (i in 1:rounds) 
astr= BString(a)
bstr=BString(b)
pmatchPattern(astr, bstr)

)

system.time(
  c= "hello"
  d="123hell5678o"
  rounds=100
  for (i in 1:rounds) 
ta <- drop(attr(adist(c, d, counts=TRUE), "trafos"))
stri_sub(d, stri_locate_all_regex(ta, "M+")[[1]])

)

   user  system elapsed 
  2.476   0.027   2.510 

   user  system elapsed 
  0.006   0.000   0.005

【讨论】：

以上是关于R中最长的公共子字符串在两个字符串之间找到不连续的匹配的主要内容，如果未能解决你的问题，请参考以下文章