跨多个候选项查找多个子字符串的最佳匹配

Posted 2023-02-22

技术标签:

【中文标题】跨多个候选项查找多个子字符串的最佳匹配【英文标题】：Find best match for multiple substrings across multiple candidates 【发布时间】：2020-04-12 10:36:38 【问题描述】：

我有以下示例数据：

targets <- c("der", "das")
candidates <- c("sdassder", "sderf", "fongs")

所需的输出：

我想找到sdassder 作为输出，因为它包含targets 的最多匹配项（作为子字符串）。

我尝试了什么：

x <- sapply(targets, function(target) sapply(candidates, grep, pattern = target)) > 0
which.max(rowSums(x))

目标：

如您所见，我发现了一些从技术上产生结果的脏代码，但我认为它不是最佳实践。我希望这个问题适合这里，否则我会转到代码审查。

我尝试了 mapply、do.call、outer，但没有找到更好的代码。

编辑：

在看到当前答案后自己添加另一个选项。

使用管道：

sapply(targets, grepl, candidates) %>% rowSums %>% which.max %>% candidates[.]

【问题讨论】：

【参考方案1】：

我想你可以简化一点。

matches <- sapply(targets, grepl, candidates)
matches
#        der   das
# [1,]  TRUE  TRUE
# [2,]  TRUE FALSE
# [3,] FALSE FALSE

并使用rowSums查找匹配数：

rowSums(matches)
# [1] 2 1 0
candidates[ which.max(rowSums(matches)) ]
# [1] "sdassder"

（请注意，最后一部分并没有真正说明关系。）

如果您想查看每个候选人的个人匹配项，您始终可以手动应用名称，尽管这只是一种审美，对工作本身的添加很少。

rownames(matches) <- candidates
matches
#            der   das
# sdassder  TRUE  TRUE
# sderf     TRUE FALSE
# fongs    FALSE FALSE
rowSums(matches)
# sdassder    sderf    fongs 
#        2        1        0 
which.max(rowSums(matches))
# sdassder 
#        1        <------ this "1" indicates the index within the rowSums vector
names(which.max(rowSums(matches)))
# [1] "sdassder"

【讨论】：

【参考方案2】：

一个stringr 选项可以是：

candidates[which.max(rowSums(outer(candidates, targets, str_detect)))]

[1] "sdassder"

【讨论】：

【参考方案3】：

我们可以将targets 粘贴在一起并创建一个匹配的模式。

library(stringr)
str_c(targets, collapse = "|")
#[1] "der|das"

在str_count 中使用它来计算模式匹配的次数。

str_count(candidates, str_c(targets, collapse = "|"))
#[1] 2 1 0

从原始candidates中获取最大值索引并将其子集

candidates[which.max(str_count(candidates, str_c(targets, collapse = "|")))]
#[1] "sdassder"

【讨论】：

以上是关于跨多个候选项查找多个子字符串的最佳匹配的主要内容，如果未能解决你的问题，请参考以下文章