R中的快速部分字符串匹配
Posted
技术标签:
【中文标题】R中的快速部分字符串匹配【英文标题】:Fast partial string matching in R 【发布时间】:2014-08-07 02:44:03 【问题描述】:给定一个字符串向量texts
和一个模式向量patterns
,我想为每个文本找到任何匹配的模式。
对于小型数据集,这可以在 R 中使用 grepl
轻松完成:
patterns = c("some","pattern","a","horse")
texts = c("this is a text with some pattern", "this is another text with a pattern")
# for each x in patterns
lapply( patterns, function(x)
# match all texts against pattern x
res = grepl( x, texts, fixed=TRUE )
print(res)
# do something with the matches
# ...
)
此解决方案是正确的,但无法按比例放大。即使有较大的数据集(约 500 个文本和模式),这段代码也非常慢,在现代机器上每秒只能解决大约 100 个案例 - 考虑到这是一个粗略的字符串部分匹配,没有正则表达式(使用 @ 设置),这很荒谬987654325@)。即使使lapply
并行也不能解决问题。
有没有办法高效地重写这段代码?
谢谢, 木龙
【问题讨论】:
你的模式总是单字吗?您是否只是对patterns
的每个元素是否出现在texts
的一个或多个元素中感兴趣(或者您是否需要知道它们出现在texts
的哪些元素中)?
【参考方案1】:
使用 stringi
包 - 它甚至比 grepl 更快。检查基准!
我使用了@Martin-Morgan 帖子中的文字
require(stringi)
require(microbenchmark)
text = readLines("~/Desktop/pg100.txt")
pattern <- strsplit("all the world's a stage and all the people players", " ")[[1]]
grepl_fun <- function()
lapply(pattern, grepl, text, fixed=TRUE)
stri_fixed_fun <- function()
lapply(pattern, function(x) stri_detect_fixed(text,x,NA))
# microbenchmark(grepl_fun(), stri_fixed_fun())
# Unit: milliseconds
# expr min lq median uq max neval
# grepl_fun() 432.9336 435.9666 446.2303 453.9374 517.1509 100
# stri_fixed_fun() 213.2911 218.1606 227.6688 232.9325 285.9913 100
# if you don't believe me that the results are equal, you can check :)
xx <- grepl_fun()
stri <- stri_fixed_fun()
for(i in seq_along(xx))
print(all(xx[[i]] == stri[[i]]))
【讨论】:
【参考方案2】:您是否准确地描述了您的问题和您所看到的性能?这是Complete Works of William Shakespeare 和针对它们的查询
text = readLines("~/Downloads/pg100.txt")
pattern <-
strsplit("all the world's a stage and all the people players", " ")[[1]]
这似乎比你暗示的要好得多?
> length(text)
[1] 124787
> system.time(xx <- lapply(pattern, grepl, text, fixed=TRUE))
user system elapsed
0.444 0.001 0.444
## avoid retaining memory; 500 x 500 case; no blank lines
> text = text[nzchar(text)]
> system.time( for (p in rep(pattern, 50)) grepl(p, text[1:500], fixed=TRUE) )
user system elapsed
0.096 0.000 0.095
我们期望通过模式和文本的长度(元素数量)进行线性缩放。我好像记错了我的莎士比亚
> idx = Reduce("+", lapply(pattern, grepl, text, fixed=TRUE))
> range(idx)
[1] 0 7
> sum(idx == 7)
[1] 8
> text[idx == 7]
[1] " And all the men and women merely players;"
[2] " cicatrices to show the people when he shall stand for his place."
[3] " Scandal'd the suppliants for the people, call'd them"
[4] " all power from the people, and to pluck from them their tribunes"
[5] " the fashion, and so berattle the common stages (so they call"
[6] " Which God shall guard; and put the world's whole strength"
[7] " Of all his people and freeze up their zeal,"
[8] " the world's end after my name-call them all Pandars; let all"
【讨论】:
以上是关于R中的快速部分字符串匹配的主要内容,如果未能解决你的问题,请参考以下文章
是否有一个 R 函数来匹配基于具有部分相似性的字符串的数据框列?
R:使用 for 循环将因子的级别部分匹配到字符串? [复制]