使用 R 有效地计算列中单词列表的出现次数

Posted 2023-02-19

技术标签:

【中文标题】使用 R 有效地计算列中单词列表的出现次数【英文标题】：Efficiently counting occurrences of a word list in a column using R 【发布时间】：2020-04-30 05:27:18 【问题描述】：

如果我有一个单词列表，我如何有效地统计这些单词在数据集中出现的次数？

一个例子：

set.seed(123) 
df_data <- data.frame(   
   data_strings = sample(c("tom smith", "smith jim", "sam sam", "ted", "xxx"), 10, replace = TRUE)
)

df_names <- data.frame(
   names = c("tom", "jim", "sam", "ted", "yyy")
)

即：

> df_data
   data_strings
1       sam sam
2       sam sam
3     smith jim
4     smith jim
5       sam sam
6           xxx
7           ted
8     tom smith
9     smith jim
10      sam sam

和

> df_names
  names
1   tom
2   jim
3   sam
4   ted
5   yyy

我可以使用 stringr 包中的 str_count 来做到这一点：

library(stringr)
library(tictoc)
tic()
df_data$counts <- as.vector(sapply(
  paste(df_names[,"names"], collapse='|'), 
  str_count, 
  string=df_data$data_strings
))
toc()

这会产生预期的结果：

> df_data
   data_strings counts
1       sam sam      2
2       sam sam      2
3     smith jim      1
4     smith jim      1
5       sam sam      2
6           xxx      0
7           ted      1
8     tom smith      1
9     smith jim      1
10      sam sam      2

但是，由于我的真实数据包含数百万行，而我的单词列表也是数百万行。事实证明，这是一种非常低效的方式来获得结果。 如何加快速度？ 我尝试使用parallel 包使用更多内核，但它同时完成（虽然我告诉它使用多个内核，但它只使用一个内核） .我在 Windows 上，所以我无法测试 mclapply()。 parallel 似乎工作正常，因为我可以让它在其他示例中使用更多内核。

library(stringr)
library(parallel)
library(tictoc)

cl <- makeCluster(4, type = "PSOCK")
tic()
df_data$counts <- as.vector(parSapply(
  cl = cl,
  paste(df_names[,"names"], collapse='|'),
  FUN=str_count, 
  string=df_data$data_strings
))
toc()
stopCluster(cl)

我还可以尝试哪些其他方法？ data.tables 的东西？ apply 里面的粘贴可以不一样吗？

【问题讨论】：

我不明白你在数什么？ @F.Privé 每行中名字的数量（在 df_names 中列出）为什么sam 是2？因为sam sam? @F.Privé 是的，没错您可能希望根据所需的输出向正则表达式添加单词边界。现在，df_names 中的 "sam" 将匹配“sam”“samuel”“samual”“sammy”等。除非你同意。需要记住的一点。 【参考方案1】：

我不确定它在实际大小的数据集上是否更快，但您可以使用内置多核支持的quanteda，在这种情况下应该非常高效：

library(dplyr)
library(quanteda)
quanteda_options("threads" = 4) # choose how many threads are used

df_data$counts <- df_data %>%
  pull(data_strings) %>% 
  dfm() %>%                               # construct document-feature-matrix
  dfm_keep(pattern = df_names$names) %>%  # keep features that are names
  convert(to = "data.frame") %>%          # convert to data.frame
  select(-document) %>%                   # remove non-numeric columns
  rowSums()                               # only keep sums

df_data
#>    data_strings counts
#> 1       sam sam      2
#> 2       sam sam      2
#> 3     smith jim      1
#> 4     smith jim      1
#> 5       sam sam      2
#> 6           xxx      0
#> 7           ted      1
#> 8     tom smith      1
#> 9     smith jim      1
#> 10      sam sam      2

^{由reprex package (v0.3.0) 于 2020 年 1 月 13 日创建}

请注意，我在构造 data.frames 时设置了选项 stringsAsFactors = FALSE。否则你会遇到因素问题。

我可以想象，如果你的系列中有很多名字，这会更快。但是在我的板凳标记中，stringr::str_count 和 stringi::stri_count_regex 使用您提供的少量名称会更快。

【讨论】：

这看起来像我正在寻找的解决方案类型，它在更大的数据集上大约快 100 倍我有一种预感，可能就是这种情况，因为我之前尝试过类似的东西，并且运行 str_count 所需的时间似乎会随着您插入的每个额外模式而增加很多。最终这将超过将文本转换为 dfm 所需的时间。但很难用一个例子来进行基准测试。有一件事，我没有在我的问题上清楚地发布这个，但经过一些检查，我发现这个方法没有像我希望的那样处理带空格的字符串。使用这两个输入：df_data <- data.frame( data_strings = c("tom", "sam", "sam tom", "xxx yyy", "aaa xxx yyy bbb") ) ... 和 ...df_names <- data.frame( names = c("tom", "jim", "sam", "xxx yyy") ) ... 我希望 "aaa xxx yyy bbb" 注册为 1，"xxx yyy" 也应该是 1。目前两者都为零。如果最长的名称由 2 个单词组成，您可以将 dfm() 命令替换为 dfm(ngrams = 1:2, concatenator = " ")。请与max(stringi::stri_count_fixed(df_names$names, " ")) + 1 联系。如果值大于 2，则在 dfm() 调用中替换 2。【参考方案2】：

str_count()已经矢量化了，你不需要sapply()，使用stringr::str_count(df_data$data_strings, paste(df_names$names, collapse='|'))即可。

【讨论】：

哦，不知道，但实际上，这是我第一次尝试。它和我列出的 sapply 版本一样慢。这就是并行化不起作用的原因吗？如果我没记错的话，所有 stringr 代码都是用 C++ 编写的，它应该比使用 sapply() 快得多。刚检查过，当我对我的数据真实数据样本进行测试时，它基本上一样慢，可能稍微快一点。也许你想使用底层的stringi::stri_count_regex。在这种情况下，语法是相同的，stringr 只是一个调用stringi 的便捷包。不确定它是否会提高速度，但值得一试。【参考方案3】：

如果您在df_data 中有重复名称，您可以使用data.table 中的连接来加快速度。如果你没有很多重复的名字，我认为这不会有太大帮助。此外，请务必从您的搜索模式中删除重复的名称。甚至像 "sam" 和 "samuel" 这样的事情也会重复进行部分字符串匹配（尽管解析起来很棘手）。

setDT(df_data2, key = "data_strings")
dt_data2 <- unique(df_data2)

dt_data2[, counts := str_count(string = data_strings, pattern = str_c(df_names$names, collapse='|'))]
dt_data2[df_data2]

    data_strings counts
 1:      sam sam      2
 2:      sam sam      2
 3:      sam sam      2
 4:      sam sam      2
 5:    smith jim      1
 6:    smith jim      1
 7:    smith jim      1
 8:          ted      1
 9:    tom smith      1
10:          xxx      0

数据：

set.seed(123) 
df_data <- data.frame(   
  data_strings = sample(c("tom smith", "smith jim", "sam sam", "ted", "xxx"), 10, replace = TRUE)
)

df_names <- data.frame(
  names = c("tom", "jim", "sam", "ted", "yyy")
)

【讨论】：

【参考方案4】：

这里有一些基本的 R 解决方案。

由于我的方法都是基于R的，所以性能不如stringr包，但是如果你觉得有用的话，也许你可以借鉴一些想法。

# method by ThomasIsCoding
f_ThomasIsCoding1 <- function() sapply(as.vector(df_data$data_strings), function(x) sum(unlist(strsplit(x,split = " "))%in% df_names$names) )

f_ThomasIsCoding2 <- function() sapply(strsplit(as.vector(df_data$data_strings),split = " "), function(x) sum(x %in% df_names$names))

f_ThomasIsCoding3 <- function() 
  bk <- paste0(df_names$names,collapse = "|")
  lengths(regmatches(df_data$data_strings,gregexpr(bk,df_data$data_strings)))


f_ThomasIsCoding4 <- function() 
  with(df_data, as.numeric(ave(as.vector(data_strings),as.numeric(data_strings),FUN = function(x) sum(strsplit(unique(as.vector(x)),split = " ")[[1]] %in% as.vector(df_names$names)))))

您可以在我的another post 中查看基准

【讨论】：

【参考方案5】：

这是一个关于性能极限的非常有趣的问题；所以我建立了一个基准模板来直观地比较不同方法的性能。

这篇文章是发给 wiki 社区的，所以欢迎大家为速度挑战添加不同的方法。

基准模板

library(microbenchmark)
library(stringr)

set.seed(123) 
df_data <- data.frame(   
  data_strings = sample(c("tom smith", "smith jim", "sam sam", "ted", "xxx"), 10000, replace = TRUE)
)

df_names <- data.frame(
  names = c("tom", "jim", "sam", "ted", "yyy")
)

# method by Joshua
f_Joshua <- function() as.vector(sapply(
  paste(df_names[,"names"], collapse='|'), 
  str_count, 
  string=df_data$data_strings
))
# method by F. Privé
f_F.Prive <- function() str_count(df_data$data_strings, paste(df_names[,"names"], collapse='|'))
# method by ThomasIsCoding
f_ThomasIsCoding1 <- function() sapply(as.vector(df_data$data_strings), function(x) sum(unlist(strsplit(x,split = " "))%in% df_names$names) )
f_ThomasIsCoding2 <- function() sapply(strsplit(as.vector(df_data$data_strings),split = " "), function(x) sum(x %in% df_names$names))
f_ThomasIsCoding3 <- function() 
  bk <- paste0(df_names$names,collapse = "|")
  lengths(regmatches(df_data$data_strings,gregexpr(bk,df_data$data_strings)))

f_ThomasIsCoding4 <- function() 
  with(df_data, as.numeric(ave(as.vector(data_strings),as.numeric(data_strings),FUN = function(x) sum(strsplit(unique(as.vector(x)),split = " ")[[1]] %in% as.vector(df_names$names)))))



bm <- microbenchmark(
  f_Joshua(),
  f_F.Prive(),
  f_ThomasIsCoding1(),
  f_ThomasIsCoding2(),
  f_ThomasIsCoding3(),
  f_ThomasIsCoding4(),
  times = 10,
  check = "equivalent",
  unit = "relative")

这样

> bm
Unit: relative
                expr       min        lq       mean    median         uq        max neval
          f_Joshua()  1.126535  1.067945  0.6261978  1.028165  0.9859666  0.2677307    10
         f_F.Prive()  1.000000  1.000000  1.0000000  1.000000  1.0000000  1.0000000    10
 f_ThomasIsCoding1() 57.177203 61.011742 32.5759501 54.980633 53.4825275 12.4735502    10
 f_ThomasIsCoding2() 18.167507 18.053833 11.8592174 17.945895 23.3277056  4.4468403    10
 f_ThomasIsCoding3() 63.448741 72.585445 35.6459037 65.608859 61.8789544  8.8344612    10
 f_ThomasIsCoding4()  4.039085  3.994598  2.1024356  3.545432  3.3914213  0.7529932    10

【讨论】：

以上是关于使用 R 有效地计算列中单词列表的出现次数的主要内容，如果未能解决你的问题，请参考以下文章