展开data.table,使每个ID的每个模式匹配一 行

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了展开data.table,使每个ID的每个模式匹配一 行相关的知识,希望对你有一定的参考价值。

我在data.table中有很多文本数据。我有几个我感兴趣的文本模式。我已设法对表进行子集化,以便显示与至少两个模式匹配的文本(相关问题here)。

我现在希望每个匹配能够有一行,并且有一个标识匹配的附加列 - 所以有多个匹配的行将与该列重复。

感觉这不应该太难,但我正在努力!我模糊的想法可能是计算模式匹配的数量,然后多次复制行...但是我不完全确定如何为每个不同的模式获取标签...(并且也不确定是非常有效率)。

谢谢你的帮助!

示例数据

library(data.table)
library(stringr)
text_table <- data.table(ID = (1:5), 
                         text = c("lucy, sarah and paul live on the same street",
                                  "lucy has only moved here recently",
                                  "lucy and sarah are cousins",
                                  "john is also new to the area",
                                  "paul and john have known each other a long time"))


text_patterns <- as.character(c("lucy", "sarah", "paul|john"))

# Filtering the table to just the IDs with at least two pattern matches
text_table_multiples <- text_table[, Reduce(`+`, lapply(text_patterns, 
                                    function(x) str_detect(text, x))) >1]

理想的输出

required_table <- data.table(ID = c(1, 1, 1, 2, 3, 3, 4, 5),
                             text = c("lucy, sarah and paul live on the same street",
                                      "lucy, sarah and paul live on the same street",
                                      "lucy, sarah and paul live on the same street",
                                      "lucy has only moved here recently",
                                      "lucy and sarah are cousins",
                                      "lucy and sarah are cousins",
                                      "john is also new to the area",
                                      "paul and john have known each other a long time"), 
                             person = c("lucy", "sarah", "paul or john", "lucy", "lucy", "sarah", "paul or john", "paul or john"))
答案

一种方法是为每个指标创建一个变量并融化:

library(stringi)
text_table[, lucy := stri_detect_regex(text, 'lucy')][ ,
  sarah := stri_detect_regex(text, 'sarah')
][ ,`paul or john` := stri_detect_regex(text, 'paul|john')
]

melt(text_table, id.vars = c("ID", "text"))[value == T][, -"value"]
##    ID                                            text     variable
## 1:  1    lucy, sarah and paul live on the same street         lucy
## 2:  2               lucy has only moved here recently         lucy
## 3:  3                      lucy and sarah are cousins         lucy
## 4:  1    lucy, sarah and paul live on the same street        sarah
## 5:  3                      lucy and sarah are cousins        sarah
## 6:  1    lucy, sarah and paul live on the same street paul or john
## 7:  4                    john is also new to the area paul or john
## 8:  5 paul and john have known each other a long time paul or john

做同样程序的一种简洁方法是:

library(tidyverse)
text_table %>%
  mutate(lucy = stri_detect_regex(text, 'lucy')) %>%
  mutate(sarah = stri_detect_regex(text, 'sarah')) %>%
  mutate(`paul or john` = stri_detect_regex(text, 'paul|john')) %>%
  gather(value = value, key = person,  - c(ID, text)) %>%
  filter(value) %>%
  select(-value)
另一答案

免责声明:这不是一个惯用的data.table解决方案

我将构建一个如下所示的辅助函数,它接受一行和一个输入并返回一个带有Nrows的新dt:

library(data.table)
library(tidyverse)

new_rows <- function(dtRow, patterns = text_patterns){

    res <- map(text_patterns, function(word) {

        textField <- grep(x = dtRow[1, text], pattern = word, value = TRUE) %>% 
            ifelse(is.character(.), ., NA)

        personField   <- str_extract(string = dtRow[1, text], pattern = word) %>% 
            ifelse(  . == "paul" | . == "john", "paul or john", .)

        idField <- ifelse(is.na(textField), NA, dtRow[1, ID])

        data.table(ID = idField, text = textField, person = personField) 

        }) %>% 
        rbindlist()

    res[!is.na(text), ]
}

我会执行它:

split(text_table, f = text_table[['ID']]) %>% 
    map_df(function(r) new_rows(dtRow = r))

答案是:

   ID                                            text       person
1:  1    lucy, sarah and paul live on the same street         lucy
2:  1    lucy, sarah and paul live on the same street        sarah
3:  1    lucy, sarah and paul live on the same street paul or john
4:  2               lucy has only moved here recently         lucy
5:  3                      lucy and sarah are cousins         lucy
6:  3                      lucy and sarah are cousins        sarah
7:  4                    john is also new to the area paul or john
8:  5 paul and john have known each other a long time paul or john

看起来像你的required_table(包括重复的ID)

   ID                                            text       person
1:  1    lucy, sarah and paul live on the same street         lucy
2:  1    lucy, sarah and paul live on the same street        sarah
3:  1    lucy, sarah and paul live on the same street paul or john
4:  2               lucy has only moved here recently         lucy
5:  3                      lucy and sarah are cousins         lucy
6:  3                      lucy and sarah are cousins        sarah
7:  4                    john is also new to the area paul or john
8:  5 paul and john have known each other a long time paul or john

以上是关于展开data.table,使每个ID的每个模式匹配一 行的主要内容,如果未能解决你的问题,请参考以下文章

如何为 data.table 中不包括 NA 的更多变量添加滞后并导致每个观察结果?

使每个选项卡的大小相同且可展开[关闭]

在 data.table 中的每个组中随机抽取行

在单个 R data.table 中按组有效地定位

检查一个 data.table 列中的所有元素以查看另一个 data.table 列中出现的每个值的最快方法

匹配一个变量后,仅将一个变量从一个 R data.table 复制到另一个