展开data.table,使每个ID的每个模式匹配一 行
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了展开data.table,使每个ID的每个模式匹配一 行相关的知识,希望对你有一定的参考价值。
我在data.table中有很多文本数据。我有几个我感兴趣的文本模式。我已设法对表进行子集化,以便显示与至少两个模式匹配的文本(相关问题here)。
我现在希望每个匹配能够有一行,并且有一个标识匹配的附加列 - 所以有多个匹配的行将与该列重复。
感觉这不应该太难,但我正在努力!我模糊的想法可能是计算模式匹配的数量,然后多次复制行...但是我不完全确定如何为每个不同的模式获取标签...(并且也不确定是非常有效率)。
谢谢你的帮助!
示例数据
library(data.table)
library(stringr)
text_table <- data.table(ID = (1:5),
text = c("lucy, sarah and paul live on the same street",
"lucy has only moved here recently",
"lucy and sarah are cousins",
"john is also new to the area",
"paul and john have known each other a long time"))
text_patterns <- as.character(c("lucy", "sarah", "paul|john"))
# Filtering the table to just the IDs with at least two pattern matches
text_table_multiples <- text_table[, Reduce(`+`, lapply(text_patterns,
function(x) str_detect(text, x))) >1]
理想的输出
required_table <- data.table(ID = c(1, 1, 1, 2, 3, 3, 4, 5),
text = c("lucy, sarah and paul live on the same street",
"lucy, sarah and paul live on the same street",
"lucy, sarah and paul live on the same street",
"lucy has only moved here recently",
"lucy and sarah are cousins",
"lucy and sarah are cousins",
"john is also new to the area",
"paul and john have known each other a long time"),
person = c("lucy", "sarah", "paul or john", "lucy", "lucy", "sarah", "paul or john", "paul or john"))
答案
一种方法是为每个指标创建一个变量并融化:
library(stringi)
text_table[, lucy := stri_detect_regex(text, 'lucy')][ ,
sarah := stri_detect_regex(text, 'sarah')
][ ,`paul or john` := stri_detect_regex(text, 'paul|john')
]
melt(text_table, id.vars = c("ID", "text"))[value == T][, -"value"]
## ID text variable
## 1: 1 lucy, sarah and paul live on the same street lucy
## 2: 2 lucy has only moved here recently lucy
## 3: 3 lucy and sarah are cousins lucy
## 4: 1 lucy, sarah and paul live on the same street sarah
## 5: 3 lucy and sarah are cousins sarah
## 6: 1 lucy, sarah and paul live on the same street paul or john
## 7: 4 john is also new to the area paul or john
## 8: 5 paul and john have known each other a long time paul or john
做同样程序的一种简洁方法是:
library(tidyverse)
text_table %>%
mutate(lucy = stri_detect_regex(text, 'lucy')) %>%
mutate(sarah = stri_detect_regex(text, 'sarah')) %>%
mutate(`paul or john` = stri_detect_regex(text, 'paul|john')) %>%
gather(value = value, key = person, - c(ID, text)) %>%
filter(value) %>%
select(-value)
另一答案
免责声明:这不是一个惯用的data.table
解决方案
我将构建一个如下所示的辅助函数,它接受一行和一个输入并返回一个带有Nrows的新dt:
library(data.table)
library(tidyverse)
new_rows <- function(dtRow, patterns = text_patterns){
res <- map(text_patterns, function(word) {
textField <- grep(x = dtRow[1, text], pattern = word, value = TRUE) %>%
ifelse(is.character(.), ., NA)
personField <- str_extract(string = dtRow[1, text], pattern = word) %>%
ifelse( . == "paul" | . == "john", "paul or john", .)
idField <- ifelse(is.na(textField), NA, dtRow[1, ID])
data.table(ID = idField, text = textField, person = personField)
}) %>%
rbindlist()
res[!is.na(text), ]
}
我会执行它:
split(text_table, f = text_table[['ID']]) %>%
map_df(function(r) new_rows(dtRow = r))
答案是:
ID text person
1: 1 lucy, sarah and paul live on the same street lucy
2: 1 lucy, sarah and paul live on the same street sarah
3: 1 lucy, sarah and paul live on the same street paul or john
4: 2 lucy has only moved here recently lucy
5: 3 lucy and sarah are cousins lucy
6: 3 lucy and sarah are cousins sarah
7: 4 john is also new to the area paul or john
8: 5 paul and john have known each other a long time paul or john
看起来像你的required_table
(包括重复的ID)
ID text person
1: 1 lucy, sarah and paul live on the same street lucy
2: 1 lucy, sarah and paul live on the same street sarah
3: 1 lucy, sarah and paul live on the same street paul or john
4: 2 lucy has only moved here recently lucy
5: 3 lucy and sarah are cousins lucy
6: 3 lucy and sarah are cousins sarah
7: 4 john is also new to the area paul or john
8: 5 paul and john have known each other a long time paul or john
以上是关于展开data.table,使每个ID的每个模式匹配一 行的主要内容,如果未能解决你的问题,请参考以下文章
如何为 data.table 中不包括 NA 的更多变量添加滞后并导致每个观察结果?