R - 如果列包含来自向量的字符串,则将标志附加到另一列

Posted

技术标签:

【中文标题】R - 如果列包含来自向量的字符串,则将标志附加到另一列【英文标题】:R - If column contains a string from vector, append flag into another column 【发布时间】:2022-01-19 23:28:15 【问题描述】:

我的数据

我有一个单词向量,如下所示。这是一个过度简化,我的真实向量超过 600 字:

myvec <- c("cat", "dog, "bird")

我有一个具有以下结构的数据框:

structure(list(id = c(1, 2, 3), onetext= c("cat furry pink british", 
"dog cat fight", "bird cat issues"), cop= c("Little Grey Cat is the nickname given to a kitten of the British Shorthair breed that rose to viral fame on Tumblr through a variety of musical tributes and photoshopped parodies in late September 2014", 
"Dogs have soft fur and tails so do cats Do cats like to chase their tails", 
"A cat and bird can coexist in a home but you will have to take certain measures to ensure that a cat cannot physically get to the bird at any point"
), text3 = c("On October 4th the first single topic blog devoted to the little grey cat was launched On October 20th Tumblr blogger Torridgristle shared a cutout exploitable image of the cat, which accumulated over 21000 notes in just over three months.", 
"there are many fights going on and this is just an example text", 
"Some cats will not care about a pet bird at all while others will make it its life mission to get at a bird You will need to assess the personalities of your pets and always remain on guard if you allow your bird and cat to interact"
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-3L))

如下图所示

我的问题

对于我的向量 myvec 上的每个关键字,我需要遍历数据集并检查列 onetextcoptext3,如果我在任何一个这 3 列,然后我需要将关键字 追加 到一个新列中。结果将如下图所示:

我的原始数据集非常大(最后一列最长),所以做多个嵌套循环(这是我尝试过的)并不理想。

编辑:请注意,只要该单词在该行中出现一次,就足够了,应该列出。应列出所有关键字。

我怎么能这样做?我使用的是 tidyverse,所以我的数据集实际上是 tibble

类似的帖子(但不完全)

以下帖子有些相似,但不完全相同:

If Column Contains String then enter value for that row R Column Check if Contains Value from Another Column Add new column if range of columns contains string in R

【问题讨论】:

【参考方案1】:

更新: 如果首选列表:使用 str_extract_all:

df %>%  
  transmute(across(-id, ~case_when(str_detect(., pattern) ~ str_extract_all(., pattern)), .names = "new_colcol")) 

给予:

  new_colonetext new_colcop new_coltext3
  <list>         <list>     <list>      
1 <chr [1]>      <NULL>     <chr [2]>   
2 <chr [2]>      <chr [2]>  <NULL>      
3 <chr [2]>      <chr [4]>  <chr [5]>  

以下是实现结果的方法:

    创建矢量模式 使用mutateacross检查所需的列 如果检测到所需的字符串,则提取到新列!
myvec <- c("cat", "dog", "bird")

pattern <- paste(myvec, collapse="|")

library(dplyr)
library(tidyr)
df %>% 
  mutate(across(-id, ~case_when(str_detect(., pattern) ~ str_extract_all(., pattern)), .names = "new_colcol")) %>% 
  unite(topic, starts_with('new'), na.rm = TRUE, sep = ',')
    id onetext                cop                                                                        text3                                                                              topic                                     
  <dbl> <chr>                  <chr>                                                                      <chr>                                                                              <chr>                                     
1     1 cat furry pink british Little Grey Cat is the nickname given to a kitten of the British Shorthai~ On October 4th the first single topic blog devoted to the little grey cat was lau~ "cat,NULL,c(\"cat\", \"cat\")"            
2     2 dog cat fight          Dogs have soft fur and tails so do cats Do cats like to chase their tails  there are many fights going on and this is just an example text                    "c(\"dog\", \"cat\"),c(\"cat\", \"cat\"),~
3     3 bird cat issues        A cat and bird can coexist in a home but you will have to take certain me~ Some cats will not care about a pet bird at all while others will make it its lif~ "c(\"bird\", \"cat\"),c(\"cat\", \"bird\"~                                                                                    

【讨论】:

谢谢。由于某种原因,它不起作用。它只是重复一个单词(最后找到的),并消除对已找到单词的提醒。所以在第 3 行,我没有得到bird,cat,而是得到bird,bird,bird 嗯。你看我的输出。它应该工作!哦,好的,library(stringr) 不见了。 好的,我看到的是它需要在该行的所有 all 列中。但我没有澄清(将编辑)的一点是它必须至少在一个上。我不知道为什么,但我不能让它工作。它不断用它找到的最后一个东西替换所有东西。 使用 str_extract_all 而不是 str_extract 似乎可以正常工作 是的,我现在才知道正在考虑如何解决它。会更新。

以上是关于R - 如果列包含来自向量的字符串,则将标志附加到另一列的主要内容,如果未能解决你的问题,请参考以下文章

如果 URL 包含 6 位数字,则将该值附加到一个类

将名称附加或粘贴到 R 中的列名称

如何将向量附加为 R 矩阵中的列?

如果文件不存在,则创建它并将字符串写入其中。如果确实如此,则将字符串附加到它。工作不正常

将值附加到R中的空向量?

R使用向量创建新列包含变量的名称