R:在列表中模式的grepl之后粘贴并组合来自ifelse的多个输出

Posted

技术标签:

【中文标题】R:在列表中模式的grepl之后粘贴并组合来自ifelse的多个输出【英文标题】:R: Paste and combine multiple outputs from ifelse after grepl of patterns in list 【发布时间】:2022-01-16 06:39:51 【问题描述】:

我尝试搜索与我的问题类似的帖子,但找不到。

我的目标是将 df1 中 name 列的单元格(如果有多个,用“_”分隔) 合并到一个新列 pasteHere,通过将 df1$order 中的字符串(我使用 grepl)匹配到 df2$ref。

这是一个大数据框,所以我包含了 for 循环来循环每一行。

我不确定错误是否来自循环,grepl,还是在这种情况下根本无法组合多个项目?

首先,虚拟数据:

## dummy data
df1 <- data.frame(ggplot2::msleep[c(1:10),c(1:5)]) 
df2 <- data.frame(ref = unique(df1$order), pasteHere = NA)

## how the dfs look like:
> df1
                         name       genus  vore        order conservation
1                     Cheetah    Acinonyx carni    Carnivora           lc
2                  Owl monkey       Aotus  omni     Primates         <NA>
3             Mountain beaver  Aplodontia herbi     Rodentia           nt
4  Greater short-tailed shrew     Blarina  omni Soricomorpha           lc
5                         Cow         Bos herbi Artiodactyla domesticated
6            Three-toed sloth    Bradypus herbi       Pilosa         <NA>
7           Northern fur seal Callorhinus carni    Carnivora           vu
8                Vesper mouse     Calomys  <NA>     Rodentia         <NA>
9                         Dog       Canis carni    Carnivora domesticated
10                   Roe deer   Capreolus herbi Artiodactyla           lc

> df2
           ref pasteHere
1    Carnivora        NA
2     Primates        NA
3     Rodentia        NA
4 Soricomorpha        NA
5 Artiodactyla        NA
6       Pilosa        NA

您可以看到CanivoraRodentiaArtiodactyla分别以df1$的顺序出现了3次、2次和2次。 p>

现在,通过将 df1$order 匹配到 df2$ref,我想将 df1$name 粘贴到 df2$pasteHere 并使用“_”将它们与多个匹配项结合起来。我在使用 R for-loop 方面仍然缺乏经验。

以下是我失败的尝试:

## my failed attempt: 
for(i in 1:length(df2$ref))
  
  for(j in df2$ref)
    df2$pasteHere[i] <- ifelse(grepl(df2$ref==j, df1$order), paste(df1$name, collapse="_"), "NA")
  


grepl 给出以下警告:

> warnings()[1:5]
Warning messages:
1: In grepl(df2$ref == j, df1$order) :
  argument 'pattern' has length > 1 and only the first element will be used
2: In df2$pasteHere[i] <- ifelse(grepl(df2$ref == j, df1$order),  ... :
  number of items to replace is not a multiple of replacement length
3: In grepl(df2$ref == j, df1$order) :
  argument 'pattern' has length > 1 and only the first element will be used
4: In df2$pasteHere[i] <- ifelse(grepl(df2$ref == j, df1$order),  ... :
  number of items to replace is not a multiple of replacement length
5: In grepl(df2$ref == j, df1$order) :
  argument 'pattern' has length > 1 and only the first element will be used

我希望我的最终数据框是什么样的:

> final_df
           ref                     pasteHere
1    Carnivora Cheetah_Northern fur seal_Dog
2     Primates                    Owl monkey
3     Rodentia  Mountain beaver_Vesper mouse
4 Soricomorpha    Greater short-tailed shrew
5 Artiodactyla                  Cow_Roe deer
6       Pilosa              Three-toed sloth

我不确定问题是否来自粘贴多个项目。请指教。其他解决方案也可以! :)

---------------更新:--------------- --------------

更新原因:

上面的虚拟数据对于我的预期问题来说过于简化,下面更新了更适合我当前情况的新虚拟数据:

df1 <- data.frame(ggplot2::msleep[c(1:10),c(1,4)]) 
order_longString <-  list(c("eeny", "Carnivora", "meeny"),
                         c("Primates", "miny", "moe"),
                         c("catch","a","tiger","Rodentia"),
                         c("by","the","toe","Soricomorpha","If"),
                         c("he","Artiodactyla","hollers"),
                         c("let","Pilosa"),
                         c("him","go","Carnivora"),
                         c("eenie","Rodentia","minie","money","more"),
                         c("Carnivora","catch"),
                         c("a","piggy","Artiodactyla","by","the","snout"))
df1$order_longString <- order_longString
df2 = data.frame(ref = unique(df1$order), pasteHere = NA)


## Updated df looks like this:
> df1
                         name        order                       order_longString
1                     Cheetah    Carnivora                 eeny, Carnivora, meeny
2                  Owl monkey     Primates                    Primates, miny, moe
3             Mountain beaver     Rodentia              catch, a, tiger, Rodentia
4  Greater short-tailed shrew Soricomorpha         by, the, toe, Soricomorpha, If
5                         Cow Artiodactyla              he, Artiodactyla, hollers
6            Three-toed sloth       Pilosa                            let, Pilosa
7           Northern fur seal    Carnivora                     him, go, Carnivora
8                Vesper mouse     Rodentia    eenie, Rodentia, minie, money, more
9                         Dog    Carnivora                       Carnivora, catch
10                   Roe deer Artiodactyla a, piggy, Artiodactyla, by, the, snout

> df2 # remain the same
           ref pasteHere
1    Carnivora        NA
2     Primates        NA
3     Rodentia        NA
4 Soricomorpha        NA
5 Artiodactyla        NA
6       Pilosa        NA

现在,让我们看看 df1$order_longString。它是一个长字符串,字符串数量不等,每个字符用“,”分隔。我需要将 df2$ref 模式与 df1$order_longString 中的字符串匹配。这就是我使用 grepl 的原因。

然后,如上所述,一旦模式匹配,然后将行的 df1$name 粘贴到 df2$pasteHere 并使用“_”将多次出现的行合并。

希望我说清楚了!

【问题讨论】:

【参考方案1】:

在这种情况下你甚至需要第二个 df 吗?

我建议使用 data.table:

library(data.table)

df1 = data.table(ggplot2::msleep[c(1:10),c(1:5)]) 

df_final = df1[, .(pasteHere = str_c(name, collapse = "_")), by=order]

输出:

> df_final
          order                     pasteHere
1:    Carnivora Cheetah_Northern fur seal_Dog
2:     Primates                    Owl monkey
3:     Rodentia  Mountain beaver_Vesper mouse
4: Soricomorpha    Greater short-tailed shrew
5: Artiodactyla                  Cow_Roe deer
6:       Pilosa              Three-toed sloth

如果您需要合并的 df,您可以这样做:

library(data.table)

df1 = data.table(ggplot2::msleep[c(1:10),c(1:5)]) 
df2 = data.table(ref = unique(df1$order), pasteHere = NA)

df1 = df1[, .(pasteHere = str_c(name, collapse = "_")), by=order]

df_final = merge(df2[, c("ref")], df1, by.x="ref", by.y="order")

输出:

            ref                     pasteHere
1: Artiodactyla                  Cow_Roe deer
2:    Carnivora Cheetah_Northern fur seal_Dog
3:       Pilosa              Three-toed sloth
4:     Primates                    Owl monkey
5:     Rodentia  Mountain beaver_Vesper mouse
6: Soricomorpha    Greater short-tailed shrew

【讨论】:

感谢您这么快回复!我意识到我使用过度简化的虚拟数据犯了一个大错误。这不完全是我的意图,我需要使虚拟数据更类似于我的情况。您是否建议我使用下面的“回答您的问题”按钮重新发布我的问题,或直接编辑帖子?对不起,我在这个网站上太新了 只需在您的初始帖子中创建一个“更新:”或“编辑:”块 :-)

以上是关于R:在列表中模式的grepl之后粘贴并组合来自ifelse的多个输出的主要内容,如果未能解决你的问题,请参考以下文章

R - 子集 - 基于列值的 grepl 选择排除行 [重复]

使用 grepl 在 R 中获取匹配的字符串

R语言应用substr函数和substring函数抽取(extract)删除(Remove)替换匹配(Match)特定的字符串并对比两个函数的异同grepl检查子字符串是否存在于字符串中

Python正则表达式过滤匹配模式的字符串列表

R语言学习13-正则表达式

Python 相当于 R 的 grepl 和 dplyr 过滤器 [重复]