按部分字符串匹配分组
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了按部分字符串匹配分组相关的知识,希望对你有一定的参考价值。
我有一个表格,其中列出了类别,每个类别都具有一个基于相似性而希望折叠的计数值...例如,Mariner-1_Amel和Mariner-10将是一个单一的Mariner类别,任何带有'名称中的“ Jockey”或“ hAT”应折叠。
我正在努力寻找一种可以应对所有可能性的解决方案。有没有简单的dplyr解决方案?
可复制
> dput(tibs)
structure(list(type = c("(TTAAG)n_1", "AMARI_1", "Copia-4_LH-I",
"DNA", "DNA-1_CQ", "DNA/hAT-Charlie", "DNA/hAT-Tip100", "DNA/MULE-MuDR",
"DNA/P", "DNA/PiggyBac", "DNA/TcMar-Mariner", "DNA/TcMar-Tc1",
"DNA/TcMar-Tigger", "G3_DM", "Gypsy-10_CFl-I", "hAT-1_DAn", "hAT-16_SM",
"hAT-N4_RPr", "HELITRON7_CB", "Jockey-1_DAn", "Jockey-1_DEl",
"Jockey-12_DF", "Jockey-5_DTa", "Jockey-6_DYa", "Jockey-6_Hmel",
"Jockey-7_HMM", "Jockey-8_Hmel", "LINE/Dong-R4", "LINE/I", "LINE/I-Jockey",
"LINE/I-Nimb", "LINE/Jockey", "LINE/L1", "LINE/L2", "LINE/R1",
"LINE/R2", "LINE/R2-NeSL", "LINE/Tad1", "LTR/Gypsy", "Mariner_CA",
"Mariner-1_AMel", "Mariner-10_HSal", "Mariner-13_ACe", "Mariner-15_HSal",
"Mariner-16_DAn", "Mariner-19_RPr", "Mariner-30_SM", "Mariner-39_SM",
"Mariner-42_HSal", "Mariner-46_HSal", "Mariner-49_HSal", "TE-5_EL",
"Unknown", "Utopia-1_Crp"), n = c(1L, 1L, 1L, 2L, 1L, 18L, 3L,
9L, 2L, 8L, 21L, 12L, 18L, 1L, 3L, 1L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 2L, 1L, 2L, 1L, 2L, 7L, 2L, 7L, 24L, 1L, 1L, 5L, 3L, 1L,
1L, 7L, 1L, 5L, 1L, 1L, 5L, 5L, 1L, 1L, 3L, 5L, 5L, 2L, 1L, 190L,
1L)), row.names = c(NA, -54L), class = c("tbl_df", "tbl", "data.frame"
))
答案
如果没有共同的定义组,则可以使用case_when
定义各个条件。
library(dplyr)
library(stringr)
tibs %>%
mutate(category = case_when(str_detect(type, 'Mariner-\d+') ~ 'Mariner',
str_detect(type, 'Jockey|hAT') ~ 'common'),
#Add more conditions
)
以上是关于按部分字符串匹配分组的主要内容,如果未能解决你的问题,请参考以下文章