拆分数据框字符串列并计数项目。 (dplyr 和 R)
Posted
技术标签:
【中文标题】拆分数据框字符串列并计数项目。 (dplyr 和 R)【英文标题】:Split data frame string column and count items. (dplyr and R) 【发布时间】:2022-01-22 08:48:37 【问题描述】:我的数据如下所示。我想做的是拆分由"/"
连接的core_enrchiment
列项,并计算每行中有多少个ID(例如101739、20382、13006 ...)。
> dat %>% select(ID, core_enrichment)
# A tibble: 22 x 2
ID core_enrichment
<chr> <chr>
1 HALLMARK_E2F_TARGETS 101739/20382/13006/212377/114714/66622/140917/19139/18813/16647/20492/67241/103573/67054/19385/14852/12567/70699/20842/70472…
2 HALLMARK_G2M_CHECKPOINT 75717/103573/14852/18141/12567/26429/20842/17975/12545/20641/21781/19357/17216/15331/12615/107823/13555/56403/26554/11991/77…
3 HALLMARK_MYC_TARGETS_V1 66942/56200/27041/12729/68981/20810/27050/19934/110639/66235/12237/70316/26965/109801/12785/103136/11757/16211/18673/20462/1…
4 HALLMARK_INTERFERON_GAMMA_R… 14293/12575/246728/12265/12984/16149/14969/17329/17750/626578/14129/21928/99899/231655/17858/66141/57444/14960/100121/80876/…
5 HALLMARK_TNFA_SIGNALING_VIA… 14282/12977/19252/16476/14281/12575/21926/15200/22151/17872/21928/21664/14345/15980/13653/20303/12515/11852/74646/18227/7171…
6 HALLMARK_P53_PATHWAY 71839/12579/12795/27280/12606/16476/14281/12578/12575/15368/15200/11820/19734/17872/19143/16450/56312/71712/22337/64058/1660…
7 HALLMARK_SPERMATOGENESIS 17344/15512/23885/12326/71838/18952/15925/14056/16162/27214/20496/18551/21821/20878/12442/106344/22137/53604/215387/72391/73…
8 HALLMARK_INFLAMMATORY_RESPO… 19222/192187/216799/14293/12977/12986/19204/12575/12267/15200/17329/19734/13733/13136/15980/20288/19217/13058/12515/16402/25…
9 HALLMARK_MITOTIC_SPINDLE 21844/233406/110033/12190/240641/26934/236266/56699/105988/16906/71819/67052/12488/67141/229841/20878/18817/208084/17318/218…
10 HALLMARK_IL6_JAK_STAT3_SIGN… 12977/12986/16476/15368/12768/21926/12984/17329/94185/16161/15980/16994/16169/12702/12982/21938/18712/16416/15945/12491/1618…
我所做的是下面的代码,它对我有用。
dat_tmp_df <- dat %>% mutate(tmp_n_genes = str_split(core_enrichment, "/"))
dat_tmp_df$num_genes <- lapply(dat_tmp_df$tmp_n_genes, length) %>% unlist()
> dat_tmp_df %>% select(ID, core_enrichment, num_genes)
# A tibble: 22 x 3
ID core_enrichment num_genes
<chr> <chr> <int>
1 HALLMARK_E2F_TARGETS 101739/20382/13006/212377/114714/66622/140917/19139/18813/16647/20492/67241/103573/67054/19385/14852/12567/70699/20… 131
2 HALLMARK_G2M_CHECKPOINT 75717/103573/14852/18141/12567/26429/20842/17975/12545/20641/21781/19357/17216/15331/12615/107823/13555/56403/26554… 102
3 HALLMARK_MYC_TARGETS_V1 66942/56200/27041/12729/68981/20810/27050/19934/110639/66235/12237/70316/26965/109801/12785/103136/11757/16211/1867… 122
4 HALLMARK_INTERFERON_GAMMA_… 14293/12575/246728/12265/12984/16149/14969/17329/17750/626578/14129/21928/99899/231655/17858/66141/57444/14960/1001… 84
5 HALLMARK_TNFA_SIGNALING_VI… 14282/12977/19252/16476/14281/12575/21926/15200/22151/17872/21928/21664/14345/15980/13653/20303/12515/11852/74646/1… 55
6 HALLMARK_P53_PATHWAY 71839/12579/12795/27280/12606/16476/14281/12578/12575/15368/15200/11820/19734/17872/19143/16450/56312/71712/22337/6… 39
7 HALLMARK_SPERMATOGENESIS 17344/15512/23885/12326/71838/18952/15925/14056/16162/27214/20496/18551/21821/20878/12442/106344/22137/53604/215387… 28
8 HALLMARK_INFLAMMATORY_RESP… 19222/192187/216799/14293/12977/12986/19204/12575/12267/15200/17329/19734/13733/13136/15980/20288/19217/13058/12515… 51
9 HALLMARK_MITOTIC_SPINDLE 21844/233406/110033/12190/240641/26934/236266/56699/105988/16906/71819/67052/12488/67141/229841/20878/18817/208084/… 38
10 HALLMARK_IL6_JAK_STAT3_SIG… 12977/12986/16476/15368/12768/21926/12984/17329/94185/16161/15980/16994/16169/12702/12982/21938/18712/16416/15945/1… 25
我想知道使用 dplyr 是否有更优雅的方式来做到这一点。我的代码有效,但看起来像意大利面条代码。
【问题讨论】:
【参考方案1】:您可以使用以下解决方案:
library(dplyr)
library(stringr)
df %>%
mutate(count_str = str_count(core_enrichment, "\\d+"))
# A tibble: 2 x 3
# Rowwise:
ID core_enrichment count_str
<chr> <chr> <int>
1 HALLMARK_E2F_TARGETS 101739/20382/13006/212377/114714/66622/140917 7
2 ALLMARK_G2M_CHECKPOINT 75717/103573/14852/18141/12567/26429/20842/17975/12545 9
数据
structure(list(ID = c("HALLMARK_E2F_TARGETS", "ALLMARK_G2M_CHECKPOINT"
), core_enrichment = c("101739/20382/13006/212377/114714/66622/140917",
"75717/103573/14852/18141/12567/26429/20842/17975/12545")), class = "data.frame", row.names = c(NA,
-2L))
【讨论】:
谢谢,它看起来棒极了!【参考方案2】:与上一个答案类似,但假设基因不一定是数字。您可以执行以下操作:
library(dplyr)
library(stringr)
这里数正斜杠的个数加1
dat %>% mutate(num_genes=str_count(core_enrichment,"/")+1)
【讨论】:
太棒了。谢谢!以上是关于拆分数据框字符串列并计数项目。 (dplyr 和 R)的主要内容,如果未能解决你的问题,请参考以下文章