拆分数据框字符串列并计数项目。 (dplyr 和 R)

Posted

技术标签:

【中文标题】拆分数据框字符串列并计数项目。 (dplyr 和 R)【英文标题】:Split data frame string column and count items. (dplyr and R) 【发布时间】:2022-01-22 08:48:37 【问题描述】:

我的数据如下所示。我想做的是拆分由"/" 连接的core_enrchiment 列项,并计算每行中有多少个ID(例如101739、20382、13006 ...)。

> dat %>% select(ID, core_enrichment)
# A tibble: 22 x 2
   ID                           core_enrichment                                                                                                              
   <chr>                        <chr>                                                                                                                        
 1 HALLMARK_E2F_TARGETS         101739/20382/13006/212377/114714/66622/140917/19139/18813/16647/20492/67241/103573/67054/19385/14852/12567/70699/20842/70472…
 2 HALLMARK_G2M_CHECKPOINT      75717/103573/14852/18141/12567/26429/20842/17975/12545/20641/21781/19357/17216/15331/12615/107823/13555/56403/26554/11991/77…
 3 HALLMARK_MYC_TARGETS_V1      66942/56200/27041/12729/68981/20810/27050/19934/110639/66235/12237/70316/26965/109801/12785/103136/11757/16211/18673/20462/1…
 4 HALLMARK_INTERFERON_GAMMA_R… 14293/12575/246728/12265/12984/16149/14969/17329/17750/626578/14129/21928/99899/231655/17858/66141/57444/14960/100121/80876/…
 5 HALLMARK_TNFA_SIGNALING_VIA… 14282/12977/19252/16476/14281/12575/21926/15200/22151/17872/21928/21664/14345/15980/13653/20303/12515/11852/74646/18227/7171…
 6 HALLMARK_P53_PATHWAY         71839/12579/12795/27280/12606/16476/14281/12578/12575/15368/15200/11820/19734/17872/19143/16450/56312/71712/22337/64058/1660…
 7 HALLMARK_SPERMATOGENESIS     17344/15512/23885/12326/71838/18952/15925/14056/16162/27214/20496/18551/21821/20878/12442/106344/22137/53604/215387/72391/73…
 8 HALLMARK_INFLAMMATORY_RESPO… 19222/192187/216799/14293/12977/12986/19204/12575/12267/15200/17329/19734/13733/13136/15980/20288/19217/13058/12515/16402/25…
 9 HALLMARK_MITOTIC_SPINDLE     21844/233406/110033/12190/240641/26934/236266/56699/105988/16906/71819/67052/12488/67141/229841/20878/18817/208084/17318/218…
10 HALLMARK_IL6_JAK_STAT3_SIGN… 12977/12986/16476/15368/12768/21926/12984/17329/94185/16161/15980/16994/16169/12702/12982/21938/18712/16416/15945/12491/1618…

我所做的是下面的代码,它对我有用。

dat_tmp_df <- dat %>% mutate(tmp_n_genes = str_split(core_enrichment, "/"))
dat_tmp_df$num_genes <- lapply(dat_tmp_df$tmp_n_genes, length) %>% unlist()

> dat_tmp_df %>% select(ID, core_enrichment, num_genes)
# A tibble: 22 x 3
   ID                          core_enrichment                                                                                                      num_genes
   <chr>                       <chr>                                                                                                                    <int>
 1 HALLMARK_E2F_TARGETS        101739/20382/13006/212377/114714/66622/140917/19139/18813/16647/20492/67241/103573/67054/19385/14852/12567/70699/20…       131
 2 HALLMARK_G2M_CHECKPOINT     75717/103573/14852/18141/12567/26429/20842/17975/12545/20641/21781/19357/17216/15331/12615/107823/13555/56403/26554…       102
 3 HALLMARK_MYC_TARGETS_V1     66942/56200/27041/12729/68981/20810/27050/19934/110639/66235/12237/70316/26965/109801/12785/103136/11757/16211/1867…       122
 4 HALLMARK_INTERFERON_GAMMA_… 14293/12575/246728/12265/12984/16149/14969/17329/17750/626578/14129/21928/99899/231655/17858/66141/57444/14960/1001…        84
 5 HALLMARK_TNFA_SIGNALING_VI… 14282/12977/19252/16476/14281/12575/21926/15200/22151/17872/21928/21664/14345/15980/13653/20303/12515/11852/74646/1…        55
 6 HALLMARK_P53_PATHWAY        71839/12579/12795/27280/12606/16476/14281/12578/12575/15368/15200/11820/19734/17872/19143/16450/56312/71712/22337/6…        39
 7 HALLMARK_SPERMATOGENESIS    17344/15512/23885/12326/71838/18952/15925/14056/16162/27214/20496/18551/21821/20878/12442/106344/22137/53604/215387…        28
 8 HALLMARK_INFLAMMATORY_RESP… 19222/192187/216799/14293/12977/12986/19204/12575/12267/15200/17329/19734/13733/13136/15980/20288/19217/13058/12515…        51
 9 HALLMARK_MITOTIC_SPINDLE    21844/233406/110033/12190/240641/26934/236266/56699/105988/16906/71819/67052/12488/67141/229841/20878/18817/208084/…        38
10 HALLMARK_IL6_JAK_STAT3_SIG… 12977/12986/16476/15368/12768/21926/12984/17329/94185/16161/15980/16994/16169/12702/12982/21938/18712/16416/15945/1…        25

我想知道使用 dplyr 是否有更优雅的方式来做到这一点。我的代码有效,但看起来像意大利面条代码。

【问题讨论】:

【参考方案1】:

您可以使用以下解决方案:

library(dplyr)
library(stringr)

df %>%
  mutate(count_str = str_count(core_enrichment, "\\d+"))

# A tibble: 2 x 3
# Rowwise: 
  ID                     core_enrichment                                        count_str
  <chr>                  <chr>                                                      <int>
1 HALLMARK_E2F_TARGETS   101739/20382/13006/212377/114714/66622/140917                  7
2 ALLMARK_G2M_CHECKPOINT 75717/103573/14852/18141/12567/26429/20842/17975/12545         9

数据

structure(list(ID = c("HALLMARK_E2F_TARGETS", "ALLMARK_G2M_CHECKPOINT"
), core_enrichment = c("101739/20382/13006/212377/114714/66622/140917", 
"75717/103573/14852/18141/12567/26429/20842/17975/12545")), class = "data.frame", row.names = c(NA, 
-2L))

【讨论】:

谢谢,它看起来棒极了!【参考方案2】:

与上一个答案类似,但假设基因不一定是数字。您可以执行以下操作:

library(dplyr)
library(stringr)

这里数正斜杠的个数加1

dat %>% mutate(num_genes=str_count(core_enrichment,"/")+1)

【讨论】:

太棒了。谢谢!

以上是关于拆分数据框字符串列并计数项目。 (dplyr 和 R)的主要内容,如果未能解决你的问题,请参考以下文章

将 Spark Dataframe 字符串列拆分为多列

从 pyspark 数据框字符串列中获取第一个数值到新列中

在 Spark-Scala 中将单个字符串列拆分为多列

从数据框字符串列中提取特定单词并存储在 Python 的新列中

编写csv时避免拆分字符串列

Python:从数据框字符串列中提取维度数据并为每个列创建具有值的列