有没有办法“合并”两列,其中新列的值是具有特定值的原始列的名称,分组明智?

Posted

技术标签:

【中文标题】有没有办法“合并”两列,其中新列的值是具有特定值的原始列的名称,分组明智?【英文标题】:Is there a way to 'merge' two columns, where the values of new column are the name of the original column that had a specific value, group wise? 【发布时间】:2020-03-26 01:26:05 【问题描述】:

我有一个数据框(将其称为“df”),其中包含相当数量的变量(数字、逻辑和字符),代表不同细胞类型从特定介质移动到另一种介质的实验,以及在特定时间对细胞进行量化。第一列和第二列分别包含“源”介质的名称和细胞移动到的介质的名称;第三列描述了活动被量化的时间,第四列是细胞类型,第五列是测量的活动,这就是有趣的地方。

我有两个主要问题,第一个是要知道是否有“R-esque”方式来做我所做的以获得第六列,其中包含值的增加/减少(百分比) 'Activity' 相对于上一行中的活动,但以组的方式(每个组由 Cell.Type、Pre.Medium 和 Time 的组合组成),这就是为什么每次 Time 的值为零时它的值都是 NA .

假设这是我的数据框(我已对其进行了简化以使我的问题更清楚):

df <- structure(list(Pre.Medium = c("Medium1", "Medium1", "Medium1", 
"Medium2", "Medium2", "Medium2", "Medium1", "Medium1", "Medium1", 
"Medium2", "Medium2", "Medium2"), Pos.Medium = c("Medium2", "Medium2", 
"Medium2", "Medium1", "Medium1", "Medium1", "Medium2", "Medium2", 
"Medium2", "Medium1", "Medium1", "Medium1"), Time = c(0, 2, 4, 
0, 2, 4, 0, 2, 4, 0, 2, 4), Cell.Type = c("Cell_A", "Cell_A", 
"Cell_A", "Cell_A", "Cell_A", "Cell_A", "Cell_B", "Cell_B", "Cell_B", 
"Cell_B", "Cell_B", "Cell_B"), Activity = c(0.5, 1, 2, 2, 1, 
0.5, 0.2, 0.8, 0.2, 0.2, 0.2, 0.4), Percent.Increase = c(NA, 
100, 100, NA, -50, -50, NA, 300, -75, NA, 0, 100), Primary.Increase = c(NA, 
TRUE, FALSE, NA, TRUE, FALSE, NA, TRUE, FALSE, NA, FALSE, FALSE
), Secondary.Increase = c(NA, FALSE, FALSE, NA, FALSE, FALSE, 
NA, FALSE, FALSE, NA, FALSE, TRUE)), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -12L), problems = structure(list(
    row = 1L, col = NA_character_, expected = "8 columns", actual = "9 columns", 
    file = "'new 2'"), row.names = c(NA, -1L), class = c("tbl_df", 
"tbl", "data.frame")), spec = structure(list(cols = list(Pre.Medium = structure(list(), class = c("collector_character", 
"collector")), Pos.Medium = structure(list(), class = c("collector_character", 
"collector")), Time = structure(list(), class = c("collector_double", 
"collector")), Cell.Type = structure(list(), class = c("collector_character", 
"collector")), Activity = structure(list(), class = c("collector_double", 
"collector")), Percent.Increase = structure(list(), class = c("collector_double", 
"collector")), Primary.Increase = structure(list(), class = c("collector_logical", 
"collector")), Secondary.Increase = structure(list(), class = c("collector_logical", 
"collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), skip = 1), class = "col_spec"))
### Pre.Med Pos.Med Time  Cell.Type Activity  Percent.Increase  Primary.Increase Secondary.Increase
### Medium1 Medium2   0    Cell_A    0.5           NA           NA                NA 
### Medium1 Medium2   2    Cell_A    1             100          TRUE              FALSE
### Medium1 Medium2   4    Cell_A    2             100          FALSE             FALSE
### Medium2 Medium1   0    Cell_A    2             NA           NA                NA
### Medium2 Medium1   2    Cell_A    1            -50           TRUE              FALSE
### Medium2 Medium1   4    Cell_A    0.5          -50           FALSE             FALSE
### Medium1 Medium2   0    Cell_B    0.2           NA           NA                NA
### Medium1 Medium2   2    Cell_B    0.8           300          TRUE              FALSE
### Medium1 Medium2   4    Cell_B    0.2          -75           FALSE             FALSE
### Medium2 Medium1   0    Cell_B    0.2           NA           NA                NA
### Medium2 Medium1   2    Cell_B    0.2           0            FALSE             FALSE
### Medium2 Medium1   4    Cell_B    0.4           100          FALSE             TRUE

我通过使用 group_by 和 mutate 函数,然后使用 lag 函数来计算上一行和上一行的增加/减少,有没有更好的方法呢?对于我的具体情况,滞后就足够了,但是如果我在每个“组”中有超过三个时间测量并且需要远远落后于计算呢?使用我的方法,在某些时候我不得不使用诸如 lag(lag(lag(lag(lag((Activity / lag(Activity)) - 1) * 100)))) 之类的东西。

另一件事是我无法以任何方式弄清楚的事情,它是将我的“宽”数据集变成一个长数据集,方法是将我的列“Primary.Increase”和“Secondary.Increase”到名为“Increase.Type”的列中,对于每个组(Cell.Type、Pre.Med 和 Time 的组合),其值将包含在列名(Primary.Response 或 Secondary.Response)中,其中其成员之一的值为 TRUE。它应该看起来像这样:

df <- structure(list(Pre.Med = c("Medium1", "Medium1", "Medium1", "Medium2", 
"Medium2", "Medium2", "Medium1", "Medium1", "Medium1", "Medium2", 
"Medium2", "Medium2"), Pos.Med = c("Medium2", "Medium2", "Medium2", 
"Medium1", "Medium1", "Medium1", "Medium2", "Medium2", "Medium2", 
"Medium1", "Medium1", "Medium1"), Time = c(0, 2, 4, 0, 2, 4, 
0, 2, 4, 0, 2, 4), Cell.Type = c("Cell_A", "Cell_A", "Cell_A", 
"Cell_A", "Cell_A", "Cell_A", "Cell_B", "Cell_B", "Cell_B", "Cell_B", 
"Cell_B", "Cell_B"), Activity = c(0.5, 1, 2, 2, 1, 0.5, 0.2, 
0.8, 0.2, 0.2, 0.2, 0.4), Percent.Inc = c(NA, 100, 100, NA, -50, 
-50, NA, 300, -75, NA, 0, 100), Increase.Type = c("Primary.Increase", 
"Primary.Increase", "Primary.Increase", "Primary.Increase", "Primary.Increase", 
"Primary.Increase", "Primary.Increase", "Primary.Increase", "Primary.Increase", 
"Secondary.Increase", "Secondary.Increase", "Secondary.Increase"
)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-12L), spec = structure(list(cols = list(Pre.Med = structure(list(), class = c("collector_character", 
"collector")), Pos.Med = structure(list(), class = c("collector_character", 
"collector")), Time = structure(list(), class = c("collector_double", 
"collector")), Cell.Type = structure(list(), class = c("collector_character", 
"collector")), Activity = structure(list(), class = c("collector_double", 
"collector")), Percent.Inc = structure(list(), class = c("collector_double", 
"collector")), Increase.Type = structure(list(), class = c("collector_character", 
"collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), skip = 1), class = "col_spec"))
### Pre.Med Pos.Med Time  Cell.Type Activity    Percent.Inc Increase.Type 
### Medium1 Medium2   0    Cell_A    0.5           NA         Primary.Increase
### Medium1 Medium2   2    Cell_A    1             100        Primary.Increase
### Medium1 Medium2   4    Cell_A    2             100        Primary.Increase
### Medium2 Medium1   0    Cell_A    2             NA         Primary.Increase
### Medium2 Medium1   2    Cell_A    1            -50         Primary.Increase
### Medium2 Medium1   4    Cell_A    0.5          -50         Primary.Increase
### Medium1 Medium2   0    Cell_B    0.2           NA         Primary.Increase
### Medium1 Medium2   2    Cell_B    0.8           300        Primary.Increase
### Medium1 Medium2   4    Cell_B    0.2          -75         Primary.Increase
### Medium2 Medium1   0    Cell_B    0.2           NA         Secondary.Increase
### Medium2 Medium1   2    Cell_B    0.2           0          Secondary.Increase     
### Medium2 Medium1   4    Cell_B    0.4           100        Secondary.Increase             

首先有没有办法做到这一点?我假设是这样,但到目前为止我还不能做到:/ 我是一名生物学本科生,对 R 比较陌生,我很喜欢你可以用它做什么,但我离擅长它还有很长的路要走。

非常感谢任何帮助。

【问题讨论】:

【参考方案1】:

我不确定我是否理解第一个问题。 如果您执行以下操作:

library(dplyr)

df %>%
  group_by(Cell.Type, Pre.Medium, Pos.Medium) %>%
  arrange(Time, .by_group = TRUE) %>% # remove if Time is always ascending
  mutate(Percent.Increase = ((Activity / lag(Activity)) - 1) * 100)

Percent.Increase的计算是向量化的, 所以Activity 有多长并不重要 (另请参阅下面我的最后解释)。

对于第二个问题, 如果我理解正确, 你可以这样做:

df %>%
  group_by(Cell.Type, Pre.Medium, Pos.Medium) %>%
  mutate(Increase.Type = if (any(Secondary.Increase, na.rm = TRUE)) "Secondary.Increase" else "Primary.Increase") %>%
  select(-(Primary.Increase:Secondary.Increase))
# A tibble: 12 x 7
# Groups:   Cell.Type, Pre.Medium, Pos.Medium [4]
   Pre.Medium Pos.Medium  Time Cell.Type Activity Percent.Increase Increase.Type     
   <chr>      <chr>      <dbl> <chr>        <dbl>            <dbl> <chr>             
 1 Medium1    Medium2        0 Cell_A         0.5               NA Primary.Increase  
 2 Medium1    Medium2        2 Cell_A         1                100 Primary.Increase  
 3 Medium1    Medium2        4 Cell_A         2                100 Primary.Increase  
 4 Medium2    Medium1        0 Cell_A         2                 NA Primary.Increase  
 5 Medium2    Medium1        2 Cell_A         1                -50 Primary.Increase  
 6 Medium2    Medium1        4 Cell_A         0.5              -50 Primary.Increase  
 7 Medium1    Medium2        0 Cell_B         0.2               NA Primary.Increase  
 8 Medium1    Medium2        2 Cell_B         0.8              300 Primary.Increase  
 9 Medium1    Medium2        4 Cell_B         0.2              -75 Primary.Increase  
10 Medium2    Medium1        0 Cell_B         0.2               NA Secondary.Increase
11 Medium2    Medium1        2 Cell_B         0.2                0 Secondary.Increase
12 Medium2    Medium1        4 Cell_B         0.4              100 Secondary.Increase

mutate 内部的转换会看到组中的所有值, 所以any(Secondary.Increase, na.rm = TRUE) 一次接收所有元素, 如果我们只返回 1 个值, 它将被复制以适应组大小。

【讨论】:

以上是关于有没有办法“合并”两列,其中新列的值是具有特定值的原始列的名称,分组明智?的主要内容,如果未能解决你的问题,请参考以下文章

如何根据合并的数据框之一的两列的值在熊猫数据框中添加值

如何从结果集中打印值而没有列中的任何重复记录

基于两列中的值合并其他列中的值

在 SQL 中,我可以在另一列中获取与它们没有关联的特定值的列中的值吗?

比较来自两个不同表的两列的逗号分隔值

Excel - 如何组合两列的值(并非在每种情况下)