仅当列存在时才执行 dplyr 操作

Posted 2023-03-11

技术标签:

【中文标题】仅当列存在时才执行 dplyr 操作【英文标题】：Execute dplyr operation only if column exists 【发布时间】：2017-12-22 03:15:17 【问题描述】：

根据conditional dplyr evaluation 上的讨论，我想根据传递的数据帧中是否存在引用列有条件地执行管道中的步骤。

示例

1) 和 2) 生成的结果应该相同。

现有列

# 1)
mtcars %>% 
  filter(am == 1) %>%
  filter(cyl == 4)

# 2)
mtcars %>%
  filter(am == 1) %>%
  
    if("cyl" %in% names(.)) filter(cyl == 4) else .

不可用的列

# 1)
mtcars %>% 
  filter(am == 1)

# 2)    
mtcars %>%
  filter(am == 1) %>%
  
    if("absent_column" %in% names(.)) filter(absent_column == 4) else .

问题

对于可用列，传递的对象与初始数据框不对应。原代码返回错误信息：

filter(cyl == 4) 中的错误：找不到对象 'cyl'

我尝试了替代语法（没有运气）：

>> mtcars %>%
...   filter(am == 1) %>%
...   
...     if("cyl" %in% names(.)) filter(.$cyl == 4) else .
...   
 Show Traceback

 Rerun with Debug
 Error in UseMethod("filter_") : 
  no applicable method for 'filter_' applied to an object of class "logical"

跟进

我想扩展这个问题，以解释 filter 调用中 == 右侧的评估。例如，下面的语法尝试过滤第一个可用值。 mtcars %>%

filter(
    if ("does_not_ex" %in% names(.))
      does_not_ex
    else
      NULL
   == 
    if ("does_not_ex" %in% names(.))
      unique(.[['does_not_ex']])
    else
      NULL
  )

如预期的那样，调用结果为错误消息：

filter_impl(.data, quo) 中的错误：结果的长度必须为 32，而不是 0

应用于现有列时：

mtcars %>%
  filter(
    if ("mpg" %in% names(.))
      mpg
    else
      NULL
   == 
    if ("mpg" %in% names(.))
      unique(.[['mpg']])
    else
      NULL
  )

它适用于警告消息：

  mpg cyl disp  hp drat   wt  qsec vs am gear carb
1  21   6  160 110  3.9 2.62 16.46  0  1    4    4

警告消息：在中：更长的对象长度不是更短的物体长度

后续问题

是否有一种扩展现有语法的巧妙方法，以便在 filter 调用的右侧获得条件评估，最好留在 dplyr 工作流程中？

【问题讨论】：

你只需要另一个.，就像if("cyl" %in% names(.)) filter(., cyl == 4) else .一样，这里有一个类似的问答：***.com/a/44001834 【参考方案1】：

由于此处范围的工作方式，您无法从 if 语句中访问数据框。幸运的是，您不需要这样做。

试试：

mtcars %>%
  filter(am == 1) %>%
  filter(if("cyl" %in% names(.)) cyl else NULL == 4)

在这里，您可以在条件中使用“.”对象，以便检查列是否存在，如果存在，您可以将该列返回给filter 函数。

编辑：根据 docendo discimus 对问题的评论，您可以访问数据框，但不能隐式访问 - 即您必须使用 . 专门引用它

【讨论】：

关于“您无法从 if 语句中访问数据框。” - 我认为这不太正确；请参阅我对原始帖子的评论。是的，你是对的。我的意思是含蓄的，但我的措辞很糟糕。正如您所指出的，您可以通过访问. 来使用它此解决方案不再有效（尝试将字符串“cyl”编辑为不存在的内容）。 Felipe Gerard 的回答确实如此。【参考方案2】：

编辑：不幸的是，这太好了，令人难以置信

我参加聚会可能有点晚了。但是是

mtcars %>% 
 filter(am == 1) %>%
 try(filter(absent_column== 4))

解决方案？

【讨论】：

运行try(filter(cyl == 4)) 似乎无法正常工作，返回未修改的数据框，它应该返回等效于应用filter(cyl == 4) 的对象，因为存在 cyl 列。 【参考方案3】：

我知道我迟到了，但这里有一个更符合你最初想法的答案：

mtcars %>%
  filter(am == 1) %>%
  
    if("cyl" %in% names(.)) filter(., cyl == 4) else .

基本上，您在filter 中缺少.。请注意，这是因为管道不会将. 添加到filter(expr)，因为它位于由包围的表达式中。

【讨论】：

【参考方案4】：

这段代码可以解决问题并且非常灵活。 ^ 和 $ 是用于执行完全匹配的正则表达式。

mtcars %>% 
  set_names(names(.) %>% 
              str_replace("am","1") %>% 
              str_replace("^cyl$","2") %>% 
              str_replace("Doesn't Exist","3")
              )

【讨论】：

【参考方案5】：

避免这个陷阱：

在忙碌的一天，可能会做以下事情：

library(dplyr)
df <- data.frame(A = 1:3, B = letters[1:3], stringsAsFactors = F)
> df %>% mutate( C = ifelse("D" %in% colnames(.), D, B)) 
# Notice the values on "C" colum. No error thrown, but the logic and result is wrong
  A B C
1 1 a a
2 2 b a
3 3 c a

为什么？因为"D" %in% colnames(.) 只返回一个值TRUE 或FALSE，因此ifelse 只运行一次。然后将值广播到整列！

正确方法：

> df %>% mutate( C = if("D" %in% colnames(.)) D else B)
  A B C
1 1 a a
2 2 b b
3 3 c c

【讨论】：

【参考方案6】：

使用 dplyr > 1.0.0 中的across()，您现在可以在过滤时使用any_of。将原始列与所有列进行比较：

mtcars %>% 
  filter(am == 1) %>% 
  filter(cyl == 4)

删除cyl 后会引发错误：

mtcars %>% 
  select(!cyl) %>% 
  filter(am == 1) %>% 
  filter(cyl == 4)

使用any_of（注意你必须写"cyl"而不是cyl）：

mtcars %>% 
  select(!cyl) %>% 
  filter(am == 1) %>% 
  filter(across(any_of("cyl"), ~.x == 4))
#N.B. this is equivalent to just filtering by `am == 1`.

【讨论】：

以上是关于仅当列存在时才执行 dplyr 操作的主要内容，如果未能解决你的问题，请参考以下文章