DPLYR 中汇总的奇怪行为

Posted 2023-02-14

技术标签:

【中文标题】DPLYR 中汇总的奇怪行为【英文标题】：Odd behavior of summarise across in DPLYR 【发布时间】：2021-07-21 07:51:03 【问题描述】：

根据对儿童及其父母进行的一项调查，我有两张大桌子 (~12k x 6)。这些表在维度、类型/类上是相同的，并且被相同地处理成 R。经过一番争吵（同样，对孩子和父母做了同样的事情），我运行以下代码：

更新：原来我的问题的根源是变量 C，它在 Children 数据集中只有值 0 和 1。将summarise 与table 一起使用时，有什么办法可以解决此错误？

Parents %>% 
  summarise(across(A, ~ table(.x)),
            across(B, ~table(.x)),
            across(C, ~ table(.x)),
            across(D, ~ table(.x)),
            across(E, ~ table(.x)))

Children %>%  
  summarise(across(A, ~ table(.x)),
            across(B, ~table(.x)),
            across(C, ~ table(.x)),
            across(D, ~ table(.x)),
            across(E, ~ table(.x)))

对于Parents，我得到以下输出（唯一值 D var (1,2,3)，其他 (0,1,2) 的频率：

        A          B      C           D      E
1   11840      11835  11409       11363    519
2      35         42    436         473   4912
3       3          1     33          42   6447

对于Children，我收到以下错误：

Error: Problem with `summarise()` input `..5`.
x Input `..5` must be size 4 or 1, not 3.
ℹ An earlier column had size 4.
ℹ Input `..5` is `(function (.cols = everything(), .fns = NULL, ..., .names = NULL) ...`.
Run `rlang::last_error()` to see where the error occurred.

运行rlang::last_error() 返回：

<error/dplyr_error>
Problem with `summarise()` input `..5`.
x Input `..5` must be size 4 or 1, not 3.
ℹ An earlier column had size 4.
ℹ Input `..5` is `(function (.cols = everything(), .fns = NULL, ..., .names = NULL) ...`.
Backtrace:
Run `rlang::last_trace()` to see the full context.

运行rlang::last_trace() 返回：

<error/dplyr_error>
Problem with `summarise()` input `..5`.
x Input `..5` must be size 4 or 1, not 3.
ℹ An earlier column had size 4.
ℹ Input `..5` is `(function (.cols = everything(), .fns = NULL, ..., .names = NULL) ...`.
Backtrace:
     █
  1. ├─`%>%`(...)
  2. ├─dplyr::summarise(...)
  3. ├─dplyr:::summarise.data.frame(...)
  4. │ └─dplyr:::summarise_cols(.data, ...)
  5. │   └─base::withCallingHandlers(...)
  6. ├─dplyr:::abort_glue(...)
  7. │ ├─rlang::exec(abort, class = class, !!!data)
  8. │ └─(function (message = NULL, class = NULL, ..., trace = NULL, parent = NULL, ...
  9. │   └─rlang:::signal_abort(cnd)
 10. │     └─base::signalCondition(cnd)
 11. └─(function (e) ...

有人知道会发生什么吗？

为了理智，这里是str的摘要：

> str(Parents)
'data.frame':   11878 obs. of  6 variables:
 $ ID         : chr  "Parent 1" "Parent 2" "Parent 3" "Parent 4" ...
 $ A          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ B          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ C          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ D          : num  2 2 1 2 3 3 2 3 3 2 ...
 $ E          : num  0 0 0 0 0 0 0 0 0 0 ...
> str(Children)
'data.frame':   11878 obs. of  6 variables:
 $ ID         : chr  "Child 1" "Child 2" "Child 3" "Child 4" ...
 $ A          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ B          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ C          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ D          : num  2 2 1 2 3 3 2 3 3 2 ...
 $ E          : num  0 0 0 0 0 0 0 0 0 0 ...

【问题讨论】：

首先是summarise(across(A:E, ~ table(.x))) 或简单的summarise(across(A:E, table)) 不需要很多重复我知道。我更改了 var/df 名称，因为这是敏感数据。我通常包含实际的 var 名称，以便我可以在我的脚本中快速告诉我正在运行什么。在这种情况下，我不介意多余。不过还是谢谢。您想在这里使用table 实现什么？您想知道每列中的值及其频率是多少？我想准确地知道值的计数。 【参考方案1】：

table 不一定总是适合tidyverse 管道，因为它返回的值数量不相等。我认为最好以长格式获取数据并使用count。您将获得相同的信息，但格式较长。

library(dplyr)
library(tidyr)

Parents %>%  pivot_longer(cols = A:E) %>% count(name, value)

Children 数据同样适用。

【讨论】：

以上是关于DPLYR 中汇总的奇怪行为的主要内容，如果未能解决你的问题，请参考以下文章