使用 sparklyr 或 dplyr 获取组组合中的成员资格

Posted

技术标签:

【中文标题】使用 sparklyr 或 dplyr 获取组组合中的成员资格【英文标题】:Getting counts of membership in combination of groups using sparklyr or dplyr 【发布时间】:2021-10-29 14:32:27 【问题描述】:

我有一个使用 sparklyr 操作的 spark 数据框,如下所示:

input_data <- data.frame(id = c(10,10,10,20,20,30,30,40,40,40,50,60,70, 80,80,80,100,100,110,110,120,120,120,130,140,150,160,170), 
           date = c("2021-01-01","2021-01-02","2021-01-03","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-02","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-05","2021-01-01","2021-01-02","2021-01-03","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-02","2021-01-01","2021-01-02","2021-01-01","2021-01-02","2021-01-05","2021-01-01","2021-01-05"), 
           group = c("A", "B", "C", "B", "C", "A", "C", "A", "A", "A", "C", "A","B","A", "B", "C", "B", "C", "A", "C", "A", "A", "A", "C", "A", "A", "B","A"), 
           event = c(1,1,1,0,1,0,1,0,0,1,1,1,0,1,1,1,0,1,0,1,0,0,1,1,1,1,1,0))

我想汇总数据,以便计算每种组合的“事件”(其中 event == 1)和“非事件”(其中 event == 0)的数量,以便最终输出看起来像以下:

data.frame(group_a = c(1,0,0,1,0,1), 
           group_b = c(0,1,0,1,1,0), 
           group_c = c(0,0,1,0,1,1), 
           event_occured = c(3,1,2,0,2,2), 
           event_not_occured = c(4,2,2,0,2,2))

因此,例如,不存在 A 和 B 是具有相同 ID 的组的组合,因此 eventnon_event 的组合为 0。 A 组参与的 ID 有 4 个,其中 3 个导致event,1 个导致non_event,依此类推。

使用 sparklyr(或 dplyr 或 pyspark)的哪种方法可以实现如上所述的聚合?我尝试了以下方法,但我得到的eventevent_not_occurred 的数量完全相同,所以我一定做错了什么,但无法查明:

combo_path_sdf <- input_data %>%
  group_by(id) %>%
  arrange(date) %>%
  mutate(order_seq = ifelse(event > 0, 1, NA)) %>%
  mutate(order_seq = lag(cumsum(ifelse(is.na(order_seq), 0, order_seq)))) %>%
  mutate(order_seq = ifelse((row_number() == 1) & (event > 0), -1, ifelse(row_number() == 1, 0, order_seq))) %>% 
  ungroup()

    combo_path_sdf %>%
      group_by(id, order_seq) %>%
      summarize(group_a = max(ifelse(group_a == "A", 1, 0)),
                group_b = max(ifelse(group_b == "B", 1, 0)),
                group_c = max(ifelse(group_c == "C", 1, 0)),
                events = sum(event)) %>%
      group_by(order_seq, group_a, group_b, group_c) %>% 
      summarize(event = sum(events),
                total_sequences = n()) %>%
      mutate(event_not_occured = total_sequences - event)

以下格式的最终​​输出也可以:

data.frame(group_a = c("A", "B", "C", "A,B", "B,C", "A,C"), 
           event_occured = c(3,1,2,1,2,2), 
           event_not_occured = c(4,2,2,1,2,2))

(下图 A,B 不正确,应该是 1,1 而不是 0,0)

【问题讨论】:

您的数据显示和预期的输出匹配吗?为什么A, B 0 的 group_a 对这两种事件类型都适用?在您的数据中有 ID 10 的事件 AB 哦,这是一个错误,你是对的。 【参考方案1】:

以下匹配您请求的输出格式,并以我理解您想要的方式处理数据,但(根据@Martin Gal 的评论)与您提供的示例结果不匹配。

input_data %>%
  group_by(id) %>%
  summarise(group_a = max(ifelse(group == 'A', 1, 0)),
            group_b = max(ifelse(group == 'B', 1, 0)),
            group_c = max(ifelse(group == 'C', 1, 0)),
            event_occured = sum(ifelse(event == 1, 1, 0)),
            event_not_occured = sum(ifelse(event == 0, 1, 0)),
            .groups = "drop") %>%
  group_by(group_a, group_b, group_c) %>%
  summarise(event_occured = sum(event_occured),
            event_not_occured = sum(event_not_occured),
            .groups = "drop")

这个想法是一个两步总结过程。第一个汇总从每个事件中为组创建一个指标,并计算事件/非事件的数量。第二个总结,结合了所有相似的组。

关于您使用的产生相同数量的事件和非事件的代码。看看hts_combined。这未在您共享的代码中定义,因此您的脚本可能正在从其他地方读取变量。

【讨论】:

以上是关于使用 sparklyr 或 dplyr 获取组组合中的成员资格的主要内容,如果未能解决你的问题,请参考以下文章

当数据集在sparklyr中时,为什么我不能对dplyr使用双冒号运算符?

SparkR vs sparklyr [关闭]

如何按组进行汇总并使用R中的dplyr获取总体数据集的摘要

如何使用 sparklyr 过滤部分匹配

R - 如何使用 sparklyr 复制火花数据框中的行

SparklyR 从 Spark 上下文中删除表