高效的多重分组

Posted

技术标签:

【中文标题】高效的多重分组【英文标题】:Efficient Multiple Group-bys 【发布时间】:2021-02-05 11:06:57 【问题描述】:

我有下表:

Year Week Day_1 Day_2 Day_3
2020 1 Walk Jump Swim
2020 3 Walk Swim Walk
2020 1 Jump Walk Swim

我想按年、周和事件(步行、跳跃、游泳)分组,并计算每个事件在第 1 天、第 2 天、第 3 天发生的次数。即

Year Week Event Count_Day_1 Count_Day_2 Count_Day_3
2020 1 Walk 1 1 0
2020 3 Walk 1 0 1
2020 1 Jump 1 1 0
2020 3 Jump 0 0 0
2020 1 Swim 0 0 2
2020 3 Swim . 0 1 0

我怎样才能有效地做到这一点?

【问题讨论】:

作为一种可能的解决方案:this answer on UNPIVOT 伴随条件聚合可能会有所帮助。 【参考方案1】:

在 BigQuery 中,我会使用数组进行反透视,然后进行聚合:

with t as (
       select 2020 as year, 1 as week, 'Walk' as day_1, 'Jump' as day_2, 'Swim' as day_3 union all
       select 2020, 3, 'Walk', 'Swim', 'Walk' union all
       select 2020, 1, 'Jump', 'Walk', 'Swim'
      )
select t.year, t.week, s.event,
       countif(day = 1) as day_1, countif(day = 2) as day_2, countif(day = 3) as day_3
from t cross join
     unnest([struct(t.day_1 as event, 1 as day),
             struct(t.day_2 as event, 2 as day),
             struct(t.day_3 as event, 3 as day)
            ]) s
group by t.year, t.week, s.event;
              

【讨论】:

【参考方案2】:

考虑这个不太冗长的选项

select year, week, event, 
  countif(offset = 0) as day_1, 
  countif(offset = 1) as day_2, 
  countif(offset = 2) as day_3
from `project.dataset.table`,
unnest([day_1, day_2, day_3]) event with offset
where not event is null
group by year, week, event   

如果应用于您问题中的样本数据 - 输出是

【讨论】:

【参考方案3】:

演示代码是 MS SQL!

如果您想为每个事件的每周和每年生成一个完整的网格,则需要两个预聚合,一个用于事件,另一个用于每年和每周。

喜欢:

DECLARE
  @OriginalData
  TABLE
  (
    numYear   smallint,
    numWeek   tinyint,
    dscDay1   nvarchar(20),
    dscDay2   nvarchar(20),
    dscDay3   nvarchar(20)
  )
;

INSERT INTO
  @OriginalData
(
  numYear, numWeek, dscDay1, dscDay2, dscDay3
)
VALUES
  ( 2020, 1, N'Walk', N'Jump', N'Swim' ),
  ( 2020, 3, N'Walk', N'Swim', N'Walk' ),
  ( 2020, 1, N'Jump', N'Walk', N'Swim' )
;

SELECT
  numYear, numWeek, dscDay1, dscDay2, dscDay3
FROM
  @OriginalData
;

WITH
  cteNormalise
(
  dscActivity
)
AS
(
  SELECT
    dscDay1
  FROM
    @OriginalData
  GROUP BY
    dscDay1
  UNION
  SELECT
    dscDay2
  FROM
    @OriginalData
  GROUP BY
    dscDay2
  UNION
  SELECT
    dscDay3
  FROM 
    @OriginalData
  GROUP BY
    dscDay3
),
  cteGrid
(
  numYear,
  numWeek
)
AS
(
  SELECT
    numYear,
    numWeek
  FROM
    @OriginalData
  GROUP BY
    numYear,
    numWeek
)
SELECT
  --/* Debug output */ *
  YearWeek.numYear,
  YearWeek.numWeek,
  Normalised.dscActivity,
  Count( Day1.dscDay1 ) AS CountDay1,
  Count( Day2.dscDay2 ) AS CountDay2,
  Count( Day3.dscDay3 ) AS CountDay3
FROM
  cteNormalise AS Normalised
  CROSS JOIN cteGrid AS YearWeek
  LEFT OUTER JOIN @OriginalData AS Day1
    ON  Day1.dscDay1 = Normalised.dscActivity
    AND Day1.numYear = YearWeek.numYear
    AND Day1.numWeek = YearWeek.numWeek
  LEFT OUTER JOIN @OriginalData AS Day2
    ON  Day2.dscDay2 = Normalised.dscActivity
    AND Day2.numYear = YearWeek.numYear
    AND Day2.numWeek = YearWeek.numWeek
  LEFT OUTER JOIN @OriginalData AS Day3
    ON  Day3.dscDay3 = Normalised.dscActivity
    AND Day3.numYear = YearWeek.numYear
    AND Day3.numWeek = YearWeek.numWeek
GROUP BY
  YearWeek.numYear,
  YearWeek.numWeek,
  Normalised.dscActivity
ORDER BY
  YearWeek.numYear,
  Normalised.dscActivity,
  YearWeek.numWeek
;

这可行,但是由于在实际聚合发生之前对数据进行标准化的步骤,效率值得怀疑。

如果可能的话,我建议先将表格转换为仅包含年、周、事件和日关键列的 3NF。然后可以产生一个相当有效的总结。以事先标准化为代价。否则查询中需要转换成本。

【讨论】:

很难将这个带有插入表变量的 T-SQL 东西移植到其他 DBMS。您可以为此使用 with 子句使其可移植到几乎所有 SQL 方言,因此最好重写查询以使此答案或多或少有用,而无需重写代码【参考方案4】:

您需要找到distinct 事件,对您的表执行cross join 并使用conditional aggregation,如下所示:

select t.year, t.week, e.event, 
       count(case when t.day_1 = e.event then 1 end) as count_day_1,
       count(case when t.day_2 = e.event then 1 end) as count_day_2,
       count(case when t.day_3 = e.event then 1 end) as count_day_3 
  from your_Table t
  cross join (select distinct day_1 as event from your_table
              union all select day_2 from your_table
              union all select day_3 from your_table) e
group by t.year, t.week, e.event

【讨论】:

源表中没有event列,所以cross join里面应该有unpivoting逻辑 正确的@astentx。更新了查询。请检查 我已经编辑过,应该可以工作:1)e.day_N 应该是t.day_N; 2) BigQuery throws an error for UNION 并要求执行 select distinct ... union all。我很有趣!

以上是关于高效的多重分组的主要内容,如果未能解决你的问题,请参考以下文章

Pandas Dataframe 中分组的多重聚合

NoSQL 中的多重分组

01背包完全背包多重背包分组背包总结

背包(01,完全,多重,分组)

具有多重嵌套表的分组方式和计数作为 LINQ 查询

多重DES