高效的多重分组
Posted
技术标签:
【中文标题】高效的多重分组【英文标题】:Efficient Multiple Group-bys 【发布时间】:2021-02-05 11:06:57 【问题描述】:我有下表:
Year | Week | Day_1 | Day_2 | Day_3 |
---|---|---|---|---|
2020 | 1 | Walk | Jump | Swim |
2020 | 3 | Walk | Swim | Walk |
2020 | 1 | Jump | Walk | Swim |
我想按年、周和事件(步行、跳跃、游泳)分组,并计算每个事件在第 1 天、第 2 天、第 3 天发生的次数。即
Year | Week | Event | Count_Day_1 | Count_Day_2 | Count_Day_3 |
---|---|---|---|---|---|
2020 | 1 | Walk | 1 | 1 | 0 |
2020 | 3 | Walk | 1 | 0 | 1 |
2020 | 1 | Jump | 1 | 1 | 0 |
2020 | 3 | Jump | 0 | 0 | 0 |
2020 | 1 | Swim | 0 | 0 | 2 |
2020 | 3 | Swim | . 0 | 1 | 0 |
我怎样才能有效地做到这一点?
【问题讨论】:
作为一种可能的解决方案:this answer on UNPIVOT 伴随条件聚合可能会有所帮助。 【参考方案1】:在 BigQuery 中,我会使用数组进行反透视,然后进行聚合:
with t as (
select 2020 as year, 1 as week, 'Walk' as day_1, 'Jump' as day_2, 'Swim' as day_3 union all
select 2020, 3, 'Walk', 'Swim', 'Walk' union all
select 2020, 1, 'Jump', 'Walk', 'Swim'
)
select t.year, t.week, s.event,
countif(day = 1) as day_1, countif(day = 2) as day_2, countif(day = 3) as day_3
from t cross join
unnest([struct(t.day_1 as event, 1 as day),
struct(t.day_2 as event, 2 as day),
struct(t.day_3 as event, 3 as day)
]) s
group by t.year, t.week, s.event;
【讨论】:
【参考方案2】:考虑这个不太冗长的选项
select year, week, event,
countif(offset = 0) as day_1,
countif(offset = 1) as day_2,
countif(offset = 2) as day_3
from `project.dataset.table`,
unnest([day_1, day_2, day_3]) event with offset
where not event is null
group by year, week, event
如果应用于您问题中的样本数据 - 输出是
【讨论】:
【参考方案3】:演示代码是 MS SQL!
如果您想为每个事件的每周和每年生成一个完整的网格,则需要两个预聚合,一个用于事件,另一个用于每年和每周。
喜欢:
DECLARE
@OriginalData
TABLE
(
numYear smallint,
numWeek tinyint,
dscDay1 nvarchar(20),
dscDay2 nvarchar(20),
dscDay3 nvarchar(20)
)
;
INSERT INTO
@OriginalData
(
numYear, numWeek, dscDay1, dscDay2, dscDay3
)
VALUES
( 2020, 1, N'Walk', N'Jump', N'Swim' ),
( 2020, 3, N'Walk', N'Swim', N'Walk' ),
( 2020, 1, N'Jump', N'Walk', N'Swim' )
;
SELECT
numYear, numWeek, dscDay1, dscDay2, dscDay3
FROM
@OriginalData
;
WITH
cteNormalise
(
dscActivity
)
AS
(
SELECT
dscDay1
FROM
@OriginalData
GROUP BY
dscDay1
UNION
SELECT
dscDay2
FROM
@OriginalData
GROUP BY
dscDay2
UNION
SELECT
dscDay3
FROM
@OriginalData
GROUP BY
dscDay3
),
cteGrid
(
numYear,
numWeek
)
AS
(
SELECT
numYear,
numWeek
FROM
@OriginalData
GROUP BY
numYear,
numWeek
)
SELECT
--/* Debug output */ *
YearWeek.numYear,
YearWeek.numWeek,
Normalised.dscActivity,
Count( Day1.dscDay1 ) AS CountDay1,
Count( Day2.dscDay2 ) AS CountDay2,
Count( Day3.dscDay3 ) AS CountDay3
FROM
cteNormalise AS Normalised
CROSS JOIN cteGrid AS YearWeek
LEFT OUTER JOIN @OriginalData AS Day1
ON Day1.dscDay1 = Normalised.dscActivity
AND Day1.numYear = YearWeek.numYear
AND Day1.numWeek = YearWeek.numWeek
LEFT OUTER JOIN @OriginalData AS Day2
ON Day2.dscDay2 = Normalised.dscActivity
AND Day2.numYear = YearWeek.numYear
AND Day2.numWeek = YearWeek.numWeek
LEFT OUTER JOIN @OriginalData AS Day3
ON Day3.dscDay3 = Normalised.dscActivity
AND Day3.numYear = YearWeek.numYear
AND Day3.numWeek = YearWeek.numWeek
GROUP BY
YearWeek.numYear,
YearWeek.numWeek,
Normalised.dscActivity
ORDER BY
YearWeek.numYear,
Normalised.dscActivity,
YearWeek.numWeek
;
这可行,但是由于在实际聚合发生之前对数据进行标准化的步骤,效率值得怀疑。
如果可能的话,我建议先将表格转换为仅包含年、周、事件和日关键列的 3NF。然后可以产生一个相当有效的总结。以事先标准化为代价。否则查询中需要转换成本。
【讨论】:
很难将这个带有插入表变量的 T-SQL 东西移植到其他 DBMS。您可以为此使用with
子句使其可移植到几乎所有 SQL 方言,因此最好重写查询以使此答案或多或少有用,而无需重写代码【参考方案4】:
您需要找到distinct
事件,对您的表执行cross join
并使用conditional aggregation
,如下所示:
select t.year, t.week, e.event,
count(case when t.day_1 = e.event then 1 end) as count_day_1,
count(case when t.day_2 = e.event then 1 end) as count_day_2,
count(case when t.day_3 = e.event then 1 end) as count_day_3
from your_Table t
cross join (select distinct day_1 as event from your_table
union all select day_2 from your_table
union all select day_3 from your_table) e
group by t.year, t.week, e.event
【讨论】:
源表中没有event
列,所以cross join
里面应该有unpivoting逻辑
正确的@astentx。更新了查询。请检查
我已经编辑过,应该可以工作:1)e.day_N
应该是t.day_N
; 2) BigQuery throws an error for UNION
并要求执行 select distinct ... union all
。我很有趣!以上是关于高效的多重分组的主要内容,如果未能解决你的问题,请参考以下文章