SQL (Hive):在使用 GROUP BY 进行聚合时使用窗口函数

Posted

技术标签:

【中文标题】SQL (Hive):在使用 GROUP BY 进行聚合时使用窗口函数【英文标题】:SQL (Hive): using window functions while aggregating with GROUP BY 【发布时间】:2020-12-16 13:00:54 【问题描述】:

我在Athena(Hive/Presto)中有下表:

CREATE EXTERNAL TABLE tmp (
    id STRING,
    updated_at TIMESTAMP,
    location STRING,
    direction STRING
)
LOCATION 's3://path'; 

我需要聚合和计数id 字段,同时选择与组内最新的timestamp 相关的locationdirection(分区再次位于id)。

到目前为止,我想出了以下查询,首先利用窗口函数,然后再分组:

SELECT
    b.id,
    MAX(b.latest_location) AS "latest_location",  -- It seems it is not possible to use first_value() on GROUP BY
    MAX(b.latest_direction) AS "latest_direction",
    COUNT(*) AS "total"
FROM (
    SELECT
        a.id,
        first_value(a.location) OVER (PARTITION BY a.id ORDER BY a.updated_at DESC) AS "latest_location",
        first_value(a.direction) OVER (PARTITION BY a.id ORDER BY a.updated_at DESC) AS "latest_direction"
    FROM tmp a
) b
GROUP BY b.id;

我第一次尝试同时做group by aggregation和window aggregation,但是好像引擎不允许这样做。是否可以编写更高效的查询(可能没有子查询)?

【问题讨论】:

您在内部查询中执行select distinct 并添加count(*) over (partition by a.id)。它会更短,但我不确定内部执行效率是否会发生很大变化。 【参考方案1】:
SELECT DISTINCT
    id,
    first_value(a.location)  OVER (PARTITION BY id ORDER BY updated_at DESC) AS latest_location,
    first_value(a.direction) OVER (PARTITION BY id ORDER BY updated_at DESC) AS latest_direction,
    count(*) OVER (PARTITION BY id) as total
FROM tmp

在您的原始查询中,max 基本上是一个虚拟聚合,因为所有行都具有相同的值。 group by 基本上是在做 distinct 在这里所做的事情。

【讨论】:

【参考方案2】:

添加到首选答案 - 考虑更正式的窗口定义支持 DRY(不要重复自己)偏好:

    SELECT DISTINCT
    id,
    first_value(a.location)  OVER w AS latest_location,
    first_value(a.direction) OVER w AS latest_direction,
    count(*) OVER (PARTITION BY id) as total
    FROM tmp
    WINDOW w AS (PARTITION BY id ORDER BY updated_at DESC)

这将允许将更复杂的窗口定义精确地维护在一个地方,并保证两个列计算使用相同的窗口逻辑。

【讨论】:

【参考方案3】:

您可以混合使用窗口函数和聚合函数。 . .但在另一个方向:先聚合,然后是窗口函数。

也就是说,如果您消除聚合,您的查询应该会更快。只需使用row_number() 和过滤:

SELECT a.id, a.location, a.updated_at
FROM (SELECT a.*,
             ROW_NUMBER() OVER (PARTITION BY a.id ORDER BY a.updated_at DESC) AS seqnum
      FROM tmp a
     ) a
WHERE seqnum = 1;

【讨论】:

以上是关于SQL (Hive):在使用 GROUP BY 进行聚合时使用窗口函数的主要内容,如果未能解决你的问题,请参考以下文章

hive高阶1--sql和hive语句执行顺序explain查看执行计划group by生成MR

Hive mapreduce SQL实现原理——SQL最终分解为MR任务,而group by在MR里和单词统计MR没有区别了

Hive SQL子句中 group by 1 是什么意思?

Hive SQL子句中 group by 1 是什么意思?

Hive之GROUP BY详解

Hive 在分区上嵌套 SUM - 错误表达式不在 GROUP BY 键中