Flink SQL--分组聚合

Posted 宝哥大数据

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Flink SQL--分组聚合相关的知识,希望对你有一定的参考价值。

文章目录

一、DISTINCT Aggregation

二、GROUPING SETS

在一个GROUP BY 查询中,根据不同的维度组合进行聚合。GROUPING SETS就是一种将多个GROUP BY逻辑 UNION 在一起。GROUPING SETS会把在单个GROUP BY逻辑中没有参与GROUP BY的那一列置为NULL值。空分组集意味着所有行都聚合到一个组中

SELECT supplier_id, rating, COUNT(*) AS total
FROM (VALUES
    ('supplier1', 'product1', 4),
    ('supplier1', 'product2', 3),
    ('supplier2', 'product3', 3),
    ('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY GROUPING SETS ((supplier_id, rating), (supplier_id), ())
Results:

+-------------+--------+-------+
| supplier_id | rating | total |
+-------------+--------+-------+
|   supplier1 |      4 |     1 |
|   supplier1 |      3 |     1 |
|   supplier2 |      3 |     1 |
|   supplier2 |      4 |     1 |

|   supplier1 | (NULL) |     2 |
|   supplier2 | (NULL) |     2 |

|      (NULL) | (NULL) |     4 |
+-------------+--------+-------+

TTL

在持续查询的过程中,由于用于分组的 key 可能会不断增加,因此计算结果所需要维护的状态也会持续增长。为了防止状态无限增长耗尽资源,Flink Table API 和 SQL 可以在表环境中配置状态的生存时间(TTL)table.exec.state.ttl

配置 TTL 有可能会导致统计结果不准确,这其实是以牺牲正确性为代价换取了资源的释放

2.1、ROLLUP

ROLLUP 相当于 GROUPING SETS 的简化

SELECT supplier_id, rating, COUNT(*)
FROM (VALUES
    ('supplier1', 'product1', 4),
    ('supplier1', 'product2', 3),
    ('supplier2', 'product3', 3),
    ('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY ROLLUP (supplier_id, rating)

等价于
SELECT supplier_id, rating, COUNT(*)
FROM (VALUES
    ('supplier1', 'product1', 4),
    ('supplier1', 'product2', 3),
    ('supplier2', 'product3', 3),
    ('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY GROUPING SETS (
    ( supplier_id, rating ),
    ( supplier_id         ),
    (                     )
)


--- 三个字段分组
SELECT supplier_id, rating,product_id, COUNT(*)
FROM (VALUES
    ('supplier1', 'product1', 4),
    ('supplier1', 'product2', 3),
    ('supplier2', 'product3', 3),
    ('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY ROLLUP(supplier_id, rating,product_id)
等价于
SELECT supplier_id, rating, product_id, COUNT(*)
FROM (VALUES
    ('supplier1', 'product1', 4),
    ('supplier1', 'product2', 3),
    ('supplier2', 'product3', 3),
    ('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY GROUPING SETS (
    ( supplier_id, product_id, rating ),
    ( supplier_id, product_id         ),
    ( supplier_id                     ),
    (                                 )
)

2.2、CUBE


SELECT supplier_id, rating,product_id, COUNT(*)
FROM (VALUES
    ('supplier1', 'product1', 4),
    ('supplier1', 'product2', 3),
    ('supplier2', 'product3', 3),
    ('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY CUBE(supplier_id, rating,product_id)


等价于
SELECT supplier_id, rating, product_id, COUNT(*)
FROM (VALUES
    ('supplier1', 'product1', 4),
    ('supplier1', 'product2', 3),
    ('supplier2', 'product3', 3),
    ('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY GROUPING SETS (
    ( supplier_id, product_id, rating ),
    ( supplier_id, product_id         ),
    ( supplier_id,             rating ),
    ( supplier_id                     ),
    (              product_id, rating ),
    (              product_id         ),
    (                          rating ),
    (                                 )
)

下面是 CUBE 和 ROLLUP 之间的具体区别:

  • CUBE 生成的结果集显示了所选列中值的所有组合的聚合
  • ROLLUP 生成的结果集显示了所选列中值的某一层次结构的聚合

三、HAVING

用于分组后的过滤条件

以上是关于Flink SQL--分组聚合的主要内容,如果未能解决你的问题,请参考以下文章

Flink SQL 分组窗口函数 Group Window 实战

LINQ to Entities - 在分组数据中创建中值

带有子查询的 SQL 多个聚合函数

数据分析—Pandas 中的分组聚合Groupby 高阶操作

95-910-148-源码-FlinkSQL-Flink SQL自定义聚合函数

包含字段中值的连接、分组和计数的 SQL 视图