Flink SQL--分组聚合
Posted 宝哥大数据
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Flink SQL--分组聚合相关的知识,希望对你有一定的参考价值。
文章目录
一、DISTINCT Aggregation
二、GROUPING SETS
在一个GROUP BY 查询中,根据不同的维度组合进行聚合。GROUPING SETS就是一种将多个GROUP BY逻辑 UNION 在一起。GROUPING SETS会把在单个GROUP BY逻辑中没有参与GROUP BY的那一列置为NULL值。空分组集意味着所有行都聚合到一个组中
SELECT supplier_id, rating, COUNT(*) AS total
FROM (VALUES
('supplier1', 'product1', 4),
('supplier1', 'product2', 3),
('supplier2', 'product3', 3),
('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY GROUPING SETS ((supplier_id, rating), (supplier_id), ())
Results:
+-------------+--------+-------+
| supplier_id | rating | total |
+-------------+--------+-------+
| supplier1 | 4 | 1 |
| supplier1 | 3 | 1 |
| supplier2 | 3 | 1 |
| supplier2 | 4 | 1 |
| supplier1 | (NULL) | 2 |
| supplier2 | (NULL) | 2 |
| (NULL) | (NULL) | 4 |
+-------------+--------+-------+
TTL
在持续查询的过程中,由于用于分组的 key 可能会不断增加,因此计算结果所需要维护的状态也会持续增长。为了防止状态无限增长耗尽资源,Flink Table API 和 SQL 可以在表环境中配置状态的生存时间(TTL)table.exec.state.ttl
配置 TTL 有可能会导致统计结果不准确,这其实是以牺牲正确性为代价换取了资源的释放
2.1、ROLLUP
ROLLUP 相当于 GROUPING SETS 的简化
SELECT supplier_id, rating, COUNT(*)
FROM (VALUES
('supplier1', 'product1', 4),
('supplier1', 'product2', 3),
('supplier2', 'product3', 3),
('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY ROLLUP (supplier_id, rating)
等价于
SELECT supplier_id, rating, COUNT(*)
FROM (VALUES
('supplier1', 'product1', 4),
('supplier1', 'product2', 3),
('supplier2', 'product3', 3),
('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY GROUPING SETS (
( supplier_id, rating ),
( supplier_id ),
( )
)
--- 三个字段分组
SELECT supplier_id, rating,product_id, COUNT(*)
FROM (VALUES
('supplier1', 'product1', 4),
('supplier1', 'product2', 3),
('supplier2', 'product3', 3),
('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY ROLLUP(supplier_id, rating,product_id)
等价于
SELECT supplier_id, rating, product_id, COUNT(*)
FROM (VALUES
('supplier1', 'product1', 4),
('supplier1', 'product2', 3),
('supplier2', 'product3', 3),
('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY GROUPING SETS (
( supplier_id, product_id, rating ),
( supplier_id, product_id ),
( supplier_id ),
( )
)
2.2、CUBE
SELECT supplier_id, rating,product_id, COUNT(*)
FROM (VALUES
('supplier1', 'product1', 4),
('supplier1', 'product2', 3),
('supplier2', 'product3', 3),
('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY CUBE(supplier_id, rating,product_id)
等价于
SELECT supplier_id, rating, product_id, COUNT(*)
FROM (VALUES
('supplier1', 'product1', 4),
('supplier1', 'product2', 3),
('supplier2', 'product3', 3),
('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY GROUPING SETS (
( supplier_id, product_id, rating ),
( supplier_id, product_id ),
( supplier_id, rating ),
( supplier_id ),
( product_id, rating ),
( product_id ),
( rating ),
( )
)
下面是 CUBE 和 ROLLUP 之间的具体区别:
- CUBE 生成的结果集显示了所选列中值的所有组合的聚合。
- ROLLUP 生成的结果集显示了所选列中值的某一层次结构的聚合。
三、HAVING
用于分组后的过滤条件
以上是关于Flink SQL--分组聚合的主要内容,如果未能解决你的问题,请参考以下文章
Flink SQL 分组窗口函数 Group Window 实战
数据分析—Pandas 中的分组聚合Groupby 高阶操作