Hive 中的联合分组结果集
Posted
技术标签:
【中文标题】Hive 中的联合分组结果集【英文标题】:union grouped result sets in Hive 【发布时间】:2019-09-19 20:42:27 【问题描述】:我需要在 2018 日历年的各个季度中断按 ID 列分组的 Hive 查询。以下是我目前的处理方式
--查询 2018 年第 1 季度的第 1 个查询以及 Q2、Q3、Q4 的三个相同查询
Create TABLE Q12018 stored as ORC as
select
ID,
count(1) as cnt,
sum(revenue) as revenue,
sum( (CASE
WHEN condition1
THEN 1
ELSE 0 END)) as metric1,
sum( (CASE
WHEN condition2
THEN revenue
ELSE 0 END)) as metric2,
sum( (CASE
WHEN condition3
THEN 1
ELSE 0 END)) as metric3,
sum( (CASE
WHEN codition4
THEN revenue
ELSE 0 END)) as metric4
from mainTable
where month between 201801 and 201803
group by
ID;
--查询2
Create TABLE combined2018 stored as ORC as
select * from Q12018
union all
select * from Q22018
union all
select * from Q32018
union all
select * from Q42018 ;
--查询3
Create TABLE Agg2018 stored as ORC as
Select
ID,
Sum(cnt),
Sum(revenue),
Sum(metric1),
Sum(metric2),
sum(metric3),
sum(metric4)
from combined2018
group by ID
【问题讨论】:
【参考方案1】:似乎最后您正在汇总按 ID 分组的所有季度结果。如果最终结果是季度结果的汇总,则更改 where 子句以包括整个年份范围以实现相同的最终结果。
select
ID,
count(1) as cnt,
sum(revenue) as revenue,
sum((CASE WHEN condition1 THEN 1 ELSE 0 END)) as metric1,
sum((CASE WHEN condition2 THEN revenue ELSE 0 END)) as metric2,
sum((CASE WHEN condition3 THEN 1 ELSE 0 END)) as metric3,
sum((CASE WHEN condition4 THEN revenue ELSE 0 END)) as metric4
from mainTable
where month between 201801 and 201812
group by ID;
【讨论】:
我需要将查询分成季度块,因为主表的大小,当涉及到长日期范围的查询时,我们的集群是不稳定的。我的原件完全按照您的建议,但存在许多性能问题。 @hghghghg 然后您只需要调整适当的并行性:***.com/a/48296562/2700344 和:***.com/a/54491316/2700344以上是关于Hive 中的联合分组结果集的主要内容,如果未能解决你的问题,请参考以下文章