Hive sum over partition preceding following 累计求和
Posted 二十六画生的博客
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Hive sum over partition preceding following 累计求和相关的知识,希望对你有一定的参考价值。
1 不sum,直接sum over
有重复,不符合预期
select
bank,
month,
revenue,
sum(revenue) over(
partition by bank --month
order by
bank,
month
) as acc_revenue
from
(
select
'b1' as bank,
'2022-01' as month,
1 as revenue
union all
select
'b1' as bank,
'2022-01' as month,
5 as revenue
union all
select
'b2' as bank,
'2022-01' as month,
2 as revenue
union all
select
'b1' as bank,
'2022-02' as month,
3 as revenue
union all
select
'b2' as bank,
'2022-02' as month,
4 as revenue
union all
select
'b2' as bank,
'2022-02' as month,
6 as revenue
) t1;
2 先sum,再sum over , 不写rows between and时默认是从第一条记录累计到当前的,与【ROWS BETWEEN UNBOUNDED preceding and 0 FOLLOWING 】同义
按月份累计时partition by不能再出现month字段!
符合预期
select
bank,
month,
total_revenue,
sum(total_revenue) over(
partition by bank
-- month
order by
bank,
month
) as acc_revenue
from
(
select
bank,
month,
sum(revenue) as total_revenue
from
(
select
'b1' as bank,
'2022-01' as month,
1 as revenue
union all
select
'b1' as bank,
'2022-01' as month,
5 as revenue
union all
select
'b2' as bank,
'2022-01' as month,
2 as revenue
union all
select
'b1' as bank,
'2022-02' as month,
3 as revenue
union all
select
'b2' as bank,
'2022-02' as month,
4 as revenue
union all
select
'b2' as bank,
'2022-02' as month,
6 as revenue
) t1
group by
bank,
month
) t2
3 先sum,再sum over , 写【ROWS BETWEEN UNBOUNDED preceding and 0 FOLLOWING 】
符合预期
select
bank,
month,
total_revenue,
sum(total_revenue) over(
partition by bank --month
order by
bank,
month ROWS BETWEEN UNBOUNDED preceding
and 0 FOLLOWING
) as acc_revenue
from
(
select
bank,
month,
sum(revenue) as total_revenue
from
(
select
'b1' as bank,
'2022-01' as month,
1 as revenue
union all
select
'b1' as bank,
'2022-01' as month,
5 as revenue
union all
select
'b2' as bank,
'2022-01' as month,
2 as revenue
union all
select
'b1' as bank,
'2022-02' as month,
3 as revenue
union all
select
'b2' as bank,
'2022-02' as month,
4 as revenue
union all
select
'b2' as bank,
'2022-02' as month,
6 as revenue
) t1
group by
bank,
month
) t2
把month加上后,是精确到month粒度了,就看不出累加的效果了:
select
bank,
month,
total_revenue,
sum(total_revenue) over(
partition by bank,month -- 把month加上
order by
bank,
month ROWS BETWEEN UNBOUNDED preceding
and 0 FOLLOWING
) as acc_revenue
from
(
select
bank,
month,
sum(revenue) as total_revenue
from
(
select
'b1' as bank,
'2022-01' as month,
1 as revenue
union all
select
'b1' as bank,
'2022-01' as month,
5 as revenue
union all
select
'b2' as bank,
'2022-01' as month,
2 as revenue
union all
select
'b1' as bank,
'2022-02' as month,
3 as revenue
union all
select
'b2' as bank,
'2022-02' as month,
4 as revenue
union all
select
'b2' as bank,
'2022-02' as month,
6 as revenue
) t1
group by
bank,
month
) t2
partition by与group by的不同点:
1 over(partition by)时select中可以出现多个字段(比如相关的维度字段或不相关的其他字段);但是如果是写[group by 维度字段]时,前面select中需要出现相同的维度字段才行(不然报错),比partition by多了一个限制!
2 partition by不会去重,group by会去重(记录会减少)。
select
bank,
month,
sum(total_revenue) over( -- 报错 可能原因:org.apache.calcite.runtime.CalciteContextException: Sql 1: From line 4, column 9 to line 4, column 21: Expression 'total_revenue' is not being grouped
order by
bank,
month
) as acc_revenue
from
(
select
bank,
month,
sum(revenue) as total_revenue
from
(
select
'b1' as bank,
'2022-01' as month,
1 as revenue
union all
select
'b1' as bank,
'2022-01' as month,
5 as revenue
union all
select
'b2' as bank,
'2022-01' as month,
2 as revenue
union all
select
'b1' as bank,
'2022-02' as month,
3 as revenue
union all
select
'b2' as bank,
'2022-02' as month,
4 as revenue
union all
select
'b2' as bank,
'2022-02' as month,
6 as revenue
) t1
group by
bank,
month
) t2
group by
bank,
month;
4 sum(sum()) over (partition by)
符合预期
select
bank,
month,
sum(revenue) as total_revenue,
sum(sum(revenue)) over(
partition by bank -- 加上才符合预期
order by
bank,
month
) as acc_revenue
from
(
select
'b1' as bank,
'2022-01' as month,
1 as revenue
union all
select
'b1' as bank,
'2022-01' as month,
5 as revenue
union all
select
'b2' as bank,
'2022-01' as month,
2 as revenue
union all
select
'b1' as bank,
'2022-02' as month,
3 as revenue
union all
select
'b2' as bank,
'2022-02' as month,
4 as revenue
union all
select
'b2' as bank,
'2022-02' as month,
6 as revenue
) t1
group by
bank,
month
5 sum(sum()) over (无partition by)
不是分bank累加,不符合预期
select
bank,
month,
sum(revenue) as total_revenue,
sum(sum(revenue)) over(
order by
bank,
month
) as acc_revenue
from
(
select
'b1' as bank,
'2022-01' as month,
1 as revenue
union all
select
'b1' as bank,
'2022-01' as month,
5 as revenue
union all
select
'b2' as bank,
'2022-01' as month,
2 as revenue
union all
select
'b1' as bank,
'2022-02' as month,
3 as revenue
union all
select
'b2' as bank,
'2022-02' as month,
4 as revenue
union all
select
'b2' as bank,
'2022-02' as month,
6 as revenue
) t1
group by
bank,
month
6 sum(sum()) over (partition by 不写 order by)
是一个bank的全部累加,不是从第一行到当前的逐步累加,不符合预期
select
bank,
month,
sum(revenue) as total_revenue,
sum(sum(revenue)) over(
PARTITION by bank --不写order by
) as acc_revenue
from
(
select
'b1' as bank,
'2022-01' as month,
1 as revenue
union all
select
'b1' as bank,
'2022-01' as month,
5 as revenue
union all
select
'b2' as bank,
'2022-01' as month,
2 as revenue
union all
select
'b1' as bank,
'2022-02' as month,
3 as revenue
union all
select
'b2' as bank,
'2022-02' as month,
4 as revenue
union all
select
'b2' as bank,
'2022-02' as month,
6 as revenue
) t1
group by
bank,
month
1.数据源:
select
*
from
stu_score
order by
score;
2.函数使用:
select
name,
score,
sum(score) over(order by score range between 2 preceding and 2 following) s1, -- 当前行的score值加减2的范围内的所有行
sum(score) over(order by score rows between 2 preceding and 2 following) s2, -- 当前行+前后2行,一共5行
sum(score) over(order by score range between unbounded preceding and unbounded following) s3, -- 全部行,不做限制
sum(score) over(order by score rows between unbounded preceding and unbounded following) s4, -- 全部行,不做限制
sum(score) over(order by score) s5, -- 第一行到当前行(和当前行相同score值的所有行都会包含进去)
sum(score) over(order by score rows between unbounded preceding and current row) s6, -- 第一行到当前行(和当前行相同score值的其他行不会包含进去,这是和上面的区别)
sum(score) over(order by score rows between 3 preceding and current row) s7, -- 当前行+往前3行
sum(score) over(order by score rows between 3 preceding and 1 following) s8, --当前行+往前3行+往后1行
sum(score) over(order by score rows between current row and unbounded following) s9 --当前行+往后所有行
from
stu_score
order by
score;
end
以上是关于Hive sum over partition preceding following 累计求和的主要内容,如果未能解决你的问题,请参考以下文章
Hive sum over partition preceding following 累计求和
Hive 问题 - Rank() OVER (PARTITION BY Dept ORDER BY sum(salary))
hive开窗函数over(partition by ......)用法