Hive sum over partition preceding following 累计求和

Posted 二十六画生的博客

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Hive sum over partition preceding following 累计求和相关的知识,希望对你有一定的参考价值。

1 不sum,直接sum over 

有重复,不符合预期

select
    bank,
    month,
    revenue,
    sum(revenue) over(
        partition by bank --month
        order by
            bank,
            month
    ) as acc_revenue
from
    (
        select
            'b1' as bank,
            '2022-01' as month,
            1 as revenue
        union all
        select
            'b1' as bank,
            '2022-01' as month,
            5 as revenue
        union all
        select
            'b2' as bank,
            '2022-01' as month,
            2 as revenue
        union all
        select
            'b1' as bank,
            '2022-02' as month,
            3 as revenue
        union all
        select
            'b2' as bank,
            '2022-02' as month,
            4 as revenue
        union all
        select
            'b2' as bank,
            '2022-02' as month,
            6 as revenue
    ) t1;

2 先sum,再sum over ,  不写rows between and时默认是从第一条记录累计到当前的,与【ROWS BETWEEN UNBOUNDED preceding and 0 FOLLOWING 】同义

按月份累计时partition by不能再出现month字段!

符合预期

select
    bank,
    month,
    total_revenue,
    sum(total_revenue) over(
        partition by bank
        -- month
        order by
            bank,
            month 
    ) as acc_revenue
from
    (
        select
            bank,
            month,
            sum(revenue) as total_revenue
        from
            (
                select
                    'b1' as bank,
                    '2022-01' as month,
                    1 as revenue
                union all
                select
                    'b1' as bank,
                    '2022-01' as month,
                    5 as revenue
                union all
                select
                    'b2' as bank,
                    '2022-01' as month,
                    2 as revenue
                union all
                select
                    'b1' as bank,
                    '2022-02' as month,
                    3 as revenue
                union all
                select
                    'b2' as bank,
                    '2022-02' as month,
                    4 as revenue
                union all
                select
                    'b2' as bank,
                    '2022-02' as month,
                    6 as revenue
            ) t1
        group by
            bank,
            month
    ) t2

 3 先sum,再sum over ,  写【ROWS BETWEEN UNBOUNDED preceding and 0 FOLLOWING 】

符合预期

select
    bank,
    month,
    total_revenue,
    sum(total_revenue) over(
        partition by bank --month
        order by
            bank,
            month ROWS BETWEEN UNBOUNDED preceding
            and 0 FOLLOWING
    ) as acc_revenue
from
    (
        select
            bank,
            month,
            sum(revenue) as total_revenue
        from
            (
                select
                    'b1' as bank,
                    '2022-01' as month,
                    1 as revenue
                union all
                select
                    'b1' as bank,
                    '2022-01' as month,
                    5 as revenue
                union all
                select
                    'b2' as bank,
                    '2022-01' as month,
                    2 as revenue
                union all
                select
                    'b1' as bank,
                    '2022-02' as month,
                    3 as revenue
                union all
                select
                    'b2' as bank,
                    '2022-02' as month,
                    4 as revenue
                union all
                select
                    'b2' as bank,
                    '2022-02' as month,
                    6 as revenue
            ) t1
        group by
            bank,
            month
    ) t2

 把month加上后,是精确到month粒度了,就看不出累加的效果了:

select
    bank,
    month,
    total_revenue,
    sum(total_revenue) over(
        partition by bank,month -- 把month加上
        order by
            bank,
            month ROWS BETWEEN UNBOUNDED preceding
            and 0 FOLLOWING
    ) as acc_revenue
from
    (
        select
            bank,
            month,
            sum(revenue) as total_revenue
        from
            (
                select
                    'b1' as bank,
                    '2022-01' as month,
                    1 as revenue
                union all
                select
                    'b1' as bank,
                    '2022-01' as month,
                    5 as revenue
                union all
                select
                    'b2' as bank,
                    '2022-01' as month,
                    2 as revenue
                union all
                select
                    'b1' as bank,
                    '2022-02' as month,
                    3 as revenue
                union all
                select
                    'b2' as bank,
                    '2022-02' as month,
                    4 as revenue
                union all
                select
                    'b2' as bank,
                    '2022-02' as month,
                    6 as revenue
            ) t1
        group by
            bank,
            month
    ) t2

partition by与group by的不同点:

over(partition by)时select中可以出现多个字段(比如相关的维度字段或不相关的其他字段);但是如果是写[group by 维度字段]时,前面select中需要出现相同的维度字段才行(不然报错),比partition by多了一个限制!

2 partition by不会去重,group by会去重(记录会减少)。

select
    bank,
    month,
    sum(total_revenue) over( -- 报错 可能原因:org.apache.calcite.runtime.CalciteContextException: Sql 1: From line 4, column 9 to line 4, column 21: Expression 'total_revenue' is not being grouped
        order by
            bank,
            month
    ) as acc_revenue
from
    (
        select
            bank,
            month,
            sum(revenue) as total_revenue
        from
            (
                select
                    'b1' as bank,
                    '2022-01' as month,
                    1 as revenue
                union all
                select
                    'b1' as bank,
                    '2022-01' as month,
                    5 as revenue
                union all
                select
                    'b2' as bank,
                    '2022-01' as month,
                    2 as revenue
                union all
                select
                    'b1' as bank,
                    '2022-02' as month,
                    3 as revenue
                union all
                select
                    'b2' as bank,
                    '2022-02' as month,
                    4 as revenue
                union all
                select
                    'b2' as bank,
                    '2022-02' as month,
                    6 as revenue
            ) t1
        group by
            bank,
            month
    ) t2
group by
    bank,
    month;

4 sum(sum()) over (partition by)

符合预期

select
    bank,
    month,
    sum(revenue) as total_revenue,
    sum(sum(revenue)) over(
        partition by bank -- 加上才符合预期
        order by
            bank,
            month
    ) as acc_revenue
from
    (
        select
            'b1' as bank,
            '2022-01' as month,
            1 as revenue
        union all
        select
            'b1' as bank,
            '2022-01' as month,
            5 as revenue
        union all
        select
            'b2' as bank,
            '2022-01' as month,
            2 as revenue
        union all
        select
            'b1' as bank,
            '2022-02' as month,
            3 as revenue
        union all
        select
            'b2' as bank,
            '2022-02' as month,
            4 as revenue
        union all
        select
            'b2' as bank,
            '2022-02' as month,
            6 as revenue
    ) t1
group by
    bank,
    month

5 sum(sum()) over (无partition by)

不是分bank累加,不符合预期

select
    bank,
    month,
    sum(revenue) as total_revenue,
    sum(sum(revenue)) over(
        order by
            bank,
            month
    ) as acc_revenue
from
    (
        select
            'b1' as bank,
            '2022-01' as month,
            1 as revenue
        union all
        select
            'b1' as bank,
            '2022-01' as month,
            5 as revenue
        union all
        select
            'b2' as bank,
            '2022-01' as month,
            2 as revenue
        union all
        select
            'b1' as bank,
            '2022-02' as month,
            3 as revenue
        union all
        select
            'b2' as bank,
            '2022-02' as month,
            4 as revenue
        union all
        select
            'b2' as bank,
            '2022-02' as month,
            6 as revenue
    ) t1
group by
    bank,
    month

6 sum(sum()) over (partition by 不写 order by)

是一个bank的全部累加,不是从第一行到当前的逐步累加,不符合预期

select
    bank,
    month,
    sum(revenue) as total_revenue,
    sum(sum(revenue)) over(
        PARTITION by bank --不写order by
    ) as acc_revenue
from
    (
        select
            'b1' as bank,
            '2022-01' as month,
            1 as revenue
        union all
        select
            'b1' as bank,
            '2022-01' as month,
            5 as revenue
        union all
        select
            'b2' as bank,
            '2022-01' as month,
            2 as revenue
        union all
        select
            'b1' as bank,
            '2022-02' as month,
            3 as revenue
        union all
        select
            'b2' as bank,
            '2022-02' as month,
            4 as revenue
        union all
        select
            'b2' as bank,
            '2022-02' as month,
            6 as revenue
    ) t1
group by
    bank,
    month


1.数据源:

select
   *
from
    stu_score
order by
    score;


2.函数使用:

select
    name,
    score,
    sum(score) over(order by score range between 2 preceding and 2 following) s1, -- 当前行的score值加减2的范围内的所有行
    sum(score) over(order by score rows between 2 preceding and 2 following) s2, -- 当前行+前后2行,一共5行
    sum(score) over(order by score range between unbounded preceding and unbounded following) s3, -- 全部行,不做限制
    sum(score) over(order by score rows between unbounded preceding and unbounded following) s4, -- 全部行,不做限制
    sum(score) over(order by score) s5, -- 第一行到当前行(和当前行相同score值的所有行都会包含进去)
    sum(score) over(order by score rows between unbounded preceding and current row) s6, -- 第一行到当前行(和当前行相同score值的其他行不会包含进去,这是和上面的区别)
    sum(score) over(order by score rows between 3 preceding and current row) s7, -- 当前行+往前3行
    sum(score) over(order by score rows between 3 preceding and 1 following) s8, --当前行+往前3行+往后1行
    sum(score) over(order by score rows between current row and unbounded following) s9 --当前行+往后所有行
from
    stu_score
order by 
    score;

 end

以上是关于Hive sum over partition preceding following 累计求和的主要内容,如果未能解决你的问题,请参考以下文章

Hive sum over partition preceding following 累计求和

Hive 问题 - Rank() OVER (PARTITION BY Dept ORDER BY sum(salary))

hive开窗函数over(partition by ......)用法

当我尝试使用 Partition over Hive 时,

DB2——sum over partition by 的用法

SUM() OVER (PARTITION BY) AS - 存在重复项时