HiveSQL一天一个小技巧：如何准确求近30天指标？

Posted 2023-02-19 莫叫石榴姐

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了HiveSQL一天一个小技巧：如何准确求近30天指标？相关的知识，希望对你有一定的参考价值。

1 需求

现在test表有三个字段用户： user_id 日期：dt 订单金额 price，

计算出一个消费者历史上“首次”在近30天周期内累计消费金额达到1W的日期

2 分析

（1）数据准备

 create table test as 
 select 'a' as user_id,7000 as price,'2022-07-01' as dt
    union all 
   select 'a' as user_id,4000 as price,'2022-08-22' as dt
   union all 
   select 'a' as user_id,8000 as price,'2022-08-23' as dt

(2) 分析

目标字段：消费者，日期

条件：首次”在近30天周期内累计消费金额达到1W的日期

第一步：如何求近30天周期内累计消费金额

一般此类问题我们容易想到如下解法

sum(price) over(partition by user_id order by dt rows between prceding 30 and current row)

但是改解法有个问题，我们采用rows的时候计算的是实际物理行数，但是实际数据中用户的时间并不是连续的，也就是存在时间断层或缺失的现象，此时用rows计算的实际结果则会偏大，显然不对。而对于hive的计算引擎提供了，range计算方法，他表示的是排序行的逻辑计算值，并在此范围内的所有数据，即[dt -30,dt],刚好反应了所要表达的意思，近30天的结果。因此可以按照如下求法

sum(price) over(partition by user_id order by cast (dt as date) range between prceding 30 and current row)

第二步：求首次日期

首次：min(dt) --最早

拓展：最近、最新、末次日期max(dt)

完整的SQL如下：

select user_id,min(dt)
from (
         select dt
              , user_id
              , sum(price)
                over (partition by user_id order by cast(dt as date) range between 30 preceding and current row) as order_price
         from (select 'a' as user_id, 7000 as price, '2022-07-01' as dt
               union all
               select 'a' as user_id, 4000 as price, '2022-08-22' as dt
               union all
               select 'a' as user_id, 8000 as price, '2022-08-23' as dt
              ) t
     ) t
where order_price > 10000
group by user_id

对比rows求得结果：

select user_id, min(dt)
from (
         select dt
              , user_id
              , sum(price)
                over (partition by user_id order by dt rows between 30 preceding and current row) as order_price
         from (select 'a' as user_id, 7000 as price, '2022-07-01' as dt
               union all
               select 'a' as user_id, 4000 as price, '2022-08-22' as dt
               union all
               select 'a' as user_id, 8000 as price, '2022-08-23' as dt
              ) t
     ) A
where order_price > 10000
group by user_id

明显rows求得的结果不对，2022-07-01日期就不在2022-08-22近30天日期范围内

中间结果如下：

对于有的数据库没有range函数的，此时如何求呢？我们可以借助时间维度表去补全日期数据，这也是常见的通用方法，比如我们有一张日期全的维度表dim_date

可以看出日期是连续的，由于partition by 后需要按照用户（user_id）分组，所以用户的维度需要补齐在时间维度表中，这种补齐维度的操作我们一般采用自关联SQL如下：

with data as
         (select 'a' as user_id, 7000 as price, '2022-07-01' as dt
          union all
          select 'a' as user_id, 4000 as price, '2022-08-22' as dt
          union all
          select 'a' as user_id, 8000 as price, '2022-08-23' as dt
         )
,dim_user AS
    (select 'a' user_id
     UNION ALL
     select 'b' user_id
     UNION ALL
     select 'c' user_id
    )
select *
from
(     select d.date_id, u.user_id
               from (select date_id
                     from dim.dim_date
                     where date_format(date_id, 'yyyy-MM') >= '2022-06'
                    ) d,
                    dim_user u
              ) d

具体结果如下：

可以看出每个时间记录上，都得到了相应用户的维度值。

最后我们用该表作为主表left join数据表，通过关联条件将数据唯一对应过来

with data as
         (select 'a' as user_id, 7000 as price, '2022-07-01' as dt
          union all
          select 'a' as user_id, 4000 as price, '2022-08-22' as dt
          union all
          select 'a' as user_id, 8000 as price, '2022-08-23' as dt
         )
,dim_user AS
    (select 'a' user_id
     UNION ALL
     select 'b' user_id
     UNION ALL
     select 'c' user_id
    )
select *
from
(     select d.date_id, u.user_id
               from (select date_id
                     from dim.dim_date
                     where date_format(date_id, 'yyyy-MM') >= '2022-06'
                    ) d,
                    dim_user u
              ) d
              left join data
on d.date_id = data.dt and d.user_id=data.user_id

具体结果如下：

我们可以看到主表是比较全的维表，拥有所有的时间、用户属性，order by 后的日期应该是维表中的日期，partition by后的user_id应该为主表中的user_id,此时再用rows 求解就没有问题。

最终SQL如下：

with data as
         (select 'a' as user_id, 7000 as price, '2022-07-01' as dt
          union all
          select 'a' as user_id, 4000 as price, '2022-08-22' as dt
          union all
          select 'a' as user_id, 8000 as price, '2022-08-23' as dt
         )
,dim_user AS
    (select 'a' user_id
     UNION ALL
     select 'b' user_id
     UNION ALL
     select 'c' user_id
    )
select user_id, min(dt)
from (
         select dt
              , d.user_id
              , sum(price)
                over (partition by d.user_id order by d.date_id rows between 30 preceding and current row) as order_price
         from (
               select d.date_id, u.user_id
               from (select date_id
                     from dim.dim_date
                     where date_format(date_id, 'yyyy-MM') >= '2022-06'
                    ) d,
                    dim_user u
              ) d
              left join data
            on d.date_id = data.dt and d.user_id=data.user_id
     ) A
where order_price > 10000
group by user_id

可以看出最终求解的结果值和range的结果是一致的。

小结：是否需要补全其他维度值，看partition by后的分组字段，有多少个就需要补全哪些，因为直接用时间维度表做主表，partition by无法正确分组，需要补全后面的分组字段才行。改方法性能上肯定比较差，但也是比较通用的方法，对于一些窗口不支持range子句的则也只能采取这样的方法。

3 小结

本文讲解了一种求近30天消费金额的方法，给出了2种思路，2种方法都比较通用，都需要掌握。

以上是关于HiveSQL一天一个小技巧：如何准确求近30天指标？的主要内容，如果未能解决你的问题，请参考以下文章

数据分析课程笔记 - 19 - HiveSQL 常用优化技巧

HTML最后一天的学习内容

（干货）送你数据可视化制作的30个小技巧！

hiveSQL常用日期函数

聚合函数以及MYSQL分组查询——GROUP BY语句（一天一个小技巧，明天你就是专业人士，欧耶！）