Hive 窗口函数

Posted noyouth

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Hive 窗口函数相关的知识,希望对你有一定的参考价值。

表数据如下

select * from business;

business.name	business.orderdate	business.cost
jack	2017-01-01	10
tony	2017-01-02	15
jack	2017-02-03	23
tony	2017-01-04	29
jack	2017-01-05	46
jack	2017-04-06	42
tony	2017-01-07	50
jack	2017-01-08	55
mart	2017-04-08	62
mart	2017-04-09	68
neil	2017-05-10	12
mart	2017-04-11	75
neil	2017-06-12	80
mart	2017-04-13	94

  

一、聚合函数开窗

  1.全局范围

  查询顾客姓名及总人数(多次购买只算一人)

select name,count(*) over() from business group by name;

name	count_window_0
jack	4
mart	4
neil	4
tony	4

  2.排序后的范围

  对所有人的消费明细,将 cost 按照日期进行累加

select *,sum(cost) over(order by orderdate) from business;

business.name	business.orderdate	business.cost	sum_window_0
jack	2017-01-01	10	10
tony	2017-01-02	15	25
tony	2017-01-04	29	54
jack	2017-01-05	46	100
tony	2017-01-07	50	150
jack	2017-01-08	55	205
jack	2017-02-03	23	228
jack	2017-04-06	42	270
mart	2017-04-08	62	332
mart	2017-04-09	68	400
mart	2017-04-11	75	475
mart	2017-04-13	94	569
neil	2017-05-10	12	581
neil	2017-06-12	80	661

  3.分区加排序后的范围。分区排序搭配:distribute by + sort by或者partition by + order by。

  将不同人的消费明细,按日期累加

select *,sum(cost) over(distribute by name sort by orderdate) from business;

business.name	business.orderdate	business.cost	sum_window_0
jack	2017-01-01	10	10
jack	2017-01-05	46	56
jack	2017-01-08	55	111
jack	2017-02-03	23	134
jack	2017-04-06	42	176
mart	2017-04-08	62	62
mart	2017-04-09	68	130
mart	2017-04-11	75	205
mart	2017-04-13	94	299
neil	2017-05-10	12	12
neil	2017-06-12	80	92
tony	2017-01-02	15	15
tony	2017-01-04	29	44
tony	2017-01-07	50	94

  4.其他范围

  ①开始到当前行

over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row )

  ②当前行到最后

over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING)

  ③前一行、当前行和下一行

over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING)

  

二、其他函数开窗

  1.lag()函数

  查询每个顾客的购买明细及上一次购买时间

select *,lag(orderdate,1,‘1970-01-01‘) over(distribute by name sort by orderdate) from business;

business.name	business.orderdate	business.cost	lag_window_0
jack	2017-01-01	10	1970-01-01
jack	2017-01-05	46	2017-01-01
jack	2017-01-08	55	2017-01-05
jack	2017-02-03	23	2017-01-08
jack	2017-04-06	42	2017-02-03
mart	2017-04-08	62	1970-01-01
mart	2017-04-09	68	2017-04-08
mart	2017-04-11	75	2017-04-09
mart	2017-04-13	94	2017-04-11
neil	2017-05-10	12	1970-01-01
neil	2017-06-12	80	2017-05-10
tony	2017-01-02	15	1970-01-01
tony	2017-01-04	29	2017-01-02
tony	2017-01-07	50	2017-01-04

  2.lead()函数

  查询每个顾客的购买明细及下一次购买时间

select *,lead(orderdate,1,‘9999-99-99‘) over(distribute by name sort by orderdate) from business;

jack	2017-01-01	10	2017-01-05
jack	2017-01-05	46	2017-01-08
jack	2017-01-08	55	2017-02-03
jack	2017-02-03	23	2017-04-06
jack	2017-04-06	42	9999-99-99
mart	2017-04-08	62	2017-04-09
mart	2017-04-09	68	2017-04-11
mart	2017-04-11	75	2017-04-13
mart	2017-04-13	94	9999-99-99
neil	2017-05-10	12	2017-06-12
neil	2017-06-12	80	9999-99-99
tony	2017-01-02	15	2017-01-04
tony	2017-01-04	29	2017-01-07
tony	2017-01-07	50	9999-99-99

  3.ntile()函数

  查询前20%时间的订单信息

select * from (select *,ntile(5) over(order by orderdate) nt from business) t1 where t1.nt = 1;

t1.name	t1.orderdate	t1.cost	t1.nt
jack	2017-01-01	10	1
tony	2017-01-02	15	1
tony	2017-01-04	29	1

  

 

以上是关于Hive 窗口函数的主要内容,如果未能解决你的问题,请参考以下文章

hive关于窗口函数的使用

数据仓库工具Hive——窗口函数,DML,事务

Hive sql及窗口函数

hive函数之~窗口函数与分析函数

大数据技术-hive窗口函数详解

Hive分析窗口函数