hive函数之~窗口函数与分析函数

Posted lojun

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了hive函数之~窗口函数与分析函数相关的知识,希望对你有一定的参考价值。

hive当中也带有很多的窗口函数以及分析函数,主要用于以下这些场景

(1)用于分区排序 
(2)动态Group By 
(3)Top N 
(4)累计计算 
(5)层次查询

1、创建hive表并加载数据

创建表

hive (hive_explode)> create table order_detail(
                    user_id string,device_id string,user_type string,price double,sales int
                    )row format delimited fields terminated by ,;

加载数据

cd /export/servers/hivedatas
vim order_detail
zhangsan,1,new,67.1,2
lisi,2,old,43.32,1
wagner,3,new,88.88,3
liliu,4,new,66.0,1
qiuba,5,new,54.32,1
wangshi,6,old,77.77,2
liwei,7,old,88.44,3
wutong,8,new,56.55,6
lilisi,9,new,88.88,5
qishili,10,new,66.66,5

加载数据

hive (hive_explode)> load data local inpath /export/servers/hivedatas/order_detail into table order_detail;

2、窗口函数

FIRST_VALUE:取分组内排序后,截止到当前行,第一个值

LAST_VALUE: 取分组内排序后,截止到当前行,最后一个值

LEAD(col,n,DEFAULT) :用于统计窗口内往下第n行值。第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

LAG(col,n,DEFAULT) :与lead相反,用于统计窗口内往上第n行值。第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

3、OVER从句

1、使用标准的聚合函数COUNT、SUM、MIN、MAX、AVG

2、使用PARTITION BY语句,使用一个或者多个原始数据类型的列

3、使用PARTITION BY与ORDER BY语句,使用一个或者多个数据类型的分区或者排序列

4、使用窗口规范,窗口规范支持以下格式:

(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING

1

2

3

当ORDER BY后面缺少窗口从句条件,窗口规范默认是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

当ORDER BY和窗口从句都缺失, 窗口规范默认是 ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

OVER从句支持以下函数, 但是并不支持和窗口一起使用它们。

Ranking函数: Rank, NTile, DenseRank, CumeDist, PercentRank.

Lead 和 Lag 函数.

使用窗口函数进行统计求销量

使用窗口函数sum  over统计销量

hive (hive_explode)> select
user_id,
user_type,
sales,
--分组内所有行
sum(sales) over(partition by user_type) AS sales_1 ,
sum(sales) over(order  by user_type) AS sales_2 ,
--默认为从起点到当前行,如果sales相同,累加结果相同
sum(sales) over(partition by user_type order by sales asc) AS sales_3,
--从起点到当前行,结果与sales_3不同。 根据排序先后不同,可能结果累加不同
sum(sales) over(partition by user_type order by sales asc rows between unbounded preceding and current row) AS sales_4,
--当前行+往前3行
sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and current row) AS sales_5,
--当前行+往前3行+往后1行
sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and 1 following) AS sales_6,
--当前行+往后所有行 
sum(sales) over(partition by user_type order by sales asc rows between current row and unbounded following) AS sales_7
from
order_detail
order by
    user_type,
    sales,
    user_id;

统计之后求得结果如下:

+-----------+------------+--------+----------+----------+----------+----------+----------+----------+----------+--+
|  user_id  | user_type  | sales  | sales_1  | sales_2  | sales_3  | sales_4  | sales_5  | sales_6  | sales_7  |
+-----------+------------+--------+----------+----------+----------+----------+----------+----------+----------+--+
| liliu     | new        | 1      | 23       | 23       | 2        | 2        | 2        | 4        | 22       |
| qiuba     | new        | 1      | 23       | 23       | 2        | 1        | 1        | 2        | 23       |
| zhangsan  | new        | 2      | 23       | 23       | 4        | 4        | 4        | 7        | 21       |
| wagner    | new        | 3      | 23       | 23       | 7        | 7        | 7        | 12       | 19       |
| lilisi    | new        | 5      | 23       | 23       | 17       | 17       | 15       | 21       | 11       |
| qishili   | new        | 5      | 23       | 23       | 17       | 12       | 11       | 16       | 16       |
| wutong    | new        | 6      | 23       | 23       | 23       | 23       | 19       | 19       | 6        |
| lisi      | old        | 1      | 6        | 29       | 1        | 1        | 1        | 3        | 6        |
| wangshi   | old        | 2      | 6        | 29       | 3        | 3        | 3        | 6        | 5        |
| liwei     | old        | 3      | 6        | 29       | 6        | 6        | 6        | 6        | 3        |
+-----------+------------+--------+----------+----------+----------+----------+----------+----------+----------+--+

注意:

结果和ORDER BY相关,默认为升序

如果不指定ROWS BETWEEN,默认为从起点到当前行;

如果不指定ORDER BY,则将分组内所有值累加;

 

关键是理解ROWS BETWEEN含义,也叫做WINDOW子句:

PRECEDING:往前

FOLLOWING:往后

CURRENT ROW:当前行

UNBOUNDED:无界限(起点或终点)

UNBOUNDED PRECEDING:表示从前面的起点

UNBOUNDED FOLLOWING:表示到后面的终点

其他COUNT、AVG,MIN,MAX,和SUM用法一样。

 

求分组后的第一个和最后一个值first_value与last_value

使用first_value和last_value求分组后的第一个和最后一个值

select
    user_id,
    user_type,
    ROW_NUMBER() OVER(PARTITION BY user_type ORDER BY sales) AS row_num, 
    first_value(user_id) over (partition by user_type order by sales desc) as max_sales_user,
    first_value(user_id) over (partition by user_type order by sales asc) as min_sales_user,
    last_value(user_id) over (partition by user_type order by sales desc) as curr_last_min_user,
    last_value(user_id) over (partition by user_type order by sales asc) as curr_last_max_user
from
    order_detail;
+-----------+------------+----------+-----------------+-----------------+---------------------+---------------------+--+
|  user_id  | user_type  | row_num  | max_sales_user  | min_sales_user  | curr_last_min_user  | curr_last_max_user  |
+-----------+------------+----------+-----------------+-----------------+---------------------+---------------------+--+
| wutong    | new        | 7        | wutong          | qiuba           | wutong              | wutong              |
| lilisi    | new        | 6        | wutong          | qiuba           | qishili             | lilisi              |
| qishili   | new        | 5        | wutong          | qiuba           | qishili             | lilisi              |
| wagner    | new        | 4        | wutong          | qiuba           | wagner              | wagner              |
| zhangsan  | new        | 3        | wutong          | qiuba           | zhangsan            | zhangsan            |
| liliu     | new        | 2        | wutong          | qiuba           | qiuba               | liliu               |
| qiuba     | new        | 1        | wutong          | qiuba           | qiuba               | liliu               |
| liwei     | old        | 3        | liwei           | lisi            | liwei               | liwei               |
| wangshi   | old        | 2        | liwei           | lisi            | wangshi             | wangshi             |
| lisi      | old        | 1        | liwei           | lisi            | lisi                | lisi                |
+-----------+------------+----------+-----------------+-----------------+---------------------+---------------------+--+

4、分析函数

1、  ROW_NUMBER():

从1开始,按照顺序,生成分组内记录的序列,比如,按照pv降序排列,生成分组内每天的pv名次,ROW_NUMBER()的应用场景非常多,再比如,获取分组内排序第一的记录;获取一个session中的第一条refer等。

2、  RANK() :

生成数据项在分组中的排名,排名相等会在名次中留下空位

3、  DENSE_RANK() :

生成数据项在分组中的排名,排名相等会在名次中不会留下空位

4、  CUME_DIST :

小于等于当前值的行数/分组内总行数。比如,统计小于等于当前薪水的人数,所占总人数的比例

5、  PERCENT_RANK :

分组内当前行的RANK值-1/分组内总行数-1

6、  NTILE(n) :

用于将分组数据按照顺序切分成n片,返回当前切片值,如果切片不均匀,默认增加第一个切片的分布。NTILE不支持ROWS BETWEEN,比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)。

RANK、ROW_NUMBER、DENSE_RANK  OVER的使用

使用这几个函数,可以实现分组求topN

需求:按照用户类型进行分类,求取销售量最大的前N条数据

select
    user_id,user_type,sales,
    RANK() over (partition by user_type order by sales desc) as r,
    ROW_NUMBER() over (partition by user_type order by sales desc) as rn,
    DENSE_RANK() over (partition by user_type order by sales desc) as dr
from
    order_detail;  
+-----------+------------+--------+----+-----+-----+--+
|  user_id  | user_type  | sales  | r  | rn  | dr  |
+-----------+------------+--------+----+-----+-----+--+
| wutong    | new        | 6      | 1  | 1   | 1   |
| qishili   | new        | 5      | 2  | 2   | 2   |
| lilisi    | new        | 5      | 2  | 3   | 2   |
| wagner    | new        | 3      | 4  | 4   | 3   |
| zhangsan  | new        | 2      | 5  | 5   | 4   |
| qiuba     | new        | 1      | 6  | 6   | 5   |
| liliu     | new        | 1      | 6  | 7   | 5   |
| liwei     | old        | 3      | 1  | 1   | 1   |
| wangshi   | old        | 2      | 2  | 2   | 2   |
| lisi      | old        | 1      | 3  | 3   | 3   |
+-----------+------------+--------+----+-----+-----+--+

使用NTILE求取百分比

我们可以使用NTILE来将我们的数据分成多少份,然后求取百分比

使用NTILE将数据进行分片

select
    user_type,sales,
    --分组内将数据分成2片
    NTILE(2) OVER(PARTITION BY user_type ORDER BY sales) AS nt2,
    --分组内将数据分成3片   
    NTILE(3) OVER(PARTITION BY user_type ORDER BY sales) AS nt3,
    --分组内将数据分成4片   
    NTILE(4) OVER(PARTITION BY user_type ORDER BY sales) AS nt4,
    --将所有数据分成4片
    NTILE(4) OVER(ORDER BY sales) AS all_nt4
from
    order_detail
order by
    user_type,
    sales;

得到结果如下:

+------------+--------+------+------+------+----------+--+
| user_type  | sales  | nt2  | nt3  | nt4  | all_nt4  |
+------------+--------+------+------+------+----------+--+
| new        | 1      | 1    | 1    | 1    | 1        |
| new        | 1      | 1    | 1    | 1    | 1        |
| new        | 2      | 1    | 1    | 2    | 2        |
| new        | 3      | 1    | 2    | 2    | 3        |
| new        | 5      | 2    | 2    | 3    | 4        |
| new        | 5      | 2    | 3    | 3    | 3        |
| new        | 6      | 2    | 3    | 4    | 4        |
| old        | 1      | 1    | 1    | 1    | 1        |
| old        | 2      | 1    | 2    | 2    | 2        |
| old        | 3      | 2    | 3    | 3    | 2        |
+------------+--------+------+------+------+----------+--+

使用NTILE求取sales前20%的用户id

select
    user_id
from
(select  user_id, NTILE(5) OVER(ORDER BY sales desc) AS nt
    from  order_detail
)A
where nt=1;
+----------+--+
| user_id  |
+----------+--+
| wutong   |
| qishili  |

5、增强的聚合Cuhe和Grouping和Rollup

这几个分析函数通常用于OLAP中,不能累加,而且需要根据不同维度上钻和下钻的指标统计,比如,分小时、天、月的UV数。

 

GROUPING SETS

在一个GROUP BY查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行UNION ALL,

其中的GROUPING__ID,表示结果属于哪一个分组集合。

需求:按照user_type和sales分别进行分组求取数据

0: jdbc:hive2://node03:10000>select
    user_type,
    sales,
    count(user_id) as pv,
    GROUPING__ID
from
    order_detail
group by
    user_type,sales
GROUPING SETS(user_type,sales)
ORDER BY
    GROUPING__ID;

求取结果如下:

+------------+--------+-----+---------------+--+
| user_type  | sales  | pv  | grouping__id  |
+------------+--------+-----+---------------+--+
| old        | NULL   | 3   | 1             |
| new        | NULL   | 7   | 1             |
| NULL       | 6      | 1   | 2             |
| NULL       | 5      | 2   | 2             |
| NULL       | 3      | 2   | 2             |
| NULL       | 2      | 2   | 2             |
| NULL       | 1      | 3   | 2             |
+------------+--------+-----+---------------+--+

需求:按照user_type,sales,以及user_type + salse  分别进行分组求取统计数据

0: jdbc:hive2://node03:10000>select
    user_type,
    sales,
    count(user_id) as pv,
    GROUPING__ID
from
    order_detail
group by
    user_type,sales
GROUPING SETS(user_type,sales,(user_type,sales))
ORDER BY
    GROUPING__ID;

求取结果如下:

+------------+--------+-----+---------------+--+
| user_type  | sales  | pv  | grouping__id  |
+------------+--------+-----+---------------+--+
| old        | NULL   | 3   | 1             |
| new        | NULL   | 7   | 1             |
| NULL       | 1      | 3   | 2             |
| NULL       | 6      | 1   | 2             |
| NULL       | 5      | 2   | 2             |
| NULL       | 3      | 2   | 2             |
| NULL       | 2      | 2   | 2             |
| old        | 3      | 1   | 3             |
| old        | 2      | 1   | 3             |
| old        | 1      | 1   | 3             |
| new        | 6      | 1   | 3             |
| new        | 5      | 2   | 3             |
| new        | 3      | 1   | 3             |
| new        | 1      | 2   | 3             |
| new        | 2      | 1   | 3             |
+------------+--------+-----+---------------+--+

6、使用cube 和ROLLUP 根据GROUP BY的维度的所有组合进行聚合。

cube进行聚合

需求:不进行任何的分组,按照user_type进行分组,按照sales进行分组,按照user_type+sales进行分组求取统计数据

0: jdbc:hive2://node03:10000>select
    user_type,
    sales,
    count(user_id) as pv,
    GROUPING__ID
from
    order_detail
group by
    user_type,sales
WITH CUBE
ORDER BY
    GROUPING__ID;
+------------+--------+-----+---------------+--+
| user_type  | sales  | pv  | grouping__id  |
+------------+--------+-----+---------------+--+
| NULL       | NULL   | 10  | 0             |
| new        | NULL   | 7   | 1             |
| old        | NULL   | 3   | 1             |
| NULL       | 6      | 1   | 2             |
| NULL       | 5      | 2   | 2             |
| NULL       | 3      | 2   | 2             |
| NULL       | 2      | 2   | 2             |
| NULL       | 1      | 3   | 2             |
| old        | 3      | 1   | 3              |
| old        | 2      | 1   | 3              |
| old        | 1      | 1   | 3              |
| new        | 6      | 1   | 3             |
| new        | 5      | 2   | 3             |
| new        | 3      | 1   | 3             |
| new        | 2      | 1   | 3             |
| new        | 1      | 2   | 3             |
+------------+--------+-----+---------------+--+
 

ROLLUP进行聚合

rollup是CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合。

select
    user_type,
    sales,
    count(user_id) as pv,
    GROUPING__ID
from
    order_detail
group by
    user_type,sales
WITH ROLLUP
ORDER BY
    GROUPING__ID;
+------------+--------+-----+---------------+--+
| user_type  | sales  | pv  | grouping__id  |
+------------+--------+-----+---------------+--+
| NULL       | NULL   | 10  | 0             |
| old        | NULL   | 3   | 1             |
| new        | NULL   | 7   | 1             |
| old        | 3      | 1   | 3             |
| old        | 2      | 1   | 3             |
| old        | 1      | 1   | 3             |
| new        | 6      | 1   | 3             |
| new        | 5      | 2   | 3             |
| new        | 3      | 1   | 3             |
| new        | 2      | 1   | 3             |
| new        | 1      | 2   | 3             |
+------------+--------+-----+---------------+--+

 

以上是关于hive函数之~窗口函数与分析函数的主要内容,如果未能解决你的问题,请参考以下文章

Hive之窗口函数

hive之窗口函数

数据仓库工具Hive——窗口函数,DML,事务

Hive 窗口与分析型函数

Hive开窗函数总结

Hive开窗函数总结