Hive函数:LAG,LEAD,FIRST_VALUE,LAST_VALUE

Posted yy

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Hive函数:LAG,LEAD,FIRST_VALUE,LAST_VALUE相关的知识,希望对你有一定的参考价值。

参考自大数据田地:http://lxw1234.com/archives/2015/04/190.htm

测试数据准备:

create external table test_data (
cookieid string,
createtime string,  --页面访问时间
url string       --被访问页面
) ROW FORMAT DELIMITED 
FIELDS TERMINATED BY , 
stored as textfile location /user/jc_rc_ftp/test_data;

 select * from test_data l;
+-------------+----------------------+---------+--+
| l.cookieid  |     l.createtime     |  l.url  |
+-------------+----------------------+---------+--+
| cookie1     | 2015-04-10 10:00:02  | url2    |
| cookie1     | 2015-04-10 10:00:00  | url1    |
| cookie1     | 2015-04-10 10:03:04  | 1url3   |
| cookie1     | 2015-04-10 10:50:05  | url6    |
| cookie1     | 2015-04-10 11:00:00  | url7    |
| cookie1     | 2015-04-10 10:10:00  | url4    |
| cookie1     | 2015-04-10 10:50:01  | url5    |
| cookie2     | 2015-04-10 10:00:02  | url22   |
| cookie2     | 2015-04-10 10:00:00  | url11   |
| cookie2     | 2015-04-10 10:03:04  | 1url33  |
| cookie2     | 2015-04-10 10:50:05  | url66   |
| cookie2     | 2015-04-10 11:00:00  | url77   |
| cookie2     | 2015-04-10 10:10:00  | url44   |
| cookie2     | 2015-04-10 10:50:01  | url55   |
+-------------+----------------------+---------+--+

LAG
LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值

第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAG(createtime,1,1970-01-01 00:00:00) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,
LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time 
FROM test_data;
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookieid  |      createtime      |   url   | rn  |     last_1_time      |     last_2_time      |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookie1   | 2015-04-10 10:00:00  | url1    | 1   | 1970-01-01 00:00:00  | NULL                 |
| cookie1   | 2015-04-10 10:00:02  | url2    | 2   | 2015-04-10 10:00:00  | NULL                 |
| cookie1   | 2015-04-10 10:03:04  | 1url3   | 3   | 2015-04-10 10:00:02  | 2015-04-10 10:00:00  |
| cookie1   | 2015-04-10 10:10:00  | url4    | 4   | 2015-04-10 10:03:04  | 2015-04-10 10:00:02  |
| cookie1   | 2015-04-10 10:50:01  | url5    | 5   | 2015-04-10 10:10:00  | 2015-04-10 10:03:04  |
| cookie1   | 2015-04-10 10:50:05  | url6    | 6   | 2015-04-10 10:50:01  | 2015-04-10 10:10:00  |
| cookie1   | 2015-04-10 11:00:00  | url7    | 7   | 2015-04-10 10:50:05  | 2015-04-10 10:50:01  |
| cookie2   | 2015-04-10 10:00:00  | url11   | 1   | 1970-01-01 00:00:00  | NULL                 |
| cookie2   | 2015-04-10 10:00:02  | url22   | 2   | 2015-04-10 10:00:00  | NULL                 |
| cookie2   | 2015-04-10 10:03:04  | 1url33  | 3   | 2015-04-10 10:00:02  | 2015-04-10 10:00:00  |
| cookie2   | 2015-04-10 10:10:00  | url44   | 4   | 2015-04-10 10:03:04  | 2015-04-10 10:00:02  |
| cookie2   | 2015-04-10 10:50:01  | url55   | 5   | 2015-04-10 10:10:00  | 2015-04-10 10:03:04  |
| cookie2   | 2015-04-10 10:50:05  | url66   | 6   | 2015-04-10 10:50:01  | 2015-04-10 10:10:00  |
| cookie2   | 2015-04-10 11:00:00  | url77   | 7   | 2015-04-10 10:50:05  | 2015-04-10 10:50:01  |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+

LEAD

与LAG相反
LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LEAD(createtime,1,1970-01-01 00:00:00) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,
LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time 
FROM test_data;
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookieid  |      createtime      |   url   | rn  |     next_1_time      |     next_2_time      |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookie1   | 2015-04-10 10:00:00  | url1    | 1   | 2015-04-10 10:00:02  | 2015-04-10 10:03:04  |
| cookie1   | 2015-04-10 10:00:02  | url2    | 2   | 2015-04-10 10:03:04  | 2015-04-10 10:10:00  |
| cookie1   | 2015-04-10 10:03:04  | 1url3   | 3   | 2015-04-10 10:10:00  | 2015-04-10 10:50:01  |
| cookie1   | 2015-04-10 10:10:00  | url4    | 4   | 2015-04-10 10:50:01  | 2015-04-10 10:50:05  |
| cookie1   | 2015-04-10 10:50:01  | url5    | 5   | 2015-04-10 10:50:05  | 2015-04-10 11:00:00  |
| cookie1   | 2015-04-10 10:50:05  | url6    | 6   | 2015-04-10 11:00:00  | NULL                 |
| cookie1   | 2015-04-10 11:00:00  | url7    | 7   | 1970-01-01 00:00:00  | NULL                 |
| cookie2   | 2015-04-10 10:00:00  | url11   | 1   | 2015-04-10 10:00:02  | 2015-04-10 10:03:04  |
| cookie2   | 2015-04-10 10:00:02  | url22   | 2   | 2015-04-10 10:03:04  | 2015-04-10 10:10:00  |
| cookie2   | 2015-04-10 10:03:04  | 1url33  | 3   | 2015-04-10 10:10:00  | 2015-04-10 10:50:01  |
| cookie2   | 2015-04-10 10:10:00  | url44   | 4   | 2015-04-10 10:50:01  | 2015-04-10 10:50:05  |
| cookie2   | 2015-04-10 10:50:01  | url55   | 5   | 2015-04-10 10:50:05  | 2015-04-10 11:00:00  |
| cookie2   | 2015-04-10 10:50:05  | url66   | 6   | 2015-04-10 11:00:00  | NULL                 |
| cookie2   | 2015-04-10 11:00:00  | url77   | 7   | 1970-01-01 00:00:00  | NULL                 |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+

FIRST_VALUE

取分组内排序后,截止到当前行,第一个值

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1 
FROM test_data;


+-----------+----------------------+---------+-----+---------+--+
| cookieid  |      createtime      |   url   | rn  | first1  |
+-----------+----------------------+---------+-----+---------+--+
| cookie1   | 2015-04-10 10:00:00  | url1    | 1   | url1    |
| cookie1   | 2015-04-10 10:00:02  | url2    | 2   | url1    |
| cookie1   | 2015-04-10 10:03:04  | 1url3   | 3   | url1    |
| cookie1   | 2015-04-10 10:10:00  | url4    | 4   | url1    |
| cookie1   | 2015-04-10 10:50:01  | url5    | 5   | url1    |
| cookie1   | 2015-04-10 10:50:05  | url6    | 6   | url1    |
| cookie1   | 2015-04-10 11:00:00  | url7    | 7   | url1    |
| cookie2   | 2015-04-10 10:00:00  | url11   | 1   | url11   |
| cookie2   | 2015-04-10 10:00:02  | url22   | 2   | url11   |
| cookie2   | 2015-04-10 10:03:04  | 1url33  | 3   | url11   |
| cookie2   | 2015-04-10 10:10:00  | url44   | 4   | url11   |
| cookie2   | 2015-04-10 10:50:01  | url55   | 5   | url11   |
| cookie2   | 2015-04-10 10:50:05  | url66   | 6   | url11   |
| cookie2   | 2015-04-10 11:00:00  | url77   | 7   | url11   |
+-----------+----------------------+---------+-----+---------+--+

LAST_VALUE

取分组内排序后,截止到当前行,最后一个值

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1 
FROM test_data;
+-----------+----------------------+---------+-----+---------+--+
| cookieid  |      createtime      |   url   | rn  |  last1  |
+-----------+----------------------+---------+-----+---------+--+
| cookie1   | 2015-04-10 10:00:00  | url1    | 1   | url1    |
| cookie1   | 2015-04-10 10:00:02  | url2    | 2   | url2    |
| cookie1   | 2015-04-10 10:03:04  | 1url3   | 3   | 1url3   |
| cookie1   | 2015-04-10 10:10:00  | url4    | 4   | url4    |
| cookie1   | 2015-04-10 10:50:01  | url5    | 5   | url5    |
| cookie1   | 2015-04-10 10:50:05  | url6    | 6   | url6    |
| cookie1   | 2015-04-10 11:00:00  | url7    | 7   | url7    |
| cookie2   | 2015-04-10 10:00:00  | url11   | 1   | url11   |
| cookie2   | 2015-04-10 10:00:02  | url22   | 2   | url22   |
| cookie2   | 2015-04-10 10:03:04  | 1url33  | 3   | 1url33  |
| cookie2   | 2015-04-10 10:10:00  | url44   | 4   | url44   |
| cookie2   | 2015-04-10 10:50:01  | url55   | 5   | url55   |
| cookie2   | 2015-04-10 10:50:05  | url66   | 6   | url66   |
| cookie2   | 2015-04-10 11:00:00  | url77   | 7   | url77   |
+-----------+----------------------+---------+-----+---------+--+

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last1 
FROM test_data;
+-----------+----------------------+---------+-----+---------+--+
| cookieid  |      createtime      |   url   | rn  |  last1  |
+-----------+----------------------+---------+-----+---------+--+
| cookie1   | 2015-04-10 11:00:00  | url7    | 7   | url7    |
| cookie1   | 2015-04-10 10:50:05  | url6    | 6   | url6    |
| cookie1   | 2015-04-10 10:50:01  | url5    | 5   | url5    |
| cookie1   | 2015-04-10 10:10:00  | url4    | 4   | url4    |
| cookie1   | 2015-04-10 10:03:04  | 1url3   | 3   | 1url3   |
| cookie1   | 2015-04-10 10:00:02  | url2    | 2   | url2    |
| cookie1   | 2015-04-10 10:00:00  | url1    | 1   | url1    |
| cookie2   | 2015-04-10 11:00:00  | url77   | 7   | url77   |
| cookie2   | 2015-04-10 10:50:05  | url66   | 6   | url66   |
| cookie2   | 2015-04-10 10:50:01  | url55   | 5   | url55   |
| cookie2   | 2015-04-10 10:10:00  | url44   | 4   | url44   |
| cookie2   | 2015-04-10 10:03:04  | 1url33  | 3   | 1url33  |
| cookie2   | 2015-04-10 10:00:02  | url22   | 2   | url22   |
| cookie2   | 2015-04-10 10:00:00  | url11   | 1   | url11   |
+-----------+----------------------+---------+-----+---------+--+

如果不指定ORDER BY,则默认按照记录在文件中的偏移量进行排序,会出现错误的结果

SELECT cookieid,
createtime,
url,
FIRST_VALUE(url) OVER(PARTITION BY cookieid) AS first2  
FROM test_data;
+-----------+----------------------+---------+---------+--+
| cookieid  |      createtime      |   url   | first2  |
+-----------+----------------------+---------+---------+--+
| cookie1   | 2015-04-10 10:00:02  | url2    | url2    |
| cookie1   | 2015-04-10 10:50:01  | url5    | url2    |
| cookie1   | 2015-04-10 10:10:00  | url4    | url2    |
| cookie1   | 2015-04-10 11:00:00  | url7    | url2    |
| cookie1   | 2015-04-10 10:50:05  | url6    | url2    |
| cookie1   | 2015-04-10 10:03:04  | 1url3   | url2    |
| cookie1   | 2015-04-10 10:00:00  | url1    | url2    |
| cookie2   | 2015-04-10 10:50:01  | url55   | url55   |
| cookie2   | 2015-04-10 10:10:00  | url44   | url55   |
| cookie2   | 2015-04-10 11:00:00  | url77   | url55   |
| cookie2   | 2015-04-10 10:50:05  | url66   | url55   |
| cookie2   | 2015-04-10 10:03:04  | 1url33  | url55   |
| cookie2   | 2015-04-10 10:00:00  | url11   | url55   |
| cookie2   | 2015-04-10 10:00:02  | url22   | url55   |
+-----------+----------------------+---------+---------+--+
SELECT cookieid,
createtime,
url,
LAST_VALUE(url) OVER(PARTITION BY cookieid) AS last2  
FROM test_data;
+-----------+----------------------+---------+--------+--+
| cookieid  |      createtime      |   url   | last2  |
+-----------+----------------------+---------+--------+--+
| cookie1   | 2015-04-10 10:00:02  | url2    | url1   |
| cookie1   | 2015-04-10 10:50:01  | url5    | url1   |
| cookie1   | 2015-04-10 10:10:00  | url4    | url1   |
| cookie1   | 2015-04-10 11:00:00  | url7    | url1   |
| cookie1   | 2015-04-10 10:50:05  | url6    | url1   |
| cookie1   | 2015-04-10 10:03:04  | 1url3   | url1   |
| cookie1   | 2015-04-10 10:00:00  | url1    | url1   |
| cookie2   | 2015-04-10 10:50:01  | url55   | url22  |
| cookie2   | 2015-04-10 10:10:00  | url44   | url22  |
| cookie2   | 2015-04-10 11:00:00  | url77   | url22  |
| cookie2   | 2015-04-10 10:50:05  | url66   | url22  |
| cookie2   | 2015-04-10 10:03:04  | 1url33  | url22  |
| cookie2   | 2015-04-10 10:00:00  | url11   | url22  |
| cookie2   | 2015-04-10 10:00:02  | url22   | url22  |
+-----------+----------------------+---------+--------+--+
14 rows selected (78.058 seconds)

如果想要取分组内排序后最后一个值,则需要变通一下:

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2 
FROM test_data 
ORDER BY cookieid,createtime;
+-----------+----------------------+---------+-----+---------+--------+--+
| cookieid  |      createtime      |   url   | rn  |  last1  | last2  |
+-----------+----------------------+---------+-----+---------+--------+--+
| cookie1   | 2015-04-10 10:00:00  | url1    | 1   | url1    | url7   |
| cookie1   | 2015-04-10 10:00:02  | url2    | 2   | url2    | url7   |
| cookie1   | 2015-04-10 10:03:04  | 1url3   | 3   | 1url3   | url7   |
| cookie1   | 2015-04-10 10:10:00  | url4    | 4   | url4    | url7   |
| cookie1   | 2015-04-10 10:50:01  | url5    | 5   | url5    | url7   |
| cookie1   | 2015-04-10 10:50:05  | url6    | 6   | url6    | url7   |
| cookie1   | 2015-04-10 11:00:00  | url7    | 7   | url7    | url7   |
| cookie2   | 2015-04-10 10:00:00  | url11   | 1   | url11   | url77  |
| cookie2   | 2015-04-10 10:00:02  | url22   | 2   | url22   | url77  |
| cookie2   | 2015-04-10 10:03:04  | 1url33  | 3   | 1url33  | url77  |
| cookie2   | 2015-04-10 10:10:00  | url44   | 4   | url44   | url77  |
| cookie2   | 2015-04-10 10:50:01  | url55   | 5   | url55   | url77  |
| cookie2   | 2015-04-10 10:50:05  | url66   | 6   | url66   | url77  |
| cookie2   | 2015-04-10 11:00:00  | url77   | 7   | url77   | url77  |
+-----------+----------------------+---------+-----+---------+--------+--+

 

以上是关于Hive函数:LAG,LEAD,FIRST_VALUE,LAST_VALUE的主要内容,如果未能解决你的问题,请参考以下文章

Hive函数:LAG,LEAD,FIRST_VALUE,LAST_VALUE

MYSQL lag() 和lead()函数使用介绍

Hive开窗函数

ORACLE 偏移分析函数 lag()与lead() 用法

ORACLE 偏移分析函数 lag()与lead() 用法

lag与lead函数 oracle_11g