hive之连续登录问题
Posted 浊酒南街
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了hive之连续登录问题相关的知识,希望对你有一定的参考价值。
目录
问题
登陆日志,计算每个人连续登陆的最大天数!(注意:断一天也算连续)
数据如下所示:
id dt
01 2021-02-28
01 2021-03-01
01 2021-03-02
01 2021-03-04
01 2021-03-05
01 2021-03-06
01 2021-03-08
02 2021-03-01
02 2021-03-02
02 2021-03-03
02 2021-03-06
03 2021-03-06
方法1
思路:等差数列
1、按照id分组同时按照dt排序,求rk
2、将每行日期减去rk值得到flag标志,如果之前是连续的日期,则相减后为相同的日期,flag相同
3、按照id和flag分组,计算count(*)得数连续的天数
4、按照id分组同时按照flag排序,求rk
5、将每行flag减去rk值得到new_flag标志
6、按照id和new_flag分组,计算连续的天数
7、对id进行分组
sql 实现:
select
id
,max(days) as days
from
(select
id
,new_flag
,sum(days)+count(*)-max(1) as days
from
(
select
id
,flag
,days
,falg -rank as new_falg
from
(select
id
,flag
,days
,rank() over(partition by id order by flag) as rank
from
(select
id
,flag
,count(dt) as days
from (
select
id
,dt
,dt -rk as flag
from (
select
id
,dt
rank()over(partition by id order by dt) rk
from tablename;
)t1
)t2
group by
id
,flag
)t3
)t4
)t5
group by
id
,new_flag
)t6
group by
id;
备注:这种方法有个弊端,像俄罗斯套娃,当条件发生改变,比如断5天也算连续,你要重复5次;
方法2
思路:采用lag开窗函数
sql 实现:
select
id
,flag
,datediff(max(dt),min(dt))+1
from
(
select
id
,dt
sum(if(dtDiff >2,1,0)) over (partition by id order by dt) as flag
from (
select
id
,dt
,datediff(dt,lagDt) as dtDiff
from (
select
id
,dt
,lag(dt,1,'1970-01-01') over(partition by id order by dt) lagDt
from tablename
)t1
)t2
)t3
group by
id
,flag
备注:断一天也算连续,则 sum(if(dtDiff >2,1,0))处为2
断n天也算连续, 则 sum(if(dtDiff >n+1,1,0))处为n+1
hive sql之:最大登录天数,获取连续登录指定天数
create table test2(
id string,
pday string
);
INSERT INTO test2(id,pday) values ('A','20190701');
INSERT INTO test2(id,pday) values ('A','20190702');
INSERT INTO test2(id,pday) values ('A','20190703');
INSERT INTO test2(id,pday) values ('A','20190704');
INSERT INTO test2(id,pday) values ('A','20190706');
INSERT INTO test2(id,pday) values ('A','20190707');
INSERT INTO test2(id,pday) values ('A','20190708');
INSERT INTO test2(id,pday) values ('A','20190711');
INSERT INTO test2(id,pday) values ('A','20190712');
INSERT INTO test2(id,pday) values ('B','20190629');
INSERT INTO test2(id,pday) values ('B','20190630');
INSERT INTO test2(id,pday) values ('B','20190701');
INSERT INTO test2(id,pday) values ('B','20190704');
INSERT INTO test2(id,pday) values ('B','20190706');
最大登录天数
select
t2.id,
max(t2.num)
from
(
select
t.id as id,
count(t.sub) num
from
(
select
id,
pday,
date_sub(
from_unixtime(unix_timestamp(pday,'yyyyMMdd'),'yyyy-MM-dd'),
row_number() over(partition by id order by pday)
) as sub
from test2
) as t
group by t.id,t.sub
) t2
group by t2.id;
获取连续登录指定天数的:
select
t.id as id,
t.pday as pday,
date_sub(t.pday,rn) as data_sub,
t.rn as rn
from
(
select
id,
from_unixtime(unix_timestamp(pday,'yyyyMMdd'),'yyyy-MM-dd') as pday,
row_number() over(partition by id order by pday desc) as rn
from test2
) t
where t.rn = 3;
===============================================
datediff的用法
select *
from
(
select
id,
from_unixtime(unix_timestamp(pday,'yyyyMMdd'),'yyyy-MM-dd') as pday,
date_sub(
from_unixtime(unix_timestamp(pday,'yyyyMMdd'),'yyyy-MM-dd'),
row_number() over(partition by id order by pday)
) date_sub
from test2
) t2
where datediff(t2.pday,t2.date_sub) > 2;
以上是关于hive之连续登录问题的主要内容,如果未能解决你的问题,请参考以下文章