hive之连续登录问题

Posted 浊酒南街

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了hive之连续登录问题相关的知识,希望对你有一定的参考价值。

目录

问题

登陆日志,计算每个人连续登陆的最大天数!(注意:断一天也算连续)
数据如下所示:
id dt
01 2021-02-28
01 2021-03-01
01 2021-03-02
01 2021-03-04
01 2021-03-05
01 2021-03-06
01 2021-03-08
02 2021-03-01
02 2021-03-02
02 2021-03-03
02 2021-03-06
03 2021-03-06

方法1

思路:等差数列

1、按照id分组同时按照dt排序,求rk
2、将每行日期减去rk值得到flag标志,如果之前是连续的日期,则相减后为相同的日期,flag相同
3、按照id和flag分组,计算count(*)得数连续的天数
4、按照id分组同时按照flag排序,求rk
5、将每行flag减去rk值得到new_flag标志
6、按照id和new_flag分组,计算连续的天数
7、对id进行分组

sql 实现:

select 
	id
	,max(days) as days
from 
(select 
	id 
	,new_flag
	,sum(days)+count(*)-max(1)  as days
from 
(
select 
	id
	,flag
	,days 
	,falg -rank as new_falg
from 
(select
id
,flag
,days  
,rank() over(partition by id order by flag) as rank 
from 
(select 
	id
	,flag
	,count(dt) as days 
from (
select 
	id 
	,dt
	,dt -rk as flag 
from (
select 
	id
	,dt 
	rank()over(partition by id order by dt) rk
from tablename;
)t1 
)t2 
group by 
	id
	,flag
)t3
)t4
)t5
group by 
	id 
	,new_flag
	)t6
group by 
	id;

备注:这种方法有个弊端,像俄罗斯套娃,当条件发生改变,比如断5天也算连续,你要重复5次;

方法2

思路:采用lag开窗函数

sql 实现:

select 
	id
	,flag
	,datediff(max(dt),min(dt))+1
	from 
	(
	select 
	id
	,dt
	sum(if(dtDiff >2,1,0)) over (partition by id order by dt) as flag 
	from (
	select 
	id
	,dt
	,datediff(dt,lagDt) as dtDiff
	from (
	select 
	id
	,dt 
	,lag(dt,1,'1970-01-01') over(partition by id order by dt) lagDt
	from tablename
	)t1
	)t2
	)t3
	group by 
	id
	,flag

备注:断一天也算连续,则 sum(if(dtDiff >2,1,0))处为2
断n天也算连续, 则 sum(if(dtDiff >n+1,1,0))处为n+1

hive sql之:最大登录天数,获取连续登录指定天数

create table test2(
  id string,
  pday string
);

INSERT INTO test2(id,pday) values ('A','20190701');
INSERT INTO test2(id,pday) values ('A','20190702');
INSERT INTO test2(id,pday) values ('A','20190703');
INSERT INTO test2(id,pday) values ('A','20190704');
INSERT INTO test2(id,pday) values ('A','20190706');
INSERT INTO test2(id,pday) values ('A','20190707');
INSERT INTO test2(id,pday) values ('A','20190708');
INSERT INTO test2(id,pday) values ('A','20190711');
INSERT INTO test2(id,pday) values ('A','20190712');

INSERT INTO test2(id,pday) values ('B','20190629');
INSERT INTO test2(id,pday) values ('B','20190630');
INSERT INTO test2(id,pday) values ('B','20190701');
INSERT INTO test2(id,pday) values ('B','20190704');
INSERT INTO test2(id,pday) values ('B','20190706');

最大登录天数

select 
  t2.id,
  max(t2.num)
from 
(
	select 
	t.id as id,
	count(t.sub) num
	from 
	(
		select 
			id,
			pday,
			date_sub(
				from_unixtime(unix_timestamp(pday,'yyyyMMdd'),'yyyy-MM-dd'),
				row_number() over(partition by id order by pday)
			) as sub
		from test2
	) as t
	group by t.id,t.sub
) t2
group by t2.id;

获取连续登录指定天数的:

select
t.id as id,
t.pday as pday,
date_sub(t.pday,rn) as data_sub,
t.rn as rn
from 
(
	select 
	id,
	from_unixtime(unix_timestamp(pday,'yyyyMMdd'),'yyyy-MM-dd') as pday,
	row_number() over(partition by id order by pday desc) as rn
	from test2
) t
where t.rn = 3;

===============================================
datediff的用法

select *
from 
(
select 
id,
from_unixtime(unix_timestamp(pday,'yyyyMMdd'),'yyyy-MM-dd') as pday,
date_sub(
from_unixtime(unix_timestamp(pday,'yyyyMMdd'),'yyyy-MM-dd'),
row_number() over(partition by id order by pday)
) date_sub
from test2
) t2 
where datediff(t2.pday,t2.date_sub) > 2;

以上是关于hive之连续登录问题的主要内容,如果未能解决你的问题,请参考以下文章

Hadoop之Hive的分区表

Hadoop基础之《(11)—整合HBase+Phoenix+Hive—安装Hive》

Hadoop 之 Hive 安装与配置

Hadoop 部署之 Hive

Hadoop之Hive的排序

hadoop生态之hive