hive-SQL学习笔记11

Posted 2022-03-04 小李飞刀李寻欢

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了hive-SQL学习笔记11相关的知识，希望对你有一定的参考价值。

之前有人问我，如何挑出一个月的最大值及其特征，比如有三列，分别是user_id，item_id，time，其中time是停留时长，这个问题就是找出这个用户，他这一行是什么，我当时就懵逼了。我说我直接全部拉下来这个月的数据，然后py操作取最大值即可。。。game over

For Recommendation in Deep learning QQ Group 277356808

For deep learning QQ Second Group 629530787

I'm here waiting for you

下面分步进行，逐渐加深

1-按照time来排序获取一定时间内的点击曝光日志，从小到大的排序

既然是排序，那么肯定是用ORDER BY，如下是简单的按照单列（time）排序

SELECT user_id,item_id,time  
FROM ClickLogTable 
WHERE concat(datetime, dayhour) between 2022010101 and 2022010102 
and user_id is not NULL and time is not NULL 
ORDER BY time

这里我要验证一下，是否是按照str(time)的形式（有没有str这个函数我不知道哈，这是py的表达方式）排序，还是按照float(time)的形式（这种是有的）排序

【where concat 用法参考此博文】

其结果如下：默认按照字符串顺序来排的，而不是浮点数大小，经查是存在10~100之间的数

 2038 e0eac9    8C7c 10.995
 2039 b83a31    8C5I 10.997
 2040 c2313e    83XL 10.997
 2041 d3f66a    8CQ7 100.0
 2042 1c79d5    7sxn 100.0
 2043 b7c62f    8CLT 100.0
 2044 37833b    8CPh 100.0

改成float经验证是正确的，如下示例

#ORDER BY float(time)

12265 de27    8CLV 19.996
12266 73bc    8CQV 20.0
12267 0461    8CQQ 20.0

2-按照user_id,time顺序来排序，这个可以得到每个用户的time排序

这里直接按照time浮点排序哈，不加float还是默认按照字符串排序，这是不符合我的本意的

SELECT user_id,item_id,time  
FROM ClickLogTable 
WHERE concat(datetime, dayhour) between 2022010101 and 2022010102 
and user_id is not NULL and time is not NULL 
ORDER BY user_id,float(time)

示例如下：那么如何获取每个用户的最长time的那一行呢？这才是问题

17b2	T2Uh	2.256
17b2	uLMX	31.76
9695	t7zD	1.206
9695	85w8	1.255
9695	8CCg	117.253
f270	8C3i	6.197
f270	8C10	11.326
f270	8Btv	132.45
f270	8CLe	343.339

为了防止按照字符串排序，我都加上float，这样就一定按照数字大小排序了

同事大佬只说了row_number，然而我还是没有得到每个用户的最长time的那行

SELECT user_id,item_id,time,row_number() over (partition by user_id order by float(time))  
FROM ClickLogTable 
WHERE concat(datetime, dayhour) between 2022010101 and 2022010102 
and user_id is not NULL and time is not NULL

fe87	8CN1	184.385	3
fe87	YQqA	311.246	4
fe87	8BqA	311.246	5
fe87	8wfh	713.201	6
fe8f	8BWi	25.878	1
fe8f	8COA	169.28	2
fe8f	8ARA	191.654	3

这个无非是每个用户观看的时长排序，增加了一列编号而已，与上面的order by两列并无区别。

给上面的time按照倒序排列，只是加个desc，如下：这样最大的就排第一了

SELECT user_id,item_id,time,row_number() over (partition by user_id order by float(time) desc)  
FROM ClickLogTable 
WHERE concat(datetime, dayhour) between 2022010101 and 2022010102 
and user_id is not NULL and time is not NULL

05b0	8CM2	373.614	1
05b0	8AR1	358.413	2
05b0	8ATg	358.392	3
05b0	8AR1	358.359	4
。。
0558	8CF9	227.706	1
0558	877A	55.612	2
0558	85iI	48.616	3

那么只需再取其中列为1的即可了，然而加上select就错了，woc，发生了什么鬼？mmp

SELECT user_id,item_id,time 
From
( SELECT user_id,item_id,time,row_number() over (partition by user_id order by float(time) desc)  
FROM ClickLogTable 
WHERE concat(datetime, dayhour) between 2022010101 and 2022010102 
and user_id is not NULL and time is not NULL 
) #这里需要加新表名，随意，比如new_tab
where rn==1


ParseException line 1:26 cannot recognize input near '(' 'select' 'user_id' in joinSource

同事大佬一下子指出来了错误，中间那一堆是个新表，需要命名一个表名，然后结果如下：

fed0	8zOY	17.159	1
feee	8Cvc	66.48599999999999	1
fe29	8Vh5	708.173	1
ffee	8CWo	30.55	1
ff65	FQEO	39.327	1
ff8c	GZxa	47.989	1
ffa9	M4aY	17.056	1
ffa9	FQEO	19.407	1

3-该段时间内的最大值，

这个直接倒排取第一个也可以，方法一：去掉上面的partition

SELECT user_id,item_id,time 
From
( SELECT user_id,item_id,time,row_number() over (order by float(time) desc)  
FROM ClickLogTable 
WHERE concat(datetime, dayhour) between 2022010101 and 2022010102 
and user_id is not NULL and time is not NULL 
) new_tab
where rn==1

edaf	hu4W	2682.734	1

方法二：加个limit 1，哈哈这个也行吧，思维打开，关注我，纵享丝滑！

SELECT user_id,item_id,time  
FROM ClickLogTable 
WHERE concat(datetime, dayhour) between 2022010101 and 2022010102 
and user_id is not NULL and time is not NULL 
ORDER BY float(time) desc limit 1

edaf	hu4W	2682.734

拜拜

愿我们终有重逢之时，

而你还记得我们曾经讨论的话题。

以上是关于hive-SQL学习笔记11的主要内容，如果未能解决你的问题，请参考以下文章