Hive实战企业级项目之谷粒影音统计各种TopN问题
Posted 龙鸣丿
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Hive实战企业级项目之谷粒影音统计各种TopN问题相关的知识,希望对你有一定的参考价值。
1.需求描述
统计硅谷影音视频网站的常规指标,各种 TopN 指标: -- 统计视频观看数 Top10 -- 统计视频类别热度 Top10 -- 统计出视频观看数最高的 20 个视频的所属类别以及类别包含 Top20 视频的个数 -- 统计视频观看数 Top50 所关联视频的所属类别排序 -- 统计每个类别中的视频热度 Top10,以 Music 为例 -- 统计每个类别视频观看数 Top10 -- 统计上传视频最多的用户 Top10 以及他们上传的视频观看次数在前 20 的视频
2.数据结构
1 )视频表字段 | 备注 | 详细描述 |
videoId | 视频唯一 id ( String ) | 11 位字符串 |
uploader | 视频上传者( String ) | 上传视频的用户名 String |
age | 视频年龄( int ) | 视频在平台上的整数天 |
category | 视频类别( Array<String> ) | 上传视频指定的视频分类 |
length | 视频长度( Int ) | 整形数字标识的视频长度 |
views | 观看次数( Int ) | 视频被浏览的次数 |
rate | 视频评分( Double ) | 满分 5 分 |
Ratings | 流量( Int ) | 视频的流量,整型数字 |
conments | 评论数( Int ) | 一个视频的整数评论数 |
relatedId | 相关视频 id ( Array<String> ) | 相关视频的 id ,最多 20 个 |
字段 | 备注 | 字段类型 |
uploader | 上传者用户名 | string |
videos | 上传视频数 | int |
friends | 朋友数量 | int |
3.准备工作
(数据和代码在我的资源里,代码是自己手打的。)
https://download.csdn.net/download/weixin_42759988/19418290?spm=1001.2014.3001.5501
通过观察原始数据形式可以发现,视频可以有多个所属分类,每个所属分类用&符号分割,且分割的两边有空格字符,同时相关视频也是可以有多个元素,多个相关视频又用“\\t”进行分割。为了分析数据时方便对存在多个子元素的数据进行操作,我们首先进行数据重组清洗操作。即:将所有的类别用“&”分割,同时去掉两边空格,多个相关视频id也使用“&”进行分割。
将资源代码打成jar包上传到hdfs或者自己的虚拟机里,相信此步骤大家都会,这里不再赘述。
使用运行jar包:
yarn jar /opt/module/jars/hive-gulivideo-1.0-SNAPSHOT.jar com.atguigu.mr.ETLDriver /user/root/guliVideo/video /guliout
说明:首先记得启动yarn,运行此jar包,后面跟着hdfs的输入输出路径。
4.准备表
1)需要准备的表
创建原始数据表: gulivideo_ori , gulivideo_user_ori , 创建最终表: gulivideo_orc , gulivideo_user_orc
2)创建原始数据表:
(1)gulivideo_oricreate table gulivideo_ori(
videoId string,
uploader string,
age int,
category array<string>,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array<string>)
row format delimited fields terminated by "\\t"
collection items terminated by "&"
stored as textfile;
(2)创建原始数据表: gulivideo_user_ori
create table gulivideo_user_ori(
uploader string,
videos int,
friends int)
row format delimited
fields terminated by "\\t"
stored as textfile;
(3)导入数据:(注意改成自己hdfs或者本地的路径加载到表)
load data inpath '/guliout/part-r-00000' into table gulivideo_ori;
select * from gulivideo_ori limit 2;
load data inpath '/user/root/guliVideo/user/user.txt' into table gulivideo_user_ori;
select * from gulivideo_user_ori limit 2;
2)创建 orc 存储格式带 snappy 压缩的表:
(1)gulivideo_orc
create table gulivideo_orc(
videoId string,
uploader string,
age int,
category array<string>,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array<string>)
stored as orc
tblproperties("orc.compress"="SNAPPY");
(2)gulivideo_user_orc
create table gulivideo_user_orc(
uploader string,
videos int,
friends int)
row format delimited
fields terminated by "\\t"
stored as orc
tblproperties("orc.compress"="SNAPPY");
(3)向 orc表插入数据
insert into table gulivideo_orc select * from gulivideo_ori;
insert into table gulivideo_user_orc select * from gulivideo_user_ori;
(4)查看数据:
1 统计视频观看数 Top10
select
videoId,views
from
gulivideo_orc
order by
views desc
limit 10;
结果:
dMH0bHeiRNg 42513417
0XxI-hvPRRA 20282464
1dmVU08zVpA 16087899
RB-wUgnyGv0 15712924
QjA5faZF1A8 15256922
-_CSo1gOd48 13199833
49IDp76kjPw 11970018
tYnn51C3X_w 11823701
pv5zWaTEVkI 11672017
D2kJZOfq7zk 11184051
2 统计视频类别热度 Top10
思路: ( 1 )即统计每个类别有多少个视频,显示出包含视频最多的前 10 个类别。 ( 2 )我们需要按照类别 group by 聚合,然后 count 组内的 videoId 个数即可。 ( 3 )因为当前表结构为:一个视频对应一个或多个类别。所以如果要 group by 类别, 需要先将类别进行列转行 ( 展开 ) ,然后再进行 count 即可。 ( 4 )最后按照热度排序,显示前 10 条。 某类视频的个数作为视频类别热度1.使用UDTF函数将类别列炸开
select
videoId,category_name
from
gulivideo_orc
lateral view explode(category) tmp_category as category_name; --t1
结果:
2.按照category_name分组,统计每种类别视频的总数,同时按照该总数进行倒叙排序,取前10dMH0bHeiRNg 42513417
0XxI-hvPRRA 20282464
1dmVU08zVpA 16087899
RB-wUgnyGv0 15712924
QjA5faZF1A8 15256922
-_CSo1gOd48 13199833
49IDp76kjPw 11970018
tYnn51C3X_w 11823701
pv5zWaTEVkI 11672017
D2kJZOfq7zk 11184051
select category_name,count(*) category_count
from t1
group by category_name
order by category_count
desc limit 10;
3.最终SQL
select category_name,count(*) category_count
from (select videoId,category_name from gulivideo_orc
lateral view explode(category) tmp_category as category_name) t1
group by category_name
order by category_count
desc limit 10;
4.结果:
Music 179049
Entertainment 127674
Comedy 87818
Animation 73293
Film 73293
Sports 67329
Games 59817
Gadgets 59817
People 48890
Blogs 48890
3 统计出视频观看数最高的 20 个视频的所属类别以及类别包含
Top20 视频的个数
思路:(1)先找到观看数最高的 20 个视频所属条目的所有信息,降序排列
(2)把这 20 条信息中的 category 分裂出来(列转行)
(3)最后查询视频分类名称和该分类下有多少个 Top20 的视频 1.统计出视频观看数Top20
select
videoId,views,category
from
gulivideo_orc
order by
views desc
limit 20; --t1
结果:
2.对t1表中的category进行炸裂(列转行)dMH0bHeiRNg 42513417 ["Comedy"]
0XxI-hvPRRA 20282464 ["Comedy"]
1dmVU08zVpA 16087899 ["Entertainment"]
RB-wUgnyGv0 15712924 ["Entertainment"]
QjA5faZF1A8 15256922 ["Music"]
-_CSo1gOd48 13199833 ["People","Blogs"]
49IDp76kjPw 11970018 ["Comedy"]
tYnn51C3X_w 11823701 ["Music"]
pv5zWaTEVkI 11672017 ["Music"]
D2kJZOfq7zk 11184051 ["People","Blogs"]
vr3x_RRJdd4 10786529 ["Entertainment"]
lsO6D1rwrKc 10334975 ["Entertainment"]
5P6UU6m3cqk 10107491 ["Comedy"]
8bbTtPL1jRs 9579911 ["Music"]
_BuRwH59oAo 9566609 ["Comedy"]
aRNzWyD7C9o 8825788 ["UNA"]
UMf40daefsI 7533070 ["Music"]
ixsZy2425eY 7456875 ["Entertainment"]
MNxwAU_xAMk 7066676 ["Comedy"]
RUCZJVJ_M8o 6952767 ["Entertainment"]
select
videoId,
category_name
from
t1
lateral view explode(category) tmp_category as category_name;--t2
执行t2,结果:
3.对t2表进行分组(category_name)求和dMH0bHeiRNg Comedy
0XxI-hvPRRA Comedy
1dmVU08zVpA Entertainment
RB-wUgnyGv0 Entertainment
QjA5faZF1A8 Music
-_CSo1gOd48 People
-_CSo1gOd48 Blogs
49IDp76kjPw Comedy
tYnn51C3X_w Music
pv5zWaTEVkI Music
D2kJZOfq7zk People
D2kJZOfq7zk Blogs
vr3x_RRJdd4 Entertainment
lsO6D1rwrKc Entertainment
5P6UU6m3cqk Comedy
8bbTtPL1jRs Music
_BuRwH59oAo Comedy
aRNzWyD7C9o UNA
UMf40daefsI Music
ixsZy2425eY Entertainment
MNxwAU_xAMk Comedy
RUCZJVJ_M8o Entertainment
select category_name,count(*) category_count
from t2
group by category_name
order by category_count desc;
4.最终SQL
select category_name,count(*) category_count
from
(
select
videoId,
category_name
from
(
select
videoId,views,category
from
gulivideo_orc
order by
views desc
limit 20
)t1
lateral view explode(category) tmp_category as category_name)t2
group by category_name
order by category_count desc;
结果:
Entertainment 6
Comedy 6
Music 5
People 2
Blogs 2
UNA 1
4 统计视频观看数 Top50 所关联视频的所属类别排序
1.统计出视频观看数Top50select
relatedId,views
from
gulivideo_orc
order by
views desc
limit 50;--t1
2.对t1表中的relatedId进行炸裂(列转行)并去重
select related_id
from t1
lateral view explode(relatedId) tmp_related as related_id
group by related_id; --t2
3. 取出观看数前50视频关联ID视频的类别
select category
from t2
join gulivideo_orc orc
on t2.related_id=orc.videoId; --t3
4.对t3表中的category进行炸裂
select explode(category) category_name
from t3; --t4
5.分组(类别)求和(总数)
select category_name,count(*) category_count
from t4
group by category_name
order by category_count desc;
6.最终SQL:
select category_name,count(*) category_count
from (select explode(category) category_name
from (select category
from (select related_id
from (select
relatedId,views
from
gulivideo_orc
order by
views desc
limit 50)t1
lateral view explode(relatedId) tmp_related as related_id
group by related_id)t2
join gulivideo_orc orc
on t2.related_id=orc.videoId)t3)t4
group by category_name
order by category_count desc;
方法二:使用rank()函数:
SELECT
t6.category_name,
t6.video_sum,
rank() over(ORDER BY t6.video_sum DESC ) rk
FROM
(
SELECT
t5.category_name,
COUNT(t5.relatedid_id) video_sum
FROM
(
SELECT
t4.relatedid_id,
category_name
FROM
(
SELECT
t2.relatedid_id ,
t3.category
FROM
(
SELECT
relatedid_id
FROM
(
SELECT
videoId,
views,
relatedid
FROM
gulivideo_orc
ORDER BY
views
DESC
LIMIT 50
)t1
lateral VIEW explode(t1.relatedid) t1_tmp AS relatedid_id
)t2
JOIN
gulivideo_orc t3
ON
t2.relatedid_id = t3.videoId
) t4
lateral VIEW explode(t4.category) t4_tmp AS category_name
) t5
GROUP BY
t5.category_name
ORDER BY
video_sum
DESC
) t6;
结果:
Comedy 232
Entertainment 216
Music 195
Blogs 51
People 51
Film 47
Animation 47
News 22
Politics 22
Games 20
Gadgets 20
Sports 19
Howto 14
DIY 14
UNA 13
Places 12
Travel 12
Animals 11
Pets 11
Autos 4
Vehicles 4
5 统计每个类别中的视频热度 Top10,以 Music 为例
思路:(1)要想统计 Music 类别中的视频热度 Top10,需要先找到 Music 类别,那么就需要将
category 展开,所以可以创建一张表用于存放 categoryId 展开的数据。
(2)向 category 展开的表中插入数据。
(3)统计对应类别(Music)中的视频热度。
SELECT
t1.videoId,
t1.views,
t1.category_name
FROM
(
SELECT
videoId,
views,
category_name
FROM gulivideo_orc
lateral VIEW explode(category) gulivideo_orc_tmp AS category_name
)t1
WHERE
t1.category_name = "Music"
ORDER BY
t1.views
DESC
LIMIT 10;
上面的方法虽然也可以解决问题,但是不建议使用。可以使用如下方法:
思路:
1)要想统计music类别中的视频热度Top10,需要先找到Music类别,那么就需要将category展开,
所以可以创建一张表用于存放categoryId展开的数据。
2)向category展开的表中插入数据
3)统计对应类别(Music)中的视频热度。
1.创建类别表:
create table guliVideo_category(
videoId String,
uploader string,
age Int,
categoryId string,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array<string>)
row format delimited fields terminated by "\\t"
collection items terminated by "&"
stored as orc;
2.向类别表中插入数据:
insert into table guliVideo_category
select
videoId,
uploader,
age,
categoryId,
length,
views,
rate,
ratings,
comments,
relatedId
from gulivideo_orc
lateral view explode(category) tmp_category as categoryId;
3.统计Music类别的Top10(也可以统计其他)
select videoId,views
from guliVideo_category
where categoryId = "Music"
order by views desc
limit 10;
结果:
QjA5faZF1A8 15256922
tYnn51C3X_w 11823701
pv5zWaTEVkI 11672017
8bbTtPL1jRs 9579911
UMf40daefsI 7533070
-xEzGIuY7kw 6946033
d6C0bNDqf3Y 6935578
HSoVKUVOnfQ 6193057
3URfWTEPmtE 5581171
thtmaZnxk_0 5142238
6 统计每个类别视频观看数 Top10
1.给每一种类别根据视频观看数添加rank值(倒叙)select
categoryId,
videoId,
views,
rank() over(partition by categoryId order by views desc) rk
from guliVideo_category; --t1
2.过滤前10
select categoryId,videoId,views
from t1
where rk <=10;
3.最终SQL:
select categoryId,videoId,views
from (select
categoryId,
videoId,
views,
rank() over(partition by categoryId order by views desc) rk
from guliVideo_category)t1
where rk <=10;
7 统计上传视频最多的用户 Top10 以及他们上传的视频观看次数在前 20 的视频
1.统计上传视频最多的用户 Top10select uploader,videos
from gulivideo_user_orc
order by videos desc
limit 10; --t1
2.关联 gulivideo_orc 表,求出这 10 个用户上传的所有的视频,按照观看数取前 20
最终SQL:
SELECT
t2.videoId,
t2.views
FROM
(
SELECT
uploader,
videos
FROM gulivideo_user_orc
ORDER BY
videos
DESC
LIMIT 10
) t1
JOIN gulivideo_orc t2
ON t1.uploader = t2.uploader
ORDER BY
t2.views DESC
LIMIT 20;
3.结果:
-IxHBW0YpZw 39059
BU-fT5XI_8I 29975
ADOcaBYbMl0 26270
yAqsULIDJFE 25511
vcm-t0TJXNg 25366
0KYGFawp14c 24659
j4DpuPvMLF4 22593
Msu4lZb2oeQ 18822
ZHZVj44rpjE 16304
foATQY3wovI 13576
-UnQ8rcBOQs 13450
crtNd46CDks 11639
D1leA0JKHhE 11553
NJu2oG1Wm98 11452
CapbXdyv4j4 10915
epr5erraEp4 10817
IyQoDgaLM7U 10597
tbZibBnusLQ 10402
_GnCHodc7mk 9422
hvEYlSlRitU 7123
参考资料:《尚硅谷大数据技术之Hive》
以上是关于Hive实战企业级项目之谷粒影音统计各种TopN问题的主要内容,如果未能解决你的问题,请参考以下文章