Hive 蚂蚁森林案例
Posted eugene0
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Hive 蚂蚁森林案例相关的知识,希望对你有一定的参考价值。
蚂蚁森林案例背景说明
- 原始数据样例
user_low_carbon.txt 记录用户每天的蚂蚁森林低碳生活领取的流水
数据样例
u_001 2017/1/1 10
u_001 2017/1/2 150
u_001 2017/1/2 110
plant_carbon.txt 记录申领环保植物所需要减少的碳排放量
数据样例
p001 梭梭树 17
p002 沙柳 19
p003 樟子树 146
p004 胡杨 215
- 以上原始数据样例建表格式如下
表名:user_low_carbon
字段说明
user_id:用户
data_dt:日期
low_carbon:减少碳排放(g)
表名:plant_carbon
字段说明
plant_id:植物编号
plant_name:植物名
low_carbon:换购植物所需要的碳
创建表
hive (default)> create table user_low_carbon(user_id String,
data_dt String,
low_carbon int
)
row format delimited fields terminated by ‘ ‘;
导入数据
load data local inpath "/opt/module/data/user_low_carbon.txt" into table user_low_carbon;
load data local inpath "/opt/module/data/plant_carbon.txt" into table plant_carbon;
设置本地模式
hive (default)> set hive.exec.mode.local.auto=true;
1 需求一:蚂蚁森林植物申领统计
假设2017年1月1日开始记录低碳数据(user_low_carbon),假设2017年10月1日之前满足申领条件的用户都申领了一颗p004-胡杨,剩余的能量全部用来领取“p002-沙柳” 。统计在10月1日累计申领“p002-沙柳” 排名前10的用户信息;以及他比后一名多领了几颗沙柳。
1.1 step1 统计每个用户截止到2017-10-01日之前收集的总碳量
hive (default)> select user_id, sum(low_carbon) sum_carbon
from user_low_carbon
where date_format(regexp_replace(data_dt, ‘/‘, ‘-‘), ‘yyyy-MM-dd‘) < ‘2017-10-01‘
group by user_id;
输出结果:
user_id sum_carbon
u_001 475
u_002 659
u_003 620
u_004 640
u_005 1100
u_006 830
u_007 1470
u_008 1240
u_009 930
u_010 1080
u_011 960
u_012 250
u_013 1430
u_014 1060
u_015 290
1.2 step2 获取胡杨和沙柳的能量
select low_carbon from plant_carbon where plant_id=‘004‘;
select low_carbon from plant_carbon where plant_id=‘002‘;
1.3 step3 计算每个用户申领沙柳的棵数
hive (default)> select user_id,
floor((t1.sum_carbon - t2.low_carbon) / t3.low_carbon) count_p002
from (
select user_id, sum(low_carbon) sum_carbon
from user_low_carbon
where date_format(regexp_replace(data_dt, ‘/‘, ‘-‘), ‘yyyy-MM-dd‘) < ‘2017-10-01‘
group by user_id
) t1,
(
select low_carbon
from plant_carbon
where plant_id = ‘p004‘
) t2,
(
select low_carbon
from plant_carbon
where plant_id = ‘p002‘
) t3;
输出结果:
user_id count_p002
u_001 13
u_002 23
u_003 21
u_004 22
u_005 46
u_006 32
u_007 66
u_008 53
u_009 37
u_010 45
u_011 39
u_012 1
u_013 63
u_014 44
u_015 3
1.4 step4 按照每个人领取的沙柳棵数倒序排序,并获取当前记录的下一条记录所领取的沙柳的棵数
统计在10月1日累计申领“p002-沙柳” 排名前10的用户信息
hive (default)> select user_id,
count_p002,
lead(count_p002, 1) over (order by count_p002 desc) lead_1_p002
from (
select user_id,
floor((t1.sum_carbon - t2.low_carbon) / t3.low_carbon) count_p002
from (
select user_id, sum(low_carbon) sum_carbon
from user_low_carbon
where date_format(regexp_replace(data_dt, ‘/‘, ‘-‘), ‘yyyy-MM-dd‘) < ‘2017-10-01‘
group by user_id
) t1,
(
select low_carbon
from plant_carbon
where plant_id = ‘p004‘
) t2,
(
select low_carbon
from plant_carbon
where plant_id = ‘p002‘
) t3
) t4
limit 10;
输出结果:
user_id count_p002 lead_1_p002
u_007 66 63
u_013 63 53
u_008 53 46
u_005 46 45
u_010 45 44
u_014 44 39
u_011 39 37
u_009 37 32
u_006 32 23
u_002 23 22
1.5 step5 统计当前用户他比后一名多领了几颗沙柳
hive (default)> select user_id,
count_p002,
(count_p002 - lead_1_p002) diff_count
from (
select user_id,
count_p002,
lead(count_p002, 1) over (order by count_p002 desc) lead_1_p002
from (
select user_id,
floor((t1.sum_carbon - t2.low_carbon) / t3.low_carbon) count_p002
from (
select user_id, sum(low_carbon) sum_carbon
from user_low_carbon
where date_format(regexp_replace(data_dt, ‘/‘, ‘-‘), ‘yyyy-MM-dd‘) < ‘2017-10-01‘
group by user_id
) t1,
(
select low_carbon
from plant_carbon
where plant_id = ‘p004‘
) t2,
(
select low_carbon
from plant_carbon
where plant_id = ‘p002‘
) t3
) t4
limit 10
) t5
order by count_p002 desc;
输出结果:
user_id count_p002 diff_count
u_007 66 3
u_013 63 10
u_008 53 7
u_005 46 1
u_010 45 1
u_014 44 5
u_011 39 2
u_009 37 5
u_006 32 9
u_002 23 1
以上是关于Hive 蚂蚁森林案例的主要内容,如果未能解决你的问题,请参考以下文章