Hive:按属性将值聚合到 JSON 或 MAP 字段中
Posted
技术标签:
【中文标题】Hive:按属性将值聚合到 JSON 或 MAP 字段中【英文标题】:Hive: Aggregate values by attribute into a JSON or MAP field 【发布时间】:2021-04-02 21:27:19 【问题描述】:我有一张如下所示的表格:
| user | attribute | value |
|--------|-------------|---------|
| 1 | A | 10 |
| 1 | A | 20 |
| 1 | B | 5 |
| 2 | B | 10 |
| 2 | B | 15 |
| 2 | C | 100 |
| 2 | C | 200 |
我想将此表按user
分组,并将value
字段的总和收集到JSON 或以属性为键的MAP 中,例如:
| user | sum_values_by_attribute |
|------|--------------------------|
| 1 | "A": 30, "B": 15 |
| 2 | "B": 25, "C": 300 |
有没有办法在 Hive 中做到这一点?
我找到了诸如this 和this 等相关问题,但没有一个考虑过值求和的情况。
【问题讨论】:
【参考方案1】:对应于map<string, int>
的JSON字符串只能使用原生函数在Hive中构建:按用户、属性聚合,然后连接对 "key":它们的值和聚合数组,使用连接数组concat_ws,添加花括号。
演示:
with initial_data as (
select stack(7,
1,'A',40,
1,'A',20,
1,'B',5,
2,'B',10,
2,'B',15,
2,'C',100,
2,'C',200) as (`user`, attribute, value )
)
select `user`, concat('',concat_ws(',',collect_set(concat('"', attribute, '": ',sum_value))), '') as sum_values_by_attribute
from
(--aggregate groupby user, attribute
select `user`, attribute, sum(value) as sum_value from initial_data group by `user`, attribute
)s
group by `user`;
结果(JSON 字符串):
user sum_values_by_attribute
1 "A": 60,"B": 5
2 "B": 25,"C": 300
注意:如果你在 Spark 上运行这个,你可以cast( as map<string, int>)
,Hive 不支持转换复杂类型。
map<string, string>
也可以仅使用本机函数轻松完成:相同的键值对数组字节不带双引号(如 A:10)使用 @987654327 连接到逗号分隔的字符串@ 并使用str_to_map
函数转换为映射(相同的 WITH CTE 被跳过):
select `user`, str_to_map(concat_ws(',',collect_set(concat(attribute, ':',sum_value)))) as sum_values_by_attribute
from
(--aggregate groupby user, attribute
select `user`, attribute, sum(value) as sum_value from initial_data group by `user`, attribute
)s
group by `user`;
结果(map
user sum_values_by_attribute
1 "A":"60","B":"5"
2 "B":"25","C":"300"
如果你需要map<string, int>
,不幸的是,它不能仅使用 Hive 原生函数来完成,因为map_to_str
返回map<string, string>
,而不是map<string, int>
。你可以试试brickhouse收集功能:
add jar '~/brickhouse/target/brickhouse-0.6.0.jar'; --check brickhouse site https://github.com/klout/brickhouse for instructions
create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';
select `user`, collect(attribute, sum_value) as sum_values_by_attribute
from
(--aggregate groupby user, attribute
select `user`, attribute, sum(value) as sum_value from initial_data group by `user`, attribute
)s
group by `user`;
【讨论】:
【参考方案2】:您可以先通过属性和user_id计算总和,然后使用collect list。 请让我知道以下输出是否正常。
SQL 下 -
select `user`,
collect_list(concat(att,":",cast(val as string))) sum_values_by_attribute
from
(select `user`,`attribute` att, sum(`value`) val from tmp2 group by u,att) tmp2
group by `user`;
测试查询 -
create table tmp2 ( `user` int, `attribute` string, `value` int);
insert into tmp2 select 1,'A',40;
insert into tmp2 select 1,'A',20;
insert into tmp2 select 1,'B',5;
insert into tmp2 select 2,'C',20;
insert into tmp2 select 1,'B',10;
insert into tmp2 select 2,'B',10;
insert into tmp2 select 2,'C',10;
select `user`,
collect_list(concat(att,":",cast(val as string))) sum_values_by_attribute
from
(select `user`,`attribute` att, sum(`value`) val from tmp2 group by u,att) tmp2
group by `user`;
【讨论】:
以上是关于Hive:按属性将值聚合到 JSON 或 MAP 字段中的主要内容,如果未能解决你的问题,请参考以下文章