Hive:按属性将值聚合到 JSON 或 MAP 字段中

Posted

技术标签:

【中文标题】Hive:按属性将值聚合到 JSON 或 MAP 字段中【英文标题】:Hive: Aggregate values by attribute into a JSON or MAP field 【发布时间】:2021-04-02 21:27:19 【问题描述】:

我有一张如下所示的表格:

|   user | attribute   |   value |
|--------|-------------|---------|
|      1 | A           |      10 |
|      1 | A           |      20 |
|      1 | B           |       5 |
|      2 | B           |      10 |
|      2 | B           |      15 |
|      2 | C           |     100 |
|      2 | C           |     200 |

我想将此表按user 分组,并将value 字段的总和收集到JSON 或以属性为键的MAP 中,例如:

| user | sum_values_by_attribute  |
|------|--------------------------|
|    1 | "A": 30, "B": 15       |
|    2 | "B": 25, "C": 300      |

有没有办法在 Hive 中做到这一点?

我找到了诸如this 和this 等相关问题,但没有一个考虑过值求和的情况。

【问题讨论】:

【参考方案1】:

对应于map<string, int>的JSON字符串只能使用原生函数在Hive中构建:按用户、属性聚合,然后连接对 "key":它们的值和聚合数组,使用连接数组concat_ws,添加花括号。

演示:

with initial_data as (
select stack(7,
1,'A',40,
1,'A',20,
1,'B',5,
2,'B',10,
2,'B',15,
2,'C',100,
2,'C',200) as (`user`, attribute, value )
)

select `user`, concat('',concat_ws(',',collect_set(concat('"', attribute, '": ',sum_value))), '') as sum_values_by_attribute  
from
(--aggregate groupby user, attribute
  select `user`, attribute, sum(value) as sum_value from initial_data group by `user`, attribute
)s
group by `user`;

结果(JSON 字符串):

user    sum_values_by_attribute
1       "A": 60,"B": 5
2       "B": 25,"C": 300

注意:如果你在 Spark 上运行这个,你可以cast( as map<string, int>),Hive 不支持转换复杂类型。

map<string, string> 也可以仅使用本机函数轻松完成:相同的键值对数组字节不带双引号(如 A:10)使用 @987654327 连接到逗号分隔的字符串@ 并使用str_to_map 函数转换为映射(相同的 WITH CTE 被跳过):

select `user`, str_to_map(concat_ws(',',collect_set(concat(attribute, ':',sum_value)))) as sum_values_by_attribute  
from
(--aggregate groupby user, attribute
  select `user`, attribute, sum(value) as sum_value from initial_data group by `user`, attribute
)s
group by `user`;

结果(map):

user    sum_values_by_attribute
1       "A":"60","B":"5"
2       "B":"25","C":"300"

如果你需要map<string, int>,不幸的是,它不能仅使用 Hive 原生函数来完成,因为map_to_str 返回map<string, string>,而不是map<string, int>。你可以试试brickhouse收集功能:

add jar '~/brickhouse/target/brickhouse-0.6.0.jar'; --check brickhouse site https://github.com/klout/brickhouse for instructions

create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';

select `user`, collect(attribute, sum_value) as sum_values_by_attribute  
from
(--aggregate groupby user, attribute
  select `user`, attribute, sum(value) as sum_value from initial_data group by `user`, attribute
)s
group by `user`;

【讨论】:

【参考方案2】:

您可以先通过属性和user_id计算总和,然后使用collect list。 请让我知道以下输出是否正常。

SQL 下 -

select `user`,
collect_list(concat(att,":",cast(val as string))) sum_values_by_attribute  
from 
(select `user`,`attribute` att, sum(`value`) val from tmp2 group by u,att) tmp2
group by `user`;

测试查询 -

create table tmp2 ( `user` int, `attribute` string, `value` int);

insert into tmp2 select 1,'A',40;
insert into tmp2 select 1,'A',20;
insert into tmp2 select 1,'B',5;
insert into tmp2 select 2,'C',20;
insert into tmp2 select 1,'B',10;
insert into tmp2 select 2,'B',10;
insert into tmp2 select 2,'C',10;

select `user`,
collect_list(concat(att,":",cast(val as string))) sum_values_by_attribute  
from 
(select `user`,`attribute` att, sum(`value`) val from tmp2 group by u,att) tmp2
group by `user`;

【讨论】:

以上是关于Hive:按属性将值聚合到 JSON 或 MAP 字段中的主要内容,如果未能解决你的问题,请参考以下文章

Cassandra 聚合到 Map

Hive 调优总结

使用预先排序的数据加速 Hive 或 Pig 聚合

Hive优化

使用 JSON Patch 将值添加到字典

如何将值列表传递到 json+sparksql 中的 rdd