如何在 Hive 或 Presto 中将以下字典格式列转换为不同的格式?
Posted
技术标签:
【中文标题】如何在 Hive 或 Presto 中将以下字典格式列转换为不同的格式?【英文标题】:How to convert the following dictionary format column into different format in Hive or Presto? 【发布时间】:2021-11-17 07:40:51 【问题描述】:我有一个 Hive 表如下:
event_name | attendees_per_countries |
---|---|
a | 'US':5 |
b | 'US':4, 'UK': 3, 'CA': 2 |
c | 'UK':2, 'CA': 1 |
我想得到一个如下所示的新表:
country | number_of_people |
---|---|
US | 9 |
UK | 5 |
CA | 4 |
如何在 Hive 或 Presto 中编写查询?
【问题讨论】:
【参考方案1】:您可以使用以下内容:
如果attendees_per_countries
的列类型是字符串,您可以使用以下内容:
WITH sample_data AS (
select
event_name,
str_to_map(
regexp_replace(attendees_per_countries,'[|]',''),
',',
':'
) as attendees_per_countries
FROM
raw_data
)
select
regexp_replace(cm.key,"[' ]","") as country,
SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC
但是,如果 attendees_per_countries
的列类型已经是 map
,那么您可以使用以下内容
select
regexp_replace(cm.key,"[' ]","") as country,
SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC
下面的完整可重现示例
with raw_data AS (
select 'a' as event_name, "'US':5" as attendees_per_countries
UNION ALL
select 'b', "'US':4, 'UK': 3, 'CA': 2"
UNION ALL
select 'c', "'UK':2, 'CA': 1"
),
sample_data AS (
select
event_name,
str_to_map(
regexp_replace(attendees_per_countries,'[]',''),
',',
':'
) as attendees_per_countries
FROM
raw_data
)
select
regexp_replace(cm.key,"[' ]","") as country,
SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC
让我知道这是否适合你
【讨论】:
【参考方案2】:如果你有attendees_per_countries
作为映射,你可以使用map_values
然后将它们与array_sum
/reduce
相加(我需要稍后使用,因为雅典娜不支持前一个)。如果不是 - 您可以将数据视为 json 并将其转换为 MAP(VARCHAR, INTEGER)
,然后使用上述函数:
WITH dataset(event_name, attendees_per_countries) AS (
VALUES
('a', JSON '"US":5'),
('b', JSON '"US":4, "UK": 3, "CA": 2'),
('c', JSON '"UK":2, "CA": 1')
)
SELECT event_name as country,
reduce(
map_values(cast(attendees_per_countries as MAP(VARCHAR, INTEGER))),
0,
(agg, curr) -> agg + curr,
s -> s
) as number_of_people
FROM dataset
order by 2 desc
输出:
country | number_of_people |
---|---|
b | 9 |
a | 5 |
c | 3 |
【讨论】:
以上是关于如何在 Hive 或 Presto 中将以下字典格式列转换为不同的格式?的主要内容,如果未能解决你的问题,请参考以下文章