如何在 Hive 或 Presto 中将以下字典格式列转换为不同的格式？

Posted 2023-03-21

技术标签:

【中文标题】如何在 Hive 或 Presto 中将以下字典格式列转换为不同的格式？【英文标题】：How to convert the following dictionary format column into different format in Hive or Presto? 【发布时间】：2021-11-17 07:40:51 【问题描述】：

我有一个 Hive 表如下：

event_name	attendees_per_countries
a	'US':5
b	'US':4, 'UK': 3, 'CA': 2
c	'UK':2, 'CA': 1

我想得到一个如下所示的新表：

country	number_of_people
US	9
UK	5
CA	4

如何在 Hive 或 Presto 中编写查询？

【问题讨论】：

【参考方案1】：

您可以使用以下内容：

如果attendees_per_countries 的列类型是字符串，您可以使用以下内容：

WITH sample_data AS (
    select 
        event_name, 
        str_to_map(
            regexp_replace(attendees_per_countries,'[|]',''),
            ',',
            ':'
        ) as attendees_per_countries 
    FROM
        raw_data
        
)
select 
    regexp_replace(cm.key,"[' ]","") as country,
    SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC

但是，如果 attendees_per_countries 的列类型已经是 map，那么您可以使用以下内容

select 
    regexp_replace(cm.key,"[' ]","") as country,
    SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC

下面的完整可重现示例

with raw_data AS (
    select 'a' as event_name, "'US':5" as attendees_per_countries
    UNION ALL 
    select 'b', "'US':4, 'UK': 3, 'CA': 2"
    UNION ALL 
    select 'c', "'UK':2, 'CA': 1"
),
sample_data AS (
    select 
        event_name, 
        str_to_map(
            regexp_replace(attendees_per_countries,'[]',''),
            ',',
            ':'
        ) as attendees_per_countries 
    FROM
        raw_data
        
)
select 
    regexp_replace(cm.key,"[' ]","") as country,
    SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC

让我知道这是否适合你

【讨论】：

【参考方案2】：

如果你有attendees_per_countries作为映射，你可以使用map_values然后将它们与array_sum/reduce相加（我需要稍后使用，因为雅典娜不支持前一个）。如果不是 - 您可以将数据视为 json 并将其转换为 MAP(VARCHAR, INTEGER)，然后使用上述函数：

WITH dataset(event_name, attendees_per_countries) AS (
   VALUES 
('a',   JSON '"US":5'),
('b',   JSON '"US":4, "UK": 3, "CA": 2'),
('c',   JSON '"UK":2, "CA": 1')
 ) 
 
SELECT event_name as country,
       reduce(
               map_values(cast(attendees_per_countries as MAP(VARCHAR, INTEGER))),
               0,
               (agg, curr) -> agg + curr,
               s -> s
           )      as number_of_people
FROM dataset
order by 2 desc

输出：

country	number_of_people
b	9
a	5
c	3

【讨论】：

以上是关于如何在 Hive 或 Presto 中将以下字典格式列转换为不同的格式？的主要内容，如果未能解决你的问题，请参考以下文章