如何在 Hive 或 Presto 中将以下字典格式列转换为不同的格式?

Posted

技术标签:

【中文标题】如何在 Hive 或 Presto 中将以下字典格式列转换为不同的格式?【英文标题】:How to convert the following dictionary format column into different format in Hive or Presto? 【发布时间】:2021-11-17 07:40:51 【问题描述】:

我有一个 Hive 表如下:

event_name attendees_per_countries
a 'US':5
b 'US':4, 'UK': 3, 'CA': 2
c 'UK':2, 'CA': 1

我想得到一个如下所示的新表:

country number_of_people
US 9
UK 5
CA 4

如何在 Hive 或 Presto 中编写查询?

【问题讨论】:

【参考方案1】:

您可以使用以下内容:

如果attendees_per_countries 的列类型是字符串,您可以使用以下内容:

WITH sample_data AS (
    select 
        event_name, 
        str_to_map(
            regexp_replace(attendees_per_countries,'[|]',''),
            ',',
            ':'
        ) as attendees_per_countries 
    FROM
        raw_data
        
)
select 
    regexp_replace(cm.key,"[' ]","") as country,
    SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC

但是,如果 attendees_per_countries 的列类型已经是 map,那么您可以使用以下内容

select 
    regexp_replace(cm.key,"[' ]","") as country,
    SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC

下面的完整可重现示例

with raw_data AS (
    select 'a' as event_name, "'US':5" as attendees_per_countries
    UNION ALL 
    select 'b', "'US':4, 'UK': 3, 'CA': 2"
    UNION ALL 
    select 'c', "'UK':2, 'CA': 1"
),
sample_data AS (
    select 
        event_name, 
        str_to_map(
            regexp_replace(attendees_per_countries,'[]',''),
            ',',
            ':'
        ) as attendees_per_countries 
    FROM
        raw_data
        
)
select 
    regexp_replace(cm.key,"[' ]","") as country,
    SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC

让我知道这是否适合你

【讨论】:

【参考方案2】:

如果你有attendees_per_countries作为映射,你可以使用map_values然后将它们与array_sum/reduce相加(我需要稍后使用,因为雅典娜不支持前一个)。如果不是 - 您可以将数据视为 json 并将其转换为 MAP(VARCHAR, INTEGER),然后使用上述函数:

WITH dataset(event_name, attendees_per_countries) AS (
   VALUES 
('a',   JSON '"US":5'),
('b',   JSON '"US":4, "UK": 3, "CA": 2'),
('c',   JSON '"UK":2, "CA": 1')
 ) 
 
SELECT event_name as country,
       reduce(
               map_values(cast(attendees_per_countries as MAP(VARCHAR, INTEGER))),
               0,
               (agg, curr) -> agg + curr,
               s -> s
           )      as number_of_people
FROM dataset
order by 2 desc

输出:

country number_of_people
b 9
a 5
c 3

【讨论】:

以上是关于如何在 Hive 或 Presto 中将以下字典格式列转换为不同的格式?的主要内容,如果未能解决你的问题,请参考以下文章

在 Hive/Presto 中将文件路径拆分为其组成路径

如何使用presto查询hive数据

presto和hive将查询结果保存到本地的方法

presto和hive将查询结果保存到本地的方法

等效于 hive 中 Presto 的 transform() 函数

在 Presto SQL 中将整数值转换为日期