如何将 json 字符串数据类型列转换为配置单元中的映射数据类型列?
Posted
技术标签:
【中文标题】如何将 json 字符串数据类型列转换为配置单元中的映射数据类型列?【英文标题】:How to convert json string datatype column to map datatype column in hive? 【发布时间】:2019-02-19 10:55:40 【问题描述】:我需要从所有行中获取所有唯一键值。 每一行都有不同的键和值请找到上图的列。
例如:一行看起来像
"START_TIME":1549002807568,"PARSING.QUERY_FORMED":1549002807586,"CUBES_WITH_PERMISSIONS":1549002807568,"PARSING.CUBE_MATCH_SELECTED":1549002807586,"POTENTIAL_COMPLETIONS_ADDED":1549002807587,"QUERY_PARSED":1549002807586,"SUGGESTIONS_FORMED":1549002807606,"PARSING.SEQUENCES_GENERATED":1549002807568,"PARSING.NGRAM_MATCHES_CACHED":1549002807585
【问题讨论】:
【参考方案1】:用两行数据对此进行了测试,所有 key_value 对都是相同的,除了在第二个 JSON 中还有一个额外的 NEW_KEY
和 PARSING.NGRAM_MATCHES_CACHED
值不同。
with data as
(
select stack(2, --data example
'"START_TIME":1549002807568,"PARSING.QUERY_FORMED":1549002807586,"CUBES_WITH_PERMISSIONS":1549002807568,"PARSING.CUBE_MATCH_SELECTED":1549002807586,"POTENTIAL_COMPLETIONS_ADDED":1549002807587,"QUERY_PARSED":1549002807586,"SUGGESTIONS_FORMED":1549002807606,"PARSING.SEQUENCES_GENERATED":1549002807568,"PARSING.NGRAM_MATCHES_CACHED":1549002807585',
'"NEW_KEY":12345,"START_TIME":1549002807568,"PARSING.QUERY_FORMED":1549002807586,"CUBES_WITH_PERMISSIONS":1549002807568,"PARSING.CUBE_MATCH_SELECTED":1549002807586,"POTENTIAL_COMPLETIONS_ADDED":1549002807587,"QUERY_PARSED":1549002807586,"SUGGESTIONS_FORMED":1549002807606,"PARSING.SEQUENCES_GENERATED":1549002807568,"PARSING.NGRAM_MATCHES_CACHED":154900280758'
) as str
)
select str_to_map(concat_ws(',',collect_set(key_value)),',',':') --collect set, concatenate and convert to map
from
(
select explode(split(regexp_replace (str,'["]',''),',')) key_value from data --remove JSON delimiters, split and explode pairs
)s;
结果:
OK
"START_TIME":"1549002807568","PARSING.QUERY_FORMED":"1549002807586","CUBES_WITH_PERMISSIONS":"1549002807568","PARSING.CUBE_MATCH_SELECTED":"1549002807586","POTENTIAL_COMPLETIONS_ADDED":"1549002807587","QUERY_PARSED":"1549002807586","SUGGESTIONS_FORMED":"1549002807606","PARSING.SEQUENCES_GENERATED":"1549002807568","PARSING.NGRAM_MATCHES_CACHED":"154900280758","NEW_KEY":"12345"
Time taken: 158.414 seconds, Fetched: 1 row(s)
当然,"PARSING.NGRAM_MATCHES_CACHED"
在结果中只存在一次,因为 map 不允许同一个键出现两次。所有 key_values 都是唯一的。
请阅读代码中的 cmets。
【讨论】:
以上是关于如何将 json 字符串数据类型列转换为配置单元中的映射数据类型列?的主要内容,如果未能解决你的问题,请参考以下文章
动态和可配置地更改几种 Spark DataFrame 列类型