使用包含空值的多个数组列展开配置单元表
Posted
技术标签:
【中文标题】使用包含空值的多个数组列展开配置单元表【英文标题】:Explode hive table with multiple array columns including null values 【发布时间】:2020-08-03 12:14:38 【问题描述】:我有一个如下所示的蜂巢表。每列的元素数量是不可预测的。谁能告诉我如何在不丢失 Null 值的情况下正确分解该表的所有列。
+-------------------------------+-------------------+----------------------------+--+
| l1.skillcode | l1.duration | l1.numberofpeople |
+-------------------------------+-------------------+----------------------------+--+
| ["ACFC"] | ["00020"] | ["1"] |
| ["ACFC"] | ["00233"] | ["1"] |
| ["AJBS"] | ["00605"] | ["1"] |
| ["ACFC"] | ["00020"] | ["1"] |
| ["TESTING"] | ["123456"] | ["09876"] |
| ["ACFC"] | ["00233","846"] | ["1"] |
| ["AJBS"] | ["00605"] | ["1"] |
| ["ACFC"] | ["00020"] | ["1"] |
| ["TESTING"] | NULL | ["09876"] |
| ["ACFC"] | ["00233"] | NULL |
| ["AJBS"] | ["00605"] | ["1"] |
| ["ACFC"] | ["00020"] | ["1"] |
| ["TESTING"] | NULL | ["09876","09877","09878"] |
| NULL | ["56743"] | ["45678","345"] |
| ["ACFC","BES","SAL","EPD"] | ["00233"] | ["1"] |
| ["AJBS"] | ["00605"] | ["1"] |
| NULL | ["00020"] | ["1"] |
| ["TESTING"] | NULL | ["09876","09877","09878"] |
| NULL | ["56743"] | ["45678","345"] |
| ["ACFC"] | ["00020"] | ["1"] |
| ["TESTING"] | NULL | ["09876","09877","09878"] |
| ["ACFC"] | ["00233"] | ["1"] |
| ["AJBS"] | ["00605"] | ["1"] |
+-------------------------------+-------------------+----------------------------+--+
当我在下面尝试时,我得到了我试图分解的列的空值,并删除了该行的相关非空值。
select L2.*,t1.duration,t1.numberofpeople from t1
lateral view explode(t1.skillcode) L2;
如何在不丢失任何 NULL 值的情况下分解表的所有列,并保持所有 3 列的值之间的关系。
【问题讨论】:
【参考方案1】:使用lateral view outer
代替lateral view
hive> select * from L2;
OK
["BES","SAL"] ["00020","846"] ["1","09876"]
["SEAL"] [] []
[] ["0020","0021"] []
Time taken: 0.088 seconds, Fetched: 3 row(s)
hive> select L3.*,L4.*,L5.* from L2 lateral view outer explode(L2.skillcode) L3 lateral view outer explode(L2.duration) L4 lateral view outer explode(L2.numberofpeople) L5;
OK
BES 00020 1
BES 00020 09876
BES 846 1
BES 846 09876
SAL 00020 1
SAL 00020 09876
SAL 846 1
SAL 846 09876
SEAL NULL NULL
NULL 0020 NULL
NULL 0021 NULL
Time taken: 0.119 seconds, Fetched: 11 row(s)
注意:手动创建 &insert 数据到 hive 中的数组列类型。
CREATE TABLE `L2`(
`skillcode` array<string>,
`duration` array<string>,
`numberofpeople` array<string>)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://localhost:9000/***/data/hive/dwh/l2';
-- to insert the data
INSERT INTO L2 select array() as skillcode,array('0020','0021') as duration,array() as numberofpeople FROM (select '1' ) t;
【讨论】:
以上是关于使用包含空值的多个数组列展开配置单元表的主要内容,如果未能解决你的问题,请参考以下文章