使用包含空值的多个数组列展开配置单元表

Posted

技术标签:

【中文标题】使用包含空值的多个数组列展开配置单元表【英文标题】:Explode hive table with multiple array columns including null values 【发布时间】:2020-08-03 12:14:38 【问题描述】:

我有一个如下所示的蜂巢表。每列的元素数量是不可预测的。谁能告诉我如何在不丢失 Null 值的情况下正确分解该表的所有列。

+-------------------------------+-------------------+----------------------------+--+
| l1.skillcode                  | l1.duration       |     l1.numberofpeople      |
+-------------------------------+-------------------+----------------------------+--+
| ["ACFC"]                      | ["00020"]         | ["1"]                      |
| ["ACFC"]                      | ["00233"]         | ["1"]                      |
| ["AJBS"]                      | ["00605"]         | ["1"]                      |
| ["ACFC"]                      | ["00020"]         | ["1"]                      |
| ["TESTING"]                   | ["123456"]        | ["09876"]                  |
| ["ACFC"]                      | ["00233","846"]   | ["1"]                      |
| ["AJBS"]                      | ["00605"]         | ["1"]                      |
| ["ACFC"]                      | ["00020"]         | ["1"]                      |
| ["TESTING"]                   | NULL              | ["09876"]                  |
| ["ACFC"]                      | ["00233"]         | NULL                       |
| ["AJBS"]                      | ["00605"]         | ["1"]                      |
| ["ACFC"]                      | ["00020"]         | ["1"]                      |
| ["TESTING"]                   | NULL              | ["09876","09877","09878"]  |
| NULL                          | ["56743"]         | ["45678","345"]            |
| ["ACFC","BES","SAL","EPD"]    | ["00233"]         | ["1"]                      |
| ["AJBS"]                      | ["00605"]         | ["1"]                      |
| NULL                          | ["00020"]         | ["1"]                      |
| ["TESTING"]                   | NULL              | ["09876","09877","09878"]  |
| NULL                          | ["56743"]         | ["45678","345"]            |
| ["ACFC"]                      | ["00020"]         | ["1"]                      |
| ["TESTING"]                   | NULL              | ["09876","09877","09878"]  |
| ["ACFC"]                      | ["00233"]         | ["1"]                      |
| ["AJBS"]                      | ["00605"]         | ["1"]                      |
+-------------------------------+-------------------+----------------------------+--+

当我在下面尝试时,我得到了我试图分解的列的空值,并删除了该行的相关非空值。

select L2.*,t1.duration,t1.numberofpeople from t1 
lateral view explode(t1.skillcode) L2;

如何在不丢失任何 NULL 值的情况下分解表的所有列,并保持所有 3 列的值之间的关系。

【问题讨论】:

【参考方案1】:

使用lateral view outer 代替lateral view

hive> select * from L2;
OK
["BES","SAL"]   ["00020","846"] ["1","09876"]
["SEAL"]    []  []
[]  ["0020","0021"] []
Time taken: 0.088 seconds, Fetched: 3 row(s)
hive> select L3.*,L4.*,L5.* from L2  lateral view outer explode(L2.skillcode) L3 lateral view outer explode(L2.duration) L4 lateral view outer explode(L2.numberofpeople) L5;
OK
BES 00020   1
BES 00020   09876
BES 846 1
BES 846 09876
SAL 00020   1
SAL 00020   09876
SAL 846 1
SAL 846 09876
SEAL    NULL    NULL
NULL    0020    NULL
NULL    0021    NULL
Time taken: 0.119 seconds, Fetched: 11 row(s)

注意:手动创建 &insert 数据到 hive 中的数组列类型。

CREATE TABLE `L2`(
  `skillcode` array<string>, 
  `duration` array<string>, 
  `numberofpeople` array<string>)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://localhost:9000/***/data/hive/dwh/l2';
-- to insert the data
INSERT INTO L2  select array() as skillcode,array('0020','0021') as duration,array() as numberofpeople FROM (select '1' ) t;

【讨论】:

以上是关于使用包含空值的多个数组列展开配置单元表的主要内容,如果未能解决你的问题,请参考以下文章

PHP - 从具有空值的日期列中插入日期时遇到问题。值在数组内

单元格函数:countcountAcountBlank

将包含一些空值的CSV文件读入VBA数组

在配置单元中创建表时向列添加默认值

从另一个表数据更新表中的多个列,包括空值

在 Google Data Studio 中显示重复列的空值的问题