Hive:如何比较 WHERE 子句中具有复杂数据类型的两列?

Posted

技术标签:

【中文标题】Hive:如何比较 WHERE 子句中具有复杂数据类型的两列?【英文标题】:Hive: How do I compare two columns in WHERE clause having complex datatypes? 【发布时间】:2018-09-05 09:43:29 【问题描述】:

我有一个 Hive 表作为我的源表。 我还有一张作为目标的蜂巢表。 源表和目标表的 DDL 相同,只是在目标表中添加了一些日志列。 以下是 DDL: 来源:

CREATE EXTERNAL TABLE source.customer_detail(
   id string,
   name string,
   city string,
   properties_owned array<struct<property_addr:string, location:string>>
)
ROW FORMAT SERDE
  'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION
  '/user/aiman/customer_detail';

目标:

CREATE EXTERNAL TABLE target.customer_detail(
   id string,
   name string,
   city string,
   properties_owned array<struct<property_addr:string, location:string>>
   audit_insterted_ts timestamp,
   audit_dml_action char(1)
)
PARTITIONED BY (audit_active_flag char(1))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0001'
STORED AS ORC
LOCATION
  '/user/aiman/target/customer_detail';

数据来源:

+---------------------+--------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------+
| customer_detail.id  |   customer_detail.name   |  customer_detail.city   |                                               customer_detail.properties_owned                                                       |
+---------------------+--------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------+
| 1                   | Aiman Sarosh             |      kolkata            |  ["property_addr":"H1 Block Saltlake","location":"kolkata","property_addr":"New Property Added Saltlake","location":"kolkata"]   |
| 2                   | Justin                   |      delhi              |  ["property_addr":"some address in delhi","location":"delhi"]                                                                      |
+---------------------+--------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------+

目标数据:

+---------------------+--------------------------+-------------------------+------------------------------------------------------------------+--------------------------------------+-----------------------------------+------------------------------------+
| customer_detail.id  |   customer_detail.name   |  customer_detail.city   |              customer_detail.properties_owned                    |  customer_detail.audit_insterted_ts  | customer_detail.audit_dml_action  | customer_detail.audit_active_flag  |
+---------------------+--------------------------+-------------------------+------------------------------------------------------------------+--------------------------------------+-----------------------------------+------------------------------------+
| 1                   | Aiman Sarosh             |      kolkata            |  ["property_addr":"H1 Block Saltlake","location":"kolkata"]    | 2018-09-04 06:55:12.361              | I                                 | A                                  |
| 2                   | Justin                   |      delhi              |  ["property_addr":"some address in delhi","location":"delhi"]  | 2018-09-05 08:36:39.023              | I                                 | A                                  |
+---------------------+--------------------------+-------------------------+---------------------------------------------------------------------------------------------------------+-----------------------------------+------------------------------------+

当我运行下面的查询时,它应该为我获取 1 条已修改的记录,即:

+---------------------+--------------------------+-------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-----------------------------------+------------------------------------+
| customer_detail.id  |   customer_detail.name   |  customer_detail.city   |                                                                  customer_detail.properties_owned                                              |  customer_detail.audit_insterted_ts  | customer_detail.audit_dml_action  | customer_detail.audit_active_flag  |
+---------------------+--------------------------+-------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-----------------------------------+------------------------------------+
| 1                   | Aiman Sarosh             |      kolkata            |  ["property_addr":"H1 Block Saltlake","location":"kolkata","property_addr":"New Property Added Saltlake","location":"kolkata"]             | 2018-09-05 07:15:10.321              | U                                 | A                                  |
+---------------------+--------------------------+-------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+-----------------------------------+------------------------------------+

基本上,"property_addr":"New Property Added Saltlake","location":"kolkata" 元素已添加到数组列properties_owned,用于source 处的记录 ID 1。

查询:

SELECT  --fetch modified/updated records in source
   source.id AS id,
   source.name AS name,
   source.city AS city,
   source.properties_owned AS properties_owned,
   current_timestamp() AS audit_insterted_ts,
   'U' AS audit_dml_action,
   'A' AS audit_active_flag
FROM source.customer_detail source
INNER JOIN target.customer_detail jrnl
ON source.id=jrnl.id
WHERE source.name!=jrnl.name
OR source.city!=jrnl.city
OR source.properties_owned!=jrnl.properties_owned

但它正在抛出错误:

Error: Error while compiling statement: FAILED: SemanticException [Error 10016]: Line 14:3 Argument type mismatch 'properties_owned': The 1st argument of NOT EQUAL  is expected to a primitive type, but list is found (state=42000,code=10016)

当我使用 JOINS 时,如何比较 WHERE 子句中具有复杂数据类型的两列? 我可以使用.POS.ITEM 但这不会有帮助,因为我的列是一个结构数组,并且数组的长度可以不同。

【问题讨论】:

你可以使用lateral view explode来分解你的数组然后执行连接 我尝试过使用它,但我无法弄清楚我应用了哪些 JOIN。任何示例查询都会有真正的帮助:) 【参考方案1】:

处理复杂类型的一种方法是将它们转换为字符串,例如 Json 字符串。 brickhouse 项目包含有用的第三方 Hive UDF。它具有to_json 函数,可以将任何复杂类型转换为json字符串。首先,克隆并构建 jar:

git clone https://github.com/klout/brickhouse.git
cd brickhouse
mvn clean package

然后将 Brickhouse jar 复制到 HDFS 并在 Hive 中添加 jar:

add jar hdfs://<your_path>/brickhouse-0.7.1-SNAPSHOT.jar;

在 Hive 中注册 to_json UDF

create temporary function to_json as 'brickhouse.udf.json.ToJsonUDF';

现在你可以使用它了,例如,

hive> select to_json(ARRAY(MAP('a',1), MAP('b',2)));
OK
["a":1,"b":2]

因此,在您的情况下,您需要将列转换为 json 字符串,然后在 where 子句中进行比较。请记住,to_json 按原样转换复数值。例如,在您的情况下,两个数组

["property_addr":"H1 Block Saltlake","location":"kolkata","property_addr":"New Property Added Saltlake","location":"kolkata"]

["property_addr":"New Property Added Saltlake","location":"kolkata","property_addr":"H1 Block Saltlake","location":"kolkata"]

会有所不同。

【讨论】:

感谢 serge_k。这是一个不错的方式。但是我使用lateral view explode() 和一个子查询来实现它以获取一列并使用collect_listconcat_ws 然后进行比较。【参考方案2】:

我使用LATERAL VIEW explode() 修复了这个问题。 然后在分解列上使用concat_ws()collect_list(array&lt;string&gt;) 方法,最后给了我一个string,我比较了:

SELECT  --fetch modified/updated records in source
   source.id AS id,
   source.name AS name,
   source.city AS city,
   source.properties_owned AS properties_owned,
   current_timestamp() AS audit_insterted_ts,
   'U' AS audit_dml_action,
   'A' AS audit_active_flag
FROM source.customer_detail source
INNER JOIN target.customer_detail jrnl
ON source.id=jrnl.id
WHERE source.id IN
(
SELECT t1.id
FROM
(
   SELECT src.id,concat_ws(',', collect_list(src.property_addr),collect_list(src.location)) newcol
   FROM
   (
      SELECT id, prop_owned.property_addr AS property_addr, prop_owned.location AS location
      FROM source.customer_detail LATERAL VIEW explode(properties_owned) exploded_tab AS prop_owned
   ) src
   GROUP BY src.id
) t1
INNER JOIN
(
   SELECT trg.id,concat_ws(',', collect_list(trg.property_addr),collect_list(trg.location)) newcol
   FROM
   (
      SELECT id, prop_owned.property_addr AS property_addr, prop_owned.location AS location
      FROM target.customer_detail LATERAL VIEW explode(properties_owned) exploded_tab AS prop_owned
   ) trg
   GROUP BY trg.id
) t2
ON t1.id=t2.id
WHERE t1.newcol!=t2.newcol

希望有人觉得这很有用和有帮助。 :-)

【讨论】:

【参考方案3】:

问题:您正在尝试比较列表而不是原始类型

当前情况:无法直接比较复杂对象列表与内置 Hive udfs(字符串列表有一些解决方法)。

解决方法:您将需要一些第三方 UDF 来帮助您解决此问题。有几个有趣的udfshere(我之前没有测试过)

【讨论】:

以上是关于Hive:如何比较 WHERE 子句中具有复杂数据类型的两列?的主要内容,如果未能解决你的问题,请参考以下文章

如何使用 HIVE 在 WHERE 语句中对 OR 子句进行分组

如何在 Hive/SQL 的 where/have 子句中使用 min()(以避免子查询)

如何从 Firestore 7.24.0 实例中查询具有多个 where 子句的数据?

Hive 中带有 Join 或 Where 子句的条件

如何使用在 where 子句中具有父属性的休眠查询更新数据

如何在 JSON 上应用复杂的数据过滤器,例如 SQL where 子句