Hive：如果两个表之间满足条件，则查找唯一值

Posted 2023-03-23

技术标签:

【中文标题】Hive：如果两个表之间满足条件，则查找唯一值【英文标题】：Hive: Find unique values if condition met between two tables 【发布时间】：2021-05-14 16:09:33 【问题描述】：

我有两张桌子。 Table 1 有我感兴趣的所有独特的地方（30 行）：

places
japan
china
india
...

Table 2 拥有 ID、去过的地方和日期的所有信息。

id	places	date
10001	japan	20210204
10001	australia	20210204
10001	china	20210204
10001	argentina	20210205
10002	spain	20210204
10002	india	20210204
10002	china	20210205
10003	argentina	20210204
10003	portugal	20210204

我有兴趣得到的是：

对于特定日期（比如 20210204）从表 2 中查找所有唯一的 IDs，这些 IDs 至少访问过表 1 中的 places 之一将那些唯一的IDs 保存到临时表中。

这是我尝试过的：

create temporary table imp.unique_ids_tmp
as select distinct(final.id) from
(select t2.id
from table2 as t2
where t2.date = '20210204'
and t2.places in 
(select * from table1)) final;

我正在努力整合“至少一个”逻辑，这样一旦找到令人满意的id，它就会停止查看那些id 记录。

【问题讨论】：

【参考方案1】：

使用left semi join（以有效的方式实现不相关的EXISTS），它将只过滤加入的记录，然后应用不同的：

create temporary table imp.unique_ids_tmp as
select distinct t2.id --distinct is not a function, do not need ()
  from table2 t2
       left semi join table1 t1 on t2.places = t1.places
 where t2.date = '20210204'
;

将满足“至少一次”条件：没有连接记录的 ID 不会出现在数据集中。

另一种方法是使用相关的 EXISTS：

create temporary table imp.unique_ids_tmp as
select distinct t2.id --distinct is not a function, do not need ()
  from table2 t2
 where t2.date = '20210204' 
   --this condition is true as soon as one match is found
   and exists (select 1 from table1 t1 where t2.places = t1.places)
;

IN 也可以。

Correlated EXIST 看起来接近于“一旦找到令人满意的 id，它就会停止查看那些 id 记录”，但所有这些方法都是使用 Hive 中的 JOIN 实现的。执行 EXPLAIN，你会看到，它会生成相同的计划，尽管它取决于你的版本中的实现。可能 EXISTS 可以更快，因为不需要检查子查询中的所有记录。考虑到您的 30 行 table1 足够小以适合内存，MAP-JOIN (set hive.auto.convert.join=true;) 将为您提供最佳性能。

使用数组或 IN(static_list) 的一种更快速的方法。它可用于小型和静态数组。有序数组可能会给你更好的性能：

select distinct t2.id --distinct is not a function, do not need ()
  from table2 t2
 where t2.date = '20210204'
       and array_contains(array('australia', 'china', 'japan', ... ), t2.places)
       --OR use t2.places IN ('australia', 'china', 'japan', ... )

为什么这个方法更快：因为不需要启动mapper和计算splits来从hdfs读取table，只会读取table2。缺点是值列表是静态的。另一方面，您可以将整个列表作为参数传递，请参阅here。

【讨论】：

以上是关于Hive：如果两个表之间满足条件，则查找唯一值的主要内容，如果未能解决你的问题，请参考以下文章

如果在 R 中满足某些行和列之间的条件，则确定一个值

HIVE/Impala 查询：计算满足特定条件的行之间的行数

excel 满足一个条件显示对应行倒数第二行？

GROUP BY 如果组中至少一个值满足条件，则创建组

电源查询：如果满足条件，则查找最小日期

lambda()函数如何同时判断两个单元格的值分别满足各自给定值怎么做？