从 Hive 中的多个表中选择增量数据

Posted

技术标签:

【中文标题】从 Hive 中的多个表中选择增量数据【英文标题】:selecting incremental data from multiple tables in Hive 【发布时间】:2017-07-26 12:23:48 【问题描述】:

我在 Hive 数据库中有五个表(A、B、C、D、E),我必须根据列“id”上的逻辑来合并这些表中的数据。

条件是:

Select * from A
UNION 
select * from B (except  ids not in A)
UNION 
select * from C (except ids not in A and B)
UNION 
select * from D(except ids not in A,B and C)
UNION 
select * from E(except ids not in A,B,C and D)

必须将此数据插入到最终表中。

一种方法是创建一个目标表 (target) 并为每个 UNION 阶段附加数据,然后使用此表与另一个 UNION 阶段连接。

这将是我的 .hql 文件的一部分:

insert into target 
(select * from A
UNION 
select B.* from 
A 
RIGHT OUTER JOIN B
on A.id=B.id
where ISNULL(A.id));

INSERT INTO target
select C.* from 
target 
RIGHT outer JOIN C
ON target.id=C.id
where ISNULL(target.id);

INSERT INTO target
select D.* from 
target 
RIGHT OUTER JOIN D
ON target.id=D.id
where ISNULL(target.id);

INSERT INTO target
select E.* from 
target 
RIGHT OUTER JOIN E
ON target.id=E.id
where ISNULL(target.id);

有没有更好的方法来实现这一点?我想我们无论如何都必须这样做 多个连接/查找。我期待着实现这一目标的最佳方法 在

1) 使用 Tez 进行 Hive

2) Spark-sql

提前致谢

【问题讨论】:

【参考方案1】:

如果id 在每个表中是唯一的,则可以使用row_number 代替rank

select      *

from       (select      *
                       ,rank () over
                        (
                            partition by    id
                            order by        src
                        )                           as rnk

            from        (           
                                    select 1 as src,* from a
                        union all   select 2 as src,* from b
                        union all   select 3 as src,* from c
                        union all   select 4 as src,* from d
                        union all   select 5 as src,* from e
                        ) t
            ) t

where       rnk = 1
;

【讨论】:

【参考方案2】:

我想我会尝试这样做:

with ids as (
      select id, min(which) as which
      from (select id, 1 as which from a union all
            select id, 2 as which from b union all
            select id, 3 as which from c union all
            select id, 4 as which from d union all
            select id, 5 as which from e
           ) x
     )
select a.*
from a join ids on a.id = ids.id and ids.which = 1
union all
select b.*
from b join ids on b.id = ids.id and ids.which = 2
union all
select c.*
from c join ids on c.id = ids.id and ids.which = 3
union all
select d.*
from d join ids on d.id = ids.id and ids.which = 4
union all
select e.*
from e join ids on e.id = ids.id and ids.which = 5;

【讨论】:

太复杂了。

以上是关于从 Hive 中的多个表中选择增量数据的主要内容,如果未能解决你的问题,请参考以下文章

更新 hive 表中的增量记录

将数据从多个 Hive 表转换为复杂 JSON

在idea上链接hive 并将mysql上的数据抽取到hive表中

HIVE:如何仅从两个表中的两列中选择第三个表中不存在的不同值?

MySQL INSERT ... 从 1 个表中选择 2 个表

无法找到创建的 Hive 表,也无法从表中检索数据