Hive SQL,如何检查多个先前的行以获得相同的引用

Posted

技术标签:

【中文标题】Hive SQL,如何检查多个先前的行以获得相同的引用【英文标题】:Hive SQL, How do I check multiple previous rows for the same reference 【发布时间】:2021-06-08 08:20:27 【问题描述】:

我有一个包含参考、资产、开始和结束日期的大型数据集。 我想通过引用为每个资产分配一个从 1 开始的键,如果引用相同并且开始日期和结束日期彼此连续,则使用相同的键,因此我最终得到:

Asset   Ref     Start       End         Key
A23BCD  12345   01/01/1900  01/01/2020  1
A23BCD  12345   02/01/2020  17/06/2020  1
A23BCD  67890   01/09/2020  31/10/2020  2
A23BCD  77777   01/11/2020  31/12/9999  3

我在 Hadoop 中使用数据,并使用 HiveQL 分配键,但这仅检查前 5 行:

create table temp_user.a1                              
row format delimited fields terminated by '\001'                                                               
stored as orc tblproperties("ORC.COMPRESS"="SNAPPY","ORC.COMPRESS.SIZE"="16384") as                                                        
select  a.*
                    ,LAG(ref) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_ref
        ,LAG(endDt) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_endDt
        ,LAG(rowNum) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_rownum
                    ,LAG(ref,2) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_ref_1
        ,LAG(endDt,2) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_endDt_1
        ,LAG(rowNum,2) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_rownum_1
                    ,LAG(ref,3) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_ref_2
        ,LAG(endDt,3) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_endDt_2
        ,LAG(rowNum,3) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_rownum_2
                    ,LAG(ref,4) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_ref_3
        ,LAG(endDt,4) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_endDt_3
        ,LAG(rowNum,4) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_rownum_3
                    ,LAG(ref,5) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_ref_4
        ,LAG(endDt,5) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_endDt_4
        ,LAG(rowNum,5) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_rownum_4
from temp_user.BigDataSet a; 
                                                                         

然后我可以将上一个参考和上一个结束日期与当前参考和当前开始日期进行比较。

有没有更好的方法来获取以前的记录,而不是多个 LAG?

使用资产和参考将大数据集连接到自身会更好吗?结束日期=(开始日期-1)?然后我将如何分配密钥?

谢谢 D

【问题讨论】:

【参考方案1】:

您应该能够以一个延迟完成这项工作,并使其适用于任何数量的先前记录。

多年前我为 Oracle 写了这篇文章:

https://betteratoracle.com/posts/35-collapsing-continuous-ranges-into-single-rows

这在很大程度上可以满足您的需求。下面的查询会给出一个想法,但我没有 Hive 来测试它。

select t.*,
       max(t.grp) over (order by asset, ref, start asc)
from 
(
select a.*,
       case
         when start - lag(end) over (partition by asset, ref order by start asc) < 1 then
          null
        else
          rownumber // this is an oracle function - maybe Hive has something similar?
        end grp
from table
order by asset, ref, start
) t

【讨论】:

以上是关于Hive SQL,如何检查多个先前的行以获得相同的引用的主要内容,如果未能解决你的问题,请参考以下文章

如何从数据框中排除特定的行?

如何编写 SQL 查询以获取具有相同前缀的行

如何更新没有任何数据的行以唯一标识该行?

SQL 计算基于 Hive 列中先前值重置的累积总和

防止多个线程/实例在 Azure SQL 数据库中选择和处理相同的行

从 ROW_NUM 中仅选择编号最大的行以获取最新更新