Hive SQL,如何检查多个先前的行以获得相同的引用
Posted
技术标签:
【中文标题】Hive SQL,如何检查多个先前的行以获得相同的引用【英文标题】:Hive SQL, How do I check multiple previous rows for the same reference 【发布时间】:2021-06-08 08:20:27 【问题描述】:我有一个包含参考、资产、开始和结束日期的大型数据集。 我想通过引用为每个资产分配一个从 1 开始的键,如果引用相同并且开始日期和结束日期彼此连续,则使用相同的键,因此我最终得到:
Asset Ref Start End Key
A23BCD 12345 01/01/1900 01/01/2020 1
A23BCD 12345 02/01/2020 17/06/2020 1
A23BCD 67890 01/09/2020 31/10/2020 2
A23BCD 77777 01/11/2020 31/12/9999 3
我在 Hadoop 中使用数据,并使用 HiveQL 分配键,但这仅检查前 5 行:
create table temp_user.a1
row format delimited fields terminated by '\001'
stored as orc tblproperties("ORC.COMPRESS"="SNAPPY","ORC.COMPRESS.SIZE"="16384") as
select a.*
,LAG(ref) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_ref
,LAG(endDt) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_endDt
,LAG(rowNum) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_rownum
,LAG(ref,2) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_ref_1
,LAG(endDt,2) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_endDt_1
,LAG(rowNum,2) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_rownum_1
,LAG(ref,3) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_ref_2
,LAG(endDt,3) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_endDt_2
,LAG(rowNum,3) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_rownum_2
,LAG(ref,4) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_ref_3
,LAG(endDt,4) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_endDt_3
,LAG(rowNum,4) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_rownum_3
,LAG(ref,5) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_ref_4
,LAG(endDt,5) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_endDt_4
,LAG(rowNum,5) OVER (PARTITION BY asset ORDER BY asset, endDt, startDt) AS prev_rownum_4
from temp_user.BigDataSet a;
然后我可以将上一个参考和上一个结束日期与当前参考和当前开始日期进行比较。
有没有更好的方法来获取以前的记录,而不是多个 LAG?
使用资产和参考将大数据集连接到自身会更好吗?结束日期=(开始日期-1)?然后我将如何分配密钥?
谢谢 D
【问题讨论】:
【参考方案1】:您应该能够以一个延迟完成这项工作,并使其适用于任何数量的先前记录。
多年前我为 Oracle 写了这篇文章:
https://betteratoracle.com/posts/35-collapsing-continuous-ranges-into-single-rows
这在很大程度上可以满足您的需求。下面的查询会给出一个想法,但我没有 Hive 来测试它。
select t.*,
max(t.grp) over (order by asset, ref, start asc)
from
(
select a.*,
case
when start - lag(end) over (partition by asset, ref order by start asc) < 1 then
null
else
rownumber // this is an oracle function - maybe Hive has something similar?
end grp
from table
order by asset, ref, start
) t
【讨论】:
以上是关于Hive SQL,如何检查多个先前的行以获得相同的引用的主要内容,如果未能解决你的问题,请参考以下文章