SQL Self Join 比较不同天数的数据
Posted
技术标签:
【中文标题】SQL Self Join 比较不同天数的数据【英文标题】:SQL Self Join to compare data different by days 【发布时间】:2020-08-06 18:58:26 【问题描述】:我想比较不同天数的产品。目标是获取第 1 天和第 2 天、第 2 天和第 3 天之间的差异,依此类推。
Product EventTime
X1 T1
X2 T1
X1 T2
X3 T2
X4 T10
注意事项
活动时间不固定,可能是第 1 天第 2 天,然后是第 10 天) 产品由多个属性呈现,但为了显示问题,我使用了 1 个字段预期结果
Product Action EventTime
X1 Added T1
X2 Added T1
X2 Removed T2
X3 Added T2
X1 Removed T10
X3 Removed T10
X4 Added T10
我的想法是为这些记录提供行号并进行完全外部连接以找出差异,但我无法获得正确的结果。
我的思考过程 - 让我们按事件时间排名。
Product EventTime RNK
X1 T1 1
X2 T1 1
X1 T2 2
X3 T2 2
X4 T10 3
如果我们这样做
select
*
from
dataset d1
full join
dataset d2
on d1.product = d2.product
and d1.RNK = d2.RNK - 1
where
d1.product is null or d2.product is null
它没有给我正确的结果。但是如果我先清理数据来制作它
Product EventTime RNK
--------------------- X1 T1 1 (cross out)
----------------------X2 T1 1
X1 T2 2
X3 T2 2
X4 T10 3
Product EventTime RNK
X1 T1 1
X2 T1 1
X1 T2 2
X3 T2 2
-------------------- X4 T10 3 (cross out)
我们对上述数据集进行完全连接。我会得到正确的结果,但性能很慢。基本上我去掉了第一名和最后一名。
对于按天序列获取 2 组之间的差异有什么想法吗?
【问题讨论】:
我不遵循逻辑。你能解释一下吗? 是的。我每天都有产品数据。我想将第 1 天的产品与第 2 天的产品进行比较,将第 2 天的产品与第 3 天的产品进行比较,依此类推。 【参考方案1】:嗯嗯。 . .这看起来像是一个孤岛问题。您可以使用以下方法获取每种产品的时间段:
select product, min(time), max(time)
from (select t.*,
row_number() over (order by time) as seqnum,
row_number() over (partition by product order by time) as seqnum_p
from t
) t
group by product, (seqnum_p - seqnum);
获取删除时间有点小技巧。 . .你需要使用lead()
和一些花哨的聚合:
select product, min(time), max(time),
max(next_time) keep (dense_rank first over order by time desc) as next_time
from (select t.*,
row_number() over (order by time) as seqnum,
row_number() over (partition by product order by time) as seqnum_p,
min(time) over (order by time range between '1' second following and unbounded following) as next_time
from t
) t
group by product, (seqnum_p - seqnum);
这可能足以满足您的需求。但是您可以取消透视:
with cte as (
select product, min(time) as min_time,
max(next_time) keep (dense_rank first over order by time desc) as next_time
from (select t.*,
row_number() over (order by time) as seqnum,
row_number() over (partition by product order by time) as seqnum_p,
min(time) over (order by time range between '1' second following and unbounded following) as next_time
from t
) t
group by product, (seqnum_p - seqnum)
)
select product, 'Added', min_time
from cte
union all
select product 'Removed', next_time
from cte;
【讨论】:
哈哈。感谢您的解决方案。我觉得这有点难以理解。我的想法完全外部加入是否可行?基本上我试图在第 1 天和第 2 天、第 2 天和第 3 天之间进行连接 .. 您似乎在max() keep
函数中缺少某些语法? dense_rank
first
还是last
?还有over
(什么)?
@MatthewMcPeak 。 . .谢谢你。为什么Oracle不能只调用函数first()
?【参考方案2】:
这样做的一种方法是将其视为“稀疏数据”问题。也就是说,您有时间事件,但并非每个产品都在每个事件中都有代表。
分区外连接可以填充稀疏数据,从而形成一个数据集,其中每次都表示每个产品。然后,您可以更轻松地查看每次添加和删除的内容。
with event_table (product, event_time) as
( SELECT 'X1', trunc(sysdate)+1 FROM DUAL UNION ALL
SELECT 'X2', trunc(sysdate)+1 FROM DUAL UNION ALL
SELECT 'X1', trunc(sysdate)+2 FROM DUAL UNION ALL
SELECT 'X3', trunc(sysdate)+2 FROM DUAL UNION ALL
SELECT 'X4', trunc(sysdate)+10 FROM DUAL ),
-- solution begins here
-- start by getting a distinct list of event times
distinct_times as ( SELECT DISTINCT event_time FROM event_table ),
-- Next, do a partitioned right join to ensure that every product is represented at every event time. If the row is sparse data that was added by the right join, et.event_time will be null.
-- We use the lag() function to see what the product looked like at the last event and
-- compare with the current event.
-- NULL -> NULL ==> no change
-- NOT NULL -> NOT NULL ==> no change
-- NULL -> NOT NULL ==> added
-- NOT NULL -> NULL ==> removed
sparse_data_filled as (
select dt.event_time, et.product,
case when lag(et.event_time ) over ( partition by et.product order by dt.event_time ) is null then
-- product wasn't present during last event
case when et.event_time is null then
-- product still is not present
null -- no change
else
-- product is present now and was not before
'Added'
end
else
-- product was present during last event
case when et.event_time is null then
-- product is no longer present
'Removed'
else
-- product is still present
null -- no change
end
end message
from event_table et partition by (product)
right join distinct_times dt on et.event_time = dt.event_time )
SELECT * from sparse_data_filled
-- filter out the non-changes
where message is not null
order by event_time, product
;
+------------+---------+---------+ | EVENT_TIME | PRODUCT | MESSAGE | +------------+---------+---------+ | 07-AUG-20 | X1 | Added | | 07-AUG-20 | X2 | Added | | 08-AUG-20 | X2 | Removed | | 08-AUG-20 | X3 | Added | | 16-AUG-20 | X1 | Removed | | 16-AUG-20 | X3 | Removed | | 16-AUG-20 | X4 | Added | +------------+---------+---------+
更紧凑、仅解决方案的版本(无测试数据):
WITH
distinct_times as ( SELECT DISTINCT event_time FROM event_table ),
changes as (
select dt.event_time, et.product,
case nvl2(et.event_time,1,0) - nvl2(lag(et.event_time ) over ( partition by et.product order by dt.event_time ),1,0)
when +1 then 'Added'
when -1 then 'Removed'
end message
from event_table et partition by (product)
right join distinct_times dt on et.event_time = dt.event_time )
SELECT * from changes
where message is not null
order by event_time, product
【讨论】:
以上是关于SQL Self Join 比较不同天数的数据的主要内容,如果未能解决你的问题,请参考以下文章
SQL Server中的SUM,在JOIN的一侧有多行,另一侧在另一侧作为比较
Spark Sql JDBC实现 聚合union同数据源Join等下推
对于SQL的Join,在学习起来可能是比较乱的。我们知道,SQL的Join语法有很多inner的,有outer的,有left的,有时候,对于Select出来的结果集是什么样子有点不是很清楚。Codin