SQL Self Join 比较不同天数的数据

Posted

技术标签:

【中文标题】SQL Self Join 比较不同天数的数据【英文标题】:SQL Self Join to compare data different by days 【发布时间】:2020-08-06 18:58:26 【问题描述】:

我想比较不同天数的产品。目标是获取第 1 天和第 2 天、第 2 天和第 3 天之间的差异,依此类推。

Product  EventTime
X1       T1
X2       T1
X1       T2
X3       T2
X4       T10

注意事项

活动时间不固定,可能是第 1 天第 2 天,然后是第 10 天) 产品由多个属性呈现,但为了显示问题,我使用了 1 个字段

预期结果

Product  Action   EventTime
X1       Added    T1
X2       Added    T1
X2       Removed  T2
X3       Added    T2
X1       Removed  T10
X3       Removed  T10
X4       Added    T10

我的想法是为这些记录提供行号并进行完全外部连接以找出差异,但我无法获得正确的结果。

我的思考过程 - 让我们按事件时间排名。

Product  EventTime  RNK
X1       T1         1
X2       T1         1
X1       T2         2
X3       T2         2
X4       T10        3

如果我们这样做

select 
  * 
from 
    dataset d1 
full join 
    dataset d2
        on d1.product = d2.product
        and d1.RNK = d2.RNK - 1
where
    d1.product is null or d2.product is null

它没有给我正确的结果。但是如果我先清理数据来制作它

Product  EventTime  RNK
--------------------- X1       T1         1 (cross out)
----------------------X2       T1         1
X1       T2         2
X3       T2         2
X4       T10        3 

Product  EventTime  RNK
X1       T1         1
X2       T1         1
X1       T2         2
X3       T2         2
-------------------- X4       T10        3  (cross out)

我们对上述数据集进行完全连接。我会得到正确的结果,但性能很慢。基本上我去掉了第一名和最后一名。

对于按天序列获取 2 组之间的差异有什么想法吗?

【问题讨论】:

我不遵循逻辑。你能解释一下吗? 是的。我每天都有产品数据。我想将第 1 天的产品与第 2 天的产品进行比较,将第 2 天的产品与第 3 天的产品进行比较,依此类推。 【参考方案1】:

嗯嗯。 . .这看起来像是一个孤岛问题。您可以使用以下方法获取每种产品的时间段:

select product, min(time), max(time)
from (select t.*,
             row_number() over (order by time) as seqnum,
             row_number() over (partition by product order by time) as seqnum_p
      from t
     ) t
group by product, (seqnum_p - seqnum);

获取删除时间有点小技巧。 . .你需要使用lead() 和一些花哨的聚合:

select product, min(time), max(time),
       max(next_time) keep (dense_rank first over order by time desc) as next_time
from (select t.*,
             row_number() over (order by time) as seqnum,
             row_number() over (partition by product order by time) as seqnum_p,
             min(time) over (order by time range between '1' second following and unbounded following) as next_time
      from t
     ) t
group by product, (seqnum_p - seqnum);

这可能足以满足您的需求。但是您可以取消透视:

with cte as (
      select product, min(time) as min_time, 
             max(next_time) keep (dense_rank first over order by time desc) as next_time
      from (select t.*,
                   row_number() over (order by time) as seqnum,
                   row_number() over (partition by product order by time) as seqnum_p,
                   min(time) over (order by time range between '1' second following and unbounded following) as next_time
            from t
           ) t
      group by product, (seqnum_p - seqnum)
     )
select product, 'Added', min_time
from cte
union all
select product 'Removed', next_time
from cte;

【讨论】:

哈哈。感谢您的解决方案。我觉得这有点难以理解。我的想法完全外部加入是否可行?基本上我试图在第 1 天和第 2 天、第 2 天和第 3 天之间进行连接 .. 您似乎在max() keep 函数中缺少某些语法? dense_rankfirst 还是last?还有over(什么)? @MatthewMcPeak 。 . .谢谢你。为什么Oracle不能只调用函数first()【参考方案2】:

这样做的一种方法是将其视为“稀疏数据”问题。也就是说,您有时间事件,但并非每个产品都在每个事件中都有代表。

分区外连接可以填充稀疏数据,从而形成一个数据集,其中每次都表示每个产品。然后,您可以更轻松地查看每次添加和删除的内容。

with event_table (product, event_time) as 
( SELECT 'X1',  trunc(sysdate)+1 FROM DUAL UNION ALL
  SELECT 'X2',  trunc(sysdate)+1 FROM DUAL UNION ALL 
  SELECT 'X1',  trunc(sysdate)+2 FROM DUAL UNION ALL  
  SELECT 'X3',  trunc(sysdate)+2 FROM DUAL UNION ALL  
  SELECT 'X4',  trunc(sysdate)+10 FROM DUAL ),
  -- solution begins here
  -- start by getting a distinct list of event times
  distinct_times as ( SELECT DISTINCT event_time FROM event_table ),
  -- Next, do a partitioned right join to ensure that every product is represented at every event time.  If the row is sparse data that was added by the right join, et.event_time will be null.
  -- We use the lag() function to see what the product looked like at the last event and
  -- compare with the current event.
  -- NULL -> NULL ==> no change
  -- NOT NULL -> NOT NULL ==> no change
  -- NULL -> NOT NULL ==> added
  -- NOT NULL -> NULL ==> removed
  sparse_data_filled as (
select dt.event_time, et.product,
case when lag(et.event_time ) over ( partition by et.product order by dt.event_time ) is null then
          -- product wasn't present during last event
          case when et.event_time is null then
            -- product still is not present
            null  -- no change
          else
            -- product is present now and was not before
            'Added'
          end
    else
      -- product was present during last event
      case when et.event_time is null then
        -- product is no longer present
          'Removed'
       else
         -- product is still present
         null   -- no change
      end
    end message
from event_table et partition by (product) 
right join distinct_times dt on et.event_time = dt.event_time )
SELECT * from sparse_data_filled
-- filter out the non-changes
where message is not null
order by event_time, product
;
+------------+---------+---------+
| EVENT_TIME | PRODUCT | MESSAGE |
+------------+---------+---------+
| 07-AUG-20  | X1      | Added   |
| 07-AUG-20  | X2      | Added   |
| 08-AUG-20  | X2      | Removed |
| 08-AUG-20  | X3      | Added   |
| 16-AUG-20  | X1      | Removed |
| 16-AUG-20  | X3      | Removed |
| 16-AUG-20  | X4      | Added   |
+------------+---------+---------+

更紧凑、仅解决方案的版本(无测试数据):

WITH 
  distinct_times as ( SELECT DISTINCT event_time FROM event_table ),
  changes as (
select dt.event_time, et.product,
case nvl2(et.event_time,1,0) - nvl2(lag(et.event_time ) over ( partition by et.product order by dt.event_time ),1,0)
       when +1 then 'Added'
       when -1 then 'Removed'
    end message
from event_table et partition by (product) 
right join distinct_times dt on et.event_time = dt.event_time )
SELECT * from changes
where message is not null
order by event_time, product

【讨论】:

以上是关于SQL Self Join 比较不同天数的数据的主要内容,如果未能解决你的问题,请参考以下文章

SQL中 LEFT JOIN ON 条件的效率高低比较

SQL Server中的SUM,在JOIN的一侧有多行,另一侧在另一侧作为比较

Spark Sql JDBC实现 聚合union同数据源Join等下推

对于SQL的Join,在学习起来可能是比较乱的。我们知道,SQL的Join语法有很多inner的,有outer的,有left的,有时候,对于Select出来的结果集是什么样子有点不是很清楚。Codin

数据准备

Java中不同的并发实现的性能比较