表本身的完全外连接并运行一些窗口函数

Posted 2023-03-30

技术标签:

【中文标题】表本身的完全外连接并运行一些窗口函数【英文标题】：Full outer join on a table itself and run some window functions 【发布时间】：2016-01-27 22:21:50 【问题描述】：

背景

我有一些 ETL 作业每小时处理实时日志文件。每当系统生成一个新事件时，它都会对所有历史事件摘要（如果存在）进行快照，并将其与当前事件一起记录下来。然后将数据加载到 Redshift 中。

示例

表格如下所示：

+------------+--------------+---------+-----------+-------+-------+
| current_id | current_time | past_id | past_time | freq1 | freq2 |
+------------+--------------+---------+-----------+-------+-------+
|          2 |        time2 |       1 |     time1 |    13 |     5 |
|          3 |        time3 |       1 |     time1 |    13 |     5 |
|          3 |        time3 |       2 |     time2 |     2 |     1 |
|          4 |        time4 |       1 |     time1 |    13 |     5 |
|          4 |        time4 |       2 |     time2 |     2 |     1 |
|          4 |        time4 |       3 |     time3 |     1 |     1 |
+------------+--------------+---------+-----------+-------+-------+

这就是上表发生的情况：

time1：事件 1 发生。系统拍摄了快照，但没有记录任何内容。 time2：事件 2 发生。系统拍摄快照并记录事件 1。 time3：事件 3 发生。系统拍摄快照并记录事件 1 和 2。 time4：事件 4 发生。系统拍摄快照并记录事件 1、2 和 3。

期望的结果

我需要将数据转换为以下格式以便进行一些分析：

+----+------------+-------+-------+
| id | event_time | freq1 | freq2 |
+----+------------+-------+-------+
|  1 |      time1 |     0 |     0 |
|  2 |      time2 |    13 |     5 |  --     13 |     5
|  3 |      time3 |    15 |     6 |  -- 13 + 2 | 5 + 1
|  4 |      time4 |    16 |     7 |  -- 15 + 1 | 6 + 1
+----+------------+-------+-------+

基本上，新的freq1和freq2是滞后的freq1和freq2的累积和。

我的想法

我在 current_id 和 past_id 上想一个自己 full outer join 并首先达到以下结果：

+----+------------+-------+-------+
| id | event_time | freq1 | freq2 |
+----+------------+-------+-------+
|  1 |      time1 |    13 |     5 |
|  2 |      time2 |     2 |     1 |
|  3 |      time3 |     1 |     1 |
|  4 |      time4 |  null |  null |
+----+------------+-------+-------+

然后我可以做一个窗口函数lag over()然后sum over()。

问题

这是正确的方法吗？有没有更有效的方法来做到这一点？这只是实际数据的一小部分样本，因此性能可能是一个问题。我的查询总是返回很多重复值，所以我不确定出了什么问题。

解决方案

@GordonLinoff 的回答对于上述用例是正确的。我正在添加一些小的更新，以便让它在我的实际桌子上工作。唯一的区别是我的 event_id 是一些 36 个字符的 Java UUID，而 event_time 是时间戳。

select distinct past_id, past_time, 0 as freq1, 0 as freq2
from (
    select past_id, past_time,
           row_number() over (partition by current_id order by current_time desc) as seqnum
    from t
) a
where a.seqnum = 1
union all
select current_id, current_time,
       sum(freq1) over (order by current_time rows unbounded preceding) as freq1,
       sum(freq2) over (order by current_time rows unbounded preceding) as freq2
from (
    select current_id, current_time, freq1, freq2,
           row_number() over (partition by current_id order by past_id desc) as seqnum
    from t
) b
where b.seqnum = 1;

【问题讨论】：

【参考方案1】：

我想你想要union all 以及窗口函数。这是一个例子：

select min(past_id) as id, min(past_time) as event_time, 0 as freq1, 0 as freq2
from t
union all
(select current_id, current_time,
        sum(freq1) over (order by current_time),
        sum(freq2) over (order by current_time)
 from (select current_id, current_time, freq1, freq2,
              row_number() over (partition by current_id order by past_id desc) as seqnum
       from t
      ) t
  where seqnum = 1
);

【讨论】：

【参考方案2】：

你的数据在你的快照表中的方式，我认为下面的 SQL 应该给你在你发布的期望结果中你正在寻找的东西

SELECT 1 AS id
      ,"time1" AS event_time
      ,0 AS freq1
      ,0 AS freq2
 UNION
SELECT T.id 
      ,T.current_time AS event_time
      ,SUM(T.freq1) AS freq1
      ,SUM(T.freq2) AS freq2
  FROM snapshot AS T
 GROUP
    BY T.id
      ,T.current_name

上面UNION 中的第一个SELECT 是为了让您可以获得time1 的第一条记录，因为它在您的基表中实际上并没有包含所有快照的条目。它没有FROM 在其中，因为我们只选择变量，如果 Redshift 不支持它，您可能需要在 Oracle 中寻找与 DUAL 表等效的东西。

希望这有帮助..

【讨论】：

以上是关于表本身的完全外连接并运行一些窗口函数的主要内容，如果未能解决你的问题，请参考以下文章