获取最后一个不同值的滞后函数(红移)

Posted

技术标签:

【中文标题】获取最后一个不同值的滞后函数(红移)【英文标题】:lag function to get the last different value(redshift) 【发布时间】:2017-06-20 06:48:15 【问题描述】:

我有一个示例数据如下,想得到一个想要的 o/p,请帮我出主意。

我希望第 3、4 行 prev_diff_value 的 o/p 为 2015-01-01 00:00:00 而不是 2015-01-02 00:00: 00.

with dat as (
            select 1 as id,'20150101 02:02:50'::timestamp as dt union all
            select 1,'20150101 03:02:50'::timestamp union all
            select 1,'20150101 04:02:50'::timestamp union all
            select 1,'20150102 02:02:50'::timestamp union all
            select 1,'20150102 02:02:50'::timestamp union all
            select 1,'20150102 02:02:51'::timestamp union all
            select 1,'20150103 02:02:50'::timestamp union all
            select 2,'20150101 02:02:50'::timestamp union all
            select 2,'20150101 03:02:50'::timestamp union all
            select 2,'20150101 04:02:50'::timestamp union all
            select 2,'20150102 02:02:50'::timestamp union all
            select 1,'20150104 02:02:50'::timestamp
            )-- select * from dat
   select id , dt , lag(trunc(dt)) over(partition by id order by dt asc) prev_diff_value
   from dat
  order by id,dt desc
O/P : 
   id   dt                    prev_diff_value
   1    2015-01-04 02:02:50   2015-01-03 00:00:00
   1    2015-01-03 02:02:50   2015-01-02 00:00:00
   1    2015-01-02 02:02:51   2015-01-02 00:00:00
   1    2015-01-02 02:02:50   2015-01-02 00:00:00
   1    2015-01-02 02:02:50   2015-01-01 00:00:00

【问题讨论】:

您好,您能否更好地解释一下您希望在 prev_diff_value 列中看到的内容? lag 函数将前一个函数作为参考,因此它工作正常,您要求该 ro 红移。为什么要退一步?也许您可以按年-月-日 hh-mm 分组忽略秒数,因此 2015-01-02 02:02:50 和 2015-01-02 02:02:51 将被视为相同? 好的..我知道它返回了正确的结果,但是我期望的 prev-diff-value 中的值是前一天或者可能是前一天而不是前一行 【参考方案1】:

据我了解,您希望获取 id 分区中每个时间戳的前一个不同日期。然后,我将lag 应用于iddate 的独特组合,并像这样加入原始数据集:

with dat as (
    select 1 as id,'20150101 02:02:50'::timestamp as dt union all
    select 1,'20150101 03:02:50'::timestamp union all
    select 1,'20150101 04:02:50'::timestamp union all
    select 1,'20150102 02:02:50'::timestamp union all
    select 1,'20150102 02:02:50'::timestamp union all
    select 1,'20150102 02:02:51'::timestamp union all
    select 1,'20150103 02:02:50'::timestamp union all
    select 2,'20150101 02:02:50'::timestamp union all
    select 2,'20150101 03:02:50'::timestamp union all
    select 2,'20150101 04:02:50'::timestamp union all
    select 2,'20150102 02:02:50'::timestamp union all
    select 1,'20150104 02:02:50'::timestamp
)
,dat_unique_lag as (
    select *, lag(date) over(partition by id order by date asc) prev_diff_value
    from (
        select distinct id,trunc(dt) as date
        from dat
    )
)
select *
from dat
join dat_unique_lag
using (id)
where trunc(dat.dt)=dat_unique_lag.date
order by id,dt desc;

但是,这并不是超级性能。如果您的数据性质是同一天的时间戳数量有限,您可能只需使用如下条件语句来延长滞后时间:

with dat as (
    select 1 as id,'20150101 02:02:50'::timestamp as dt union all
    select 1,'20150101 03:02:50'::timestamp union all
    select 1,'20150101 04:02:50'::timestamp union all
    select 1,'20150102 02:02:50'::timestamp union all
    select 1,'20150102 02:02:50'::timestamp union all
    select 1,'20150102 02:02:51'::timestamp union all
    select 1,'20150103 02:02:50'::timestamp union all
    select 2,'20150101 02:02:50'::timestamp union all
    select 2,'20150101 03:02:50'::timestamp union all
    select 2,'20150101 04:02:50'::timestamp union all
    select 2,'20150102 02:02:50'::timestamp union all
    select 1,'20150104 02:02:50'::timestamp
)
select id, dt,
case 
    when lag(trunc(dt)) over(partition by id order by dt asc)=trunc(dt)
    then case 
        when lag(trunc(dt),2) over(partition by id order by dt asc)=trunc(dt)
        then case
            when lag(trunc(dt),3) over(partition by id order by dt asc)=trunc(dt)
            then lag(trunc(dt),4) over(partition by id order by dt asc)
            else lag(trunc(dt),3) over(partition by id order by dt asc)
            end
        else lag(trunc(dt),2) over(partition by id order by dt asc)
        end
    else lag(trunc(dt)) over(partition by id order by dt asc)
end as prev_diff_value
from dat
order by id,dt desc;

基本上,您会查看之前的记录,如果它不适合您,那么您会回头查看该记录之前的记录,依此类推,直到找到正确的记录或用完您的陈述深度。在这里,它一直到第 4 条记录为止。

【讨论】:

【参考方案2】:

这是看待问题的另一种方式,虽然效率不高,但还是很有趣。

with dat as (
    select 1 as id,'20150101 02:02:50'::timestamp as dt union all
    select 1,'20150101 03:02:50'::timestamp union all
    select 1,'20150101 04:02:50'::timestamp union all
    select 1,'20150102 02:02:50'::timestamp union all
    select 1,'20150102 02:02:50'::timestamp union all
    select 1,'20150102 02:02:51'::timestamp union all
    select 1,'20150103 02:02:50'::timestamp union all
    select 2,'20150101 02:02:50'::timestamp union all
    select 2,'20150101 03:02:50'::timestamp union all
    select 2,'20150101 04:02:50'::timestamp union all
    select 2,'20150102 02:02:50'::timestamp union all
    select 1,'20150104 02:02:50'::timestamp
)
select distinct
dat.id
,dat.dt
,last_value(dat2.d) over (partition by dat.id, dat.dt order by dat2.d asc rows between unbounded preceding and unbounded following) as prev_diff_value
from dat
left join (
    select distinct
    id
    ,trunc(dt) as d
    from dat) dat2 on dat.id = dat2.id and trunc(dat.dt) > dat2.d
order by 1,2,3;

这将提取不同的 id 和日期对,并仅在连接日期早于相关行的情况下将它们重新加入数据集。然后,last_value 函数将获取每行的最后一个值,并且 distinct 从输出中删除所有不相关的行。我知道这个问题已经有几年的历史了——但我偶然发现了它并从中获得了乐趣。

【讨论】:

以上是关于获取最后一个不同值的滞后函数(红移)的主要内容,如果未能解决你的问题,请参考以下文章

如何对使用其自身输出的滞后值的函数进行矢量化?

BigQuery 无法查询滞后的大表

滞后函数 - 为最后一个值创建虚拟行?

从红移表中获取上次更新时间戳

亚马逊红移中的上次更新查询计数

如何从 lambda 函数异步传递红移查询?