返回跨不同组的时间增量的中值

Posted

技术标签:

【中文标题】返回跨不同组的时间增量的中值【英文标题】:Returning median values of time deltas across different groups 【发布时间】:2019-05-16 14:00:41 【问题描述】:

尝试计算我的数据表中不同步骤之间的范围,并使用此 SQL 代码返回每个计算的中位数:

SELECT median(datediff(seconds,one,two)) as step_one,
       median(datediff(seconds,two,three)) as step_two,
FROM Table

这将返回以下错误消息:

[0A000][500310] 亚马逊无效操作:在组 ORDER 内 聚合函数的 BY 子句必须相同; java.lang.RuntimeException: com.amazon.support.exceptions.ErrorException:亚马逊无效 操作:组内 ORDER BY 子句的聚合函数必须 一样的;

注意:不过,我可以返回一个中值。

这是我的数据框示例:

one                                 two                        three    
2015-12-14 19:01:58.014247  2015-12-21 17:36:06.187302  2015-12-14 19:10:00.040057  2015-12-14 19:03:18.153519
2016-01-02 05:18:50.351975  2016-01-02 05:26:10.660299  2016-01-02 05:22:58.353365  2016-01-02 05:19:34.915794
2016-02-08 07:29:23.938046  2016-02-08 07:41:42.016819  2016-02-08 07:31:23.899776  2016-02-08 07:30:03.168844
2016-02-25 18:25:39.223014  2016-02-25 18:31:07.087808  2016-02-25 18:29:02.490969  2016-02-25 18:26:20.188472
2015-11-26 12:02:27.033141  2015-11-26 12:07:52.813699  2015-11-26 12:06:33.106484  2015-11-26 12:03:09.152853

2015-12-18 08:44:13.184319  2015-12-18 13:10:51.707354  2015-12-18 13:09:35.938711  2015-12-18 13:02:22.650966
2016-01-31 06:41:55.165849  2016-01-31 06:44:58.004319  2016-01-31 06:43:25.923505  2016-01-31 06:42:29.955232
2016-02-15 12:22:29.051259  2016-02-22 09:29:15.649721  2016-02-22 08:40:45.221558  2016-02-16 06:52:52.368139

期望的结果是一到二和二到三之间的中值时间增量(实际数据中有更多列)

【问题讨论】:

median 是来自 amazon-redshift 的窗口函数:docs.aws.amazon.com/redshift/latest/dg/r_WF_MEDIAN.html 你需要放置分区。 【参考方案1】:

如果语句包含对基于排序的聚合函数(LISTAGG、PERCENTILE_CONT 或 MEDIAN)的多次调用,则它们都必须使用相同的 ORDER BY 值。请注意,MEDIAN 对表达式值应用了隐式 order by。

从 https://docs.aws.amazon.com/redshift/latest/dg/r_PERCENTILE_CONT.html

【讨论】:

【参考方案2】:

对于此查询,由于没有分组依据,您可以简单地将查询分为两部分:

Select step_one,
       step_two
From
      (SELECT median(datediff(seconds,one,two)) as step_one
       FROM Table) as a,
      (SELECT median(datediff(seconds,two,three)) as step_two,
       FROM Table) as b

但在更复杂的情况下,在 select 中有一个 group by part 的情况下,我找到了解决这个问题的方法。考虑下表:


create table test321 (i int, j int, k int, l int);



insert into test321 values(null, null, null, null);
insert into test321 values(null, 13, null, null);
insert into test321 values(17, null, null, null);
insert into test321 values(null, 15, null, 14);
insert into test321 values(15, null, null, 15);


insert into test321 values(null, 14, 10, null);
insert into test321 values(14, null, 11, null);
insert into test321 values(null, 16, 12, 12);
insert into test321 values(16, null, 13, 13);

insert into test321 values(1, 1, 1, 1);
insert into test321 values(2, 2, 1, 2);
insert into test321 values(3, 3, 1, 3);
insert into test321 values(4, 4, 2, 1);
insert into test321 values(5, 5, 2, 2);
insert into test321 values(6, 6, 2, 3);
insert into test321 values(7, 7, 3, 1);
insert into test321 values(8, 8, 3, 2);
insert into test321 values(9, 9, 3, 3);
insert into test321 values(10, 10, 4, 1);
insert into test321 values(11, 11, 4, 2);
insert into test321 values(12, 12, 4, 3);

假设我们正在寻找:

select  k, l, medin(i), median(j)
from    test321
group by  k, l

那么一般的解决方案是:


Select  case when a1.kstatus = -1 then null else a1.k end k,
        case when a1.lstatus = -1 then null else a1.l end l,
        medi,
        medj
From    ( Select  coalesce(k, (select max(k) k from test321 where k is not null)) k,
                  case when a.k is not null then 0 else -1 end kstatus,
                  coalesce(l, (select max(l) l from test321 where l is not null)) l,
                  case when a.l is not null then 0 else -1 end lstatus,
                  median(i) medi
          From    (
                    select  i, j, k, l
                    from    test321
                  ) as a
          group by k, l
        ) as a1
          inner join
        ( Select  coalesce(k, (select max(k) l from test321 where k is not null)) k,
                  case when a.k is not null then 0 else -1 end kstatus,
                  coalesce(l, (select max(l) l from test321 where l is not null)) l,
                  case when a.l is not null then 0 else -1 end lstatus,
                  median(j) medj
          From    (
                    select  i, j, k, l
                    from    test321
                  ) as a
          group by k, l
        ) as a2
          on  (
                a1.k =  a2.k            and
                a1.l =  a2.l            and
                a1.kstatus = a2.kstatus and
                a1.lstatus = a2.lstatus
              )
;

希望这会有所帮助。

【讨论】:

以上是关于返回跨不同组的时间增量的中值的主要内容,如果未能解决你的问题,请参考以下文章

具有最大内存效率的增量中值计算

跨不同行的值 - 将它们组合成 1 行

Python实现对于给定的输入,保证和为 target 的不同组合数

Python实现对于给定的输入,保证和为 target 的不同组合数

Python实现对于给定的输入,保证和为 target 的不同组合数

张书乐:谜一样的二次创作,跨不进千亿级游戏周边的大市场