(REDSHIFT) 垂直合并 / FIRST_VALUE() 作为聚合

Posted

技术标签:

【中文标题】(REDSHIFT) 垂直合并 / FIRST_VALUE() 作为聚合【英文标题】:(REDSHIFT) Vertical Coalesce / FIRST_VALUE() as an Aggregate 【发布时间】:2019-02-14 20:11:08 【问题描述】:

(这是 Redshift 特有的,应该考虑到它的柱状性质、排序顺序等)

当按时间戳排序时,我需要按类别从每列中获取第一个非 NULL 值。

本质上与 FIRST_VALUE() 相同,但作为一个聚合。

或者,将 COALESCE() 作为聚合。

然而,Redshift 没有更高版本的 PostgreSQL 或 Oracle 的优点。所以,我正在寻找选项来测试我的 1 亿行导入 :)

(我不喜欢我的任何一个选择,但我很难找到更好的选择。)

示例输入

 category | row_timestamp | value_a | value_b | value_c
----------+---------------+---------+---------+---------

    01    |      001      |   NULL  |   NULL  |     4
    01    |      010      |      7  |   NULL  |  NULL
    01    |      100      |   NULL  |      1  |     2
    01    |      999      |      6  |      3  |     6

    02    |      001      |      1  |   NULL  |  NULL
    02    |      010      |   NULL  |      2  |  NULL
    02    |      100      |   NULL  |      1  |     9
    02    |      999      |      6  |      3  |     2

预期结果

 category |                 value_a | value_b | value_c
----------+-------------------------+---------+---------
    01    |                      7  |      1  |     4
    02    |                      1  |      2  |     9

当前解决方案

SELECT DISTINCT
    category,
    FIRST_VALUE(value_a IGNORE NULLS)
        OVER (PARTITION BY category
                  ORDER BY row_timestamp
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
             )
                 AS value_a,

    FIRST_VALUE(value_b IGNORE NULLS)
        OVER (PARTITION BY category
                  ORDER BY row_timestamp
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
             )
                 AS value_b,

    FIRST_VALUE(value_c IGNORE NULLS)
        OVER (PARTITION BY category
                  ORDER BY row_timestamp
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
             )
                 AS value_c
FROM
    mytable

它有效,但 DISTINCT 可能适用于数百或数千行。不太理想。

如果它只用于一两列,这可能会起作用(但它是用于十几列,所以这很可怕)...

WITH
    sorted_value_a AS
(
    SELECT
        category,
        value_a,
        ROW_NUMBER() OVER (PARTITION BY category
                               ORDER BY value_a IS NOT NULL, row_timestamp
                          )
                              AS row_ordinal
    FROM
        myTable
),
    sorted_value_b AS
(
    SELECT
        category,
        value_b,
        ROW_NUMBER() OVER (PARTITION BY category
                               ORDER BY value_b IS NOT NULL, row_timestamp
                          )
                              AS row_ordinal
    FROM
        myTable
),
    sorted_value_c AS
(
    SELECT
        category,
        value_c,
        ROW_NUMBER() OVER (PARTITION BY category
                               ORDER BY value_c IS NOT NULL, row_timestamp
                          )
                              AS row_ordinal
    FROM
        myTable
)
SELECT
    *
FROM
    sorted_value_a   AS a
INNER JOIN
    sorted_value_b   AS b
        ON b.category = a.category
INNER JOIN
    sorted_value_c   AS c
        ON c.category = a.category

【问题讨论】:

【参考方案1】:

嗯,我不知道这是否美观,但你可以这样做:

select category, value_a, value_b, value_c, value_d
from (select coalesce(value_a, lag(value_a ignore nulls) over (partition by category order by row_timestamp)) as value_a,
             coalesce(value_b, lag(value_b ignore nulls) over (partition by category order by row_timestamp)) as value_b,
             coalesce(value_c, lag(value_c ignore nulls) over (partition by category order by row_timestamp)) as value_c,
             coalesce(value_d, lag(value_d ignore nulls) over (partition by category order by row_timestamp)) as value_d
             row_number() over (partition by category order by row_timestamp desc) as seqnum 
      from mytable t
     ) t
where seqnum = 1;

【讨论】:

希望比使用DISTINCT 折叠它更好。好主意,谢谢,我会尝试并告诉你:) ROW_NUMBER() = 1 替换DISTINCT 可以节省大量成本,包括为后续连接制定更好的执行计划等。因为我们碰巧需要FIRST_VALUE()LAST_VALUE() 的组合(在我的 OP 时我没有意识到这一点),我从未测试过 FIRST_VALUE(this IGNORE NULLS)COALESCE(this, LAG(this IGNORE NULLS)) 的好处。

以上是关于(REDSHIFT) 垂直合并 / FIRST_VALUE() 作为聚合的主要内容,如果未能解决你的问题,请参考以下文章

如何在redshift中合并行

在 Redshift 中合并 JSON 数组中的元素

Redshift 不使用交错排序键执行合并连接

在 Redshift 中合并单独的月份和年份

如何在 Redshift 中合并 JSON 对象?

在 Redshift COPY 中合并文件名