(REDSHIFT) 垂直合并 / FIRST_VALUE() 作为聚合
Posted
技术标签:
【中文标题】(REDSHIFT) 垂直合并 / FIRST_VALUE() 作为聚合【英文标题】:(REDSHIFT) Vertical Coalesce / FIRST_VALUE() as an Aggregate 【发布时间】:2019-02-14 20:11:08 【问题描述】:(这是 Redshift 特有的,应该考虑到它的柱状性质、排序顺序等)
当按时间戳排序时,我需要按类别从每列中获取第一个非 NULL 值。
本质上与 FIRST_VALUE() 相同,但作为一个聚合。
或者,将 COALESCE() 作为聚合。
然而,Redshift 没有更高版本的 PostgreSQL 或 Oracle 的优点。所以,我正在寻找选项来测试我的 1 亿行导入 :)
(我不喜欢我的任何一个选择,但我很难找到更好的选择。)
示例输入
category | row_timestamp | value_a | value_b | value_c
----------+---------------+---------+---------+---------
01 | 001 | NULL | NULL | 4
01 | 010 | 7 | NULL | NULL
01 | 100 | NULL | 1 | 2
01 | 999 | 6 | 3 | 6
02 | 001 | 1 | NULL | NULL
02 | 010 | NULL | 2 | NULL
02 | 100 | NULL | 1 | 9
02 | 999 | 6 | 3 | 2
预期结果
category | value_a | value_b | value_c
----------+-------------------------+---------+---------
01 | 7 | 1 | 4
02 | 1 | 2 | 9
当前解决方案
SELECT DISTINCT
category,
FIRST_VALUE(value_a IGNORE NULLS)
OVER (PARTITION BY category
ORDER BY row_timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
AS value_a,
FIRST_VALUE(value_b IGNORE NULLS)
OVER (PARTITION BY category
ORDER BY row_timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
AS value_b,
FIRST_VALUE(value_c IGNORE NULLS)
OVER (PARTITION BY category
ORDER BY row_timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
AS value_c
FROM
mytable
它有效,但 DISTINCT 可能适用于数百或数千行。不太理想。
如果它只用于一两列,这可能会起作用(但它是用于十几列,所以这很可怕)...
WITH
sorted_value_a AS
(
SELECT
category,
value_a,
ROW_NUMBER() OVER (PARTITION BY category
ORDER BY value_a IS NOT NULL, row_timestamp
)
AS row_ordinal
FROM
myTable
),
sorted_value_b AS
(
SELECT
category,
value_b,
ROW_NUMBER() OVER (PARTITION BY category
ORDER BY value_b IS NOT NULL, row_timestamp
)
AS row_ordinal
FROM
myTable
),
sorted_value_c AS
(
SELECT
category,
value_c,
ROW_NUMBER() OVER (PARTITION BY category
ORDER BY value_c IS NOT NULL, row_timestamp
)
AS row_ordinal
FROM
myTable
)
SELECT
*
FROM
sorted_value_a AS a
INNER JOIN
sorted_value_b AS b
ON b.category = a.category
INNER JOIN
sorted_value_c AS c
ON c.category = a.category
【问题讨论】:
【参考方案1】:嗯,我不知道这是否美观,但你可以这样做:
select category, value_a, value_b, value_c, value_d
from (select coalesce(value_a, lag(value_a ignore nulls) over (partition by category order by row_timestamp)) as value_a,
coalesce(value_b, lag(value_b ignore nulls) over (partition by category order by row_timestamp)) as value_b,
coalesce(value_c, lag(value_c ignore nulls) over (partition by category order by row_timestamp)) as value_c,
coalesce(value_d, lag(value_d ignore nulls) over (partition by category order by row_timestamp)) as value_d
row_number() over (partition by category order by row_timestamp desc) as seqnum
from mytable t
) t
where seqnum = 1;
【讨论】:
希望比使用DISTINCT
折叠它更好。好主意,谢谢,我会尝试并告诉你:)
用ROW_NUMBER() = 1
替换DISTINCT
可以节省大量成本,包括为后续连接制定更好的执行计划等。因为我们碰巧需要FIRST_VALUE()
和LAST_VALUE()
的组合(在我的 OP 时我没有意识到这一点),我从未测试过 FIRST_VALUE(this IGNORE NULLS)
与 COALESCE(this, LAG(this IGNORE NULLS))
的好处。以上是关于(REDSHIFT) 垂直合并 / FIRST_VALUE() 作为聚合的主要内容,如果未能解决你的问题,请参考以下文章