UNION 查询 Redshift 性能不佳
Posted
技术标签:
【中文标题】UNION 查询 Redshift 性能不佳【英文标题】:Poor performance of UNION query Redshift 【发布时间】:2020-07-01 08:25:14 【问题描述】:我有一个性能很差的 Redshift UNION 查询。查询如下:
WITH a1 AS (SELECT
revenue_month,
SUM(revenue) AS revenue,
SUM(cost1) AS cost1,
SUM(cost2) AS cost2,
SUM(cost3) AS cost3
FROM orders1
GROUP BY revenue_month),
a2 AS (SELECT
revenue_month,
SUM(revenue) AS revenue,
SUM(cost1) AS cost1,
SUM(cost2) AS cost2,
SUM(cost3) AS cost3
FROM orders2
GROUP BY revenue_month),
b1 AS (SELECT
revenue_month,
amount_type,
SUM(amount) AS amount
FROM monthly
GROUP BY revenue_month,amount_type)
SELECT 'a1' AS data_set, 'revenue' AS amount_type, a1.revenue AS amount FROM a1 UNION
SELECT 'a1' AS data_set, 'cost1' AS amount_type, a1.cost1 AS amount FROM a1 UNION
SELECT 'a1' AS data_set, 'cost2' AS amount_type, a1.cost2 AS amount FROM a1 UNION
SELECT 'a1' AS data_set, 'cost3' AS amount_type, a1.cost3 AS amount FROM a1 UNION
SELECT 'a2' AS data_set, 'revenue' AS amount_type, a2.revenue AS amount FROM a2 UNION
SELECT 'a2' AS data_set, 'cost1' AS amount_type, a2.cost1 AS amount FROM a2 UNION
SELECT 'a2' AS data_set, 'cost2' AS amount_type, a2.cost2 AS amount FROM a2 UNION
SELECT 'a2' AS data_set, 'cost3' AS amount_type, a2.cost3 AS amount FROM a2 UNION
SELECT 'b1' AS data_set, b1.amount_type, b2.amount FROM b2
UNION 部分的目标是将 a1 和 a2 转换为与 b1 具有相同的结果集架构,并最终获得一个组合数据集。
当单独运行 a1 和 a2 子查询时,每个子查询需要大约 60 秒才能完成 6000 行,而 b1 需要 5 秒才能完成 500 行。这些运行时间对我来说是可以接受的,但是,上面的“组合”查询运行了高达 20 分钟。
我认为获取部分是这个查询花费太多时间的部分。我曾尝试使用 UNION ALL,但性能并没有提高那么多。如果我能够以某种方式将 a1 和 a2 转换为 b1 架构而不必使用 UNION 会很棒,但我无法这样做。
任何帮助将不胜感激。谢谢
【问题讨论】:
【参考方案1】:您基本上想要取消透视a1
和a2
表。
我会这样做:
WITH
seq (idx) AS (
select 'revenue' UNION ALL
select 'cost1' UNION ALL
select 'cost2' UNION ALL
select 'cost3'
),
a1 AS (SELECT
revenue_month,
SUM(revenue) AS revenue,
SUM(cost1) AS cost1,
SUM(cost2) AS cost2,
SUM(cost3) AS cost3
FROM orders1
GROUP BY revenue_month),
a2 AS (SELECT
revenue_month,
SUM(revenue) AS revenue,
SUM(cost1) AS cost1,
SUM(cost2) AS cost2,
SUM(cost3) AS cost3
FROM orders2
GROUP BY revenue_month),
b1 AS (SELECT
revenue_month,
amount_type,
SUM(amount) AS amount
FROM monthly
GROUP BY revenue_month,amount_type)
SELECT
'a1' AS data_set,
seq.idx AS amount_type,
CASE seq.idx
WHEN 'revenue' THEN a1.revenue
WHEN 'cost1' THEN a1.cost1
WHEN 'cost2' THEN a1.cost2
WHEN 'cost3' THEN a1.cost3
END AS amount
FROM a1 CROSS JOIN seq
UNION ALL
SELECT
'a2' AS data_set,
seq.idx AS amount_type,
CASE seq.idx
WHEN 'revenue' THEN a1.revenue
WHEN 'cost1' THEN a1.cost1
WHEN 'cost2' THEN a1.cost2
WHEN 'cost3' THEN a1.cost3
END AS amount
FROM a2 CROSS JOIN seq
UNION ALL
SELECT
'b1' AS data_set,
b1.amount_type,
b1.amount
FROM b1
【讨论】:
感谢@botchniaque 的快速回答。我已经测试了 CROSS JOIN 方法可以正确转换 a1 和 a2。但是,使用 a1、a2 和 b1 的 UNION ALL 部分运行的最终查询会引发错误XX000: This type of correlated subquery pattern is not supported due to internal error
。我已经检查了这个link,但是这个查询模式不包含在那个列表中。知道是什么原因造成的吗?
您使用的确切查询是什么?你直接使用我的例子吗?也许将 a1
和 a2
的不旋转移动到他们自己的 cte (with a1_unpivot as (...), a2_unpivot as (...)
) 会有所帮助?
我对您遇到的错误感到困惑,因为那里没有子查询。
您好,谢谢您的及时回复!真的很感激。我实际上也完全按照您的建议尝试了 - 将“unpivoting”移动到他们自己的 ctes 中,但奇怪的是抛出了同样的错误。当 CROSS JOIN 和 UNION 都集成在查询中时,redshift 读取它的方式有些问题,它总是失败并出现相同的错误,但在运行单个 ctes 时没有问题。然而,在玩了一整天之后,对我有用的是将b1
转换为a1
和a2
的结果模式,对三个ctes 执行UNION
,然后最后将CROSS JOIN
转换为unpivot
【参考方案2】:
感谢@botchniaque 提供的所有帮助。您的CROSS JOIN
建议解决了这个问题。尽管 Redshift 无法读取该查询模式,但仍有一些内容。对我有用的最终查询是这样的:
WITH a1 AS (SELECT
revenue_month,
SUM(revenue) AS revenue,
SUM(cost1) AS cost1,
SUM(cost2) AS cost2,
SUM(cost3) AS cost3
FROM orders1
GROUP BY revenue_month),
a2 AS (SELECT
revenue_month,
SUM(revenue) AS revenue,
SUM(cost1) AS cost1,
SUM(cost2) AS cost2,
SUM(cost3) AS cost3
FROM orders2
GROUP BY revenue_month),
b1 AS (SELECT
revenue_month,
SUM(CASE WHEN amount_type = 'revenue' THEN amount ELSE 0 END) AS revenue,
SUM(CASE WHEN amount_type = 'cost1' THEN amount ELSE 0 END) AS cost1,
SUM(CASE WHEN amount_type = 'cost2' THEN amount ELSE 0 END) AS cost2,
SUM(CASE WHEN amount_type = 'cost3' THEN amount ELSE 0 END) AS cost3
FROM (SELECT
revenue_month,
amount_type,
SUM(amount) AS amount
FROM monthly
GROUP BY revenue_month,amount_type) AS b0
GROUP BY revenue_month)
SELECT
ab.data_set,
ab.revenue_month,
seq.amount_type,
CASE seq.amount_type
WHEN 'revenue' THEN ab.revenue
WHEN 'cost1' THEN ab.cost1
WHEN 'cost2' THEN ab.cost2
WHEN 'cost3' THEN ab.cost3
END AS amount
FROM
(SELECT a1.revenue_month, a1.revenue, a1.cost1, a1.cost2, a1.cost3 FROM a1 UNION ALL
SELECT a2.revenue_month, a2.revenue, a2.cost1, a2.cost2, a2.cost3 FROM a2 UNION ALL
SELECT b1.revenue_month, b1.revenue, b1.cost1, b1.cost2, b1.cost3 FROM b1) AS ab
CROSS JOIN (SELECT 'revenue' AS amount_type UNION ALL
SELECT 'cost1' AS amount_type UNION ALL
SELECT 'cost2' AS amount_type UNION ALL
SELECT 'cost3' AS amount_type) AS seq
基本上它首先将b1
转为与a1
和a2
具有相同的架构。然后将所有三个数据集与UNION
组合成ab
。最后使用CROSS JOIN
对组合数据集进行反透视
【讨论】:
以上是关于UNION 查询 Redshift 性能不佳的主要内容,如果未能解决你的问题,请参考以下文章
Amazon Redshift - 复制 - 数据加载与查询性能问题