UNION 查询 Redshift 性能不佳

Posted

技术标签:

【中文标题】UNION 查询 Redshift 性能不佳【英文标题】:Poor performance of UNION query Redshift 【发布时间】:2020-07-01 08:25:14 【问题描述】:

我有一个性能很差的 Redshift UNION 查询。查询如下:

WITH a1 AS (SELECT
            revenue_month,
            SUM(revenue) AS revenue,
            SUM(cost1) AS cost1,
            SUM(cost2) AS cost2,
            SUM(cost3) AS cost3
            FROM orders1
            GROUP BY revenue_month),

     a2 AS (SELECT
            revenue_month,
            SUM(revenue) AS revenue,
            SUM(cost1) AS cost1,
            SUM(cost2) AS cost2,
            SUM(cost3) AS cost3
            FROM orders2
            GROUP BY revenue_month),

     b1 AS (SELECT
            revenue_month,
            amount_type,
            SUM(amount) AS amount
            FROM monthly
            GROUP BY revenue_month,amount_type)
             
SELECT 'a1' AS data_set, 'revenue' AS amount_type, a1.revenue AS amount FROM a1 UNION
SELECT 'a1' AS data_set, 'cost1' AS amount_type, a1.cost1 AS amount FROM a1 UNION
SELECT 'a1' AS data_set, 'cost2' AS amount_type, a1.cost2 AS amount FROM a1 UNION
SELECT 'a1' AS data_set, 'cost3' AS amount_type, a1.cost3 AS amount FROM a1 UNION

SELECT 'a2' AS data_set, 'revenue' AS amount_type, a2.revenue AS amount FROM a2 UNION
SELECT 'a2' AS data_set, 'cost1' AS amount_type, a2.cost1 AS amount FROM a2 UNION
SELECT 'a2' AS data_set, 'cost2' AS amount_type, a2.cost2 AS amount FROM a2 UNION
SELECT 'a2' AS data_set, 'cost3' AS amount_type, a2.cost3 AS amount FROM a2 UNION

SELECT 'b1' AS data_set, b1.amount_type, b2.amount FROM b2

UNION 部分的目标是将 a1 和 a2 转换为与 b1 具有相同的结果集架构,并最终获得一个组合数据集。

当单独运行 a1 和 a2 子查询时,每个子查询需要大约 60 秒才能完成 6000 行,而 b1 需要 5 秒才能完成 500 行。这些运行时间对我来说是可以接受的,但是,上面的“组合”查询运行了高达 20 分钟。

我认为获取部分是这个查询花费太多时间的部分。我曾尝试使用 UNION ALL,但性能并没有提高那么多。如果我能够以某种方式将 a1 和 a2 转换为 b1 架构而不必使用 UNION 会很棒,但我无法这样做。

任何帮助将不胜感激。谢谢

【问题讨论】:

【参考方案1】:

您基本上想要取消透视a1a2 表。

我会这样做:

WITH
     seq (idx) AS (
         select 'revenue' UNION ALL
         select 'cost1' UNION ALL
         select 'cost2' UNION ALL
         select 'cost3'
     ),
     a1 AS (SELECT
                revenue_month,
                SUM(revenue) AS revenue,
                SUM(cost1) AS cost1,
                SUM(cost2) AS cost2,
                SUM(cost3) AS cost3
            FROM orders1
            GROUP BY revenue_month),

     a2 AS (SELECT
                revenue_month,
                SUM(revenue) AS revenue,
                SUM(cost1) AS cost1,
                SUM(cost2) AS cost2,
                SUM(cost3) AS cost3
            FROM orders2
            GROUP BY revenue_month),

     b1 AS (SELECT
                revenue_month,
                amount_type,
                SUM(amount) AS amount
            FROM monthly
            GROUP BY revenue_month,amount_type)

SELECT
    'a1' AS data_set,
    seq.idx AS amount_type,
    CASE seq.idx
       WHEN 'revenue' THEN a1.revenue
       WHEN 'cost1' THEN a1.cost1
       WHEN 'cost2' THEN a1.cost2
       WHEN 'cost3' THEN a1.cost3
    END AS amount
FROM a1 CROSS JOIN seq

UNION ALL

SELECT
    'a2' AS data_set,
    seq.idx AS amount_type,
    CASE seq.idx
        WHEN 'revenue' THEN a1.revenue
        WHEN 'cost1' THEN a1.cost1
        WHEN 'cost2' THEN a1.cost2
        WHEN 'cost3' THEN a1.cost3
        END AS amount
FROM a2 CROSS JOIN seq

UNION ALL

SELECT 
       'b1' AS data_set, 
       b1.amount_type, 
       b1.amount 
FROM b1

【讨论】:

感谢@botchniaque 的快速回答。我已经测试了 CROSS JOIN 方法可以正确转换 a1 和 a2。但是,使用 a1、a2 和 b1 的 UNION ALL 部分运行的最终查询会引发错误XX000: This type of correlated subquery pattern is not supported due to internal error。我已经检查了这个link,但是这个查询模式不包含在那个列表中。知道是什么原因造成的吗? 您使用的确切查询是什么?你直接使用我的例子吗?也许将 a1a2 的不旋转移动到他们自己的 cte (with a1_unpivot as (...), a2_unpivot as (...)) 会有所帮助? 我对您遇到的错误感到困惑,因为那里没有子查询。 您好,谢谢您的及时回复!真的很感激。我实际上也完全按照您的建议尝试了 - 将“unpivoting”移动到他们自己的 ctes 中,但奇怪的是抛出了同样的错误。当 CROSS JOIN 和 UNION 都集成在查询中时,redshift 读取它的方式有些问题,它总是失败并出现相同的错误,但在运行单个 ctes 时没有问题。然而,在玩了一整天之后,对我有用的是将b1 转换为a1a2 的结果模式,对三个ctes 执行UNION,然后最后将CROSS JOIN 转换为unpivot 【参考方案2】:

感谢@botchniaque 提供的所有帮助。您的CROSS JOIN 建议解决了这个问题。尽管 Redshift 无法读取该查询模式,但仍有一些内容。对我有用的最终查询是这样的:

WITH a1 AS (SELECT
            revenue_month,
            SUM(revenue) AS revenue,
            SUM(cost1) AS cost1,
            SUM(cost2) AS cost2,
            SUM(cost3) AS cost3
            FROM orders1
            GROUP BY revenue_month),

     a2 AS (SELECT
            revenue_month,
            SUM(revenue) AS revenue,
            SUM(cost1) AS cost1,
            SUM(cost2) AS cost2,
            SUM(cost3) AS cost3
            FROM orders2
            GROUP BY revenue_month),

     b1 AS (SELECT
            revenue_month,
            SUM(CASE WHEN amount_type = 'revenue' THEN amount ELSE 0 END) AS revenue,
            SUM(CASE WHEN amount_type = 'cost1' THEN amount ELSE 0 END) AS cost1,
            SUM(CASE WHEN amount_type = 'cost2' THEN amount ELSE 0 END) AS cost2,
            SUM(CASE WHEN amount_type = 'cost3' THEN amount ELSE 0 END) AS cost3

            FROM (SELECT
                  revenue_month,
                  amount_type,
                  SUM(amount) AS amount
                  FROM monthly
                  GROUP BY revenue_month,amount_type) AS b0
                  
            GROUP BY revenue_month)

SELECT
ab.data_set,
ab.revenue_month,
seq.amount_type,
CASE seq.amount_type
    WHEN 'revenue' THEN ab.revenue
    WHEN 'cost1' THEN ab.cost1
    WHEN 'cost2' THEN ab.cost2
    WHEN 'cost3' THEN ab.cost3
END AS amount

FROM            
            
(SELECT a1.revenue_month, a1.revenue, a1.cost1, a1.cost2, a1.cost3 FROM a1 UNION ALL
 SELECT a2.revenue_month, a2.revenue, a2.cost1, a2.cost2, a2.cost3 FROM a2 UNION ALL
 SELECT b1.revenue_month, b1.revenue, b1.cost1, b1.cost2, b1.cost3 FROM b1) AS ab
 
 CROSS JOIN (SELECT 'revenue' AS amount_type UNION ALL
             SELECT 'cost1' AS amount_type UNION ALL
             SELECT 'cost2' AS amount_type UNION ALL
             SELECT 'cost3' AS amount_type) AS seq

基本上它首先将b1 转为与a1a2 具有相同的架构。然后将所有三个数据集与UNION 组合成ab。最后使用CROSS JOIN 对组合数据集进行反透视

【讨论】:

以上是关于UNION 查询 Redshift 性能不佳的主要内容,如果未能解决你的问题,请参考以下文章

MySQL 中的 UNION ALL 性能不佳

Amazon Redshift - 复制 - 数据加载与查询性能问题

如何在没有 UNION ALL 的情况下查询 redshift 中的许多表?

性能不佳的 SQL 查询

Redshift - 如何识别查询中的低性能区域?

Redshift 查询性能以降低 CPU 利用率