DB2 中 WITH 查询的 SQL 查询性能改进

Posted

技术标签:

【中文标题】DB2 中 WITH 查询的 SQL 查询性能改进【英文标题】:SQL query performance improvement for WITH query in DB2 【发布时间】:2018-07-11 18:47:49 【问题描述】:

我在示例中给出的查询运行速度非常慢。我已经在 my_task 表中关闭了 400 万条记录。

我们可以对此做任何形式的性能改进吗?

以下表为例,

这里我放了数字start_dtend_dt,而不是放timestamp格式。

补充说明,end_dt 为空的地方表示它是一条活动记录,正在由工作人员处理。

T_ID |start_dt |end_dt |code       |p_id
-----|---------|-------|-----------|---
1    |8        |4      |INPROGRESS |110
1    |4        |       |ASSIGNED   |110
4    |10       |4      |INPROGRESS |110
4    |4        |       |ASSIGNED   |110
5    |4        |4      |INPROGRESS |110
6    |12       |12     |INPROGRESS |110
6    |8        |8      |ASSIGNED   |110
6    |8        |       |DONE       |110
2    |12       |12     |INPROGRESS |210
2    |8        |8      |ASSIGNED   |210
2    |8        |       |DONE       |210
3    |12       |12     |INPROGRESS |111

输出看起来像,

P_ID |avg_bgn_diff |assigned |in_progress |completed | comp_diff
-----|-------------|---------|------------|----------|----------
110  | 4           |   2     |    1       |     1    |      10
210  | null        |   0     |    0       |     1    |      8
111  | null        |   0     |    1       |     0    |      null

输出解释:我已经用虚构的名称掩盖了原始查询表 ref 可以被破坏,对此我提前道歉。

MY_TASK 表有唯一的 T_ID MY_PEOPLE 表是员工表 MY_TASK_REF 表包含有关谁有什么任务的详细信息 TASK 具有状态,因为每个状态更改操作都会导致在任务表中创建记录。雕像,例如 ASSIGNED、INPROGRESS 和 DONE 现在凡不存在 END_DT 的地方都代表活动记录 第一个输出字段 avg_bgn_diff 我们只想找到所有(平均 END_DT 为空)“ASSIGNED”任务的平均时间 这个输出字段assigned |in_progress |completed 表示每个员工在每个类别中有多少活动任务。 查找每个员工的平均 comp_diff 完成时间。当记录进入 INPROGRESS 时,员工开始工作。我们今天完成了状态为 DONE 的任务的平均值。我们得到 INPROGRESS 的开始日期和 DONE 的开始日期。

我有以下查询,

WITH a AS (
    SELECT
        t1.t_id AS t_id,
        t1.start_dt AS start_dt,
        t1.end_dt AS end_dt,
        t1.code AS code,
        t2.p_id AS p_id
    FROM
        my_task t2
        INNER JOIN my_task_ref t1 ON t1.t_id = t2.t_id
        INNER JOIN my_people p1 ON t2.p_id = p1.p_id
    WHERE
        -- ignore DONE tasks
        t1.t_id NOT IN (
            SELECT t.t_id
            FROM my_task t
            WHERE t.code = 'DONE' AND trunc(t.execution_dt) < trunc(current_timestamp)
        )
        and p1.department_id = '1234' 
    ORDER BY p_id DESC
) SELECT
    d.p_id,
    d.avg_bgn_diff
    ,e.assigned
    ,e.in_progress
    ,e.completed
    ,g.comp_diff
  FROM
  `-- find average time for persons for diff ASSIGNMENT
    (
        SELECT c.p_id,AVG(c.bgn_diff) AS avg_bgn_diff
        FROM(
                SELECT b.p_id,timestampdiff(4,current_timestamp - a.start_dt) AS bgn_diff
                FROM ( SELECT p_id,t_id,start_dt FROM a WHERE end_dt IS NULL ) b
                LEFT OUTER JOIN  ( SELECT p_id, t_id,start_dt FROM a WHERE 
                     code = 'ASSIGNED' AND   end_dt IS NULL ) x ON x.p_id = b.p_id
            ) c  GROUP BY C.p_id
    ) d
    -- find count of each codes person has
    INNER JOIN (
        SELECT 
            p_id,
            SUM( CASE WHEN code = 'ASSIGNED' THEN 1 ELSE 0 END ) AS assigned,
            SUM( CASE WHEN code = 'INPROGRESS' THEN 1 ELSE 0 END ) AS in_progress,
            SUM( CASE WHEN code = 'DONE' AND trunc(start_dt) = trunc(current_timestamp)
                    THEN 1 ELSE 0 END ) AS completed
        FROM
            a where end_dt IS NULL
        GROUP BY p_id
    ) e on D.p_id=E.p_id 
    -- find total avg diff of entire task took to compelete.
    LEFT OUTER JOIN (
        SELECT F.p_id,AVG(f.bgn_diff) AS comp_diff
        FROM
            (
                SELECT a.p_id, timestampdiff(4,b.start_dt - a.start_dt) AS bgn_diff
                FROM (
                        SELECT p_id, t_id, start_dt FROM a WHERE code = 'INPROGRESS'
                    ) a
                    INNER JOIN (
                        SELECT p_id, t_id, start_dt FROM a
                        WHERE code = 'DONE' AND   trunc(start_dt) = trunc(current_timestamp)
                    ) b ON a.t_id = b.t_id
            ) f GROUP BY F.p_id
    ) g ON D.p_id=G.p_id
WITH
ur;

我们可以用不同的方式写这个来提高性能吗?

注意:索引存在于所有必要的列中。

提前致谢。

【问题讨论】:

对于初学者,尝试用左连接替换 NOT IN @DanielMarcus 我有这个想法。还有其他变化吗? 为什么有这么多嵌套查询?我觉得其中很多也可以用左连接重写-例如,最后,您从表“a”中选择了三次,如果需要,您应该能够进行一次选择并使用条件逻辑 @DanielMarcus 我想了一会儿,但没有想出一个完整的左连接查询。如果你能证明find average time for persons for diff ASSIGNMENT写这部分作为我的例子? 请解释您的数据集如何与您的查询结果相匹配 - 我看到似乎有很多不一致之处 【参考方案1】:

如果您提供查询EXPLAIN 计划、索引列表以及对您正在尝试执行的操作的更好解释(并更正表参考c 的语法错误),我们当然可以做得更好),但这个版本的查询可能会加快速度。

请注意全程的 cmets!

WITH Incomplete_Task AS (SELECT My_Task_Ref.t_id,
                                My_Task_Ref.start_dt, My_Task_Ref.end_dt,
                                My_Task_Ref.code,
                                Task_A.p_id
                         FROM My_Task AS Task_A
                         JOIN My_Task_Ref
                           ON My_Task_Ref.t_id = Task_A.t_id
                         JOIN My_People
                           ON My_People.p_id = My_Task_Ref.p_id
                              AND My_People.department_id = '1234'
                         -- NOT IN should be fine, I just default to NOT EXISTS
                         WHERE NOT EXISTS (SELECT 1
                                           FROM My_Task AS Task_B
                                           WHERE Task_B.t_id = Task_A.t_id
                                           AND Task_B.code = 'DONE'
                                           -- Calling a function on a column can 
                                           -- cause indices to be ignored
                                           AND Task_B.execution_dt < TIMESTAMP(CURRENT_DATE)))

SELECT Average_Time_And_Code_Count.p_id,
       Average_Time_And_Code_Count.average_begin_difference,
       COALESCE(Average_Time_And_Code_Count.assigned, 0),
       COALESCE(Average_Time_And_Code_Count.in_progress, 0),
       COALESCE(Average_Time_And_Code_Count.completed, 0),
       Average_Complete_Time.average_complete_difference
FROM (SELECT p_id,
             -- The join you had previously was almost certainly duplicating 
             -- some rows, distorting the results.
             AVG(CASE WHEN code = 'ASSIGNED' 
                      -- TIMESTAMPDIFF works off an estimate, and will be wrong
                      -- if a task takes more than a month.
                      THEN TIMESTAMPDIFF(4, CURRENT_TIMESTAMP - A.start_dt) END) AS average_begin_difference,
             SUM(CASE WHEN code = 'ASSIGNED' 
                               THEN 1 END) AS assigned,
             SUM(CASE WHEN code = 'INPROGRESS' 
                               THEN 1 END) AS in_progress,
             SUM(CASE WHEN code = 'DONE' 
                                    AND start_dt >= TIMESTAMP(CURRENT_DATE) 
                               THEN 1 END) AS completed
      FROM Filtered_Task
      WHERE end_dt IS NULL
      GROUP BY p_id) AS Average_Time_And_Code_Count
-- I'm not convinced this measures what you think it does,
-- but I'm not sure what it is you think you _are_ measuring....
LEFT JOIN (SELECT p_id, TIMESTAMPDIFF(4, Done.start_dt - InProgress.start_dt) AS average_complete_difference
           FROM Filtered_Task AS InProgress
           JOIN Filtered_Task AS Done
             ON InProgress.t_id = Done.t_id
                AND Done.code = 'DONE'
                AND Done.start_dt >= TIMESTAMP(CURRENT_DATE)
           WHERE InProgress.code = 'INPROGRESS') AS Average_Complete_Time
       ON Average_Complete_Time.p_id = Averate_Time_And_Code_Count.p_id

【讨论】:

@Clockwork-Mouse 我已经为输出添加了详细的解释。 TIMESTAMPDIFF works off an estimate, and will be wrong if a task takes more than a month. 这个有什么解决办法? @JBaba - 我要重定向你to this existing answer。那一个涵盖几个小时,但您应该能够将其转换为分钟。请注意,两个时间戳必须在同一个时区,并且必须在没有 DST 的情况下合二为一,否则您将不得不做很多额外的复杂数学运算第一的。就个人而言,我建议为您创建一个包含数学运算的函数 - TIMESTAMP_DIFF(unit, start, end)【参考方案2】:

尝试在第一个查询中删除 ORDER BY p_id DESC,通常 ORDER BY 非常昂贵。同样在第一个查询中,NOT IN 似乎正在查看同一个基表 my_task,因此我建议将过滤器直接放在 WHERE 子句中。

WITH a AS (
SELECT
    t1.t_id AS t_id,
    t1.start_dt AS start_dt,
    t1.end_dt AS end_dt,
    t1.code AS code,
    t2.p_id AS p_id
FROM
    my_task t2
    INNER JOIN my_task_ref t1 ON t1.t_id = t2.t_id
    INNER JOIN my_people p1 ON t2.p_id = p1.p_id
WHERE
    -- ignore DONE tasks
    t2.code <> 'DONE' AND trunc(t2.execution_dt) < trunc(current_timestamp)
    and p1.department_id = '1234' )

此外,最好尝试减少子查询的深度/数量。 所以像

 SELECT c.p_id,AVG(c.bgn_diff) AS avg_bgn_diff
    FROM(
            SELECT b.p_id,timestampdiff(4,current_timestamp - a.start_dt) AS bgn_diff
            FROM ( SELECT p_id,t_id,start_dt FROM a WHERE end_dt IS NULL ) b
            LEFT OUTER JOIN  ( SELECT p_id, t_id,start_dt FROM a WHERE 
                 code = 'ASSIGNED' AND   end_dt IS NULL ) x ON x.p_id = b.p_id
        ) c  GROUP BY C.p_id

可能会...

SELECT a.p_id,AVG(timestampdiff(4,current_timestamp - a.start_dt)) AS 
avg_bgn_diff
FROM a
WHERE end_dt IS NULL OR (code = 'ASSIGNED' AND end_dt IS NULL )
GROUP BY a.p_id

【讨论】:

您的建议 1 是不可能的,因为我想删除所有已完成且少于 current_timestamp 的任务记录。虽然您的建议仅删除了 DONE 记录,但并非全部。 @JBaba 如果你“想要”一些东西,请在你的问题中描述它并且不要让人们猜测 - 如果你忘记指定一些东西,这不是试图帮助你的人的错!

以上是关于DB2 中 WITH 查询的 SQL 查询性能改进的主要内容,如果未能解决你的问题,请参考以下文章

使用 like 谓词(模式匹配)对 DB2 Z/oS 的 SQL 查询进行性能调优

SQL DB2 如何将日期变量应用于内部查询?

Sql Azure:WITH 语句查询的性能非常慢

选择查询和聚合函数的 SQL Server 性能改进

具有多个 WITH AS 案例的 DB2 查询未编译

DB2 中的合并查询