DB2 中 WITH 查询的 SQL 查询性能改进
Posted
技术标签:
【中文标题】DB2 中 WITH 查询的 SQL 查询性能改进【英文标题】:SQL query performance improvement for WITH query in DB2 【发布时间】:2018-07-11 18:47:49 【问题描述】:我在示例中给出的查询运行速度非常慢。我已经在 my_task
表中关闭了 400 万条记录。
我们可以对此做任何形式的性能改进吗?
以下表为例,
这里我放了数字start_dt
和end_dt
,而不是放timestamp
格式。
补充说明,end_dt
为空的地方表示它是一条活动记录,正在由工作人员处理。
T_ID |start_dt |end_dt |code |p_id
-----|---------|-------|-----------|---
1 |8 |4 |INPROGRESS |110
1 |4 | |ASSIGNED |110
4 |10 |4 |INPROGRESS |110
4 |4 | |ASSIGNED |110
5 |4 |4 |INPROGRESS |110
6 |12 |12 |INPROGRESS |110
6 |8 |8 |ASSIGNED |110
6 |8 | |DONE |110
2 |12 |12 |INPROGRESS |210
2 |8 |8 |ASSIGNED |210
2 |8 | |DONE |210
3 |12 |12 |INPROGRESS |111
输出看起来像,
P_ID |avg_bgn_diff |assigned |in_progress |completed | comp_diff
-----|-------------|---------|------------|----------|----------
110 | 4 | 2 | 1 | 1 | 10
210 | null | 0 | 0 | 1 | 8
111 | null | 0 | 1 | 0 | null
输出解释:我已经用虚构的名称掩盖了原始查询表 ref 可以被破坏,对此我提前道歉。
MY_TASK 表有唯一的 T_ID MY_PEOPLE 表是员工表 MY_TASK_REF 表包含有关谁有什么任务的详细信息 TASK 具有状态,因为每个状态更改操作都会导致在任务表中创建记录。雕像,例如 ASSIGNED、INPROGRESS 和 DONE 现在凡不存在 END_DT 的地方都代表活动记录 第一个输出字段avg_bgn_diff
我们只想找到所有(平均 END_DT 为空)“ASSIGNED”任务的平均时间
这个输出字段assigned |in_progress |completed
表示每个员工在每个类别中有多少活动任务。
查找每个员工的平均 comp_diff
完成时间。当记录进入 INPROGRESS 时,员工开始工作。我们今天完成了状态为 DONE 的任务的平均值。我们得到 INPROGRESS 的开始日期和 DONE 的开始日期。
我有以下查询,
WITH a AS (
SELECT
t1.t_id AS t_id,
t1.start_dt AS start_dt,
t1.end_dt AS end_dt,
t1.code AS code,
t2.p_id AS p_id
FROM
my_task t2
INNER JOIN my_task_ref t1 ON t1.t_id = t2.t_id
INNER JOIN my_people p1 ON t2.p_id = p1.p_id
WHERE
-- ignore DONE tasks
t1.t_id NOT IN (
SELECT t.t_id
FROM my_task t
WHERE t.code = 'DONE' AND trunc(t.execution_dt) < trunc(current_timestamp)
)
and p1.department_id = '1234'
ORDER BY p_id DESC
) SELECT
d.p_id,
d.avg_bgn_diff
,e.assigned
,e.in_progress
,e.completed
,g.comp_diff
FROM
`-- find average time for persons for diff ASSIGNMENT
(
SELECT c.p_id,AVG(c.bgn_diff) AS avg_bgn_diff
FROM(
SELECT b.p_id,timestampdiff(4,current_timestamp - a.start_dt) AS bgn_diff
FROM ( SELECT p_id,t_id,start_dt FROM a WHERE end_dt IS NULL ) b
LEFT OUTER JOIN ( SELECT p_id, t_id,start_dt FROM a WHERE
code = 'ASSIGNED' AND end_dt IS NULL ) x ON x.p_id = b.p_id
) c GROUP BY C.p_id
) d
-- find count of each codes person has
INNER JOIN (
SELECT
p_id,
SUM( CASE WHEN code = 'ASSIGNED' THEN 1 ELSE 0 END ) AS assigned,
SUM( CASE WHEN code = 'INPROGRESS' THEN 1 ELSE 0 END ) AS in_progress,
SUM( CASE WHEN code = 'DONE' AND trunc(start_dt) = trunc(current_timestamp)
THEN 1 ELSE 0 END ) AS completed
FROM
a where end_dt IS NULL
GROUP BY p_id
) e on D.p_id=E.p_id
-- find total avg diff of entire task took to compelete.
LEFT OUTER JOIN (
SELECT F.p_id,AVG(f.bgn_diff) AS comp_diff
FROM
(
SELECT a.p_id, timestampdiff(4,b.start_dt - a.start_dt) AS bgn_diff
FROM (
SELECT p_id, t_id, start_dt FROM a WHERE code = 'INPROGRESS'
) a
INNER JOIN (
SELECT p_id, t_id, start_dt FROM a
WHERE code = 'DONE' AND trunc(start_dt) = trunc(current_timestamp)
) b ON a.t_id = b.t_id
) f GROUP BY F.p_id
) g ON D.p_id=G.p_id
WITH
ur;
我们可以用不同的方式写这个来提高性能吗?
注意:索引存在于所有必要的列中。
提前致谢。
【问题讨论】:
对于初学者,尝试用左连接替换 NOT IN @DanielMarcus 我有这个想法。还有其他变化吗? 为什么有这么多嵌套查询?我觉得其中很多也可以用左连接重写-例如,最后,您从表“a”中选择了三次,如果需要,您应该能够进行一次选择并使用条件逻辑 @DanielMarcus 我想了一会儿,但没有想出一个完整的左连接查询。如果你能证明find average time for persons for diff ASSIGNMENT
写这部分作为我的例子?
请解释您的数据集如何与您的查询结果相匹配 - 我看到似乎有很多不一致之处
【参考方案1】:
如果您提供查询EXPLAIN
计划、索引列表以及对您正在尝试执行的操作的更好解释(并更正表参考c
的语法错误),我们当然可以做得更好),但这个版本的查询可能会加快速度。
请注意全程的 cmets!
WITH Incomplete_Task AS (SELECT My_Task_Ref.t_id,
My_Task_Ref.start_dt, My_Task_Ref.end_dt,
My_Task_Ref.code,
Task_A.p_id
FROM My_Task AS Task_A
JOIN My_Task_Ref
ON My_Task_Ref.t_id = Task_A.t_id
JOIN My_People
ON My_People.p_id = My_Task_Ref.p_id
AND My_People.department_id = '1234'
-- NOT IN should be fine, I just default to NOT EXISTS
WHERE NOT EXISTS (SELECT 1
FROM My_Task AS Task_B
WHERE Task_B.t_id = Task_A.t_id
AND Task_B.code = 'DONE'
-- Calling a function on a column can
-- cause indices to be ignored
AND Task_B.execution_dt < TIMESTAMP(CURRENT_DATE)))
SELECT Average_Time_And_Code_Count.p_id,
Average_Time_And_Code_Count.average_begin_difference,
COALESCE(Average_Time_And_Code_Count.assigned, 0),
COALESCE(Average_Time_And_Code_Count.in_progress, 0),
COALESCE(Average_Time_And_Code_Count.completed, 0),
Average_Complete_Time.average_complete_difference
FROM (SELECT p_id,
-- The join you had previously was almost certainly duplicating
-- some rows, distorting the results.
AVG(CASE WHEN code = 'ASSIGNED'
-- TIMESTAMPDIFF works off an estimate, and will be wrong
-- if a task takes more than a month.
THEN TIMESTAMPDIFF(4, CURRENT_TIMESTAMP - A.start_dt) END) AS average_begin_difference,
SUM(CASE WHEN code = 'ASSIGNED'
THEN 1 END) AS assigned,
SUM(CASE WHEN code = 'INPROGRESS'
THEN 1 END) AS in_progress,
SUM(CASE WHEN code = 'DONE'
AND start_dt >= TIMESTAMP(CURRENT_DATE)
THEN 1 END) AS completed
FROM Filtered_Task
WHERE end_dt IS NULL
GROUP BY p_id) AS Average_Time_And_Code_Count
-- I'm not convinced this measures what you think it does,
-- but I'm not sure what it is you think you _are_ measuring....
LEFT JOIN (SELECT p_id, TIMESTAMPDIFF(4, Done.start_dt - InProgress.start_dt) AS average_complete_difference
FROM Filtered_Task AS InProgress
JOIN Filtered_Task AS Done
ON InProgress.t_id = Done.t_id
AND Done.code = 'DONE'
AND Done.start_dt >= TIMESTAMP(CURRENT_DATE)
WHERE InProgress.code = 'INPROGRESS') AS Average_Complete_Time
ON Average_Complete_Time.p_id = Averate_Time_And_Code_Count.p_id
【讨论】:
@Clockwork-Mouse 我已经为输出添加了详细的解释。TIMESTAMPDIFF works off an estimate, and will be wrong if a task takes more than a month.
这个有什么解决办法?
@JBaba - 我要重定向你to this existing answer。那一个涵盖几个小时,但您应该能够将其转换为分钟。请注意,两个时间戳必须在同一个时区,并且必须在没有 DST 的情况下合二为一,否则您将不得不做很多额外的复杂数学运算第一的。就个人而言,我建议为您创建一个包含数学运算的函数 - TIMESTAMP_DIFF(unit, start, end)
。【参考方案2】:
尝试在第一个查询中删除 ORDER BY p_id DESC,通常 ORDER BY 非常昂贵。同样在第一个查询中,NOT IN 似乎正在查看同一个基表 my_task,因此我建议将过滤器直接放在 WHERE 子句中。
WITH a AS (
SELECT
t1.t_id AS t_id,
t1.start_dt AS start_dt,
t1.end_dt AS end_dt,
t1.code AS code,
t2.p_id AS p_id
FROM
my_task t2
INNER JOIN my_task_ref t1 ON t1.t_id = t2.t_id
INNER JOIN my_people p1 ON t2.p_id = p1.p_id
WHERE
-- ignore DONE tasks
t2.code <> 'DONE' AND trunc(t2.execution_dt) < trunc(current_timestamp)
and p1.department_id = '1234' )
此外,最好尝试减少子查询的深度/数量。 所以像
SELECT c.p_id,AVG(c.bgn_diff) AS avg_bgn_diff
FROM(
SELECT b.p_id,timestampdiff(4,current_timestamp - a.start_dt) AS bgn_diff
FROM ( SELECT p_id,t_id,start_dt FROM a WHERE end_dt IS NULL ) b
LEFT OUTER JOIN ( SELECT p_id, t_id,start_dt FROM a WHERE
code = 'ASSIGNED' AND end_dt IS NULL ) x ON x.p_id = b.p_id
) c GROUP BY C.p_id
可能会...
SELECT a.p_id,AVG(timestampdiff(4,current_timestamp - a.start_dt)) AS
avg_bgn_diff
FROM a
WHERE end_dt IS NULL OR (code = 'ASSIGNED' AND end_dt IS NULL )
GROUP BY a.p_id
【讨论】:
您的建议 1 是不可能的,因为我想删除所有已完成且少于current_timestamp
的任务记录。虽然您的建议仅删除了 DONE
记录,但并非全部。
@JBaba 如果你“想要”一些东西,请在你的问题中描述它并且不要让人们猜测 - 如果你忘记指定一些东西,这不是试图帮助你的人的错!以上是关于DB2 中 WITH 查询的 SQL 查询性能改进的主要内容,如果未能解决你的问题,请参考以下文章