使用内部查询进行查询优化
Posted
技术标签:
【中文标题】使用内部查询进行查询优化【英文标题】:Query optimization with inner query 【发布时间】:2014-03-25 21:53:25 【问题描述】:我有一个查询需要很长时间,我想在这里提出它,希望我错过了什么 - 这是查询(基本上是说“给我所有至少有一个职位的资金” )
SELECT org_name.legacy_id,
org_name. name,
org_desc.description,
org_name.instrument_style_code,
org_name.investment_orientation,
org_name.is_active,
org_name.organization_id,
mgr_org.eng_name as manager_name,
mgrs.manager_org_id as manager_organization_id,
mgrs.manager_legacy_id as manager_legacy_id
FROM ownership_organization_names org_name
INNER JOIN (SELECT fund.legacy_id
FROM ownership_organization_names fund
INNER JOIN ownership_ownerships own
ON fund.legacy_id = own.legacy_id
LEFT JOIN ownership_unconsolidated_holding_positions pos
ON own.ownership_id = pos.ownership_id
GROUP BY fund.legacy_id
HAVING COUNT(pos.holding_position_id) > 0) funds_with_positions
ON funds_with_positions.legacy_id = org_name.legacy_id
LEFT JOIN ownership_organization_descriptions org_desc
on org_name.legacy_id = org_desc.legacy_id
LEFT JOIN ownership_fund_mgrs mgrs
on org_name.legacy_id = mgrs.fund_legacy_id
LEFT JOIN organization mgr_org
on mgr_org.id = mgrs.manager_org_id
内部查询需要 42 秒的持续时间和 320 秒的获取时间(听起来不对!)并返回 135,683 行。
整个查询需要 372 秒的持续时间和 2 秒的提取时间(这听起来绝对不对)
这是来自查询的解释(持续时间 350 秒)并为格式化(或缺少)道歉
1 PRIMARY <derived2> ALL 135683
1 PRIMARY org_name ref PRIMARY PRIMARY 8 funds_with_positions.legacy_id 22303
1 PRIMARY org_desc eq_ref PRIMARY PRIMARY 8 funds_with_positions.legacy_id 1
1 PRIMARY mgrs ref PRIMARY PRIMARY 8 people_directory.org_name.legacy_id 665
1 PRIMARY mgr_org eq_ref PRIMARY PRIMARY 8 people_directory.mgrs.manager_org_id 1
2 DERIVED fund index PRIMARY PRIMARY 16 46728 Using index
2 DERIVED own ref legacy_id_idx legacy_id_idx 9 people_directory.fund.legacy_id 15 Using where
2 DERIVED pos ref ownership_id_idx ownership_id_idx 9 people_directory.own.ownership_id 3
我已经为每个连接列建立了索引,并且通过将子查询移动到 INNER JOIN 而不是 WHERE 中获得了巨大的性能提升。
我也尝试创建一个索引临时表并加入它,但我发现填充它需要大约 360 秒 - 但是它上面的外部连接变得微不足道(比如 1 秒),这告诉我内部查询非常糟糕未优化,但我不确定我能做些什么来进一步优化它
我也来自 Microsoft SQL 背景,但假设所有其他原则都是相同的。我已经看到各种线程讨论更改数据库存储引擎和调整缓冲区大小,但我想看看在采取这些措施之前我是否已经用尽了优化查询本身的所有可能性
更新: 最终,最大的性能提升来自于我的内部查询中有一个不必要的连接,这将它从大约 360 秒减少到了大约 70 秒。然而,尝试其他一些逻辑上等效的优化技术会产生一些有趣的怪癖:
按照建议,我尝试了:
SELECT
org_name.legacy_id,
org_name.`name`,
org_desc.description,
org_name.instrument_style_code,
org_name.investment_orientation,
org_name.is_active,
org_name.organization_id,
mgr_org.eng_name as manager_name,
mgrs.manager_org_id as manager_organization_id,
mgrs.manager_legacy_id as manager_legacy_id
FROM ownership_organization_names org_name
INNER JOIN (SELECT own.legacy_id
FROM ownership_ownerships own
WHERE EXISTS (SELECT 1
FROM ownership_unconsolidated_holding_positions pos
WHERE own.ownership_id = pos.ownership_id)
) funds_with_positions ON funds_with_positions.legacy_id = org_name.legacy_id
LEFT JOIN ownership_organization_descriptions org_desc on org_name.legacy_id = org_desc.legacy_id
LEFT JOIN ownership_fund_mgrs mgrs on org_name.legacy_id = mgrs.fund_legacy_id
LEFT JOIN organization mgr_org on mgr_org.id = mgrs.manager_org_id
mysql Workbench 报告查询持续时间为 242.422 秒,获取部分超时,客户端返回错误“错误代码:2008 MySQL 客户端内存不足”
将 WHERE EXISTS 样式的子查询移动到 WHERE 子句中最终确实返回了,但是它需要 0.234 秒的持续时间/ 157.781 秒的获取时间。我怀疑这根本不准确
我很好奇这种将派生表作为子查询移动到 WHERE 子句中的优化方法背后的想法——不会在派生表中更早地对其进行 INNER JOIN 减少在查询而不是稍后在 WHERE 子句中?
当然,我承认我不熟悉 WHERE EXISTS 运算符,或者至少我从没想过经常使用它 - 它在性能/内存使用与子查询/派生表方法方面的含义是什么?原来有?
【问题讨论】:
【参考方案1】:关注子查询:
(SELECT fund.legacy_id
FROM ownership_organization_names fund INNER JOIN
ownership_ownerships own
ON fund.legacy_id = own.legacy_id LEFT JOIN
ownership_unconsolidated_holding_positions pos
ON own.ownership_id = pos.ownership_id
GROUP BY fund.legacy_id
HAVING COUNT(pos.holding_position_id) > 0
) funds_with_positions
我观察到不需要fund
。您可以使用own.legacy_id
。而且,left outer join
是不必要的。您只是在寻找匹配项。这将查询简化为:
(SELECT own.legacy_id
FROM ownership_ownerships own JOIN
ownership_unconsolidated_holding_positions pos
ON own.ownership_id = pos.ownership_id
GROUP BY own.legacy_id
HAVING COUNT(*) > 0
) funds_with_positions
此查询需要显式聚合,这可能会很昂贵。我倾向于尝试以下性能:
(SELECT own.legacy_id
FROM ownership_ownerships own
WHERE EXISTS (SELECT 1
FROM ownership_unconsolidated_holding_positions pos
WHERE own.ownership_id = pos.ownership_id
)
) funds_with_positions
整个子查询只是用作过滤器。因此,我的最终建议是完全删除子查询并包含以下 where
子句:
WHERE EXISTS (SELECT 1
FROM ownership_ownerships own
WHERE own.legacy_id = orgname.legacy_id AND
EXISTS (SELECT 1
FROM ownership_unconsolidated_holding_positions pos
WHERE own.ownership_id = pos.ownership_id
)
)
我假设这些表都有正确的处理索引。对于一块,您需要在ownership_unconsolidated_holding_positions(ownership_id)
和ownership_ownerships(legacy_id, ownership_id)
上建立索引。
【讨论】:
关于内部查询上额外的、不必要的连接的出色观察,仅此一项就产生了很好的提升!然而,WHERE EXISTS 方法产生了一些有趣的怪癖(我已在最后添加到我原来的问题中),并希望您能帮助提供一些关于此运算符在性能方面的行为的见解 @manning18 。 . .目的是让where exists
进入最外层查询,而不是子查询。
对不起,我应该提到我将 WHERE EXISTS 移到了外部查询中,但它花了大约 160 秒。在修改后的答案中,我在内部查询中显示了 WHERE EXISTS,因为它给出了 MySQL WorkBench 的内存不足异常,我对导致它的含义更加好奇【参考方案2】:
假设pos.holding_position_id
不可为空,只要ownership_unconsolidated_holding_positions
中有匹配记录,COUNT(pos.holding_position_id) > 0
就会返回,
所以你不应该真的使用LEFT OUTER JOIN
,而是明确地依赖 JOIN,因为它会在游戏的早期过滤掉一些东西。正如您对问题的描述已经指出的那样,
子查询仅用于查明是否有可用于给定组织的基金。听起来你可以更好地使用更具可读性的WHERE EXISTS()
。
额外的好处是您不再需要聚合查找以避免双打。
此外,别名 fund
和 org_name
都引用同一个表。这是故意的,因为多个记录可以具有相同的 legacy_id? (很有可能!)
或者两者总是引用相同的记录?
如果后者是真的,您可能会进一步优化查询。
SELECT org_name.legacy_id,
org_name. name,
org_desc.description,
org_name.instrument_style_code,
org_name.investment_orientation,
org_name.is_active,
org_name.organization_id,
mgr_org.eng_name as manager_name,
mgrs.manager_org_id as manager_organization_id,
mgrs.manager_legacy_id as manager_legacy_id
FROM ownership_organization_names org_name
LEFT JOIN ownership_organization_descriptions org_desc
on org_name.legacy_id = org_desc.legacy_id
LEFT JOIN ownership_fund_mgrs mgrs
on org_name.legacy_id = mgrs.fund_legacy_id
LEFT JOIN organization mgr_org
on mgr_org.id = mgrs.manager_org_id
WHERE EXISTS ( SELECT *
FROM ownership_organization_names fund
JOIN ownership_ownerships own
ON fund.legacy_id = own.legacy_id
JOIN ownership_unconsolidated_holding_positions pos
ON own.ownership_id = pos.ownership_id
WHERE funds.legacy_id = org_name.legacy_id )
【讨论】:
以上是关于使用内部查询进行查询优化的主要内容,如果未能解决你的问题,请参考以下文章