在连接的子查询中重复 WHERE 标准
Posted
技术标签:
【中文标题】在连接的子查询中重复 WHERE 标准【英文标题】:Duplicate WHERE criteria in joined sub-queries 【发布时间】:2017-11-24 10:16:03 【问题描述】:我写了一个连接十四个表的查询。当条件返回很多行时,查询需要很长时间。这是原始查询,带有大的IN
条件:
SELECT r.source_uri AS su_on_r, r.title AS t_on_r, r.subtitle AS s_on_r, r.artist_name AS an_on_r, r.asin AS a_on_r, r.country AS c_on_r, r.release_date AS rd_on_r, string_agg(DISTINCT barcode.barcode::TEXT, '|') AS b_on_barcode, string_agg(DISTINCT genre.genre::TEXT, '|') AS g_on_genre, string_agg(DISTINCT typ.type::TEXT, '|') AS t_on_typ, string_agg(tag.voted_tag::TEXT, '|') AS vt_on_tag, IMAGE.uri AS u_on_image, IMAGE.width AS w_on_image, IMAGE.height AS h_on_image, IMAGE.score AS s_on_image, string_agg(DISTINCT imageType.image_type::TEXT, '|') AS it_on_imageType, string_agg(tag.votes::TEXT, '|') AS v_on_tag, string_agg(DISTINCT url.url::TEXT, '|') AS u_on_url, event.label_name AS ln_on_event, event.cat AS c_on_event, m.position AS p_on_m, m.title AS t_on_m, m.format AS f_on_m, t.position AS p_on_t, t.title AS t_on_t, string_agg(DISTINCT t.duration::TEXT, '|') AS d_on_t, string_agg(DISTINCT tArtist.artist::TEXT, '|') AS a_on_tArtist, string_agg(DISTINCT tComposer.composer::TEXT, '|') AS c_on_tComposer, string_agg(DISTINCT tIsrc.isrc::TEXT, '|') AS i_on_tIsrc
FROM release r
LEFT JOIN release_barcode barcode ON r.source_uri = barcode.source_uri
LEFT JOIN release_genre genre ON r.source_uri = genre.source_uri
LEFT JOIN release_type typ ON r.source_uri = typ.source_uri
LEFT JOIN release_voted_tag tag ON r.source_uri = tag.source_uri
LEFT JOIN release_image IMAGE ON r.source_uri = IMAGE.source_uri
LEFT JOIN release_image_type imageType ON IMAGE.id = imageType.image_id
LEFT JOIN release_url url ON r.source_uri = url.source_uri
LEFT JOIN release_event event ON r.source_uri = event.source_uri
LEFT JOIN medium m ON r.source_uri = m.source_uri
LEFT JOIN track t ON m.id = t.medium
LEFT JOIN track_artist tArtist ON t.id = tArtist.track
LEFT JOIN track_composer tComposer ON t.id = tComposer.track
LEFT JOIN track_isrc tIsrc ON t.id = tIsrc.track
WHERE r.source_uri IN (
'https://api.discogs.com/releases/1955915'
,'https://api.discogs.com/releases/8602631'
,[and so on for about thirty more URIs]
)
GROUP BY su_on_r, t_on_r, s_on_r, an_on_r, a_on_r, c_on_r, rd_on_r, u_on_image, w_on_image, h_on_image, s_on_image, ln_on_event, c_on_event, p_on_m, t_on_m, f_on_m, p_on_t, t_on_t;
查看说明,由于 GROUP BY 语句很大:https://explain.depesz.com/s/dV5o
您可以看到聚合适用于 >90k 行。由于连接的数量,行数非常大,许多 1:m 表导致行数呈指数增长。
第一次尝试,将聚合移动到连接的子查询
所以我想知道如何重写查询以不必组合所有这些行。我决定将连接写为子查询,并将聚合移动到这些子查询中。
我的第一次尝试是(release_barcode
的一个示例,对所有表都重复):
LEFT JOIN (
SELECT source_uri, string_agg(DISTINCT barcode::TEXT, '|') AS b_on_barcode
FROM release_barcode
GROUP BY source_uri
) AS barcode ON r.source_uri = barcode.source_uri
这样可以减少返回的行数,并且我不需要进行大量排序,因为***查询中没有 GROUP BY。
但是,这比较慢! 这是因为查询规划器似乎没有首先应用***查询的条件。而是将整个表连接在一起。
下一次尝试,子查询中的重复条件
所以我尝试了一些不同的东西;为了在每个子查询中强制过滤,我只是复制了条件:
LEFT JOIN (
SELECT source_uri, string_agg(DISTINCT barcode::TEXT, '|') AS b_on_barcode
FROM release_barcode
WHERE source_uri IN (
'https://api.discogs.com/releases/1955915'
,'https://api.discogs.com/releases/8602631'
,[and so on for about thirty more URIs]
)
GROUP BY source_uri
) AS barcode ON r.source_uri = barcode.source_uri
WHERE
子句只是在每个子查询中重复。
结果不言自明:https://explain.depesz.com/s/exSw
更复杂的查询,但速度提高了 100 倍!
但当然,重复标准闻起来非常笨拙。
所以我的问题是双重的:
这种类型的优化是否有名称,是否令人不悦? 有没有更好的方法来避免重复(参见我的第一次尝试)?【问题讨论】:
***.com/help/mcve 浮现在脑海中。 尝试通过设置set enable_nestloop to off
禁用嵌套循环并重新运行第一个查询。
14 个表对于规划器来说太多了,它会进入 geqo 模式(参见 geqo_threshold
和 join_collapse_limit
)尝试隔离引用表的紧密集群并将该集群放入 CTE ,并在主查询中引用此 CTE。 (这会将您的查询减少到 7+8 个范围表条目)
@JustMe 这似乎让情况变得更糟了......我大约在 30 年代前运行了查询,它仍在继续......
@jarih 考虑到连接数量和大标准的重复问题,请随意建议尺寸以使其更小。
【参考方案1】:
增加geqo_treshold
(甚至join_collapse_limit
)注意:这可能会将计划时间增加到一秒以上
通过将紧密相关的表拆分为 CTE 来减少范围表条目的数量:
[行大小相当大]避免fat索引和fat表,(例如%uri
字段:将其放入单独的表中并通过代理键引用它)
[下一步可能是:将整个查询(没有聚合)放在第三个 CTE 中,并在主查询中进行聚合]
WITH rel AS (
SELECT * FROM release
WHERE source_uri IN (
'https://api.discogs.com/releases/1955915'
,'https://api.discogs.com/releases/8602631'
-- ,[and so on for about thirty more URIs]
)
, media AS (
SELECT *
FROM medium m -- ON r.source_uri = m.source_uri
LEFT JOIN track t ON m.id = t.medium
LEFT JOIN track_artist tArtist ON t.id = tArtist.track
LEFT JOIN track_composer tComposer ON t.id = tComposer.track
LEFT JOIN track_isrc tIsrc ON t.id = tIsrc.track
)
SELECT r.source_uri AS su_on_r, r.title AS t_on_r, r.subtitle AS s_on_r, r.artist_name AS an_on_r
, r.asin AS a_on_r, r.country AS c_on_r, r.release_date AS rd_on_r
, string_agg(DISTINCT barcode.barcode::TEXT, '|') AS b_on_barcode
, string_agg(DISTINCT genre.genre::TEXT, '|') AS g_on_genre
, string_agg(DISTINCT typ.type::TEXT, '|') AS t_on_typ
, string_agg(tag.voted_tag::TEXT, '|') AS vt_on_tag
, img.uri AS u_on_image, img.width AS w_on_image
, img.height AS h_on_image, img.score AS s_on_image
, string_agg(DISTINCT imageType.image_type::TEXT, '|') AS it_on_imageType
, string_agg(tag.votes::TEXT, '|') AS v_on_tag
, string_agg(DISTINCT url.url::TEXT, '|') AS u_on_url
, event.label_name AS ln_on_event, event.cat AS c_on_event
, m.position AS p_on_m, m.title AS t_on_m, m.format AS f_on_m
, m.position AS p_on_t, m.title AS t_on_t <<-- !!need to fix thes in the CTE
, string_agg(DISTINCT m.duration::TEXT, '|') AS d_on_t
, string_agg(DISTINCT m.artist::TEXT, '|') AS a_on_tArtist
, string_agg(DISTINCT m.composer::TEXT, '|') AS c_on_tComposer
, string_agg(DISTINCT m.isrc::TEXT, '|') AS i_on_tIsrc
FROM rel r -- <<--- ########################## CTE
LEFT JOIN release_barcode barcode ON r.source_uri = barcode.source_uri
LEFT JOIN release_genre genre ON r.source_uri = genre.source_uri
LEFT JOIN release_type typ ON r.source_uri = typ.source_uri
LEFT JOIN release_voted_tag tag ON r.source_uri = tag.source_uri
LEFT JOIN release_image img ON r.source_uri = img.source_uri
LEFT JOIN release_image_type imageType ON img.id = imageType.image_id
LEFT JOIN release_url url ON r.source_uri = url.source_uri
LEFT JOIN release_event event ON r.source_uri = event.source_uri
LEFT JOIN media ON r.source_uri = media.source_uri -- <<--- ########################## CTE
GROUP BY su_on_r, t_on_r, s_on_r, an_on_r
, a_on_r, c_on_r, rd_on_r
, u_on_image, w_on_image, h_on_image
, s_on_image, ln_on_event, c_on_event
, p_on_m, t_on_m, f_on_m, p_on_t, t_on_t
;
注意:我在将术语移动到 media
CTE 时犯了一些错误。还有一些重命名要做...
【讨论】:
以上是关于在连接的子查询中重复 WHERE 标准的主要内容,如果未能解决你的问题,请参考以下文章