在连接的子查询中重复 WHERE 标准

Posted

技术标签:

【中文标题】在连接的子查询中重复 WHERE 标准【英文标题】:Duplicate WHERE criteria in joined sub-queries 【发布时间】:2017-11-24 10:16:03 【问题描述】:

我写了一个连接十四个表的查询。当条件返回很多行时,查询需要很长时间。这是原始查询,带有大的IN 条件:

SELECT r.source_uri AS su_on_r, r.title AS t_on_r, r.subtitle AS s_on_r, r.artist_name AS an_on_r, r.asin AS a_on_r, r.country AS c_on_r, r.release_date AS rd_on_r, string_agg(DISTINCT barcode.barcode::TEXT, '|') AS b_on_barcode, string_agg(DISTINCT genre.genre::TEXT, '|') AS g_on_genre, string_agg(DISTINCT typ.type::TEXT, '|') AS t_on_typ, string_agg(tag.voted_tag::TEXT, '|') AS vt_on_tag, IMAGE.uri AS u_on_image, IMAGE.width AS w_on_image, IMAGE.height AS h_on_image, IMAGE.score AS s_on_image, string_agg(DISTINCT imageType.image_type::TEXT, '|') AS it_on_imageType, string_agg(tag.votes::TEXT, '|') AS v_on_tag, string_agg(DISTINCT url.url::TEXT, '|') AS u_on_url, event.label_name AS ln_on_event, event.cat AS c_on_event, m.position AS p_on_m, m.title AS t_on_m, m.format AS f_on_m, t.position AS p_on_t, t.title AS t_on_t, string_agg(DISTINCT t.duration::TEXT, '|') AS d_on_t, string_agg(DISTINCT tArtist.artist::TEXT, '|') AS a_on_tArtist, string_agg(DISTINCT tComposer.composer::TEXT, '|') AS c_on_tComposer, string_agg(DISTINCT tIsrc.isrc::TEXT, '|') AS i_on_tIsrc
FROM release r
LEFT JOIN release_barcode barcode ON r.source_uri = barcode.source_uri
LEFT JOIN release_genre genre ON r.source_uri = genre.source_uri
LEFT JOIN release_type typ ON r.source_uri = typ.source_uri
LEFT JOIN release_voted_tag tag ON r.source_uri = tag.source_uri
LEFT JOIN release_image IMAGE ON r.source_uri = IMAGE.source_uri
LEFT JOIN release_image_type imageType ON IMAGE.id = imageType.image_id
LEFT JOIN release_url url ON r.source_uri = url.source_uri
LEFT JOIN release_event event ON r.source_uri = event.source_uri
LEFT JOIN medium m ON r.source_uri = m.source_uri
LEFT JOIN track t ON m.id = t.medium
LEFT JOIN track_artist tArtist ON t.id = tArtist.track
LEFT JOIN track_composer tComposer ON t.id = tComposer.track
LEFT JOIN track_isrc tIsrc ON t.id = tIsrc.track
WHERE r.source_uri IN (
  'https://api.discogs.com/releases/1955915'
  ,'https://api.discogs.com/releases/8602631'
  ,[and so on for about thirty more URIs]
  )
GROUP BY su_on_r, t_on_r, s_on_r, an_on_r, a_on_r, c_on_r, rd_on_r, u_on_image, w_on_image, h_on_image, s_on_image, ln_on_event, c_on_event, p_on_m, t_on_m, f_on_m, p_on_t, t_on_t;

查看说明,由于 GROUP BY 语句很大:https://explain.depesz.com/s/dV5o

您可以看到聚合适用于 >90k 行。由于连接的数量,行数非常大,许多 1:m 表导致行数呈指数增长。

第一次尝试,将聚合移动到连接的子查询

所以我想知道如何重写查询以不必组合所有这些行。我决定将连接写为子查询,并将聚合移动到这些子查询中。

我的第一次尝试是(release_barcode 的一个示例,对所有表都重复):

LEFT JOIN (
    SELECT source_uri, string_agg(DISTINCT barcode::TEXT, '|') AS b_on_barcode
    FROM release_barcode
    GROUP BY source_uri
) AS barcode ON r.source_uri = barcode.source_uri

这样可以减少返回的行数,并且我不需要进行大量排序,因为***查询中没有 GROUP BY。

但是,这比较慢! 这是因为查询规划器似乎没有首先应用***查询的条件。而是将整个表连接在一起。

下一次尝试,子查询中的重复条件

所以我尝试了一些不同的东西;为了在每个子查询中强制过滤,我只是复制了条件:

LEFT JOIN (
    SELECT source_uri, string_agg(DISTINCT barcode::TEXT, '|') AS b_on_barcode
    FROM release_barcode
    WHERE source_uri IN (
      'https://api.discogs.com/releases/1955915'
      ,'https://api.discogs.com/releases/8602631'
      ,[and so on for about thirty more URIs]
      )
    GROUP BY source_uri
) AS barcode ON r.source_uri = barcode.source_uri

WHERE 子句只是在每个子查询中重复。

结果不言自明:https://explain.depesz.com/s/exSw

更复杂的查询,但速度提高了 100 倍!

但当然,重复标准闻起来非常笨拙。

所以我的问题是双重的:

这种类型的优化是否有名称,是否令人不悦? 有没有更好的方法来避免重复(参见我的第一次尝试)?

【问题讨论】:

***.com/help/mcve 浮现在脑海中。 尝试通过设置set enable_nestloop to off 禁用嵌套循环并重新运行第一个查询。 14 个表对于规划器来说太多了,它会进入 geqo 模式(参见 geqo_thresholdjoin_collapse_limit )尝试隔离引用表的紧密集群并将该集群放入 CTE ,并在主查询中引用此 CTE。 (这会将您的查询减少到 7+8 个范围表条目) @JustMe 这似乎让情况变得更糟了......我大约在 30 年代前运行了查询,它仍在继续...... @jarih 考虑到连接数量和大标准的重复问题,请随意建议尺寸以使其更小。 【参考方案1】:

增加geqo_treshold(甚至join_collapse_limit注意:这可能会将计划时间增加到一秒以上

通过将紧密相关的表拆分为 CTE 来减少范围表条目的数量:

[行大小相当大]避免fat索引和fat表,(例如%uri字段:将其放入单独的表中并通过代理键引用它) [下一步可能是:将整个查询(没有聚合)放在第三个 CTE 中,并在主查询中进行聚合]
WITH rel AS (
        SELECT * FROM release 
        WHERE source_uri IN (
  'https://api.discogs.com/releases/1955915'
  ,'https://api.discogs.com/releases/8602631'
  -- ,[and so on for about thirty more URIs]
        )
, media AS (
        SELECT *
        FROM medium m -- ON r.source_uri = m.source_uri
        LEFT JOIN track t ON m.id = t.medium
        LEFT JOIN track_artist tArtist ON t.id = tArtist.track
        LEFT JOIN track_composer tComposer ON t.id = tComposer.track
        LEFT JOIN track_isrc tIsrc ON t.id = tIsrc.track
        )
SELECT r.source_uri AS su_on_r, r.title AS t_on_r, r.subtitle AS s_on_r, r.artist_name AS an_on_r
        , r.asin AS a_on_r, r.country AS c_on_r, r.release_date AS rd_on_r
        , string_agg(DISTINCT barcode.barcode::TEXT, '|') AS b_on_barcode
        , string_agg(DISTINCT genre.genre::TEXT, '|') AS g_on_genre
        , string_agg(DISTINCT typ.type::TEXT, '|') AS t_on_typ
        , string_agg(tag.voted_tag::TEXT, '|') AS vt_on_tag
        , img.uri AS u_on_image, img.width AS w_on_image
        , img.height AS h_on_image, img.score AS s_on_image
        , string_agg(DISTINCT imageType.image_type::TEXT, '|') AS it_on_imageType
        , string_agg(tag.votes::TEXT, '|') AS v_on_tag
        , string_agg(DISTINCT url.url::TEXT, '|') AS u_on_url
        , event.label_name AS ln_on_event, event.cat AS c_on_event
        , m.position AS p_on_m, m.title AS t_on_m, m.format AS f_on_m
        , m.position AS p_on_t, m.title AS t_on_t <<-- !!need to fix thes in the CTE
        , string_agg(DISTINCT m.duration::TEXT, '|') AS d_on_t
        , string_agg(DISTINCT m.artist::TEXT, '|') AS a_on_tArtist
        , string_agg(DISTINCT m.composer::TEXT, '|') AS c_on_tComposer
        , string_agg(DISTINCT m.isrc::TEXT, '|') AS i_on_tIsrc
FROM rel r -- <<--- ##########################  CTE
LEFT JOIN release_barcode barcode       ON      r.source_uri = barcode.source_uri
LEFT JOIN release_genre genre           ON      r.source_uri = genre.source_uri
LEFT JOIN release_type typ              ON      r.source_uri = typ.source_uri
LEFT JOIN release_voted_tag tag         ON      r.source_uri = tag.source_uri
LEFT JOIN release_image img             ON      r.source_uri = img.source_uri
  LEFT JOIN release_image_type imageType         ON img.id = imageType.image_id
LEFT JOIN release_url url               ON      r.source_uri = url.source_uri
LEFT JOIN release_event event           ON      r.source_uri = event.source_uri
LEFT JOIN media                         ON      r.source_uri = media.source_uri -- <<--- ##########################  CTE
GROUP BY su_on_r, t_on_r, s_on_r, an_on_r
        , a_on_r, c_on_r, rd_on_r
        , u_on_image, w_on_image, h_on_image
        , s_on_image, ln_on_event, c_on_event
        , p_on_m, t_on_m, f_on_m, p_on_t, t_on_t
        ;

注意:我在将术语移动到 media CTE 时犯了一些错误。还有一些重命名要做...

【讨论】:

以上是关于在连接的子查询中重复 WHERE 标准的主要内容,如果未能解决你的问题,请参考以下文章

表连接查询与where后使用子查询的性能分析。

连接的子查询(在/存在)

SQL多表链接查询、嵌入SELECT语句的子查询技术

SQL中的子查询

与查询具有相同 WHERE 子句的子查询

MySQL基础语法之子链接查询和特殊查询(union 和 limit)