添加第二个连接条件会以指数方式增加查询时间

Posted 2023-03-31

技术标签:

【中文标题】添加第二个连接条件会以指数方式增加查询时间【英文标题】：Adding second join condition increases query time exponentially 【发布时间】：2015-07-17 13:57:14 【问题描述】：

所以，我正在使用 Redshift（基于 postgres）。不幸的是，我不能分享我的数据（原因很明显），但无论如何这更像是一个概念问题。当然，我会分享我的代码。

此查询几乎立即返回：

select
    count(*)
from
    table_one as c
inner join
    table_two as z
on
    regexp_replace(c.telephone_number, '[^0-9]', '') = regexp_replace(z.affected_phone_number, '[^0-9]', '');

但是这个会运行几个小时：

select
    count(*)
from
    table_one as c
inner join
    table_two as z
on
    regexp_replace(c.telephone_number, '[^0-9]', '') = regexp_replace(z.affected_phone_number, '[^0-9]', '')
    or c.email = z.requester_email;

为什么用or 添加第二个连接条件会导致这个问题？

（我可以使用union 解决这个问题，但我有兴趣在这里学习...）

如果有帮助，请运行explain...

问题查询的查询计划：

QUERY PLAN
XN Aggregate  (cost=159728183882.77..159728183882.77 rows=1 width=0)
  ->  XN Nested Loop DS_BCAST_INNER  (cost=0.00..159726036322.85 rows=859023969 width=0)
        Join Filter: ((regexp_replace(("inner".telephone_number)::text, '[^0-9]'::text, ''::text, 1) = regexp_replace(("outer".affected_phone_number)::text, '[^0-9]'::text, ''::text, 1)) OR (("inner".email)::text = ("outer".requester_email)::text))
        ->  XN Seq Scan on table_two z  (cost=0.00..4447.40 rows=444740 width=36)
        ->  XN Seq Scan on table_one c  (cost=0.00..3853.89 rows=385389 width=32)
----- Nested Loop Join in the query plan - review the join predicates to avoid Cartesian products -----

非问题查询的查询计划：

QUERY PLAN
XN Aggregate  (cost=62358556140.01..62358556140.01 rows=1 width=0)
  ->  XN Hash Join DS_BCAST_INNER  (cost=4817.36..62356413666.21 rows=856989520 width=0)
        Hash Cond: (regexp_replace(("outer".affected_phone_number)::text, '[^0-9]'::text, ''::text, 1) = regexp_replace(("inner".telephone_number)::text, '[^0-9]'::text, ''::text, 1))
        ->  XN Seq Scan on table_two z  (cost=0.00..4447.40 rows=444740 width=12)
        ->  XN Hash  (cost=3853.89..3853.89 rows=385389 width=8)
              ->  XN Seq Scan on table_one c  (cost=0.00..3853.89 rows=385389 width=8)

【问题讨论】：

你的“table_two”有什么索引......它可能有助于在（请求者电子邮件，受影响的电话号码）上建立一个复合索引您也应该在问题中添加表定义。 Redshift 作为 MPP 数据库，物理设计至关重要。 DS_BCAST_INNER: A copy of the entire inner table is broadcast to all the compute nodes。所有网络流量加上嵌套循环连接。难怪它为什么会成倍地变慢 【参考方案1】：

我们只能猜测为什么不访问数据库会很慢。

猜测不是性能优化的合适工具。

使用EXPLAIN 语句查看postgres 实际如何处理这两个语句。

【讨论】：

解释信息已添加...显然，嵌套循环连接是问题所在，但我正在寻找关于查询导致嵌套循环连接的原因以及是否有解决方法的理论解释除了union...【参考方案2】：

您是否在表架构中使用sortkey？

如果不是，或者如果不在适当的字段上，数据将按其插入顺序在节点中排序。这将导致您正在谈论的循环。

在指定表架构时，请确保包含最常用的sortkey，记住您可以有多个sortkeys：

CREATE TABLE schemaex.a1.account_revenue (
    account_id varchar(30) NOT NULL,
    date date NOT NULL distkey,
    registration_date timestamp,
    revenue float(8),
    cost varchar(8),        
)
compound sortkey(account_id, date);

在将排序键中的这些字段分别用作连接键和条件时，这将显着减少连接和聚合的执行时间。

Best Practices Sort Key

【讨论】：

以上是关于添加第二个连接条件会以指数方式增加查询时间的主要内容，如果未能解决你的问题，请参考以下文章