Redshift SQL 查询 - 优化
Posted
技术标签:
【中文标题】Redshift SQL 查询 - 优化【英文标题】:Redshift SQL query - optimize 【发布时间】:2020-05-08 12:30:56 【问题描述】:我有一个查询需要 15 分钟以上才能在 Redshift 中执行。此查询是使用超时为 15 分钟的 AWS Lambda 触发的。因此,我想检查是否有办法优化查询以使其快速给出结果。
这是我的 SQL 查询:
insert into
test.qa_locked
select
'1d8db587-f5ab-41f4-9c2b-c4e21e0c7481',
'ABC-013505',
'ABC-013505-2-2020',
user_id,
cast(TIMEOFDAY() as timestamp)
from
(
select
user_id
from
(
select
contact_id
from
test.qa_locked
)
where
contact_cnt <= 1
)
)
计划如下:
XN Subquery Scan "*SELECT*" (cost=1000028198481.69..1000028198481.75 rows=1 width=218)
-> XN Subquery Scan derived_table1 (cost=1000028198481.69..1000028198481.73 rows=1 width=210)
-> XN Window (cost=1000028198481.69..1000028198481.71 rows=1 width=56)
-> XN Sort (cost=1000028198481.69..1000028198481.70 rows=1 width=56)
-> XN Network (cost=1645148.05..28198481.68 rows=1 width=56)
-> XN Hash NOT IN Join DS_DIST_OUTER (cost=1645148.05..28198481.68 rows=1 width=56)
-> XN Hash NOT IN Join DS_DIST_INNER (cost=1645147.76..28091814.71 rows=1 width=56)
-> XN Hash NOT IN Join DS_DIST_INNER (cost=1645147.09..7491814.01 rows=1 width=56)
-> XN Hash NOT IN Join DS_DIST_INNER (cost=1645146.68..6805146.91 rows=1 width=56)
-> XN Hash NOT IN Join DS_DIST_INNER (cost=1645146.16..6438479.71 rows=1 width=56)
-> XN Hash NOT IN Join DS_DIST_NONE (cost=1645145.65..6071812.51 rows=1 width=56)
-> XN Hash NOT IN Join DS_DIST_NONE (cost=1645145.29..6071812.13 rows=1 width=56)
-> XN Hash NOT IN Join DS_DIST_BOTH (cost=1645144.96..6071811.77 rows=1 width=56)
-> XN Hash NOT IN Join DS_DIST_NONE (cost=1645144.50..5598477.96 rows=1 width=56)
-> XN Hash NOT IN Join DS_DIST_BOTH (cost=1645144.47..5598477.91 rows=1 width=84)
-> XN Hash NOT IN Join DS_DIST_OUTER (cost=1645142.59..5078476.00 rows=1 width=84)
-> XN Hash NOT IN Join DS_BCAST_INNER (cost=1645142.57..4065142.63 rows=1 width=600)
-> XN Hash Left Join DS_DIST_BOTH (cost=1201145.21..3221145.24 rows=1 width=1116)
-> XN Seq Scan on contacts xa (cost=1201145.21..1201145.21 rows=1 width=640)
-> XN Hash (cost=0.00..0.00 rows=1 width=556)
-> XN Seq Scan on accounts ya (cost=0.00..0.00 rows=1 width=556)
-> XN Hash (cost=443997.35..443997.35 rows=1 width=32)
-> XN Subquery Scan "IN_subquery" (cost=23989.76..443997.35 rows=1 width=32)
-> XN Unique (cost=23989.76..443997.34 rows=1 width=516)
-> XN Nested Loop DS_BCAST_INNER (cost=23989.76..443997.34 rows=1 width=516)
-> XN Seq Scan on accounts con (cost=0.00..0.00 rows=1 width=516)
-> XN Hash NOT IN Join DS_DIST_OUTER (cost=23989.76..83997.32 rows=1 width=26)
-> XN Seq Scan on campaign_exclusion_list cam (cost=0.00..7.53 rows=1 width=26)
-> XN Hash (cost=23989.75..23989.75 rows=1 width=32)
-> XN Subquery Scan "IN_subquery" (cost=0.00..23989.75 rows=1 width=32)
-> XN Unique (cost=0.00..23989.74 rows=1 width=18)
-> XN Seq Scan on campaign_inclusion_list (cost=0.00..23989.74 rows=1 width=18)
-> XN Hash (cost=0.01..0.01 rows=1 width=516)
-> XN Subquery Scan "IN_subquery" (cost=0.00..0.01 rows=1 width=516)
-> XN Unique (cost=0.00..0.00 rows=1 width=516)
-> XN Seq Scan on contacts (cost=0.00..0.00 rows=1 width=516)
-> XN Hash (cost=1.88..1.88 rows=1 width=210)
-> XN Seq Scan on bh_email_open_clicks (cost=0.00..1.88 rows=1 width=210)
-> XN Hash (cost=0.01..0.01 rows=1 width=210)
-> XN Subquery Scan "IN_subquery" (cost=0.00..0.01 rows=1 width=210)
-> XN Unique (cost=0.00..0.00 rows=1 width=28)
-> XN Seq Scan on contacts (cost=0.00..0.00 rows=1 width=28)
-> XN Hash (cost=0.45..0.45 rows=1 width=210)
-> XN Seq Scan on bh_leads (cost=0.00..0.45 rows=1 width=210)
-> XN Hash (cost=0.32..0.32 rows=1 width=402)
-> XN Subquery Scan "IN_subquery" (cost=0.30..0.32 rows=1 width=402)
-> XN HashAggregate (cost=0.30..0.31 rows=1 width=402)
-> XN Seq Scan on campaign_extraction_history (cost=0.00..0.30 rows=1 width=402)
-> XN Hash (cost=0.35..0.35 rows=1 width=402)
-> XN Subquery Scan "IN_subquery" (cost=0.33..0.35 rows=1 width=402)
-> XN HashAggregate (cost=0.33..0.34 rows=1 width=402)
-> XN Seq Scan on campaign_extraction_history (cost=0.00..0.33 rows=1 width=402)
-> XN Hash (cost=0.50..0.50 rows=1 width=210)
-> XN Seq Scan on bh_leads (cost=0.00..0.50 rows=1 width=210)
-> XN Hash (cost=0.50..0.50 rows=1 width=210)
-> XN Seq Scan on bh_leads (cost=0.00..0.50 rows=1 width=210)
-> XN Hash (cost=0.40..0.40 rows=1 width=402)
-> XN Seq Scan on campaign_extraction_history (cost=0.00..0.40 rows=1 width=402)
-> XN Hash (cost=0.30..0.30 rows=30 width=402)
-> XN Seq Scan on ce_locked_records_tb (cost=0.00..0.30 rows=30 width=402)
-> XN Hash (cost=0.27..0.27 rows=1 width=210)
-> XN Subquery Scan "IN_subquery" (cost=0.26..0.27 rows=1 width=210)
-> XN HashAggregate (cost=0.26..0.26 rows=1 width=210)
-> XN Seq Scan on bh_leads (cost=0.00..0.25 rows=1 width=210)
请建议是否有任何方法可以优化此查询。
【问题讨论】:
这个查询是手工编写的,还是由某些 BI 工具生成的?首先要注意的是,它非常复杂,包含所有这些子选择。摆脱它们会很棒。另一件事是它有 26 个not in
运算符,众所周知,这些运算符不利于效率。 not in
需要选择大量数据,然后检查所需的值是否不在每个返回的行中。这使得 any 数据库中的事情变得非常缓慢。他们似乎也负责大部分cost
计算。该查询中也有 31 个子查询 (SELECT
)。
@JohnRotenstein 使用存储过程生成的查询。所有代码都是手工编写的。什么是有效的替代方法?
NOT IN
通常可以替换为LEFT OUTER JOIN
。然后,确认连接字段为 NULL。网上对此有不少讨论,例如:SQL performance on LEFT OUTER JOIN vs NOT EXISTS and Consider using NOT EXISTS instead of NOT IN with a subquery - Redgate Software and NOT IN vs. NOT EXISTS vs. OUTER APPLY vs. OUTER JOIN.
@JohnRotenstein 你能重写我的查询以加快速度吗?
EXPLAIN 计划显示cost
数字。您应该专注于减少或消除高成本。此外,高成本可能是由DS_DIST_INNER
和DS_DIST_BOTH
活动引起的。这些通常可以通过共享相同DISTKEY
的表或通过在所有节点上复制表来避免。见:Evaluating the query plan - Amazon Redshift
【参考方案1】:
这感觉像是一次又一次添加的查询,有很多代码重复和很多不必要的表扫描。
了解我的主要经验是使用 MSSQL 而不是 redshift,但对于大多数相同的原则都适用。
(
lower(xa.primary_function) in (
select
lower(param_val)
from
ce_campaign_spec_tb
where
job_id = '1d8db587-f5ab-41f4-9c2b-c4e21e0c7481'
and param = 'primary_function'
and relation_id = 4
)
and lower(xa.role) in (
select
lower(param_val)
from
ce_campaign_spec_tb
where
job_id = '1d8db587-f5ab-41f4-9c2b-c4e21e0c7481'
and param = 'role'
and relation_id = 4
)
and lower(xa.title) in (
select
lower(title)
from
contacts con
inner join ce_campaign_spec_tb camp on lower(con.title) ilike '%' || trim(
both ' '
from
camp.param_val
) || '%'
where
job_id = '1d8db587-f5ab-41f4-9c2b-c4e21e0c7481'
and param = 'title'
and relation_id = 4
)
)
在不知道这是做什么的情况下,您似乎将这段代码重复了 5 次,唯一的变化是关系 ID。您从 id 4 开始,然后是 2,然后是 1,然后是 3,然后是 5,但除此之外,id 似乎没有任何变化。可能存在细微差别,但现在您开始扫描表 5 次,而不是使用单个谓词一次。根据表的大小,这可能是您正在扫描的大量数据
再往前几行:
and xa.contact_id not in (
select
contact_id
from
bh_leads
where
(CURRENT_DATE - creation_date :: date) <= 60
and UPPER(LOB) = 'ABC'
and agency_id = '1002'
)
and xa.contact_id not in (
select
contact_id
from
bh_leads
where
(CURRENT_DATE - creation_date :: date) <= 60
and UPPER(LOB) = 'ABC'
and sponsor_id = '8306'
)
再次对几乎相同的数据进行 2 个表扫描,唯一的区别在于检查赞助商 ID 的值和另一个检查机构 ID。这可以在单个语句中完成,而不是 2
再往下:
and email_id not in (
select
distinct email_id
from
contacts
where
is_email_suppressed = 1
)
之前您引用了联系 (xa) 并将其作为谓词放在 where 子句中:
and xa.is_email_suppressed = 0
在不知道相关表的确切架构的情况下,我无法确定,但它们似乎在很大程度上做同样的事情。
另外,来自此处的 Redshift 文档:https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html
您似乎可以在单个会话期间创建临时表。可以准备大多数子查询,以便您可以加入结果集。例如,如果您首先为campaign_extraction_history 表准备一个具有有效结果的临时结果集,您可以将以下谓词替换为单个左连接:
AND contact_id NOT IN (
select
contact_id
from
campaign_extraction_history
where
sf_oms_campaign_id = 'ABC-013505-2-2020'
and sf_campaign_id = 'ABC-013505'
and (CURRENT_DATE - creation_date :: date) < 1
and channel = 'BOTH'
and (
UPPER(STATUS) = 'EXTRACTED'
OR UPPER(STATUS) = 'LAUNCHED'
OR UPPER(STATUS) = 'CONFIRMED'
)
)
AND contact_id NOT IN (
select
contact_id
from
campaign_extraction_history
where
creation_date :: date = CURRENT_DATE
and channel = 'BOTH'
and (
UPPER(STATUS) = 'EXTRACTED'
OR UPPER(STATUS) = 'LAUNCHED'
OR UPPER(STATUS) = 'CONFIRMED'
)
group by
contact_id
having
count(*) > 10
)
AND contact_id NOT IN (
select
contact_id
from
campaign_extraction_history
where
sf_campaign_id = 'ABC-013505'
and channel = 'BOTH'
and (
UPPER(STATUS) = 'EXTRACTED'
OR UPPER(STATUS) = 'LAUNCHED'
OR UPPER(STATUS) = 'CONFIRMED'
)
group by
contact_id
having
count(*) >= 3
)
您可以在更多地方组合查询并一次性从表中获取数据。例如,您排除了许多 email_id 值,但在不同语句和子查询中的不同位置。它们很可能在一个语句中完成。
也许提高性能的最好方法是问问自己查询试图做什么和排除什么,然后重写整个查询。这可能是相当多的工作,但从长远来看可能会更快。
【讨论】:
对于您的第一点 - 我同意我已将同一块重复了 5 次。但是因为我想为相同的relation_id 设置AND 条件。例如。 relationship_id = 1 应具有 AND 中的条件。然后我取下一个用 OR 分隔的。这个怎么修改? 用 'and relation_id between 1 and 5' 替换 ' and relation_id = 4',或者用另一个查询的最小值和最大值替换 1 和 5。如果有效 ID 的增量不总是 1,您还可以使用 CTE 或临时表来构建有效 ID(或任何其他子选择)以在事务中使用 我需要比较相同relation_id 的primary_function、role、function 的值。它不能介于 1 和 5 之间。 如果你使用 1 到 5 之间的关系 id,你会得到 5 个结果。然后你找到所有列匹配的单行(primary_function、role和function)以上是关于Redshift SQL 查询 - 优化的主要内容,如果未能解决你的问题,请参考以下文章