如何让这个查询在 postgres 中运行得更快

Posted

技术标签:

【中文标题】如何让这个查询在 postgres 中运行得更快【英文标题】:How can I make this query run faster in postgres 【发布时间】:2015-07-20 21:23:34 【问题描述】:

我有这个查询需要 86 秒才能执行。

select cust_id customer_id,
       cust_first_name customer_first_name,
       cust_last_name customer_last_name,
       cust_prf customer_prf,
       cust_birth_country customer_birth_country,
       cust_login customer_login,
       cust_email_address customer_email_address,
       date_year ddyear,
       sum(((stock_ls_price-stock_ws_price-stock_ds_price)+stock_es_price)/2) total_yr,
       's' stock_type
 from customer, stock, date
 where customer_k = stock_customer_k
   and stock_soldate_k = date_k
 group by cust_id, cust_first_name, cust_last_name, cust_prf, cust_birth_country, cust_login, cust_email_address, date_year;

解释分析结果:

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate  (cost=639753.55..764040.06 rows=2616558 width=213) (actual time=81192.575..86536.398 rows=190581 loops=1)
   Group Key: customer.cust_id, customer.cust_first_name, customer.cust_last_name, customer.cust_prf, customer.cust_birth_country, customer.cust_login, customer.cust_email_address, date.date_year
   ->  Sort  (cost=639753.55..646294.95 rows=2616558 width=213) (actual time=81192.468..83977.960 rows=2685453 loops=1)
         Sort Key: customer.cust_id, customer.cust_first_name, customer.cust_last_name, customer.cust_prf, customer.cust_birth_country, customer.cust_login, customer.cust_email_address, date.date_year
         Sort Method: external merge  Disk: 460920kB
         ->  Hash Join  (cost=6527.66..203691.58 rows=2616558 width=213) (actual time=60.500..2306.082 rows=2685453 loops=1)
               Hash Cond: (stock.stock_customer_k = customer.customer_k)
               ->  Merge Join  (cost=1423.66..144975.59 rows=2744641 width=30) (actual time=8.820..1412.109 rows=2750311 loops=1)
                     Merge Cond: (date.date_k = stock.stock_soldate_k)
                     ->  Index Scan using date_key_idx on date (cost=0.29..2723.33 rows=73049 width=8) (actual time=0.013..7.164 rows=37622 loops=1)
                     ->  Index Scan using stock_soldate_k_index on stock  (cost=0.43..108829.12 rows=2880404 width=30) (actual time=0.004..735.043 rows=2750312 loops=1)
               ->  Hash  (cost=3854.00..3854.00 rows=100000 width=191) (actual time=51.650..51.650rows=100000 loops=1)
                     Buckets: 16384  Batches: 1  Memory Usage: 16139kB
                     ->  Seq Scan on customer  (cost=0.00..3854.00 rows=100000 width=191) (actual time=0.004..30.341 rows=100000 loops=1)
 Planning time: 1.761 ms
 Execution time: 86621.807 ms

我有work_mem=512MB。我创建了索引 cust_idcustomer_kstock_customer_kstock_soldate_kdate_k

customer 大约有 100,000 行,stock 有 3,000,000 行,date 有 80,000 行。

我怎样才能使这个查询运行得更快? 我将不胜感激!

表定义

日期

 Column              |     Type      | Modifiers
---------------------+---------------+-----------
 date_k              | integer       | not null
 date_id             | character(16) | not null
 date_date           | date          |
 date_year           | integer       |

库存

Column                 |     Type     | Modifiers
-----------------------+--------------+-----------
 stock_soldate_k       | integer      |
 stock_soltime_k       | integer      |
 stock_customer_k      | integer      |
 stock_ds_price        | numeric(7,2) |
 stock_es_price        | numeric(7,2) |
 stock_ls_price        | numeric(7,2) |
 stock_ws_price        | numeric(7,2) |

客户:

Column                     |         Type          | Modifiers
---------------------------+-----------------------+-----------
 customer_k                | integer               | not null
 customer_id               | character(16)         | not null
 cust_first_name           | character(20)         |
 cust_last_name            | character(30)         |
 cust_prf                  | character(1)          |
 cust_birth_country        | character varying(20) |
 cust_login                | character(13)         |
 cust_email_address        | character(50)         |

TABLE "stock" CONSTRAINT "st1" FOREIGN KEY (stock_soldate_k) REFERENCES date(date_k)

"st2" FOREIGN KEY (stock_customer_k) REFERENCES customer(customer_k)

【问题讨论】:

看看表、索引和约束的定义会很有帮助。 【参考方案1】:

试试这个:

with stock_grouped as
     (select stock_customer_k, date_year, sum(((stock_ls_price-stock_ws_price-stock_ds_price)+stock_es_price)/2) total_yr
      from stock, date
      where stock_soldate_k = date_k
      group by stock_customer_k, date_year)
select cust_id customer_id,
       cust_first_name customer_first_name,
       cust_last_name customer_last_name,
       cust_prf customer_prf,
       cust_birth_country customer_birth_country,
       cust_login customer_login,
       cust_email_address customer_email_address,
       date_year ddyear,
       total_yr,
       's' stock_type
from customer, stock_grouped
where customer_k = stock_customer_k

此查询预期连接上的分组。

【讨论】:

【参考方案2】:

你得到的一个很大的性能损失是大约 450MB 的中间数据存储在外部:Sort Method: external merge Disk: 460920kB。发生这种情况是因为规划器首先需要满足 3 个表之间的连接条件,包括可能低效的表 customer,然后才能进行聚合 sum(),即使聚合可以在表 stock 上完美执行一个人。

查询

由于您的表相当大,您最好尽快减少符合条件的行数,最好是在任何连接之前减少。在这种情况下,这意味着在子查询中对表 stock 进行聚合并将结果连接到其他两个表:

SELECT c.cust_id AS customer_id,
       c.cust_first_name AS customer_first_name,
       c.cust_last_name AS customer_last_name,
       c.cust_prf AS customer_prf,
       c.cust_birth_country AS customer_birth_country,
       c.cust_login AS customer_login,
       c.cust_email_address AS customer_email_address,
       d.date_year AS ddyear,
       ss.total_yr,
       's' stock_type
FROM (
    SELECT 
      stock_customer_k AS ck,
      stock_soldate_k AS sdk,
      sum((stock_ls_price-stock_ws_price-stock_ds_price+stock_es_price)*0.5) AS total_yr
    FROM stock
    GROUP BY 1, 2) ss
JOIN customer c ON c.customer_k = ss.ck
JOIN date d ON d.date_k = ss.sdk;

stock 上的子查询将产生更少的行,具体取决于每个客户每个日期的平均订单数。此外,在sum() 函数中,乘以 0.5 远比除以 2 便宜得多(尽管从宏观上看,这将是相对微不足道的)。

数据模型

您还应该认真看待您的数据模型。

在表customer 中,您使用char(30) 之类的数据类型,即使您只存储“X”,它也会在您的行中占用30 个字节。当许多字符串短于声明的最大宽度时,使用varchar(30) 数据类型效率更高,因为它占用的空间更少,因此需要更少的页面读取(和中间数据写入)。

stock 使用 numeric(7,2) 来表示价格。使用numeric 数据类型可以在对数据进行多次重复操作时给出准确的结果,但它们也很慢。在您的场景中,double precision 数据类型将更快且同样准确。出于演示目的,您可以将该值四舍五入到所需的精度。

作为建议,使用double precision 数据类型而不是numeric 创建一个表stock_f,将所有数据从stock 复制到stock_f 并在该表上运行查询。

【讨论】:

以上是关于如何让这个查询在 postgres 中运行得更快的主要内容,如果未能解决你的问题,请参考以下文章

如何让已经运行的 PL SQL 包运行得更快?

如何让这个 python 代码运行得更快?

这个 git log 命令运行的时间越长,我使用的越多,如何让它运行得更快?

有没有办法让这段代码运行得更快

如何让python程序运行得更快

如何让scikit-learn Nearest Neighbors算法运行得更快?