Postgres中非常慢的不同和排序方法

Posted 2023-04-15

技术标签:

【中文标题】Postgres中非常慢的不同和排序方法【英文标题】：Very slow distinct and sort method in Postgres 【发布时间】：2016-11-03 15:54:22 【问题描述】：

我有以下视图：http://pastebin.com/jgLeM3cd，我的数据库大小约为 10 GB。问题是因为DISTINCT 视图执行非常非常慢。

SELECT DISTINCT 
    users.id AS user_id, 
    contacts.id AS contact_id,
    contact_types.name AS relationship, 
    channels.name AS channel,
    feed_items.send_at AS sent_at, 
    feed_items.body AS message,
    feed_items.from_id, 
    feed_items.feed_id
FROM feed_items
JOIN channels ON feed_items.channel_id = channels.id
JOIN feeds ON feed_items.feed_id = feeds.id
JOIN contacts ON feeds.contact_id = contacts.id
JOIN contact_types ON contacts.contact_type_id = contact_types.id
JOIN users ON contacts.user_id = users.id
WHERE contacts.is_fake = false;

例如下面是对LIMIT 10的执行分析：https://explain.depesz.com/s/K8q2

   QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=7717200.06..7717200.28 rows=10 width=1113) (actual time=118656.704..118656.726 rows=10 loops=1)
   ->  Unique  (cost=7717200.06..7780174.02 rows=2798843 width=1113) (actual time=118656.702..118656.723 rows=10 loops=1)
         ->  Sort  (cost=7717200.06..7724197.16 rows=2798843 width=1113) (actual time=118656.700..118656.712 rows=10 loops=1)
               Sort Key: users.id, contacts.id, contact_types.name, channels.name, feed_items.send_at, feed_items.body, feed_items.from_id, feed_items.feed_id
               Sort Method: external merge  Disk: 589888kB
               ->  Hash Join  (cost=22677.02..577531.86 rows=2798843 width=1113) (actual time=416.072..12918.259 rows=5301453 loops=1)
                     Hash Cond: (feed_items.channel_id = channels.id)
                     ->  Hash Join  (cost=22675.84..539046.59 rows=2798843 width=601) (actual time=416.052..10703.796 rows=5301636 loops=1)
                           Hash Cond: (contacts.contact_type_id = contact_types.id)
                           ->  Hash Join  (cost=22674.73..500479.61 rows=2820650 width=89) (actual time=416.038..8494.439 rows=5303074 loops=1)
                                 Hash Cond: (feed_items.feed_id = feeds.id)
                                 ->  Seq Scan on feed_items  (cost=0.00..223787.54 rows=6828254 width=77) (actual time=0.025..2300.762 rows=6820169 loops=1)
                                 ->  Hash  (cost=18314.88..18314.88 rows=250788 width=16) (actual time=415.830..415.830 rows=268669 loops=1)
                                       Buckets: 4096  Batches: 16  Memory Usage: 806kB
                                       ->  Hash Join  (cost=1642.22..18314.88 rows=250788 width=16) (actual time=19.562..337.146 rows=268669 loops=1)
                                             Hash Cond: (feeds.contact_id = contacts.id)
                                             ->  Seq Scan on feeds  (cost=0.00..11888.11 rows=607111 width=8) (actual time=0.013..116.339 rows=607117 loops=1)
                                             ->  Hash  (cost=1517.99..1517.99 rows=9938 width=12) (actual time=19.537..19.537 rows=9945 loops=1)
                                                   Buckets: 1024  Batches: 1  Memory Usage: 427kB
                                                   ->  Hash Join  (cost=619.65..1517.99 rows=9938 width=12) (actual time=5.743..16.746 rows=9945 loops=1)
                                                         Hash Cond: (contacts.user_id = users.id)
                                                         ->  Seq Scan on contacts  (cost=0.00..699.58 rows=9938 width=12) (actual time=0.005..5.981 rows=9945 loops=1)
                                                               Filter: (NOT is_fake)
                                                               Rows Removed by Filter: 14120
                                                         ->  Hash  (cost=473.18..473.18 rows=11718 width=4) (actual time=5.728..5.728 rows=11800 loops=1)
                                                               Buckets: 2048  Batches: 1  Memory Usage: 415kB
                                                               ->  Seq Scan on users  (cost=0.00..473.18 rows=11718 width=4) (actual time=0.004..2.915 rows=11800 loops=1)
                           ->  Hash  (cost=1.05..1.05 rows=5 width=520) (actual time=0.004..0.004 rows=5 loops=1)
                                 Buckets: 1024  Batches: 1  Memory Usage: 1kB
                                 ->  Seq Scan on contact_types  (cost=0.00..1.05 rows=5 width=520) (actual time=0.002..0.003 rows=5 loops=1)
                     ->  Hash  (cost=1.08..1.08 rows=8 width=520) (actual time=0.012..0.012 rows=8 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 1kB
                           ->  Seq Scan on channels  (cost=0.00..1.08 rows=8 width=520) (actual time=0.006..0.007 rows=8 loops=1)
 Total runtime: 118765.513 ms
(34 rows)

除了feed_items.body 之外，我几乎在所有使用的列上都创建了b-tree 索引，因为这是text 列。我也增加了work_mem，但没有帮助。有什么想法可以加快速度吗？

【问题讨论】：

性能问题应该包括EXPLAIN ANALYZE和一些关于表大小、索引、当前时间性能、期望时间等的信息。Slow是一个相对术语，我们需要一个真实的值来比较。 @a_horse_with_no_name 是的，但是慢还是一个相对的说法，没有说需要多少时间，或者显示CREATE TABLE和索引信息。您真的需要 all 列上的不同吗？最大的问题是没有足够的 work_mem 可用来完成内存中的排序（“排序方法：外部合并磁盘：589888kB”）解决这个问题的一种方法是增加work_mem，例如set session work_mem='1GB';（如果你有足够的内存）如果您只需要不同的用户，那么distinct on (user_id) 可能会更快（因为排序/不同只在单个整数列上完成） I've created b-tree indexes on almost all columns that are used except ... 这不是数据模型，而是带有索引的电子表格。如果没有表定义（包括键、索引和近似大小），您的问题将无法回答。 【参考方案1】：

正如其他人在 cmets 中所说：

使用尽可能少的字段DISTINCT。

也许你只需要一个GROUP BY...

增加 work_mem 可能会有所帮助，但这不是最终的解决方案（您的查询效率非常低，并且随着数据库的增长，它会再次降级......）

还有：

索引在像这样的大型扫描查询中几乎没有帮助：索引可以更快地选择具体结果，但对索引的完整扫描比对表（或连接）的顺序扫描要昂贵得多。

唯一的例外是您只需要从一张大表中挑选几条记录。但是规划器很难猜到它，所以你需要通过使用子查询或 CTE（“WITH”子句）来强制它。

在work_mem增加的同一行中，9.6版本的PostgreSQL自带并行扫描功能（必须先手动启用）：如果你的服务器是那个版本或者你有机会升级它，它还可以加快响应时间（即使，无论如何，您的查询似乎需要改进...... ;-)）。

所以，我的建议是尽量减少连接中涉及的数据。特别是在第一次加入。也就是说：加入顺序很重要。请记住（幸运的是）您没有任何左连接，因此每个连接实际上都是一个潜在的过滤器，因此首先选择较短的表（或您将选择较少行的表）可以大大减少连接所需的内存。

例如，（根据您的查询，根本没有经过测试，请记住，您的数据分布很重要）：

SELECT DISTINCT
    users.id AS user_id,
    contacts.id AS contact_id,
    contact_types.name AS relationship,
    channels.name AS channel,
    feed_items.send_at AS sent_at,
    feed_items.body AS message,
    feed_items.from_id,
    feed_items.feed_id
-- Base your query in contacts because is the only place where you are making
-- some discardings:
FROM contacts
JOIN feeds ON (
    contacts.is_fake = false -- Filter here to reduce join size
    and feeds.contact_id = contacts.id -- Actual join condition
)
JOIN feed_items ON feed_items.feed_id = feeds.id
JOIN channels ON channels.id = feed_items.channel_id
JOIN contact_types ON contacts.contact_type_id = contact_types.id
JOIN users ON contacts.user_id = users.id
;

但是，再说一遍：一切都取决于您的实际数据。

试一试，解释分析，找出最昂贵的部分，并考虑改进它的策略。

这只是一些随机的想法，但我希望它可以帮助你一点。

祝你好运！

【讨论】：

派生表（需要别名 btw）不会改变任何东西，FROM 子句中的表顺序也不会改变。

以上是关于Postgres中非常慢的不同和排序方法的主要内容，如果未能解决你的问题，请参考以下文章