如何优化大表的 Postgresql ARRAY_AGG 查询?

Posted

技术标签:

【中文标题】如何优化大表的 Postgresql ARRAY_AGG 查询?【英文标题】:How can I optimize Postgresql ARRAY_AGG queries for large tables? 【发布时间】:2021-03-22 19:06:42 【问题描述】:

我使用 PostgreSQL 来实现它的数组功能。这是我的架构:

CREATE TABLE questions (
  id INTEGER PRIMARY KEY,
  product_id INTEGER UNIQUE NOT NULL,
  body VARCHAR(1000) NOT NULL,
  date_written DATE NOT NULL DEFAULT current_date,
  asker_name VARCHAR(60) NOT NULL,
  asker_email VARCHAR(60) NOT NULL,
  reported BOOLEAN DEFAULT FALSE,
  helpful INTEGER NOT NULL DEFAULT 0
);

CREATE TABLE answers (
  id PRIMARY KEY NOT NULL,
  question_id INTEGER NOT NULL,
  body VARCHAR(1000) NOT NULL,
  date_written DATE NOT NULL DEFAULT current_date,
  answerer_name VARCHAR(60) NOT NULL,
  answerer_email VARCHAR(60) NOT NULL,
  reported BOOLEAN DEFAULT FALSE,
  helpful INTEGER NOT NULL DEFAULT 0
);

CREATE TABLE photos (
  id INTEGER UNIQUE,
  answer_id INTEGER NOT NULL,
  photo VARCHAR(200)
);

我正在尝试查询我的答案表以获取给定问题 id 的所有答案的列表,并包含该给定 answer_id 存在的所有照片的数组。结果应按有用性的降序排列。到目前为止,我有一个显示我正在寻找的结果的大量查询,但执行时间是 729.595 毫秒。我正在尝试优化以使查询的时间缩短到 200 毫秒。我有以下索引来尝试优化我的查询时间:

  indexname    |                                 indexdef                                  
-----------------+---------------------------------------------------------------------------
 answer_id       | CREATE UNIQUE INDEX answer_id ON public.answers USING btree (id)
 question_id     | CREATE INDEX question_id ON public.answers USING btree (question_id)
 idx_reported_id | CREATE INDEX idx_reported_id ON public.answers USING btree (reported, id)
 answers_pkey    | CREATE UNIQUE INDEX answers_pkey ON public.answers USING btree (id)
  indexname    |                                  indexdef                                  
----------------+----------------------------------------------------------------------------
 id             | CREATE UNIQUE INDEX id ON public.questions USING btree (id)
 idx_q_reported | CREATE INDEX idx_q_reported ON public.questions USING btree (id, reported)
 questions_pkey | CREATE UNIQUE INDEX questions_pkey ON public.questions USING btree (id)
   indexname   |                              indexdef                               
---------------+---------------------------------------------------------------------
 photos_id_key | CREATE UNIQUE INDEX photos_id_key ON public.photos USING btree (id)
 p_links       | CREATE INDEX p_links ON public.photos USING btree (photo)

在我的分析中,我注意到 GroupAggregate 很耗时:GroupAggregate (cost=126222.21..126222.71 rows=25 width=129) (actual time=729.497..729.506 rows=5 loops=1) 组键:answers.id

有什么方法可以避免耗时的 GROUP BY?我是否缺少索引的某些内容?这是查询本身:

SELECT answers.id, 
       question_id, 
       body, 
       date_written, 
       answerer_name, 
       answerer_email, 
       reported, 
       helpful, 
       ARRAY_AGG(photo) as photos 
FROM answers 
  LEFT JOIN photos ON answers.id = photos.answer_id 
WHERE reported IS 
false AND answers.id IN (SELECT id 
                         FROM answers 
                         WHERE question_id = 20012) 
GROUP BY answers.id 
ORDER BY helpful DESC;

谢谢!

【问题讨论】:

【参考方案1】:

我认为你可以跳过子查询:

SELECT answers.id, question_id, body, date_written, answerer_name, answerer_email, reported, helpful, ARRAY_AGG(photo) as photos 
FROM answers 
LEFT JOIN photos ON answers.id = photos.answer_id 
WHERE reported IS false AND question_id = 20012 
GROUP BY answers.id, question_id, body, date_written, answerer_name, answerer_email, reported, helpful
ORDER BY helpful DESC;

您可以在 photos.answer_id 上添加 btree 索引,因为此字段用于 join 子句。

您在 GROUP BY 子句中丢失了相同的字段;

【讨论】:

我会尝试更改 WHERE 顺序 WHERE question_id = 20012 AND reported IS false。首先限制行数,但也许 Postgresql 会自动这样做。【参考方案2】:

一种常用的方法是先聚合,然后加入结果(而不是聚合完整的结果)。而且你也不需要 IN 条件

SELECT a.id, 
       a.question_id, 
       a.body, 
       a.date_written, 
       a.answerer_name, 
       a.answerer_email, 
       a.reported, 
       a.helpful, 
       p.photos 
FROM answers a
  LEFT JOIN (
    select answer_id, array_agg(photo) as photos 
    from photos
    group by answer_id
  ) p ON a.id = p.answer_id 
WHERE reported IS false 
  AND a.question_id = 20012
ORDER BY a.helpful DESC;

【讨论】:

以上是关于如何优化大表的 Postgresql ARRAY_AGG 查询?的主要内容,如果未能解决你的问题,请参考以下文章

PostgreSQL对or exists产生的filter优化二

PostgreSQL对or exists产生的filter优化二

PostgreSQL对or exists产生的filter优化二

PostgreSQL大表的更新时间

带有大表的 Geoserver WFS + PostgreSQL 速度极慢

从大表的子集中对随机行进行最快查询 - postgresql