在 Postgres 中优化 jsonb 数组值的 GROUP BY
Posted
技术标签:
【中文标题】在 Postgres 中优化 jsonb 数组值的 GROUP BY【英文标题】:Optimizing GROUP BY of jsonb array values in Postgres 【发布时间】:2021-03-18 16:09:07 【问题描述】:为更好的硬件而更新
我在 Postgres12 中有以下简化模式,其中一个 dataset 有许多 cfiles 并且每个 cfile 都有 property_values 存储为 jsonb:
SELECT * FROM cfiles;
id | dataset_id | property_values (jsonb)
----+------------+-----------------------------------------------
1 | 1 | "Sample Names": ["SampA", "SampB"], (...other properties)
2 | 1 | "Sample Names": ["SampA", "SampC"], (...other properties)
3 | 1 | "Sample Names": ["SampD"], (...other properties)
一些“示例名称”数组很大,可能有多达 100 个短字符串(总共约 1K 个字符)
我正在使用下面的查询按“示例名称”jsonb 数组中的每个值进行分组。查询给出了我想要的结果,但在 ~45K 行(在底部解释查询计划)上运行在 1vCPU 的 aws t2.micro 上大约需要 15 秒和 1GB 内存
SELECT
jsonb_array_elements_text(property_values -> 'Sample Names') as sample_names,
max(cfiles.last_modified) as last_modified,
string_agg(DISTINCT(users.email), ', ') as user_email,
string_agg(DISTINCT(groups.name), ', ') as group_name
FROM cfiles
JOIN datasets ON cfiles.dataset_id=datasets.id
LEFT JOIN user_permissions ON (user_permissions.cfile_id=cfiles.id OR user_permissions.dataset_id=datasets.id)
LEFT JOIN users on users.id=user_permissions.user_id
LEFT JOIN group_permissions ON (group_permissions.cfile_id=cfiles.id OR group_permissions.dataset_id=datasets.id)
LEFT JOIN groups ON groups.id=group_permissions.group_id
LEFT JOIN user_groups ON groups.id=user_groups.group_id
WHERE
cfiles.tid=5
-- Query needs to support sample name filtering eg:
-- AND property_values ->> 'Sample Names' like '%test%'
-- As well as filtering by columns from the other joined tables
GROUP BY sample_names
-- Below is required for pagination
ORDER BY "sample_names" desc NULLS LAST
所有 ID 列和连接列都已编入索引。
我试图进一步减少查询,但发现它很棘手,因为很难将增量减少与改进联系起来,而且我的最终结果需要包含所有内容。
只是没有任何连接的 cfiles 仍然需要大约 12 秒:
SELECT
jsonb_array_elements_text(property_values -> 'Sample Names') as sample_names,
max(cfiles.last_modified) as last_modified
FROM cfiles
WHERE
cfiles.tid=5
GROUP BY sample_names
ORDER BY "sample_names" desc NULLS LAST
LIMIT 20
OFFSET 0
我尝试在下面使用更多的 CTE 模式,但速度较慢并且没有产生正确的结果
WITH cf AS (
SELECT
cfiles.id as id,
cfiles.dataset_id as dataset_id,
jsonb_array_elements_text(property_values -> 'Sample Names') as sample_names,
cfiles.last_modified as last_modified
FROM cfiles
WHERE
cfiles.tid=5
GROUP BY sample_names, id, dataset_id, last_modified
ORDER BY "sample_names" desc NULLS LAST
LIMIT 20
OFFSET 0
)
SELECT cf.sample_names as sample_names,
max(cf.last_modified) as last_modified,
string_agg(DISTINCT(users.email), ', ') as user_email,
string_agg(DISTINCT(groups.name), ', ') as group_name
FROM cf
JOIN datasets ON cf.dataset_id=datasets.id
LEFT JOIN user_permissions ON (user_permissions.cfile_id=cf.id OR user_permissions.dataset_id=datasets.id)
LEFT JOIN users on users.id=user_permissions.user_id
LEFT JOIN group_permissions ON (group_permissions.cfile_id=cf.id OR group_permissions.dataset_id=datasets.id)
LEFT JOIN groups ON groups.id=group_permissions.group_id
LEFT JOIN user_groups ON groups.id=user_groups.group_id
GROUP BY sample_names
ORDER BY "sample_names" desc NULLS LAST
有没有其他方法可以将该查询的性能提高到几秒钟?我可以重新排列它,使用临时表和索引,以最有效的方式。
增加 RAM 是否会有所帮助?
更新了解释分析详细:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1046905.99..1046907.17 rows=20 width=104) (actual time=15394.929..15409.256 rows=20 loops=1)
Output: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), (max(cfiles.last_modified)), (string_agg(DISTINCT (users.email)::text, ', '::text)), (string_agg(DISTINCT (groups.name)::text, ', '::text))
-> GroupAggregate (cost=1046905.99..1130738.74 rows=1426200 width=104) (actual time=15394.927..15409.228 rows=20 loops=1)
Output: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), max(cfiles.last_modified), string_agg(DISTINCT (users.email)::text, ', '::text), string_agg(DISTINCT (groups.name)::text, ', '::text)
Group Key: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text)))
-> Sort (cost=1046905.99..1057967.74 rows=4424700 width=104) (actual time=15394.877..15400.483 rows=11067 loops=1)
Output: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), cfiles.last_modified, users.email, groups.name
Sort Key: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))) DESC NULLS LAST
Sort Method: external merge Disk: 163288kB
-> ProjectSet (cost=41.40..74530.20 rows=4424700 width=104) (actual time=0.682..2933.628 rows=3399832 loops=1)
Output: jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text)), cfiles.last_modified, users.email, groups.name
-> Nested Loop Left Join (cost=41.40..51964.23 rows=44247 width=526) (actual time=0.587..1442.326 rows=46031 loops=1)
Output: cfiles.property_values, cfiles.last_modified, users.email, groups.name
Join Filter: ((user_permissions.cfile_id = cfiles.id) OR (user_permissions.dataset_id = datasets.id))
Rows Removed by Join Filter: 2391425
-> Nested Loop Left Join (cost=38.55..11694.81 rows=44247 width=510) (actual time=0.473..357.751 rows=46016 loops=1)
Output: cfiles.property_values, cfiles.last_modified, cfiles.id, datasets.id, groups.name
Join Filter: ((group_permissions.cfile_id = cfiles.id) OR (group_permissions.dataset_id = datasets.id))
Rows Removed by Join Filter: 616311
-> Hash Join (cost=35.81..4721.54 rows=44247 width=478) (actual time=0.388..50.189 rows=44255 loops=1)
Output: cfiles.property_values, cfiles.last_modified, cfiles.id, datasets.id
Inner Unique: true
Hash Cond: (cfiles.dataset_id = datasets.id)
-> Seq Scan on public.cfiles (cost=0.00..4568.77 rows=44247 width=478) (actual time=0.012..20.676 rows=44255 loops=1)
Output: cfiles.id, cfiles.tid, cfiles.uuid, cfiles.dataset_id, cfiles.path, cfiles.name, cfiles.checksum, cfiles.size, cfiles.last_modified, cfiles.content_type, cfiles.locked, cfiles.property_values, cfiles.created_at, cfiles.updated_at
Filter: (cfiles.tid = 5)
Rows Removed by Filter: 1567
-> Hash (cost=28.14..28.14 rows=614 width=8) (actual time=0.363..0.363 rows=614 loops=1)
Output: datasets.id
Buckets: 1024 Batches: 1 Memory Usage: 32kB
-> Seq Scan on public.datasets (cost=0.00..28.14 rows=614 width=8) (actual time=0.004..0.194 rows=614 loops=1)
Output: datasets.id
-> Materialize (cost=2.74..4.39 rows=9 width=48) (actual time=0.000..0.003 rows=14 loops=44255)
Output: group_permissions.cfile_id, group_permissions.dataset_id, groups.name
-> Hash Right Join (cost=2.74..4.35 rows=9 width=48) (actual time=0.049..0.071 rows=14 loops=1)
Output: group_permissions.cfile_id, group_permissions.dataset_id, groups.name
Hash Cond: (user_groups.group_id = groups.id)
-> Seq Scan on public.user_groups (cost=0.00..1.38 rows=38 width=8) (actual time=0.003..0.011 rows=38 loops=1)
Output: user_groups.id, user_groups.tid, user_groups.user_id, user_groups.group_id, user_groups.created_at, user_groups.updated_at
-> Hash (cost=2.62..2.62 rows=9 width=56) (actual time=0.039..0.039 rows=9 loops=1)
Output: group_permissions.cfile_id, group_permissions.dataset_id, groups.name, groups.id
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Hash Right Join (cost=1.20..2.62 rows=9 width=56) (actual time=0.021..0.035 rows=9 loops=1)
Output: group_permissions.cfile_id, group_permissions.dataset_id, groups.name, groups.id
Hash Cond: (groups.id = group_permissions.group_id)
-> Seq Scan on public.groups (cost=0.00..1.24 rows=24 width=40) (actual time=0.003..0.008 rows=24 loops=1)
Output: groups.id, groups.tid, groups.name, groups.description, groups.default_uview, groups.created_at, groups.updated_at
-> Hash (cost=1.09..1.09 rows=9 width=24) (actual time=0.010..0.010 rows=9 loops=1)
Output: group_permissions.cfile_id, group_permissions.dataset_id, group_permissions.group_id
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on public.group_permissions (cost=0.00..1.09 rows=9 width=24) (actual time=0.003..0.006 rows=9 loops=1)
Output: group_permissions.cfile_id, group_permissions.dataset_id, group_permissions.group_id
-> Materialize (cost=2.85..4.78 rows=52 width=48) (actual time=0.000..0.010 rows=52 loops=46016)
Output: user_permissions.cfile_id, user_permissions.dataset_id, users.email
-> Hash Left Join (cost=2.85..4.52 rows=52 width=48) (actual time=0.037..0.076 rows=52 loops=1)
Output: user_permissions.cfile_id, user_permissions.dataset_id, users.email
Inner Unique: true
Hash Cond: (user_permissions.user_id = users.id)
-> Seq Scan on public.user_permissions (cost=0.00..1.52 rows=52 width=24) (actual time=0.003..0.014 rows=52 loops=1)
Output: user_permissions.id, user_permissions.tid, user_permissions.user_id, user_permissions.dataset_id, user_permissions.cfile_id, user_permissions.read, user_permissions.share, user_permissions.write_meta, user_permissions.manage_files, user_permissions.delete_files, user_permissions.task_id, user_permissions.notified_at, user_permissions.first_downloaded_at, user_permissions.created_at, user_permissions.updated_at
-> Hash (cost=2.38..2.38 rows=38 width=40) (actual time=0.029..0.030 rows=38 loops=1)
Output: users.email, users.id
Buckets: 1024 Batches: 1 Memory Usage: 11kB
-> Seq Scan on public.users (cost=0.00..2.38 rows=38 width=40) (actual time=0.004..0.016 rows=38 loops=1)
Output: users.email, users.id
Planning Time: 0.918 ms
Execution Time: 15442.689 ms
(67 rows)
Time: 15575.174 ms (00:15.575)
更新 2
将 work_mem 增加到 256mb 后,没有任何连接查询的 cfiles 从 12 秒下降到 2 秒,但完整查询仍然在 11 秒 - 下面是新计划
Limit (cost=1197580.62..1197582.26 rows=20 width=104) (actual time=11049.784..11057.060 rows=20 loops=1)
Output: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), (max(cfiles.last_modified)), (string_agg(DISTINCT (users.email)::text, ', '::text)), (string_agg(DISTINCT (groups.name)::text, ', '::text))
-> GroupAggregate (cost=1197580.62..1313691.62 rows=1423800 width=104) (actual time=11049.783..11057.056 rows=20 loops=1)
Output: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), max(cfiles.last_modified), string_agg(DISTINCT (users.email)::text, ', '::text), string_agg(DISTINCT (groups.name)::text, ', '::text)
Group Key: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text)))
-> Sort (cost=1197580.62..1215107.62 rows=7010800 width=80) (actual time=11049.741..11051.064 rows=11067 loops=1)
Output: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))), cfiles.last_modified, users.email, groups.name
Sort Key: (jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text))) DESC NULLS LAST
Sort Method: external merge Disk: 163248kB
-> ProjectSet (cost=39.41..88894.93 rows=7010800 width=80) (actual time=0.314..1309.381 rows=3399832 loops=1)
Output: jsonb_array_elements_text((cfiles.property_values -> 'Sample Names'::text)), cfiles.last_modified, users.email, groups.name
-> Hash Left Join (cost=39.41..53139.85 rows=70108 width=502) (actual time=0.230..413.324 rows=46031 loops=1)
Output: cfiles.property_values, cfiles.last_modified, users.email, groups.name
Hash Cond: (groups.id = user_groups.group_id)
-> Nested Loop Left Join (cost=37.55..51994.13 rows=44279 width=510) (actual time=0.218..405.437 rows=44535 loops=1)
Output: cfiles.property_values, cfiles.last_modified, users.email, groups.name, groups.id
Join Filter: ((user_permissions.cfile_id = cfiles.id) OR (user_permissions.dataset_id = datasets.id))
Rows Removed by Join Filter: 2313897
-> Nested Loop Left Join (cost=34.70..11695.59 rows=44279 width=500) (actual time=0.166..99.664 rows=44520 loops=1)
Output: cfiles.property_values, cfiles.last_modified, cfiles.id, datasets.id, groups.name, groups.id
Join Filter: ((group_permissions.cfile_id = cfiles.id) OR (group_permissions.dataset_id = datasets.id))
Rows Removed by Join Filter: 396532
-> Hash Join (cost=33.16..4718.97 rows=44279 width=478) (actual time=0.141..30.449 rows=44255 loops=1)
Output: cfiles.property_values, cfiles.last_modified, cfiles.id, datasets.id
Inner Unique: true
Hash Cond: (cfiles.dataset_id = datasets.id)
-> Seq Scan on public.cfiles (cost=0.00..4568.77 rows=44279 width=478) (actual time=0.016..12.724 rows=44255 loops=1)
Output: cfiles.id, cfiles.tid, cfiles.uuid, cfiles.dataset_id, cfiles.path, cfiles.name, cfiles.checksum, cfiles.size, cfiles.last_modified, cfiles.content_type, cfiles.locked, cfiles.property_values, cfiles.created_at, cfiles.updated_at
Filter: (cfiles.tid = 5)
Rows Removed by Filter: 1567
-> Hash (cost=25.48..25.48 rows=614 width=8) (actual time=0.119..0.120 rows=614 loops=1)
Output: datasets.id
Buckets: 1024 Batches: 1 Memory Usage: 32kB
-> Index Only Scan using datasets_pkey on public.datasets (cost=0.28..25.48 rows=614 width=8) (actual time=0.010..0.057 rows=614 loops=1)
Output: datasets.id
Heap Fetches: 0
-> Materialize (cost=1.54..2.70 rows=9 width=38) (actual time=0.000..0.001 rows=9 loops=44255)
Output: group_permissions.cfile_id, group_permissions.dataset_id, groups.name, groups.id
-> Hash Left Join (cost=1.54..2.65 rows=9 width=38) (actual time=0.015..0.021 rows=9 loops=1)
Output: group_permissions.cfile_id, group_permissions.dataset_id, groups.name, groups.id
Inner Unique: true
Hash Cond: (group_permissions.group_id = groups.id)
-> Seq Scan on public.group_permissions (cost=0.00..1.09 rows=9 width=24) (actual time=0.002..0.003 rows=9 loops=1)
Output: group_permissions.id, group_permissions.tid, group_permissions.group_id, group_permissions.dataset_id, group_permissions.cfile_id, group_permissions.read, group_permissions.share, group_permissions.write_meta, group_permissions.manage_files, group_permissions.delete_files, group_permissions.task_id, group_permissions.notified_at, group_permissions.first_downloaded_at, group_permissions.created_at, group_permissions.updated_at
-> Hash (cost=1.24..1.24 rows=24 width=22) (actual time=0.009..0.010 rows=24 loops=1)
Output: groups.name, groups.id
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on public.groups (cost=0.00..1.24 rows=24 width=22) (actual time=0.003..0.006 rows=24 loops=1)
Output: groups.name, groups.id
-> Materialize (cost=2.85..4.78 rows=52 width=42) (actual time=0.000..0.003 rows=52 loops=44520)
Output: user_permissions.cfile_id, user_permissions.dataset_id, users.email
-> Hash Left Join (cost=2.85..4.52 rows=52 width=42) (actual time=0.021..0.036 rows=52 loops=1)
Output: user_permissions.cfile_id, user_permissions.dataset_id, users.email
Inner Unique: true
Hash Cond: (user_permissions.user_id = users.id)
-> Seq Scan on public.user_permissions (cost=0.00..1.52 rows=52 width=24) (actual time=0.002..0.005 rows=52 loops=1)
Output: user_permissions.id, user_permissions.tid, user_permissions.user_id, user_permissions.dataset_id, user_permissions.cfile_id, user_permissions.read, user_permissions.share, user_permissions.write_meta, user_permissions.manage_files, user_permissions.delete_files, user_permissions.task_id, user_permissions.notified_at, user_permissions.first_downloaded_at, user_permissions.created_at, user_permissions.updated_at
-> Hash (cost=2.38..2.38 rows=38 width=34) (actual time=0.016..0.017 rows=38 loops=1)
Output: users.email, users.id
Buckets: 1024 Batches: 1 Memory Usage: 11kB
-> Seq Scan on public.users (cost=0.00..2.38 rows=38 width=34) (actual time=0.002..0.010 rows=38 loops=1)
Output: users.email, users.id
-> Hash (cost=1.38..1.38 rows=38 width=8) (actual time=0.009..0.010 rows=38 loops=1)
Output: user_groups.group_id
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on public.user_groups (cost=0.00..1.38 rows=38 width=8) (actual time=0.002..0.005 rows=38 loops=1)
Output: user_groups.group_id
Planning Time: 0.990 ms
Execution Time: 11081.013 ms
(69 rows)
Time: 11143.367 ms (00:11.143)
【问题讨论】:
你好。该查询未使用您创建的索引,因为该表达式不在WHERE
子句中。您可以与索引共享创建表语句吗?由于涉及到许多表,如果您可以创建一个最小的可重现示例,事情会变得更容易。 datasets.id
是否已编入索引?
感谢 @JimJones datasets.id
已编入索引,当我准备最小示例时,我意识到没有 LEFT JOIN 只需 3 秒,因此我将尝试重构该部分并报告
如果你把除了cfiles之外的所有表都去掉,你会得到什么?
花一分钟对占用 163,296kB 的 3,399,832 行进行排序是惊人的。你的硬件怎么样?
谢谢@jjanes,你说得对,我的笔记本电脑硬件太慢了 - 为 aws 更新了所有内容,查询计划也有所不同
【参考方案1】:
我认为您将很难使用当前架构做得更好。您能否对数据进行规范化,您是否有一个表,每个 (tid,"Sample Names",id)
组合有一行,或者每个唯一(“示例名称”)或每个 (tid,"Sample Names")
只有一行?
虽然我认为“以及按其他连接表中的列过滤”没有一个通用的答案。答案取决于过滤器的选择性以及是否可索引。
【讨论】:
正确的问题,不幸的是我被绑定到当前模式,但也许我可以实现一些更规范化的缓存表。我尝试在this 帖子之后将work_mem
增加到256MB(解释中的Sort Method: external merge Disk: 163288kB
的两倍),但它似乎没有任何效果。这是因为它已经达到了它的处理最大值吗?还是 AWS 施加了其他限制?
哦,实际上它确实对 cfiles 查询产生了巨大影响 - 从 12 秒缩短到 2 秒,所以也许我可以重新处理连接 - 我已经更新了新计划跨度>
增加排序的 work_mem 通常不会有多大好处,除非它大到可以在内存中完成整个排序。由于磁盘格式比内存格式更紧凑,这通常可以使外部合并磁盘使用量增加约 3 倍。以上是关于在 Postgres 中优化 jsonb 数组值的 GROUP BY的主要内容,如果未能解决你的问题,请参考以下文章