在 Postgresql 和 IN 子句中索引 jsonb 列

Posted 2023-04-12

技术标签:

【中文标题】在 Postgresql 和 IN 子句中索引 jsonb 列【英文标题】：Indexing jsonb column in Postgresql and IN clause 【发布时间】：2020-07-18 09:29:54 【问题描述】：

我被 jsonb 索引困住了，需要帮助。我有一张带 jsonb 的桌子：

+-------+----------+------------------------------------------------------------+-------+
|id     |measure_id|parameters                                                  |value  |
+-------+----------+------------------------------------------------------------+-------+
|564174 |19        |"1": 12, "2": 59, "5": 79, "6": 249, "7": 248, "8": 412   |42.461 |
|564176 |19        |"1": 12, "2": 59, "5": 80, "6": 249, "7": 248, "8": 412   |46.198 |
|568244 |19        |"1": 12, "2": 316, "5": 129, "6": 249, "7": 248, "8": 412 |19.482 |
|568246 |19        |"1": 12, "2": 316, "5": 130, "6": 249, "7": 248, "8": 412 |20.051 |
|572313 |19        |"1": 12, "2": 331, "5": 113, "6": 249, "7": 248, "8": 412 |7.098  |
|596434 |19        |"1": 193, "2": 297, "5": 124, "6": 249, "7": 248, "8": 412|103.253|
|682354 |22        |"1": 427, "2": 25, "5": 121, "6": 426, "9": 441, "11": 428|0.132  |
|686423 |22        |"1": 427, "2": 60, "5": 72, "6": 426, "9": 443, "11": 428 |0.000  |
|1682439|44        |"1": 193, "2": 518, "5": 91, "6": 426, "9": 429, "11": 431|8.321  |
|1686787|44        |"1": 193, "2": 515, "5": 96, "6": 426, "9": 429, "11": 431|23.062 |
+-------+----------+------------------------------------------------------------+-------+

这是一些统计数据，每一行都有度量和一些参数设置。每个度量的参数数量都不同，因此我将它们放在 jsonb 列中。我必须做的：

选择所有不同的度量和参数：

 SELECT DISTINCT 
     measure_id, 
     jsonb_object_keys(parameters) AS parameter_id, 
     parameters -> jsonb_object_keys(parameters) AS parameter_value_id 
 FROM data;

从此表中选择数据：

 SELECT d.id,
    d.measure_id,
    CAST(d.attributes as TEXT) as attributes,
    CAST(d.parameters as TEXT) as parameters,
    d.value
 FROM data d
 WHERE d.measure_id=19
   AND (jsonb_extract_path(d.parameters, '1')::bigint in (12))
   AND (jsonb_extract_path(d.parameters, '2')::bigint in (2,59))
   AND (jsonb_extract_path(d.parameters, '5')::bigint in (79, 80, 129, 130, 113))
   AND (jsonb_extract_path(d.parameters, '6')::bigint in (249))
   AND (jsonb_extract_path(d.parameters, '7')::bigint in (248))
   AND (jsonb_extract_path(d.parameters, '8')::bigint in (412))
 ORDER BY d.id;

两个查询都运行缓慢。我的索引：

CREATE INDEX idx_data_measure ON data USING btree (measure_id);

CREATE INDEX idx_data_parameters
ON data USING btree (((parameters ->> '1'::text)::bigint), ((parameters ->> '2'::text)::bigint),
         ((parameters ->> '5'::text)::bigint), ((parameters ->> '6'::text)::bigint),
         ((parameters ->> '7'::text)::bigint), ((parameters ->> '8'::text)::bigint),
         ((parameters ->> '9'::text)::bigint), ((parameters ->> '10'::text)::bigint),
         ((parameters ->> '11'::text)::bigint), ((parameters ->> '458'::text)::bigint),
         ((parameters ->> '717'::text)::bigint), ((parameters ->> '718'::text)::bigint),
         ((parameters ->> '719'::text)::bigint), ((parameters ->> '720'::text)::bigint));

我尝试创建一个组合索引：

CREATE INDEX idx_data_parameters ON data USING btree (measure_id, ((parameters ->> '1'::text)::bigint),...

但这无济于事。

我试过EXPLAIN ANALYZE，但老实说我不明白:(

EXPLAIN ANALYZE
SELECT DISTINCT
        measure_id,
        jsonb_object_keys(parameters) AS parameter_id,
        parameters -> jsonb_object_keys(parameters) AS parameter_value_id
FROM data;

QUERY PLAN
Unique  (cost=2212571.28..2222400.17 rows=982889 width=72) (actual time=79346.142..84316.123 rows=5050 loops=1)
  ->  Sort  (cost=2212571.28..2215028.50 rows=982889 width=72) (actual time=79346.141..82358.141 rows=5586011 loops=1)
        Sort Key: measure_id, (jsonb_object_keys(parameters)), ((parameters -> (jsonb_object_keys(parameters))))"
        Sort Method: external merge  Disk: 202816kB
        ->  Gather  (cost=1000.00..2034108.05 rows=982889 width=72) (actual time=2467.949..63448.545 rows=5586011 loops=1)
              Workers Planned: 2
              Workers Launched: 2
              ->  Result  (cost=0.00..1934819.15 rows=40953700 width=72) (actual time=2432.167..63305.298 rows=1862004 loops=3)
                    ->  ProjectSet  (cost=0.00..1218129.40 rows=40953700 width=156) (actual time=2432.151..62251.992 rows=1862004 loops=3)
                          ->  Parallel Seq Scan on data  (cost=0.00..1010289.37 rows=409537 width=124) (actual time=2432.118..61448.821 rows=327630 loops=3)
Planning Time: 0.417 ms
Execution Time: 84406.575 ms

我觉得我有错误的索引，但无法正确创建它。据我了解，GIN 不是好主意，因为我需要 IN 子句作为参数，所以我制作了 BTREE。请帮帮我。

编辑 1：PG 版本：PostgreSQL 11.8。还更新了查询以适应样本数据。

EDIT 2：选择数据SELECT...WHERE...的查询计划：

Sort  (cost=1030.03..1030.04 rows=1 width=83) (actual time=63.659..63.661 rows=5 loops=1)
  Sort Key: id
  Sort Method: quicksort  Memory: 26kB
  Buffers: shared hit=4881
  ->  Index Scan using idx_data_measure on data d  (cost=0.55..1030.02 rows=1 width=83) (actual time=0.044..63.635 rows=5 loops=1)
        Index Cond: (measure_id = 19)
        Filter: (((jsonb_extract_path(parameters, VARIADIC '2'::text[]))::bigint = ANY ('2,59'::bigint[])) AND ((jsonb_extract_path(parameters, VARIADIC '1'::text[]))::bigint = 12) AND ((jsonb_extract_path(parameters, VARIADIC '6'::text[]))::bigint = 249) AND ((jsonb_extract_path(parameters, VARIADIC '7'::text[]))::bigint = 248) AND ((jsonb_extract_path(parameters, VARIADIC '8'::text[]))::bigint = 412) AND ((jsonb_extract_path(parameters, VARIADIC '5'::text[]))::bigint = ANY ('79,80,129,130,113'::bigint[])))"
        Rows Removed by Filter: 28733
        Buffers: shared hit=4881
Planning Time: 0.451 ms
Execution Time: 64.973 ms

我看到 idx_data_measure 正在工作，仅此而已...

【问题讨论】：

您的select distinct 查询必须为每一行打开并展开parameters 对象。您在此表上创建的索引比使用宽而稀疏的表或相关的表对要多得多的工作（和使用的空间）。我添加了 pg 版本来发布并编辑查询以适应示例数据。 @a_horse_with_no_name 它用于单个值。但是如何处理多个值呢？让它where parameters @> '...' OR parameters @>'...'？这将是一个很长的查询，因为用户可以选择任何参数集。看来measure_id 上的条件已经足够好，可以使用该列上的索引，因此您可能不需要任何额外的索引。 @a_horse_with_no_name 很好，我的开发机器上有 3M 记录，在服务器上超过 200M 记录，这个索引还不够。那么你应该添加来自生产服务器的explain (analyze) 输出 【参考方案1】：

使用 Postgres 12，我会尝试这样的事情：

create index on data (measure_id);
create index on data using gin (parameters);

运算符@> 和@? 可以使用GIN index，所以也许这样可以提供更好的性能。

select *
from data
where measure_id = 19 
  and parameters @? '$."5" ? (@ == 79 || @ == 80 || @ == 129 || @ == 130 || @ == 112)'
  and parameters @> '"1": 12, "2": 59, "6": 249, "7": 248, "8": 412'

在 Postgres 11 中，您不能使用 @? 运算符，因此可能会将条件拆分为一个使用 @> 运算符（以启用 GIN 索引）和另一个使用 IN 条件.这只有在使用@> 运算符的条件具有高度选择性时才会有效。

select *
from data
where measure_id = 19 
  and parameters @> '"1": 12, "2": 59, "6": 249, "7": 248, "8": 412'
  and (parameters ->> '5')::bigint in (79,80,129,130,112);

请注意，无需使用jsonb_extract_path()。我不知道，但也许使用->> 更快。

您发布的计划显示了从表中检索所有行的查询，并且还取消了 JSON 中的所有元素的嵌套。没有索引可以加快速度。

但也许只进行一次取消嵌套，这会更快：

SELECT DISTINCT
        d.measure_id,
        p.ky AS parameter_id,
        p.value AS parameter_value_id
FROM data d
  cross join jsonb_each(d.parameters) as p(ky, value)

但同样：没有索引可以帮助查询，因为您要从表中检索所有行。

【讨论】：

将尝试 GIN 索引并查询 11v。将度量和参数索引分开或合并为一个USING GIN (measure_id, parameters)? 如果你想组合它们，你需要安装btree_gin扩展好的，谢谢，将尝试并返回更新

以上是关于在 Postgresql 和 IN 子句中索引 jsonb 列的主要内容，如果未能解决你的问题，请参考以下文章