在 postgres 中提取独特的结果并降低星型模式的成本
Posted
技术标签:
【中文标题】在 postgres 中提取独特的结果并降低星型模式的成本【英文标题】:Extracting unique result and reducing cost in star schema in postgres 【发布时间】:2020-11-02 09:20:27 【问题描述】:我有以下星型模式,其中涉及到一堆表,如下所示,以确定图书馆中书籍的可用性
f_book_availability(该表包含书籍的可用性以及对其他维度表的引用,我将稍后解释)索引主键
+-------------------------------------------------------------------------------+
| id | book_id |publisherid | location_id | genre_id | date_id| available |
| | | | | | | |
+-------------------------------------------------------------------------------+
| 1 | 1 | 1 | 72 | 1 | 1 | 1 |
| | | | | | | |
+-------------------------------------------------------------------------------+
| 2 | 2 | 1 | 60 | 2 | 1 | 1 |
| | | | | | | |
+-------------------------------------------------------------------------------+
d_book - 此维度表包含有关书籍的详细信息,例如名称和类型。类型只有 id 1 和 2。1 表示“由发布者发布”,2 表示“自行发布”,没有任何引用表。索引主键
+-----------+-----------+------------+
| id | type | name |
+------------------------------------+
| 1 | 1 | LOR |
+------------------------------------+
| 2 | 2 | My life |
+-----------+-----------+------------+
d_publisher :此维度表包含发布者信息。索引主键
+-----------+------------
| id | name |
+-----------------------+
| 1 | abc |
+-----------------------+
| 2 | def |
+-----------+------------
d_location- 这个维度是一个棘手的维度。 (注意架构已经存在,我无法修改它)。它有地点的 id 和父地点的 id。注意 id 我们保存层次结构,例如,如果你选择 id 72 这是一个叶节点并告诉库的机架,你可以看到每个叶节点有四个具有不同父节点的条目,你可以从中知道层次结构 国家->城市->图书馆->机架。这适用于每个地方。例如,如果您有图书馆的位置,那么您可以找到层次结构 country->City->Library。索引 1) 主键 (id & parentId), 2) parentId
+-----------+-----------+------------+------------+
| id | parent_id | name | c_code |
| | | | |
+-------------------------------------------------+
| 1 | 1 | France | FR |
+-------------------------------------------------+
| 4 | 1 | Paris | FR |
+-------------------------------------------------+
| 4 | 4 | Paris | FR |
+-------------------------------------------------+
| 25 | 1 | GtLibrary | FR |
+-------------------------------------------------+
| 25 | 4 | GtLibrary | FR |
+-------------------------------------------------+
| 25 | 25 | GtLibrary | FR |
+-------------------------------------------------+
| 72 | 1 | Rack1 | FR |
+-------------------------------------------------+
| 72 | 4 | Rack1 | FR |
+-------------------------------------------------+
| 72 | 25 | Rack1 | FR |
+-------------------------------------------------+
| 72 | 72 | Rack1 | FR |
+-----------+-----------+------------+------------+
d_genre : 这个维度表有流派信息索引主键
+-----------+------------
| id | name |
+-----------------------+
| 1 | fantasy |
+-----------------------+
| 2 | horror |
+-----------+------------
d_date :此维度表包含所有日期(请注意,我没有显示其他列,例如月、日、年、星期几、周数,但我没有显示它只是为了简单起见,只是为了让您知道它不仅仅是日期因为它看起来很愚蠢:) ) 索引 - 1) 主键 2) 日期
+-----------+--------------
| id | date |
+-------------------------+
| 1 | 2020-11-25|
+-------------------------+
| 2 | 2019-10-24|
+-----------+--------------
从这个表中,我试图准确地了解一本书是否在某个日期可用,包括其出版商、流派、日期、位置及其直接父位置等信息。
我写了以下查询
select
fba.id,
location.c_code as country,
parentLocation.name as parentPlace,
location.id as locationId,
location.name as locationName,
publisher.id as "publisherId",
publisher.name as publisherName,
case when book.type = 1 then 'published' else 'self-published' end as "bookType",
book.type as typeId,
genre.name as genreName,
book.id as "bookId",
book.name as bookTitle,
d."date",
fba.available
from f_book_availability fba
join d_book book on fba.product_id = product.id
join d_publisher publisher on fba.publisherid = publisher.id
join d_location location on fba.location_id = location.id
join d_location parentLocation on location.parent_id = parentLocation.id
join d_genre genre on fba.genre_id = genre.id
join d_date d on fba.date_id = d.id
where
location.id <> location.parent_id
and d."date" >= now() and d."date" <= '2020-12-01'
and location.c_code ='FR'
and book.type = 1
and genre.name = 'fantasy'
实际输出
+--------+---------+-------------------------+--------------+----------------------------+----------+--------+------------+--------+-----------+-----------+-----------+
| fba_id | country | parentPlace| locationId | locationName | publisherid| publisherName | bookType | typeId | genreName | bookId | bookTitle | date | available |
| | | | | | | | | | | | | | |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | | | | | | | | | | | | | |
| 1 | FR | France | 72 | Rack1 | 1 | abc | published| 1 | fantasy | 1 | LOR | 2020-11-25| 1 |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | | | | | | | | | | | | | |
| 1 | FR | Paris | 72 | Rack1 | 1 | abc | published| 1 | fantasy | 1 | LOR | 2020-11-25| 1 |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | | | | | | | | | | | | | |
| 1 | FR | Paris | 72 | Rack1 | 1 | abc | published| 1 | fantasy | 1 | LOR | 2020-11-25| 1 |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | | | | | | | | | | | | | |
| 1 | FR | GtLibrary | 72 | Rack1 | 1 | abc | published| 1 | fantasy | 1 | LOR | 2020-11-25| 1 |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | | | | | | | | | | | | | |
| 1 | FR | GtLibrary | 72 | Rack1 | 1 | abc | published| 1 | fantasy | 1 | LOR | 2020-11-25| 1 |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | | | | | | | | | | | | | |
| 1 | FR | GtLibrary | 72 | Rack1 | 1 | abc | published| 1 | fantasy | 1 | LOR | 2020-11-25| 1 |
+--------+---------+------------+------------+--------------+------------+---------------+-------------------+------------+--------+-----------+-----------------------+
预期输出:正如您在上面看到的,由于与父位置和条件的连接,存在重复。
-
我想要实现的最好的事情就是得到一个 1 行,它是机架和库的最后一行。如果 location 有一个 immediateParent 标志会更容易,但我无法更改架构
第二好的事情是有不同的记录,所以在这种情况下,只有 3 个记录架和图书馆,架子和城市,架子和国家。如果我无法实现第一个选项,这将是可以的。然而,不同的条款有很多成本,我不知道如何降低成本
费用
Unique (cost=12070116.66..12072185.22 rows=63648 width=179)
-> Sort (cost=12070116.66..12070275.78 rows=63648 width=179)
Sort Key: fba.id, parentLocation.name, location.id, location.name, publisher.id, publisher.name, (CASE WHEN (book.type = 1) THEN 'published'::text ELSE 'self-published'::text END), genre.name, book.id, book.name, d.date, fba.available
-> Hash Join (cost=3316.39..12059378.75 rows=63648 width=179)
Hash Cond: (location.parent_id= parentLocation.id)
-> Hash Join (cost=2601.08..12057653.07 rows=19090 width=141)
Hash Cond: (fa.publisher_id = publisher.id)
-> Hash Join (cost=2475.28..12057477.06 rows=19090 width=120)
Hash Cond: (fba.date_id = d.id)
-> Gather (cost=2466.05..12051967.29 rows=2092656 width=124)
Workers Planned: 2
-> Hash Join (cost=1466.05..11841701.69 rows=871940 width=124)
Hash Cond: (fba.location_id = location.id)
-> Hash Join (cost=820.92..11816871.09 rows=1374762 width=99)
Hash Cond: (fa.book_id = book.id)
-> Hash Join (cost=8.30..11808952.45 rows=2706393 width=54)
Hash Cond: (fba.genre_id = genre.id)
-> Parallel Seq Scan on f_book_availability fba (cost=0.00..11047094.57 rows=278758458 width=45)
-> Hash (cost=8.29..8.29 rows=1 width=25)
-> Seq Scan on d_genre genre(cost=0.00..8.29 rows=1 width=25)
Filter: ((tech_en)::text = 'fantasy'::text)
-> Hash (cost=702.26..702.26 rows=8829 width=53)
-> Seq Scan on d_book book (cost=0.00..702.26 rows=8829 width=53)
Filter: (type = 1)
-> Hash (cost=613.88..613.88 rows=2500 width=33)
-> Seq Scan on d_location location (cost=0.00..613.88 rows=2500 width=33)
Filter: ((id <> parent_id) AND ((c_code)::text = 'FR'::text))
-> Hash (cost=8.86..8.86 rows=29 width=8)
-> Index Scan using date_unique on d_date d (cost=0.28..8.86 rows=29 width=8)
Index Cond: ((date >= now()) AND (date <= '2020-12-01'::date))
-> Hash (cost=98.69..98.69 rows=2169 width=29)
-> Seq Scan on d_publisher publisher (cost=0.00..98.69 rows=2169 width=29)
-> Hash (cost=546.25..546.25 rows=13525 width=22)
-> Seq Scan on d_location parentLocation (cost=0.00..546.25 rows=13525 width=22)
抱歉,如果帖子不应该在这里,如果有错字,因为我不得不工作两个小时来创建那个 ascii 表,并且可能有错字,因为我只是为了举例而更改了列名
【问题讨论】:
嗨 - 我担心您的数据设计从根本上被破坏了,需要修复才能使您的查询正常工作。在维度建模中,我们创建唯一的代理键作为维度的 PK,并将其用作事实表中的 FK。不幸的是,在您的设计中(正如您所指出的),您使用我假设的源 ID 作为 d_location 中的标识符,这对于该表中的记录来说并不是唯一的 - 因此您的连接和查询不起作用。除了重新设计您的数据之外,任何解决方案都只是解决问题,而不是实际解决问题 支持建议。我明白你的意思。实际上 d_location 具有 id 和 parent_id 的复合键。因此,这个维度极大地帮助我们在其他用例中查看具有层次结构的数据 d_location 中的数据基本上没有问题(尽管我会将所有父字段反规范化到此表中)所以我不建议您需要重新设计 - 这是关键设计对于错误且需要更正的表,您不应该在 Dimension 表中使用复合 PK - 正如您发现的那样,您的查询不起作用 有道理。感谢您抽出时间并回复。如果你会回答它,我很乐意将其标记为解决方案 【参考方案1】:添加为答案,以便您可以勾选:)
d_location 中的数据基本上没有任何问题(尽管我会将所有父字段反规范化到此表中)所以我不建议您需要重新设计它 - 这是表的关键设计错误并且需要更正,您不应该在 Dimension 表中包含复合 PK - 正如您发现的那样,您的查询不起作用
【讨论】:
以上是关于在 postgres 中提取独特的结果并降低星型模式的成本的主要内容,如果未能解决你的问题,请参考以下文章
Sequelize Postgres - 如何使用 ON CONFLICT 来实现独特的?