Postgres - 将列中的各种名称替换为相应的唯一标识符
Posted
技术标签:
【中文标题】Postgres - 将列中的各种名称替换为相应的唯一标识符【英文标题】:Postgres - Replacing variety of names in a column for a respective unique identifier 【发布时间】:2013-09-27 17:07:42 【问题描述】:下面的列表是指结婚前后的人名。随着时间的推移,他们中的一些人离婚并再次结婚和/或改名。 我想要做的是获取该人一生中的所有姓名,并为每个人添加一个具有唯一标识符的新列。
这是名为 Names 的实际列表:
Name_before Name_after
Misti Gulick Misti Gulick Thibodeaux
Faye Leaton Faye Leaton Hemby
Arden Peck Arden Peck Mroz
Carlton Kingsley Carlton Kingsley Mcelveen
Dolly Verhey Dolly Verhey Irish
Gaynell Pasquale Gaynell Pasquale Ayala
Misti Gulick Thibodeaux Misti Thibodeaux
Faye Leaton Hemby Faye Hemby
Arden Peck Mroz Arden Mroz
Carlton Kingsley Mcelveen Carlton Mcelveen
Dolly Verhey Irish Dolly Irish
Gaynell Pasquale Ayala Gaynell Ayala
Misti Thibodeaux Misti Trey Thibodeaux
Faye Hemby Faye Barrett Hemby
Arden Mroz Arden Justin Mroz
Carlton Mcelveen Carlton Tameka Mcelveen
Dolly Irish Dolly Jeremiah Irish
Gaynell Ayala Gaynell Cherry Ayala
理想的列表应该是这样的:
Name_before Name_after Identifier
Misti Gulick Misti Gulick Thibodeaux Misti Gulick
Faye Leaton Faye Leaton Hemby Faye Leaton
Arden Peck Arden Peck Mroz Arden Peck
Carlton Kingsley Carlton Kingsley Mcelveen Carlton Kingsley
Dolly Verhey Dolly Verhey Irish Dolly Verhey
Gaynell Pasquale Gaynell Pasquale Ayala Gaynell Pasquale
Misti Gulick Thibodeaux Misti Thibodeaux Misti Gulick
Faye Leaton Hemby Faye Hemby Faye Leaton
Arden Peck Mroz Arden Mroz Arden Peck
Carlton Kingsley Mcelveen Carlton Mcelveen Carlton Kingsley
Dolly Verhey Irish Dolly Irish Dolly Verhey
Gaynell Pasquale Ayala Gaynell Ayala Gaynell Pasquale
Misti Thibodeaux Misti Trey Thibodeaux Misti Gulick
Faye Hemby Faye Barrett Hemby Faye Leaton
Arden Mroz Arden Justin Mroz Arden Peck
Carlton Mcelveen Carlton Tameka Mcelveen Carlton Kingsley
Dolly Irish Dolly Jeremiah Irish Dolly Verhey
Gaynell Ayala Gaynell Cherry Ayala Gaynell Pasquale
我试图做的是在 Name_before 中遇到来自 Name_after 的常见值,然后重复执行直到我没有更多匹配项。 每次创建这些表之一时,名称的数量都会减少。
create table name_temp1 as
select *
from Names
where Name_after in (select distinct(Name_before) from Names)
order by Name_before, Name_after;
create table name_temp2 as
select *
from name_temp1
where Name_after in (select distinct(Name_before) from name_temp1)
order by Name_before, Name_after;
create table name_temp3 as
select *
from name_temp2
where Name_after in (select distinct(Name_before) from name_temp2)
order by Name_before, Name_after;
然后我会使用带有“case”函数的查询:
select *,case when n3.Name_before=n2.Name_after
then case when n2.Name_before=n1.Name_after
then n1.Name_after else n.after end end end
from Names n, name_temp1 n1, name_temp2 n2, name_temp3 n3;
我知道这根本不优雅,也没有性能。你们中的一些人会帮助我改进它吗?或者甚至欢迎其他建议!谢谢,
【问题讨论】:
这些变更记录是按时间顺序排列的吗? (应该有一个排序键,或者至少一个键,恕我直言) @wildplasser:实际上,如果名称是唯一的,则不需要 any 顺序。每人只有一排没有前排,只有一排没有后排。毫不含糊..但当然,出于实际目的,该结构几乎没有用处。 是的,但我姐姐已经与同一个人 两次 结婚(没有中间的婚姻,而且在两次婚姻中她都保留了她的婚前姓氏......)在大多数情况下, 一点隐含的顺序有助于避免不匹配。 【参考方案1】:架构
整个过程的目标应该是一个规范化的模式:有一个表person
包括一个代理主键person_id
(因为没有明显的自然首要的关键)。我建议您为此使用 serial
列。
还有一个表person_name
,外键指向person
:
CREATE TEMP TABLE person(
person_id serial PRIMARY KEY -- implicit primary key constraint
-- probably more attributes belonging to the person
);
CREATE TEMP TABLE person_name(
person_name_id serial PRIMARY KEY
,person_id int NOT NULL REFERENCES person(person_id) -- foreign key
,name text NOT NULL
,step int DEFAULT 0
-- possibly more attributes that belong to the person at this step only
);
(person_id, name)
不能变成 UNIQUE
,因为同一个人在一生中可以多次使用相同的名字。
为了提取数据,我假设您使用带有recursive CTE 的单个查询。但是,如果任何人曾经使用相同的名字,那么您的操作肯定会模棱两可。您可能会得到无意义的结果或循环依赖,如果没有额外的信息就无法解决。
person_name
中带有step = 0
的行将包含您的"Identifier"
。
查询
为了这个查询,我假设使用 UNIQUE 名称(或者它无法工作。)。
WITH RECURSIVE p_start AS (
SELECT row_number() OVER (ORDER BY n.name_before) AS person_id, n.*
FROM names n
LEFT JOIN names n2 ON n2.name_after = n.name_before
WHERE n2.name_after IS NULL
)
, pers AS (
SELECT person_id, name_after AS name, 1 AS step
FROM p_start
UNION ALL
SELECT p.person_id, n.name_after, p.step + 1
FROM pers p
JOIN names n ON n.name_before = p.name
-- WHERE p.step < 10 -- If query doesn't finish, stop the infinite recursion
)
SELECT person_id, name_before AS name, 0 AS step
FROM p_start
UNION ALL
SELECT person_id, name, step
FROM pers
ORDER BY person_id, step
-> SQLfiddle demo.
一站式服务
有了上述架构,您可以通过一个查询完成所有操作:填充新表并返回结果:
WITH RECURSIVE p_start AS (
SELECT row_number() OVER (ORDER BY n.name_before) AS person_id, n.*
FROM names n
LEFT JOIN names n2 ON n2.name_after = n.name_before
WHERE n2.name_after IS NULL
)
, pers AS (
SELECT person_id, name_after AS name, 1 AS step
FROM p_start
UNION ALL
SELECT p.person_id, n.name_after, p.step + 1
FROM pers p
JOIN names n ON n.name_before = p.name
-- WHERE p.step < 10 -- If query doesn't finish, stop the infinite recursion
)
, ins_person AS (
INSERT INTO person(person_id)
SELECT person_id FROM p_start
)
INSERT INTO person_name(person_id, name, step)
SELECT person_id, name_before, 0 AS step
FROM p_start
UNION ALL
SELECT person_id, name, step
FROM pers
ORDER BY person_id, step
RETURNING *
-> SQLfiddle demo.
最后,person
的初始化序列,这样以后就不会出现重复的密钥违规:
SELECT setval('person_person_id_seq', (SELECT max(person_id) FROM person))
【讨论】:
我不得不求助于记录的隐含顺序来打破潜在的循环。你的更干净,更敏感(但我的不那么脆弱,恕我直言) @wildplasser:关于架构:我还考虑使用自联接来引用person_name
中的前一个名称(如您的 parent_id
)。使以后的添加或删除更容易。甚至建议的索引等(检查更改日志)。但为了简单起见,我使用了step
的静态编号。它已经够复杂了。
加载时的序列号有点骇人听闻。顺便说一句:这是我更喜欢逐步完成的事情,因为以后可能需要一些中间步骤。将所有东西都放在一张桌子上是将混乱减少到最低限度的一种方法。 (SQL 不是 SAS 该死的!;-) 同一张表中的 canon_id 是纯粹的赢家,恕我直言。 (它自动暗示列上的 FK 索引,在大多数情况下(转换)就足够了。
@wildplasser:这两部连续剧都是书本上的。 person_name_id
不是绝对必要的,但我更喜欢代理 pk 而不是三列上的复合 pk。我同意,逐步做事是一种明智的做法。在这种情况下,如果出现任何问题,只需重新创建新表即可。
从数据建模的角度来看,至少需要两个表。加上额外关键组件的至少一个日期或版本号(这实际上是一个历史表)。但是,即使您使用临时/临时表(这似乎是一种理智的方式),添加的序列也可以作为代理添加。不过,源表中仍然需要一个订单...【参考方案2】:
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
-- make some data
CREATE TABLE names_org
( name_id SERIAL NOT NULL PRIMARY KEY
, name_org varchar
, name_new varchar
);
COPY names_org (name_org,name_new) FROM stdin;
Misti Gulick Misti Gulick Thibodeaux
Faye Leaton Faye Leaton Hemby
Arden Peck Arden Peck Mroz
Carlton Kingsley Carlton Kingsley Mcelveen
Dolly Verhey Dolly Verhey Irish
Gaynell Pasquale Gaynell Pasquale Ayala
Misti Gulick Thibodeaux Misti Thibodeaux
Faye Leaton Hemby Faye Hemby
Arden Peck Mroz Arden Mroz
Carlton Kingsley Mcelveen Carlton Mcelveen
Dolly Verhey Irish Dolly Irish
Gaynell Pasquale Ayala Gaynell Ayala
Misti Thibodeaux Misti Trey Thibodeaux
Faye Hemby Faye Barrett Hemby
Arden Mroz Arden Justin Mroz
Carlton Mcelveen Carlton Tameka Mcelveen
Dolly Irish Dolly Jeremiah Irish
Gaynell Ayala Gaynell Cherry Ayala
\.
SELECT * FROM names_org;
以及更改和更新(为了清晰起见,分步进行)
--Add a few self-referencing fields
--
ALTER TABLE names_org
-- points to the **first** entry for this person
ADD COLUMN canon_id INTEGER
REFERENCES names_org (name_id)
-- points to the **nearest previous** entry for this person
, ADD COLUMN parent_id INTEGER
REFERENCES names_org (name_id)
;
-- Update from **the nearest** previous record; if any
UPDATE names_org dst
SET parent_id = src.name_id
FROM names_org src
-- src is the previous row for this person
WHERE src.name_new = dst.name_org
AND src.name_id < dst.name_id
-- The nearest: eliminate the middlemen
AND NOT EXISTS (SELECT *
FROM names_org nx
WHERE nx.name_new = dst.name_org
AND nx.name_id < dst.name_id
AND nx.name_id > src.name_id
);
-- Add the final newnames (at the end of the chains) to the table, too.
-- These are the name strings that only occur in name_new,
-- but never in name_org
INSERT INTO names_org (name_org, parent_id)
SELECT name_new, name_id
FROM names_org src
WHERE NOT EXISTS (
SELECT *
FROM names_org nx
WHERE nx.parent_id = src.name_id
);
-- Find canonical parent (the head of the chain)
WITH RECURSIVE list AS (
SELECT name_id AS canon_id
, name_id AS this_id
FROM names_org
WHERE parent_id IS NULL
UNION ALL
SELECT list.canon_id AS canon_id
, this.name_id AS this_id
FROM list
JOIN names_org this ON this.parent_id = list.this_id
)
UPDATE names_org this
SET canon_id = list.canon_id
FROM list
WHERE this.name_id = list.this_id
;
-- Now we can drop the new name and rename the org name
ALTER TABLE names_org DROP COLUMN name_new ;
ALTER TABLE names_org RENAME COLUMN name_org TO current_name ;
SELECT * FROM names_org;
结果:
ALTER TABLE
UPDATE 12
INSERT 0 6
UPDATE 24
ALTER TABLE
ALTER TABLE
name_id | current_name | canon_id | parent_id
---------+---------------------------+----------+-----------
1 | Misti Gulick | 1 |
2 | Faye Leaton | 2 |
3 | Arden Peck | 3 |
4 | Carlton Kingsley | 4 |
5 | Dolly Verhey | 5 |
6 | Gaynell Pasquale | 6 |
7 | Misti Gulick Thibodeaux | 1 | 1
8 | Faye Leaton Hemby | 2 | 2
9 | Arden Peck Mroz | 3 | 3
10 | Carlton Kingsley Mcelveen | 4 | 4
11 | Dolly Verhey Irish | 5 | 5
12 | Gaynell Pasquale Ayala | 6 | 6
13 | Misti Thibodeaux | 1 | 7
14 | Faye Hemby | 2 | 8
15 | Arden Mroz | 3 | 9
16 | Carlton Mcelveen | 4 | 10
17 | Dolly Irish | 5 | 11
18 | Gaynell Ayala | 6 | 12
19 | Misti Trey Thibodeaux | 1 | 13
20 | Faye Barrett Hemby | 2 | 14
21 | Arden Justin Mroz | 3 | 15
22 | Carlton Tameka Mcelveen | 4 | 16
23 | Dolly Jeremiah Irish | 5 | 17
24 | Gaynell Cherry Ayala | 6 | 18
(24 rows)
注意:这种尴尬的结构统一了规范名称/编号(链表的开始)和更新链(后向链表),全部组合在一张表中。
可能更新步骤可以合并在一个语句中,但我不在乎。 而且,正如 Erwin 评论的那样,这个过程对拼写错误、错误命中、不匹配和丢失记录非常敏感。特别是,字符集故障可能非常痛苦。
在大多数情况下,流程中的某个地方需要一些手动步骤。
并且,为了使事情更完整:模拟所需表格的视图:
CREATE VIEW triple_view AS
SELECT
COALESCE(prev.current_name ,this.current_name) AS name_before
, this.current_name AS name_after
, abs.current_name AS identifier
FROM names_org this
JOIN names_org prev ON prev.name_id = this.parent_id
JOIN names_org abs ON abs.name_id = this.canon_id
;
SELECT * FROM triple_view;
此视图的结果:
name_before | name_after | identifier
---------------------------+---------------------------+------------------
Misti Gulick | Misti Gulick Thibodeaux | Misti Gulick
Faye Leaton | Faye Leaton Hemby | Faye Leaton
Arden Peck | Arden Peck Mroz | Arden Peck
Carlton Kingsley | Carlton Kingsley Mcelveen | Carlton Kingsley
Dolly Verhey | Dolly Verhey Irish | Dolly Verhey
Gaynell Pasquale | Gaynell Pasquale Ayala | Gaynell Pasquale
Misti Gulick Thibodeaux | Misti Thibodeaux | Misti Gulick
Faye Leaton Hemby | Faye Hemby | Faye Leaton
Arden Peck Mroz | Arden Mroz | Arden Peck
Carlton Kingsley Mcelveen | Carlton Mcelveen | Carlton Kingsley
Dolly Verhey Irish | Dolly Irish | Dolly Verhey
Gaynell Pasquale Ayala | Gaynell Ayala | Gaynell Pasquale
Misti Thibodeaux | Misti Trey Thibodeaux | Misti Gulick
Faye Hemby | Faye Barrett Hemby | Faye Leaton
Arden Mroz | Arden Justin Mroz | Arden Peck
Carlton Mcelveen | Carlton Tameka Mcelveen | Carlton Kingsley
Dolly Irish | Dolly Jeremiah Irish | Dolly Verhey
Gaynell Ayala | Gaynell Cherry Ayala | Gaynell Pasquale
(18 rows)
【讨论】:
以上是关于Postgres - 将列中的各种名称替换为相应的唯一标识符的主要内容,如果未能解决你的问题,请参考以下文章
将列从 varchar 转换为 nvarchar 是不是会更改存储在列中的字符串的编码?