从 Last,First,Middle,Suffix 形式解析名称组件
Posted
技术标签:
【中文标题】从 Last,First,Middle,Suffix 形式解析名称组件【英文标题】:Parse name components from the form Last,First,Middle,Suffix 【发布时间】:2021-05-06 19:40:31 【问题描述】:现在,我有一个全名列,其中全名没有标准化形式。该形式通常遵循 Last,First Middle,Suffix 但它不是所有行的 vanilla。一些示例形式包括:
Last,First Middle,Suffix Last,First 中间后缀 姓氏、名字、后缀 最后一个,第一个后缀 Last,First Middle 最后,第一期望的结果是让每个组件都在它自己的列中,不存在的组件为 NULL。以下是一些预期结果的示例。
FULLNAME FIRSTNAME MIDDLENAME LASTNAME SUFFIX
Johnson,John Johnny,Jr. John Johnny Johnson Jr
Anderson,Andrew A, Sr. Andrew A Anderson Sr
Smith,Smitty Jr. Smitty NULL Smith Jr
Abegnale,Frank Frank NULL Abegnale NULL
Henry,King III King NULL Henry III
Garcia,Jerome John Jerome John Garcia NULL
我目前的解决方案是这样的:
SELECT
FullName
,SUBSTRING(FullNM, 1, CHARINDEX(',', FullNM) - 1) AS LastName
,CASE
WHEN LEN(SUBSTRING(FullNM, CHARINDEX(',',FullNM)+ 1,99)) - LEN(REPLACE(SUBSTRING(FullNM, CHARINDEX(',',FullNM)+ 1,99), ' ', '')) > 0
THEN REPLACE(SUBSTRING(FullNM, LEN(FullNM) - CHARINDEX(' ', REVERSE(FullNM))+1, 99), '.', '')
ELSE NULL
END AS MiddleName
,CASE
WHEN LEN(SUBSTRING(FullNM, CHARINDEX(',',FullNM) + 1, 99)) - LEN(REPLACE(SUBSTRING(FullNM, CHARINDEX(',',FullNM)+ 1, 99), ' ', '')) > 0
THEN SUBSTRING(FullNM, CHARINDEX(',',FullNM) + 1, (LEN(SUBSTRING(FullNM, CHARINDEX(',',FullNM)+ 1, 99)) - LEN(SUBSTRING(FullNM, LEN(FullNM) - CHARINDEX(' ', REVERSE(FullNM)) + 1, 99))))
ELSE SUBSTRING(FullNM,CHARINDEX(',',FullNM) + 1, 99)
END AS FirstNM
FROM MyTable
不幸的是,我只是想不出一个格式化后缀的好方法,尤其是在没有中间名的情况下。使用当前代码,如果有后缀,则将其添加为 MiddleNM。
非常感谢任何帮助或建议!
【问题讨论】:
Henry,King III
和 Garcia,Jerome John
在语法上是相等的。您需要一些额外的来源来说明您想要作为输出的差异。
你有后缀表吗?
【参考方案1】:
请尝试以下解决方案。它使用 XQuery 来标记 FullName 列。
正如 Gordon 所说,您可能需要扩展合法后缀列表。
SQL
-- DDL and sample data population, start
DECLARE @tbl TABLE (ID INT IDENTITY PRIMARY KEY, FullName VARCHAR(100));
INSERT INTO @tbl (FullName) VALUES
('Johnson,John Johnny,Jr.'),
('Anderson,Andrew A, Sr.'),
('Smith,Smitty Jr.'),
('Abegnale,Frank'),
('Henry,King III'),
('Garcia,Jerome John');
DECLARE @suffix TABLE (suffix VARCHAR(10));
INSERT INTO @suffix (suffix) VALUES
('Jr.'), ('III'), ('Sr.');
-- DDL and sample data population, end
DECLARE @separator CHAR(1) = ',';
;WITH rs AS
(
SELECT *
, TRY_CAST('<root><x><![CDATA[' +
REPLACE(FullName + SPACE(1) COLLATE Czech_BIN2, @separator, ']]></x><x><![CDATA[') +
']]></x></root>' AS XML) AS xmldata
FROM @tbl
), cte AS
(
SELECT rs.*, x.pos, x.size
, LEFT(c.value('(x[2]/text())[1]', 'VARCHAR(30)'), pos - 1) AS FirstName
, c.value('(x[1]/text())[1]', 'VARCHAR(30)') AS LastName
, TRIM(RIGHT(c.value('(x[2]/text())[1]', 'VARCHAR(30)'), IIF((size - pos) < 0, 0, size-pos+1))) AS MiddleName
, TRIM(COALESCE(c.value('(x[3]/text())[1]', 'VARCHAR(30)')
, RIGHT(c.value('(x[2]/text())[1]', 'VARCHAR(30)'), IIF((size - pos) < 0, 0, size-pos+1)))) AS Suffix
FROM rs CROSS APPLY xmldata.nodes('/root') AS t(c)
CROSS APPLY (SELECT CHARINDEX(SPACE(1), c.value('(x[2]/text())[1]', 'VARCHAR(30)'))
, LEN(c.value('(x[2]/text())[1]', 'VARCHAR(30)'))) AS x(pos, size)
)
SELECT ID, FullName, cte.FirstName
, IIF(MiddleName IN (SELECT Suffix FROM @suffix, '', cte.MiddleName) AS MiddleName
, cte.LastName
, IIF(cte.Suffix NOT IN (SELECT Suffix FROM @suffix), '', cte.Suffix) AS Suffix
FROM cte;
输出
+----+-------------------------+-----------+------------+----------+--------+
| ID | FullName | FirstName | MiddleName | LastName | Suffix |
+----+-------------------------+-----------+------------+----------+--------+
| 1 | Johnson,John Johnny,Jr. | John | Johnny | Johnson | Jr. |
| 2 | Anderson,Andrew A, Sr. | Andrew | A | Anderson | Sr. |
| 3 | Smith,Smitty Jr. | Smitty | | Smith | Jr. |
| 4 | Abegnale,Frank | Frank | | Abegnale | |
| 5 | Henry,King III | King | | Henry | III |
| 6 | Garcia,Jerome John | Jerome | John | Garcia | |
+----+-------------------------+-----------+------------+----------+--------+
【讨论】:
感谢您的回答。我遇到错误“传递给 LEFT 或 SUBSTRING 函数的长度参数无效”。我知道我可以引入一个假的“空间”。你有什么不使用 XML 的解决方案吗?我希望这更容易理解,尤其是那些可能没有太多 SQL 经验的人。 @EthanT,(1) 请编辑您的原始问题,并添加 DDL 和示例数据群 部分。 (2) 我更喜欢使用 XML 和 XQuery 进行简单的字符串标记化。【参考方案2】:稍后有点脑痛和 Sql 服务器错误,这是您可以尝试的一种可能的解决方案。
这有点冗长,正如建议的那样,确实需要您提前知道可能的后缀是什么。
with
num as (select top(30) Row_Number() over(order by (select null)) n from sys.messages),
parts as (
select top(1000) n.Fullname, w.n, /*sql server optimizer issue workaround*/
IsNull(Iif( CharIndex(',',name1)=0 and CharIndex(' ',name1)=0 ,name1,null),
Iif( CharIndex(',',name2)=0 and CharIndex(' ',name2)=0 ,name2,null)) part
from Names n
cross apply (
select num.n, Substring(n.Fullname, num.n, CharIndex(',', n.Fullname + ',', num.n) - num.n) name1,
Substring(n.Fullname, num.n, CharIndex(' ', n.Fullname + ' ', num.n) - num.n) name2
from num
where num.n<DataLength(n.Fullname) and Substring(',' + n.Fullname, num.n, 1) in (',', ' ')
)w
)
select Fullname,
Max(FirstName) firstName,
Max(Iif(MiddleName=FirstName or MiddleName=Lastname,null,MiddleName))MiddleName,
Max(Lastname) LastName,
Max(Suffix) Suffix
from (
select Fullname,
case when Lag(n) over(partition by Fullname order by n)=Min(n) over(partition by Fullname) then part end FirstName,
case when Lead(n) over(partition by Fullname order by n)=Max(n) over(partition by Fullname) and Suffix is null
or n=Max(n) over(partition by Fullname) and Suffix is null then part end MiddleName,
case when n=Min(n) over(partition by Fullname) then part end Lastname,
case when n=Max(n) over(partition by Fullname) and Suffix is not null then part end Suffix
from parts p
left join Suffixes s on s.Suffix=p.part
where p.part !=''
)x
group by Fullname
order by Fullname
看到这个working fiddle example
【讨论】:
以上是关于从 Last,First,Middle,Suffix 形式解析名称组件的主要内容,如果未能解决你的问题,请参考以下文章