从 Last,First,Middle,Suffix 形式解析名称组件

Posted

技术标签:

【中文标题】从 Last,First,Middle,Suffix 形式解析名称组件【英文标题】:Parse name components from the form Last,First,Middle,Suffix 【发布时间】:2021-05-06 19:40:31 【问题描述】:

现在,我有一个全名列,其中全名没有标准化形式。该形式通常遵循 Last,First Middle,Suffix 但它不是所有行的 vanilla。一些示例形式包括:

Last,First Middle,Suffix Last,First 中间后缀 姓氏、名字、后缀 最后一个,第一个后缀 Last,First Middle 最后,第一

期望的结果是让每个组件都在它自己的列中,不存在的组件为 NULL。以下是一些预期结果的示例。

FULLNAME                    FIRSTNAME  MIDDLENAME LASTNAME  SUFFIX
Johnson,John Johnny,Jr.     John       Johnny     Johnson   Jr
Anderson,Andrew A, Sr.      Andrew     A          Anderson  Sr
Smith,Smitty Jr.            Smitty     NULL       Smith     Jr
Abegnale,Frank              Frank      NULL       Abegnale  NULL
Henry,King III              King       NULL       Henry     III
Garcia,Jerome John          Jerome     John       Garcia    NULL
     

我目前的解决方案是这样的:

SELECT
FullName
,SUBSTRING(FullNM, 1, CHARINDEX(',', FullNM) - 1) AS LastName
,CASE
      WHEN LEN(SUBSTRING(FullNM, CHARINDEX(',',FullNM)+ 1,99)) - LEN(REPLACE(SUBSTRING(FullNM, CHARINDEX(',',FullNM)+ 1,99), ' ', '')) > 0
      THEN REPLACE(SUBSTRING(FullNM, LEN(FullNM) - CHARINDEX(' ', REVERSE(FullNM))+1, 99), '.', '')
      ELSE NULL
    END AS MiddleName
,CASE
      WHEN LEN(SUBSTRING(FullNM, CHARINDEX(',',FullNM) + 1, 99)) - LEN(REPLACE(SUBSTRING(FullNM, CHARINDEX(',',FullNM)+ 1, 99), ' ', '')) > 0
      THEN SUBSTRING(FullNM, CHARINDEX(',',FullNM) + 1, (LEN(SUBSTRING(FullNM, CHARINDEX(',',FullNM)+ 1, 99)) - LEN(SUBSTRING(FullNM, LEN(FullNM) - CHARINDEX(' ', REVERSE(FullNM)) + 1, 99))))
      ELSE SUBSTRING(FullNM,CHARINDEX(',',FullNM) + 1, 99)
    END AS FirstNM
FROM MyTable

不幸的是,我只是想不出一个格式化后缀的好方法,尤其是在没有中间名的情况下。使用当前代码,如果有后缀,则将其添加为 MiddleNM。

非常感谢任何帮助或建议!

【问题讨论】:

Henry,King IIIGarcia,Jerome John 在语法上是相等的。您需要一些额外的来源来说明您想要作为输出的差异。 你有后缀表吗? 【参考方案1】:

请尝试以下解决方案。它使用 XQuery 来标记 FullName 列。

正如 Gordon 所说,您可能需要扩展合法后缀列表。

SQL

-- DDL and sample data population, start
DECLARE @tbl TABLE (ID INT IDENTITY PRIMARY KEY, FullName VARCHAR(100));
INSERT INTO @tbl (FullName) VALUES
('Johnson,John Johnny,Jr.'),
('Anderson,Andrew A, Sr.'),    
('Smith,Smitty Jr.'),  
('Abegnale,Frank'),        
('Henry,King III'),          
('Garcia,Jerome John');

DECLARE @suffix TABLE (suffix VARCHAR(10));
INSERT INTO @suffix (suffix) VALUES
('Jr.'), ('III'), ('Sr.');
-- DDL and sample data population, end

DECLARE @separator CHAR(1) = ',';

;WITH rs AS
(
   SELECT * 
      , TRY_CAST('<root><x><![CDATA[' + 
            REPLACE(FullName + SPACE(1) COLLATE Czech_BIN2, @separator, ']]></x><x><![CDATA[') + 
         ']]></x></root>' AS XML) AS xmldata
   FROM @tbl
), cte AS
(
SELECT rs.*, x.pos, x.size
    , LEFT(c.value('(x[2]/text())[1]', 'VARCHAR(30)'), pos - 1) AS FirstName
    , c.value('(x[1]/text())[1]', 'VARCHAR(30)') AS LastName
    , TRIM(RIGHT(c.value('(x[2]/text())[1]', 'VARCHAR(30)'), IIF((size - pos) < 0, 0, size-pos+1))) AS MiddleName
    , TRIM(COALESCE(c.value('(x[3]/text())[1]', 'VARCHAR(30)')
        , RIGHT(c.value('(x[2]/text())[1]', 'VARCHAR(30)'), IIF((size - pos) < 0, 0, size-pos+1)))) AS Suffix
FROM rs CROSS APPLY xmldata.nodes('/root') AS t(c)
    CROSS APPLY (SELECT CHARINDEX(SPACE(1), c.value('(x[2]/text())[1]', 'VARCHAR(30)'))
        , LEN(c.value('(x[2]/text())[1]', 'VARCHAR(30)'))) AS x(pos, size)
)
SELECT ID, FullName, cte.FirstName
    , IIF(MiddleName IN (SELECT Suffix FROM @suffix, '', cte.MiddleName) AS MiddleName
    , cte.LastName
    , IIF(cte.Suffix NOT IN (SELECT Suffix FROM @suffix), '', cte.Suffix) AS Suffix
FROM cte;

输出

+----+-------------------------+-----------+------------+----------+--------+
| ID |        FullName         | FirstName | MiddleName | LastName | Suffix |
+----+-------------------------+-----------+------------+----------+--------+
|  1 | Johnson,John Johnny,Jr. | John      | Johnny     | Johnson  | Jr.    |
|  2 | Anderson,Andrew A, Sr.  | Andrew    | A          | Anderson | Sr.    |
|  3 | Smith,Smitty Jr.        | Smitty    |            | Smith    | Jr.    |
|  4 | Abegnale,Frank          | Frank     |            | Abegnale |        |
|  5 | Henry,King III          | King      |            | Henry    | III    |
|  6 | Garcia,Jerome John      | Jerome    | John       | Garcia   |        |
+----+-------------------------+-----------+------------+----------+--------+

【讨论】:

感谢您的回答。我遇到错误“传递给 LEFT 或 SUBSTRING 函数的长度参数无效”。我知道我可以引入一个假的“空间”。你有什么不使用 XML 的解决方案吗?我希望这更容易理解,尤其是那些可能没有太多 SQL 经验的人。 @EthanT,(1) 请编辑您的原始问题,并添加 DDL 和示例数据群 部分。 (2) 我更喜欢使用 XML 和 XQuery 进行简单的字符串标记化。【参考方案2】:

稍后有点脑痛和 Sql 服务器错误,这是您可以尝试的一种可能的解决方案。

这有点冗长,正如建议的那样,确实需要您提前知道可能的后缀是什么。

with 
    num as (select top(30) Row_Number() over(order by (select null)) n from sys.messages),
    parts as (
        select top(1000) n.Fullname, w.n, /*sql server optimizer issue workaround*/
            IsNull(Iif( CharIndex(',',name1)=0 and CharIndex(' ',name1)=0 ,name1,null),
            Iif( CharIndex(',',name2)=0 and CharIndex(' ',name2)=0 ,name2,null)) part
        from Names n
        cross apply (
            select num.n, Substring(n.Fullname, num.n, CharIndex(',', n.Fullname + ',', num.n) - num.n) name1, 
                Substring(n.Fullname, num.n, CharIndex(' ', n.Fullname + ' ', num.n) - num.n) name2
            from num
            where num.n<DataLength(n.Fullname) and Substring(',' + n.Fullname, num.n, 1) in (',', ' ')
        )w
)
select Fullname, 
    Max(FirstName) firstName, 
    Max(Iif(MiddleName=FirstName or MiddleName=Lastname,null,MiddleName))MiddleName, 
    Max(Lastname) LastName, 
    Max(Suffix) Suffix
from (
    select Fullname,
        case when Lag(n) over(partition by Fullname order by n)=Min(n) over(partition by Fullname) then part end FirstName,
        case when Lead(n) over(partition by Fullname order by n)=Max(n) over(partition by Fullname) and Suffix is null 
            or n=Max(n) over(partition by Fullname) and Suffix is null then part end MiddleName,
        case when n=Min(n) over(partition by Fullname) then part end Lastname,
        case when n=Max(n) over(partition by Fullname) and Suffix is not null then part end Suffix
    from parts p
    left join Suffixes s on s.Suffix=p.part
    where p.part !=''
)x
group by Fullname
order by Fullname

看到这个working fiddle example

【讨论】:

以上是关于从 Last,First,Middle,Suffix 形式解析名称组件的主要内容,如果未能解决你的问题,请参考以下文章

在T-SQL中组合First,Middle Initial,Last name和Suffix(无额外空格)

mergeSort算法的Python实现

python中归并排序

从任意长度的可迭代对象中分解元素

让实参变成可选

从任意长的可迭代对象中分解元素(*式方法)