TSQL:解析具有各种字符的字符串
Posted
技术标签:
【中文标题】TSQL:解析具有各种字符的字符串【英文标题】:TSQL: Parsing strings with various characters 【发布时间】:2019-08-16 16:10:59 【问题描述】:我有一个带有文件名列的表,其中各个供应商以不同的方式命名文件。因此,文件中存在一个文件名,其中包含姓氏、名字和中间名,并用各种字符分隔名称。有些有逗号+空格,逗号不带空格,单词之间有空格,单词之间没有空格,一个下划线,两个下划线等等。
有什么好的方法可以提取到想要的结果?(这是一次性的数据转换,不一定要漂亮。)
我在下面的示例代码中尝试过,使用各种子字符串/字符索引组合
文件名示例:(注意逗号、空格、无空格、下划线、双下划线)
期望的结果:
示例代码/测试数据(在临时表中)
IF OBJECT_ID('tempdb..#dob') IS NOT NULL DROP TABLE #dob
CREATE TABLE #dob (
FILENAME VARCHAR(MAX)
,StudentID INT
,FullName VARCHAR(500)
,LastName VARCHAR(500)
,FirstName VARCHAR(500)
,MiddleName VARCHAR(500)
)
INSERT INTO #dob
( FILENAME )
VALUES
('Last, First, Middle_DOB ID.pdf')
,('Denver, John C 11_23_1980_123456.pdf')
,('Denver John_11-23-1980, 1234567.pdf')
,('Denver,John,Clifford_ 01_22_1980_123456.pdf')
,('Denver, John, 11-23-1980, 1234567.pdf')
,('Denver, John__01_22_1980_123456.pdf')
--This is what I tried.
SELECT FILENAME
,fullname
,LastName
,FirstName
,MiddleName
,SUBSTRING(FileName,1, CHARINDEX(' ', FileName, (charindex(' ', FileName, 1))+2)) AS test1
,SUBSTRING(FileName,1, CHARINDEX('_', FileName, (charindex(' ', FileName, 1))+2)) AS test2
,SUBSTRING(FileName,1, CHARINDEX(',', FileName, (charindex(', ', FileName, 1))+1)) AS test3
,SUBSTRING(FileName,1, CHARINDEX(' ', FileName, (charindex('__', FileName, 1))+2)) AS test4
,SUBSTRING(FileName,1, CHARINDEX('__', FileName, (charindex(' ', FileName, 1))+2)) AS test5
FROM #dob
【问题讨论】:
这至少可以说是丑陋的。值得庆幸的是,这是一次性的事情,而不是一场持续的噩梦。您将不得不为此进行几次传递,因为格式无处不在。 【参考方案1】:这是一个滑坡,但如果您的真实数据接近样本,请考虑以下事项。
示例
SELECT FILENAME
,LastName = Pos1
,FirstName = Pos2
,MiddleName = case when try_convert(int,left(Pos3,1)) is null then Pos3 else '' end
FROM #dob A
Cross Apply ( values ( replace(
replace(
replace(
replace(FileName,', ',',')
,' ,',',')
,' ',',')
,'_',',')
)
) B(CleanString)
Cross Apply [dbo].[tvf-Str-Parse-Row](CleanString,',') C
退货
FILENAME LastName FirstName MiddleName
Last, First, Middle_DOB ID.pdf Last First Middle
Denver, John C 11_23_1980_123456.pdf Denver John C
Denver John_11-23-1980, 1234567.pdf Denver John
Denver,John,Clifford_ 01_22_1980_123456.pdf Denver John Clifford
Denver, John, 11-23-1980, 1234567.pdf Denver John
Denver, John__01_22_1980_123456.pdf Denver John
TVF(如果有兴趣)
CREATE FUNCTION [dbo].[tvf-Str-Parse-Row] (@String varchar(max),@Delimiter varchar(10))
Returns Table
As
Return (
Select Pos1 = ltrim(rtrim(xDim.value('/x[1]','varchar(max)')))
,Pos2 = ltrim(rtrim(xDim.value('/x[2]','varchar(max)')))
,Pos3 = ltrim(rtrim(xDim.value('/x[3]','varchar(max)')))
,Pos4 = ltrim(rtrim(xDim.value('/x[4]','varchar(max)')))
,Pos5 = ltrim(rtrim(xDim.value('/x[5]','varchar(max)')))
,Pos6 = ltrim(rtrim(xDim.value('/x[6]','varchar(max)')))
,Pos7 = ltrim(rtrim(xDim.value('/x[7]','varchar(max)')))
,Pos8 = ltrim(rtrim(xDim.value('/x[8]','varchar(max)')))
,Pos9 = ltrim(rtrim(xDim.value('/x[9]','varchar(max)')))
From ( values (cast('<x>' + replace((Select replace(@String,@Delimiter,'§§Split§§') as [*] For XML Path('')),'§§Split§§','</x><x>')+'</x>' as xml))) as A(xDim)
)
【讨论】:
@JM1 Cross Apply B 将“清理字符串”,这可能需要随着时间的推移进行调整 @JM1 Cross Apply C 将调用表值函数。 TVF将字符串转换成XML,然后会生成9列, @JM1 顶部选择只是应用了一点逻辑来确定是否有 2 个或 3 个名称 @JM1 乐于助人。完全披露,我是 XML 的后期采用者。值得你花时间,另外,如果 2016+,JSON :) @JM1 这是 XML 而不是 JSON。 JSON 于 2016 年推出。要学习 XML 部分,我会谷歌“SQL Server XML 教程”或从 mssqltips.com/sqlservertip/2889/basic-sql-server-xml-querying 开始【参考方案2】:这是我尝试使用 patindex 的东西,看看是否有帮助
SELECT FILENAME
,fullname
,LastName
,FirstName
,MiddleName
,ISNULL(LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)-1),'')
+' '
+ISNULL(LEFT(LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)),'')),PATINDEX('%[, _]%',LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)),'')))-1),'')
+' '
+ISNULL(IIF(PATINDEX('%[, _]%',LEFT(LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)+PATINDEX('%[, _]%',LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)),'')))),'')),PATINDEX('%[, _]%',LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)+PATINDEX('%[, _]%',LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)),'')))),'')))))>1
,IIF(PATINDEX('%[a-z]%',LEFT(LEFT(LEFT(LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)+PATINDEX('%[, _]%',LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)),'')))),'')),PATINDEX('%[, _]%',LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)+PATINDEX('%[, _]%',LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)),'')))),'')))),PATINDEX('%[, _]%',LEFT(LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)+PATINDEX('%[, _]%',LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)),'')))),'')),PATINDEX('%[, _]%',LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)+PATINDEX('%[, _]%',LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)),'')))),'')))))),1))=1,
LEFT(LEFT(LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)+PATINDEX('%[, _]%',LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)),'')))),'')),PATINDEX('%[, _]%',LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)+PATINDEX('%[, _]%',LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)),'')))),'')))),PATINDEX('%[, _]%',LEFT(LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)+PATINDEX('%[, _]%',LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)),'')))),'')),PATINDEX('%[, _]%',LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)+PATINDEX('%[, _]%',LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)),'')))),'')))))-1)
,NULL)
,NULL),'') FULLNAME
,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)-1) LASTNAME
,LEFT(LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)),'')),PATINDEX('%[, _]%',LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)),'')))-1) FIRSTNAME
,IIF(PATINDEX('%[, _]%',LEFT(LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)+PATINDEX('%[, _]%',LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)),'')))),'')),PATINDEX('%[, _]%',LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)+PATINDEX('%[, _]%',LTRIM(REPLACE(FILENAME,LEFT(FILENAME,PATINDEX('%[, _]%',FILENAME)),'')))),'')))))>1
,IIF(PATINDEX('%[a-z

,NULL)
,NULL) MIDDLENAME
FROM #dob
【讨论】:
感谢您的宝贵时间,这确实按要求输出。以上是关于TSQL:解析具有各种字符的字符串的主要内容,如果未能解决你的问题,请参考以下文章