减少多值连接 SQL 查询中的重复记录
Posted
技术标签:
【中文标题】减少多值连接 SQL 查询中的重复记录【英文标题】:Reduce Duplicate Records In Multi-Value Join SQL Query 【发布时间】:2021-04-07 21:25:36 【问题描述】:背景故事: 我创建了一堆分析客户数据的存储过程。我正在查看供应商列表并尝试找出可能的重复项。它工作得很好,但每条记录都有 2 个可能的地址,当在两个地址中都找到匹配项时,我会得到重复的结果。理想情况下,我只需要记录在结果中出现一次。
过程:
我创建了一个“干净”版本的地址,我在其中删除了特殊字符并标准化为 USPS 标准。这有助于我匹配 West v W v W. 或 PO Box v P.O. Box v P O Box 等。然后,我从两个地址([cleanAddress]
和 [cleanRemit_Address]
)中获取所有不同的地址值并放入主列表。然后,我使用HAVING COUNT(*) > 1
与源表进行比较,以确定哪些地址出现了多次。最后,我将多次出现的最终地址列表与源数据结合起来输出。
问题:
如果您查看底部附近的结果,您会发现我有 2 组几乎相同的骗子,只是地址略有不同。 Address
和 Remit_Address
本质上是相同的,因此它会在“SouthWestern Medical”和“NERO CO”的 [cleanAddress]
和 [cleanRemit_Address]
值上找到匹配项,因此这两组骗子在列表中出现了两次一次(请参阅底部的所需结果)。
我需要匹配 [cleanAddress]
或 [cleanRemit_Address]
但我不知道如何限制每条记录在结果中出现一次。
查询:
--SQL (Address List): Combines all addresses for a master list of all addresses in the table
SELECT * INTO [address_list] FROM (
SELECT DISTINCT [NewAdd] FROM
(
SELECT DISTINCT [cleanAddress] AS [NewAdd]
FROM [sample_data]
WHERE
( [cleanAddress] IS NOT NULL AND [cleanAddress] <> '' ) AND
( [Supplier_No] IS NOT NULL AND [Supplier_No] <> '' )
GROUP BY [cleanAddress]
UNION
SELECT DISTINCT [cleanRemit_Address] AS [NewAdd]
FROM [sample_data]
WHERE
( [cleanRemit_Address] IS NOT NULL AND [cleanRemit_Address] <> '' ) AND
( [Supplier_No] IS NOT NULL AND [Supplier_No] <> '' )
GROUP BY [cleanRemit_Address]
) q1
) q2
ORDER BY
[NewAdd]
--SQL (Address Dupes): Determines which addresses appear in the data more than once
SELECT * INTO [dupe_addresses] FROM (
SELECT [NewAdd]
FROM [address_list] n
LEFT JOIN [sample_data] pv ON
(
( n.[NewAdd] = pv.[cleanAddress] AND ( [Address] <> '' AND [Address] IS NOT NULL ) ) OR
( n.[NewAdd] = pv.[cleanRemit_Address] AND ( [Remit_Address] <> '' AND [Remit_Address] IS NOT NULL ) )
)
WHERE
( [Supplier_No] IS NOT NULL AND [Supplier_No] <> '' )
GROUP BY [NewAdd]
HAVING COUNT(*) > 1
) q1
ORDER BY [NewAdd]
--SQL (Address Query): Outputs the information of the matched addresses
SELECT
'Address Match' AS [Reason],
pv.[Supplier_No],
pv.[Name],
pv.[Address],
pv.[City],
pv.[State],
pv.[Zip],
pv.[Country],
pv.[Remit_Address],
pv.[Remit_City],
pv.[Remit_State],
pv.[Remit_Zip],
pv.[Remit_Country]
FROM
[dupe_addresses] n
LEFT JOIN [sample_data] pv
ON (
(n.[NewAdd] = pv.[cleanAddress] AND ( [Address] <> '' AND [Address] IS NOT NULL ) )
OR
(n.[NewAdd] = pv.[cleanRemit_Address] AND ( [Remit_Address] <> '' AND [Remit_Address] IS NOT NULL ) )
)
WHERE
( [Supplier_No] IS NOT NULL AND [Supplier_No] <> '' )
样本数据:
CREATE TABLE [sample_data] (
[Supplier_No] varchar(255),
[Name] varchar(255),
[Address] varchar(255),
[City] varchar(255),
[State] varchar(255),
[Zip] varchar(255),
[Country] varchar(255),
[Remit_Address] varchar(255),
[Remit_City] varchar(255),
[Remit_State] varchar(255),
[Remit_Zip] varchar(255),
[Remit_Country] varchar(255),
[cleanAddress] varchar(255),
[cleanRemit_Address] varchar(255),
CONSTRAINT [suppliers_pk] PRIMARY KEY ([Supplier_No])
)
INSERT INTO [sample_data] VALUES
('1039104','Geez Companies','100 Aero Hudson Rd','Streetsboro','OH','44241','','100 Aero Hudson Road','Streetsboro','OH','44241','USA','100 Aero Hudson Rd','100 Aero Hudson Rd'),
('1218409','SouthWestern Medical','100 West Balor Ave','Osceola','AR','72370','USA','SouthWestern Medical100 W Balor Ave','Osceola','AR','72370','USA','100 W Balor Ave','SouthWestern Medical100 W Balor Ave'),
('1243789','SouthWestern Medical','100 West Balor Ave','Osceola','AR','72370','USA','SouthWestern Medical100 West Balor Ave','Osceola','AR','72370','USA','100 W Balor Ave','SouthWestern Medical100 W Balor Ave'),
('1243636','SIRI SYSTEMS','15 BRAD ROAD','WEXFORD','PA','15090','','','','','','','15 BRAD RD',''),
('1152482','FLEETWOOD MACK','22 WINDSOCK CT','ADDISON','IL','60101','','PO BOX 951','CHICAGO','IL','60694-5124','','22 WINDSOCK CT','PO BOX 951'),
('1224483','Aerospace Junction','211500 Communicate Ave','Mingo Junction','OH','43939','USA','P O Box 99','Mingo Junction','OH','43939','USA','211500 Communicate Ave','PO Box 99'),
('1243397','Squeezy Felt','SCHREIBER DIST','NEW KENSINGTON','PA','15068','','','','','','','SCHREIBER DIST',''),
('1230895','NERO CO','28 North US State Highway 99','Osceola','AR','72370','USA','PO Box 204','Cape Girardeau','MO','63702-2045','USA','28 N US State Hwy 99','PO Box 204'),
('1243782','NERO CO','28 North US State Highway 99','Osceola','AR','72370','USA','PO Box 204','Cape Girardeau','MO','63702-2045','USA','28 N US State Hwy 99','PO Box 204'),
('1135880','RICHARD PRYOR SEMINARS','PO BOX 2194','KANSAS CITY','MO','64121-9468','USA','RICHARD PRYOR SEMINARS P O BOX 2194','KANSAS CITY','MO','64121-9468','USA','PO BOX 2194','RICHARD PRYOR SEMINARS PO BOX 2194'),
('1241328','INFINITY AND BEYOND','P.O. BOX 169','GASTONIA','NC','28053-0269','USA','','','','','','PO BOX 169',''),
('1259522','MILES STONES','PO BOX 169','GASSTONIA','NC','28053-0269','USA','','','','','','PO BOX 169',''),
('1255253','AT&T','PO Box 50221','Carol Stream','IL','60197','USA','','','','','','PO Box 50221',''),
('1135513','AT&T','PO Box 50221','Carol Stream','IL','60197-5080','USA','','','','','','PO Box 50221',''),
('1119161','Machine Co, Inc','3306 N Thorne Blvd','Chattanooga','TN','','','PO BOX 5301','CHATTANOOGA','TN','37406','USA','3306 N Thorne Blvd','PO BOX 5301'),
('1176587','Topsy Turvy','365 Welmington Road','Chicago','IL','60606','USA','','','','','','365 Welmington Rd',''),
('2156671','Topsy Turvvy, Inc.','P.O. Box 55217','Columbus','OH','43081','','365 Welmington Road','Chicago','IL','60606','USA','','365 Welmington Rd')
当前结果:
Reason Supplier_No Name Address City State Zip Country Remit_Address Remit_City Remit_State Remit_Zip Remit_Country
Address Match 1218409 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA SouthWestern Medical100 W Balor Ave Osceola AR 72370 USA
Address Match 1243789 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA SouthWestern Medical100 West Balor Ave Osceola AR 72370 USA
Address Match 1230895 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA
Address Match 1243782 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA
Address Match 1176587 Topsy Turvy 365 Welmington Road Chicago IL 60606 USA
Address Match 2156671 Topsy Turvvy, Inc. P.O. Box 55217 Columbus OH 43081 365 Welmington Road Chicago IL 60606 USA
Address Match 1241328 INFINITY AND BEYOND P.O. BOX 169 GASTONIA NC 28053-0269 USA
Address Match 1259522 MILES STONES PO BOX 169 GASSTONIA NC 28053-0269 USA
Address Match 1230895 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA
Address Match 1243782 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA
Address Match 1255253 AT&T PO Box 50221 Carol Stream IL 60197 USA
Address Match 1135513 AT&T PO Box 50221 Carol Stream IL 60197-5080 USA
Address Match 1218409 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA Southern Lawn Care1004 W Hale Ave Osceola AR 72370 USA
Address Match 1243789 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA SouthWestern Medical100 West Balor Ave Osceola AR 72370 USA
期望的结果:
Reason Supplier_No Name Address City State Zip Country Remit_Address Remit_City Remit_State Remit_Zip Remit_Country
Address Match 1218409 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA SouthWestern Medical100 W Balor Ave Osceola AR 72370 USA
Address Match 1243789 SouthWestern Medical 100 West Balor Ave Osceola AR 72370 USA SouthWestern Medical100 West Balor Ave Osceola AR 72370 USA
Address Match 1230895 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA
Address Match 1243782 NERO CO 28 North US State Highway 99 Osceola AR 72370 USA PO Box 204 Cape Girardeau MO 63702-2045 USA
Address Match 1176587 Topsy Turvy 365 Welmington Road Chicago IL 60606 USA
Address Match 2156671 Topsy Turvvy, Inc. P.O. Box 55217 Columbus OH 43081 365 Welmington Road Chicago IL 60606 USA
Address Match 1241328 INFINITY AND BEYOND P.O. BOX 169 GASTONIA NC 28053-0269 USA
Address Match 1259522 MILES STONES PO BOX 169 GASSTONIA NC 28053-0269 USA
Address Match 1255253 AT&T PO Box 50221 Carol Stream IL 60197 USA
Address Match 1135513 AT&T PO Box 50221 Carol Stream IL 60197-5080 USA
【问题讨论】:
@DaleK 是的,Supplier_No 是唯一值 @DaleK 将 PK 添加到表中,以便更清晰 【参考方案1】:只需将row_number
每个供应商添加到最终结果集中并仅过滤掉第 1 行。
注意row_number
函数需要一个order by
子句,用于确定哪些您希望保留的重复行。更改它以适合您的情况。
WITH cte AS (
SELECT
'Address Match' AS [Reason],
pv.[Supplier_No],
pv.[Name],
pv.[Address],
pv.[City],
pv.[State],
pv.[Zip],
pv.[Country],
pv.[Remit_Address],
pv.[Remit_City],
pv.[Remit_State],
pv.[Remit_Zip],
pv.[Remit_Country]
, ROW_NUMBER() OVER (PARTITION BY pv.[Supplier_No] ORDER BY pv.[Name]) rn
FROM dupe_addresses n
LEFT JOIN sample_data pv
ON (
(n.[NewAdd] = pv.[cleanAddress] AND ( [Address] <> '' AND [Address] IS NOT NULL ))
OR (n.[NewAdd] = pv.[cleanRemit_Address] AND ( [Remit_Address] <> '' AND [Remit_Address] IS NOT NULL))
)
WHERE ([Supplier_No] IS NOT NULL AND [Supplier_No] <> '')
)
SELECT *
FROM cte
WHERE rn = 1
ORDER BY Supplier_No, [Name];
【讨论】:
我喜欢它,但是它引入了一个差异问题。在我的结果中,骗子被分组在一起,因此很容易看到它们,但是使用这种方法,匹配的骗子记录可能在数百行之外,因此很难阅读。您不能真正按[Address]
排序,因为它可以与[Remit_Address]
匹配
我不明白?我们正在删除受骗者...所以您不会有多个受骗者?正如我在答案中评论的那样,您应该 order by
任何对您有意义的逻辑 - 所以更改 order by
逻辑以满足您的需要。
没有好的字段可以排序,最好的排序方法是使用我的查询匹配它们的顺序。当它在[Address]
上找到匹配项时,将两个“SouthWestern Medical”记录(具有不同的供应商 ID)组合在一起。查看输出并查看重复数据非常容易。我需要一种方法将匹配的(重复)组合在一起,以便输出类似于我在 Desired Output 中输入的内容
给你一个想法,我正在使用的数据集有 122k 条记录。如果我删除重复的条目,他们需要解决超过 25k 的重复项。我会说可能 60% 的人出于某种原因需要继续被骗,因此具有这种可读性对于他们及时审查至关重要。
您的示例查询根本没有排序依据,因此结果的顺序完全无法预测。你可以看到可供订购的信息,你可以使用任何你喜欢的逻辑,但你必须决定这个逻辑到底是什么。以上是关于减少多值连接 SQL 查询中的重复记录的主要内容,如果未能解决你的问题,请参考以下文章