SQL 查找具有多个字段的重复项(没有唯一 ID)

Posted

技术标签:

【中文标题】SQL 查找具有多个字段的重复项(没有唯一 ID)【英文标题】:SQL Find duplicate with several field (no unique ID) WORK AROUND 【发布时间】:2018-02-02 10:29:20 【问题描述】:

我正在尝试使用 vendor 表和 vendor_address 表中的多个字段从数据库中查找重复的供应商。问题是我做的内部连接越多,查询丢失的潜在结果就越少。虽然我在供应商 ID 中没有重复,但我希望找到类似的潜在供应商。

这是我目前的查询:

SELECT 
     o.vendor_id
    ,o.vndr_name_shrt_user
    ,O.COUNTRY 
    ,O.VENDOR_NAME_SHORT 
    ,B.POSTAL
    ,B.ADDRESS1
    ,SAME_ADDRESS_NB
    ,SAME_POSTAL_NB
    ,OC.SAME_SHORT_NAME
    ,oc.SAME_USER_NUM
FROM VENDOR o

JOIN vendor_addr B ON o.VENDOR_ID = B.VENDOR_ID

INNER JOIN (
    SELECT vndr_name_shrt_user, COUNT(*) AS SAME_USER_NUM
    FROM VENDOR 
    WHERE COUNTRY = 'CANADA'
    AND VENDOR_STATUS = 'A'
    GROUP BY vndr_name_shrt_user
    HAVING COUNT(*) > 1
) oc on o.vndr_name_shrt_user = oc.vndr_name_shrt_user

INNER JOIN ( SELECT VENDOR_NAME_SHORT, COUNT(*) AS SAME_SHORT_NAME
    FROM VENDOR 
    WHERE COUNTRY = 'CANADA'
    AND VENDOR_STATUS = 'A'
    GROUP BY VENDOR_NAME_SHORT
    HAVING COUNT(*) > 1
) oc on o.VENDOR_NAME_SHORT = oc.VENDOR_NAME_SHORT

INNER JOIN (SELECT POSTAL, COUNT(*) AS SAME_POSTAL_NB
    FROM vendor_addr 
    WHERE COUNTRY = 'CANADA'
    AND COUNTRY ='CANADA'
    AND POSTAL != ' '
    GROUP BY POSTAL
    HAVING COUNT(*) > 1
) oc on b.POSTAL = oc.POSTAL

INNER JOIN (SELECT ADDRESS1, COUNT(*) AS SAME_ADDRESS_NB
    FROM ps_vendor_addr 
    WHERE COUNTRY = 'CANADA'
    AND COUNTRY ='CANADA'
    AND ADDRESS1 != ' '
    GROUP BY ADDRESS1
    HAVING COUNT(*) > 1
) oc on b.ADDRESS1 = oc.ADDRESS1   
WHERE O.COUNTRY ='CANADA' 
    AND B.COUNTY = 'CANADA';

【问题讨论】:

你为什么要加入Inner?在您不想丢失数据的地方使用左外连接。 请提供minimal reproducible example,包括您的表的 DDL 语句和一些示例数据的 DML 语句以及该数据的预期输出。 谢谢你,先生好辛苦 【参考方案1】:

您的联接似乎有点有趣,原因不止一个。首先,你有内部连接,这将消除除了那些有 all 重复迹象的所有 - 这是你不想要的东西。此外,您似乎在所有派生表上都有相同的别名 oc - 这不会真的飞到这里,而且您不会走得太远。

我建议您对每个重复符号重复基本查询,而不是这样做,如下所示(我删除了 same_address_nb 和 same_postal_nb 字段,您会明白为什么):

select 
    o.vendor_id
    ,o.vndr_name_shrt_user
    ,O.COUNTRY 
    ,O.VENDOR_NAME_SHORT 
    ,B.POSTAL
    ,B.ADDRESS1
    ,OC.SAME_SHORT_NAME
    ,oc.SAME_USER_NUM
from VENDOR o
JOIN vendor_addr B ON o.VENDOR_ID = B.VENDOR_ID
WHERE O.COUNTRY ='CANADA'
AND B.COUNTY = 'CANADA'
AND ...

对于这些重复符号中的每一个,您将向上面显示的省略号添加一个嵌套查询,如下所示 - 使用 vndr_name_shrt_user 中的重复项显示的示例:

select 
    o.vendor_id
    ,o.vndr_name_shrt_user
    ,O.COUNTRY 
    ,O.VENDOR_NAME_SHORT 
    ,B.POSTAL
    ,B.ADDRESS1
    ,OC.SAME_SHORT_NAME
    ,oc.SAME_USER_NUM
    ,'SAME_USER_NUM' as duplicateFlag
from VENDOR o
JOIN vendor_addr B ON o.VENDOR_ID = B.VENDOR_ID
WHERE O.COUNTRY ='CANADA'
AND B.COUNTY = 'CANADA'
AND o.vndr_name_shrt_user in 
(
    SELECT 
        vndr_name_shrt_user
    FROM VENDOR 
    WHERE COUNTRY = o.country
    AND VENDOR_STATUS = 'A'
    GROUP BY vndr_name_shrt_user
    HAVING COUNT(*) > 1
) 

您可以将这些查询 UNION ALL 放在一起,然后查看所有重复项。

附带说明,您在最后三个派生表中检查了两次country = 'canada'

更新:显示多个重复标志

select 
    o.vendor_id
    ,o.vndr_name_shrt_user
    ,O.COUNTRY 
    ,O.VENDOR_NAME_SHORT 
    ,B.POSTAL
    ,B.ADDRESS1
    ,OC.SAME_SHORT_NAME
    ,oc.SAME_USER_NUM
    ,'SAME_USER_NUM' as duplicateFlag
from VENDOR o
JOIN vendor_addr B ON o.VENDOR_ID = B.VENDOR_ID
WHERE O.COUNTRY ='CANADA'
AND B.COUNTY = 'CANADA'
AND o.vndr_name_shrt_user in 
(
    SELECT 
        vndr_name_shrt_user
    FROM VENDOR 
    WHERE COUNTRY = o.country
    AND VENDOR_STATUS = 'A'
    GROUP BY vndr_name_shrt_user
    HAVING COUNT(*) > 1
) 

UNION ALL

select 
    o.vendor_id
    ,o.vndr_name_shrt_user
    ,O.COUNTRY 
    ,O.VENDOR_NAME_SHORT 
    ,B.POSTAL
    ,B.ADDRESS1
    ,OC.SAME_SHORT_NAME
    ,oc.SAME_USER_NUM
    ,'VENDOR_NAME_SHORT' as duplicateFlag
from VENDOR o
JOIN vendor_addr B ON o.VENDOR_ID = B.VENDOR_ID
WHERE O.COUNTRY ='CANADA'
AND B.COUNTY = 'CANADA'
AND o.VENDOR_NAME_SHORT in 
(
    SELECT 
        VENDOR_NAME_SHORT
    FROM VENDOR 
    WHERE COUNTRY = o.country
    AND VENDOR_STATUS = 'A'
    GROUP BY VENDOR_NAME_SHORT
    HAVING COUNT(*) > 1
) 

【讨论】:

只有一个重复的标志会使查询充满重复标志不是吗?或者我创建'SAME_USER_NUM'作为duplicateFlag2? 您可以将不同的重复标志作为最后一列 - 我将使用示例更新查询 我是否应该删除 OC.SAME_SHORT_NAME, oc.SAME_USER_NUM,因为我在原始查询中创建了它们 + 我得到了太多结果作为错误,非常感谢 是的,我也会删除这些字段 - 因为您现在将拥有可能重复的供应商的 ID,以及他们怀疑是重复的字段。您如何确定您在此列表中得到误报? 我正在阅读误报 atm 对不起,对于嵌套选择和联合所有行,我仍然收到错误太多行【参考方案2】:

让我们有一些有趣的数据,其中包含不同属性的链式重复:

CREATE TABLE data ( ID, A, B, C ) AS
  SELECT 1, 1, 1, 1 FROM DUAL UNION ALL -- Related to #2 on column A
  SELECT 2, 1, 2, 2 FROM DUAL UNION ALL -- Related to #1 on column A, #3 on B & C, #5 on C
  SELECT 3, 2, 2, 2 FROM DUAL UNION ALL -- Related to #2 on columns B & C, #5 on C
  SELECT 4, 3, 3, 3 FROM DUAL UNION ALL -- Related to #5 on column A
  SELECT 5, 3, 4, 2 FROM DUAL UNION ALL -- Related to #2 and #3 on column C, #4 on A
  SELECT 6, 5, 5, 5 FROM DUAL;          -- Unrelated

现在,我们可以使用分析函数获得一些关系(无需任何连接):

SELECT d.*,
       LEAST(
         FIRST_VALUE( id ) OVER ( PARTITION BY a ORDER BY id ),
         FIRST_VALUE( id ) OVER ( PARTITION BY b ORDER BY id ),
         FIRST_VALUE( id ) OVER ( PARTITION BY c ORDER BY id )
       ) AS duplicate_of
FROM   data d;

这给出了:

ID A B C DUPLICATE_OF
-- - - - ------------
 1 1 1 1            1
 2 1 2 2            1
 3 2 2 2            2
 4 3 3 3            4
 5 3 4 2            2
 6 5 5 5            6

但这并没有表明 #4 与 #5 相关,而 #5 与 #2 相关,然后与 #1 相关......

这可以通过分层查询找到:

SELECT id, a, b, c,
       CONNECT_BY_ROOT( id ) AS duplicate_of
FROM   data
CONNECT BY NOCYCLE ( PRIOR a = a OR PRIOR b = b OR PRIOR c = c );

但这会产生很多很多重复的行(因为它不知道从哪里开始层次结构,所以会依次选择每一行作为根) - 相反,您可以使用第一个查询为分层查询提供一个起始当IDDUPLICATE_OF 值相同时:

SELECT id, a, b, c,
       CONNECT_BY_ROOT( id ) AS duplicate_of
FROM   (
  SELECT d.*,
         LEAST(
           FIRST_VALUE( id ) OVER ( PARTITION BY a ORDER BY id ),
           FIRST_VALUE( id ) OVER ( PARTITION BY b ORDER BY id ),
           FIRST_VALUE( id ) OVER ( PARTITION BY c ORDER BY id )
         ) AS duplicate_of
  FROM   data d
)
START WITH id = duplicate_of
CONNECT BY NOCYCLE ( PRIOR a = a OR PRIOR b = b OR PRIOR c = c );

这给出了:

ID A B C DUPLICATE_OF
-- - - - ------------
 1 1 1 1            1
 2 1 2 2            1
 3 2 2 2            1
 4 3 3 3            1
 5 3 4 2            1
 1 1 1 1            4
 2 1 2 2            4
 3 2 2 2            4
 4 3 3 3            4
 5 3 4 2            4
 6 5 5 5            6

仍然有一些行重复,因为搜索中的局部最小值出现 #4 ... 可以使用简单的GROUP BY 删除:

SELECT id, a, b, c,
       MIN( duplicate_of ) AS duplicate_of
FROM   (
  SELECT id, a, b, c,
         CONNECT_BY_ROOT( id ) AS duplicate_of
  FROM   (
    SELECT d.*,
           LEAST(
             FIRST_VALUE( id ) OVER ( PARTITION BY a ORDER BY id ),
             FIRST_VALUE( id ) OVER ( PARTITION BY b ORDER BY id ),
             FIRST_VALUE( id ) OVER ( PARTITION BY c ORDER BY id )
           ) AS duplicate_of
    FROM   data d
  )
  START WITH id = duplicate_of
  CONNECT BY NOCYCLE ( PRIOR a = a OR PRIOR b = b OR PRIOR c = c )
)
GROUP BY id, a, b, c;

它给出了输出:

ID A B C DUPLICATE_OF
-- - - - ------------
 1 1 1 1            1
 2 1 2 2            1
 3 2 2 2            1
 4 3 3 3            1
 5 3 4 2            1
 6 5 5 5            6

【讨论】:

处理需要很多时间 SELECT vendor_id, VENDOR_NAME_SHORT, VNDR_NAME_SHRT_USR, NAME1, MIN( duplicate_of ) AS duplicate_of FROM (SELECT vendor_id, VENDOR_NAME_SHORT, VNDR_NAME_SHRT_USR, NAME1, CONNECT_BY_ROOT( vendor_id) AS duplicate_of FROM (SELECT D.*, LEAST(FIRST_VALUE(vendor_id) OVER (PARTITION BY VENDOR_NAME_SHORT ORDER BY vendor_id), FIRST_VALUE(vendor_id) OVER (PARTITION BY VNDR_NAME_SHRT_USR ORDER BY vendor_id), FIRST_VALUE(partition BY NAME1 ORDER BY vendor_id ) ) AS duplicate_of FROM PS_VENDOR d ) 从 vendor_id = duplicate_of CONNECT BY NOCYCLE 开始(PRIOR VENDOR_NAME_SHORT = VENDOR_NAME_SHORT 或 PRIOR VNDR_NAME_SHRT_USR = VNDR_NAME_SHRT_USR 或 PRIOR NAME1 = NAME1 ) ) 按 vendor_id、VENDOR_NAME_SHORT、VNDR_NAME_SHORT 分组

以上是关于SQL 查找具有多个字段的重复项(没有唯一 ID)的主要内容,如果未能解决你的问题,请参考以下文章

MS SQL - 查找和删除重复项[重复]

选择语句以查找某些字段的重复项

如何用sql 语句查找一个表里的两个字段重复的记录

查找唯一标识符重复的每个字段的最大序列号

MYSQL过滤表中某几个字段重复的数据

Excel:在多列中查找具有重复项的多个值