删除带有警告的重复项

Posted

技术标签:

【中文标题】删除带有警告的重复项【英文标题】:Remove Duplicates with Caveats 【发布时间】:2010-09-14 03:12:32 【问题描述】:

我有一个包含 rowID、经度、纬度、businessName、url、标题的表格。这可能看起来像:

rowID | long  | lat |  businessName | url | caption

  1      20     -20     Pizza Hut   yum.com  null

如何删除所有重复项,但仅保留具有 URL 的副本(第一优先级),或者如果另一个没有 URL 则保留具有标题的副本(第二优先级)并删除休息吗?

【问题讨论】:

重复是基于企业名称吗? 猜测重复是 long + lat + businessName? 副本基于long + lat + businessName,理想情况下,最后只有一个最适合场景的long + lat + businessName。 【参考方案1】:

这个解决方案是上周“我在 Stack Overflow 上学到的东西”带给你的:

DELETE restaurant
WHERE rowID in 
(SELECT rowID
    FROM restaurant
    EXCEPT
    SELECT rowID 
    FROM (
        SELECT rowID, Rank() over (Partition BY BusinessName, lat, long ORDER BY url DESC, caption DESC ) AS Rank
        FROM restaurant
        ) rs WHERE Rank = 1)

警告:我没有在真实数据库上测试过这个

【讨论】:

【参考方案2】:

这是我的循环技术。这可能会因为不是主流而被否决——我对此很满意。

DECLARE @LoopVar int

DECLARE
  @long int,
  @lat int,
  @businessname varchar(30),
  @winner int

SET @LoopVar = (SELECT MIN(rowID) FROM Locations)

WHILE @LoopVar is not null
BEGIN
  --initialize the variables.
  SELECT 
    @long = null,
    @lat = null,
    @businessname = null,
    @winner = null

  -- load data from the known good row.  
  SELECT
    @long = long,
    @lat = lat,
    @businessname = businessname
  FROM Locations
  WHERE rowID = @LoopVar

  --find the winning row with that data
  SELECT top 1 @Winner = rowID
  FROM Locations
  WHERE @long = long
    AND @lat = lat
    AND @businessname = businessname
  ORDER BY
    CASE WHEN URL is not null THEN 1 ELSE 2 END,
    CASE WHEN Caption is not null THEN 1 ELSE 2 END,
    RowId

  --delete any losers.
  DELETE FROM Locations
  WHERE @long = long
    AND @lat = lat
    AND @businessname = businessname
    AND @winner != rowID

  -- prep the next loop value.
  SET @LoopVar = (SELECT MIN(rowID) FROM Locations WHERE @LoopVar < rowID)
END

【讨论】:

我使用了非常相似的方法。这种类型的循环也比 CURSOR 快。它还有一个好处是它不会与服务器的 CPU 挂钩。我将类似的代码放在您在问题中链接的另一篇文章中。 如果 rowID 是 char(11) 变量怎么办?它是主键,但你可以在字符串中选择 min(foo) 吗? 为了使某些类型成为真正的主键,它必须在表上建立排序。按 char(11) 排序没有问题。【参考方案3】:

基于集合的解决方案:

delete from T as t1
where /* delete if there is a "better" row
         with same long, lat and businessName */
  exists(
    select * from T as t2 where
      t1.rowID <> t2.rowID
      and t1.long = t2.long
      and t1.lat = t2.lat
      and t1.businessName = t2.businessName 
      and
        case when t1.url is null then 0 else 4 end
          /* 4 points for non-null url */
        + case when t1.businessName is null then 0 else 2 end
          /* 2 points for non-null businessName */
        + case when t1.rowID > t2.rowId then 0 else 1 end
          /* 1 point for having smaller rowId */
        <
        case when t2.url is null then 0 else 4 end
        + case when t2.businessName is null then 0 else 2 end
        )

【讨论】:

【参考方案4】:
delete MyTable
from MyTable
left outer join (
        select min(rowID) as rowID, long, lat, businessName
        from MyTable
        where url is not null
        group by long, lat, businessName
    ) as HasUrl
    on MyTable.long = HasUrl.long
    and MyTable.lat = HasUrl.lat
    and MyTable.businessName = HasUrl.businessName
left outer join (
        select min(rowID) as rowID, long, lat, businessName
        from MyTable
        where caption is not null
        group by long, lat, businessName
    ) HasCaption
    on MyTable.long = HasCaption.long
    and MyTable.lat = HasCaption.lat
    and MyTable.businessName = HasCaption.businessName
left outer join (
        select min(rowID) as rowID, long, lat, businessName
        from MyTable
        where url is null
            and caption is null
        group by long, lat, businessName
    ) HasNone 
    on MyTable.long = HasNone.long
    and MyTable.lat = HasNone.lat
    and MyTable.businessName = HasNone.businessName
where MyTable.rowID <> 
        coalesce(HasUrl.rowID, HasCaption.rowID, HasNone.rowID)

【讨论】:

【参考方案5】:

与另一个答案类似,但您想根据行号而不是排名来删除。也可以与常用的表表达式混合:


;WITH GroupedRows AS
(   SELECT rowID, Row_Number() OVER (Partition BY BusinessName, lat, long ORDER BY url DESC, caption DESC) rowNum 
    FROM restaurant
)
DELETE r
FROM restaurant r
JOIN GroupedRows gr ON r.rowID = gr.rowID
WHERE gr.rowNum > 1

【讨论】:

【参考方案6】:

如果可能的话,你能同质化,然后去除重复吗?

第 1 步:

UPDATE BusinessLocations
SET BusinessLocations.url = LocationsWithUrl.url
FROM BusinessLocations
INNER JOIN (
  SELECT long, lat, businessName, url, caption
  FROM BusinessLocations 
  WHERE url IS NOT NULL) LocationsWithUrl 
    ON BusinessLocations.long = LocationsWithUrl.long
    AND BusinessLocations.lat = LocationsWithUrl.lat
    AND BusinessLocations.businessName = LocationsWithUrl.businessName

UPDATE BusinessLocations
SET BusinessLocations.caption = LocationsWithCaption.caption
FROM BusinessLocations
INNER JOIN (
  SELECT long, lat, businessName, url, caption
  FROM BusinessLocations 
  WHERE caption IS NOT NULL) LocationsWithCaption 
    ON BusinessLocations.long = LocationsWithCaption.long
    AND BusinessLocations.lat = LocationsWithCaption.lat
    AND BusinessLocations.businessName = LocationsWithCaption.businessName

第 2 步: 删除重复项。

【讨论】:

以上是关于删除带有警告的重复项的主要内容,如果未能解决你的问题,请参考以下文章

javascript 从带有Set的Array中删除重复项

使用 JavaScript 从数组中删除所有重复项 [重复]

如何使用重复项编辑列表项

根据不同列中的值删除重复项

删除每个函数中的重复项

如何删除 MySQL 表中的重复项