SQL Server：分组重复行集

Posted 2023-03-29

技术标签:

【中文标题】SQL Server：分组重复行集【英文标题】：SQL Server : Grouping sets of duplicate rows 【发布时间】：2021-12-08 22:16:02 【问题描述】：

我有可能有重复的地理产品。

我返回一个可能重复的列表并将它们显示在地图上供用户检查和删除。

为了帮助用户在行之间交叉引用，我想对重复项进行颜色编码。看起来重复的两行或多行应该具有相同的 ColourGroup，以便比较两行绿色或三行红色。

我想为每个 ColourGroup 返回一个唯一编号。

SQL

这将返回Products 的列表，根据它们的Latitude、Longitude、ProductType 和Price +/- 5%，可能有重复项。

WITH Prods AS
(
SELECT p.ProductID,
p. ProductType,
p.Price,
p.Price+((p.Price/100)*5) As PriceUpper,
p.Price-((p.Price/100)*5) As PriceLower,
Round(p.Latitude,3) As Latitude,
Round(p.Longitude,3) As Longitude
FROM Products p
AND p.Latitude is not null AND p.Longitude is not null
)

SELECT 
DISTINCT a.ProductID,
b.ProductID As Duplicate,
a.Latitude,
a.Longitude
FROM Prods a

INNER JOIN Prods b ON a.ProductID <> b.ProductID
AND a.Latitude = b.Latitude 
AND a.Longitude = b.Longitude
AND a.ProductType = b.ProductType

AND (b.Price < a.PriceUpper AND b.Price > a.PriceLower)

期望的结果

Products

ProductID   Price    Latitude   Longitude   ProductType
ID1         500      12.34      56.78       Widget
ID2         505      12.34      56.78       Widget
ID3         200      12.34      56.78       Widget
ID4         800      12.34      56.78       Widget
ID5         500      12.34      56.78       Doodad

ID6         300      98.76      54.32       Doodad
ID7         295      98.76      54.32       Doodad
ID8         302      98.76      54.32       Doodad
ID9         100      98.76      54.32       Doodad

ID10        250      12.34      56.78       Thingamy
ID11        600      12.34      56.78       Thingamy

我想返回以下内容：

ProductID   Duplicate  Latitude   Longitude   ColourGroup
ID1         ID2        12.34      56.78       1
ID2         ID1        12.34      56.78       1

ID6         ID7        98.76      54.32       2
ID6         ID8        98.76      54.32       2
ID7         ID6        98.76      54.32       2
ID7         ID8        98.76      54.32       2
ID8         ID6        98.76      54.32       2
ID8         ID7        98.76      54.32       2

ID3 和 ID4 与 ID1 或 ID2 不匹配，因为它们超出了 +/- 5%，并且彼此不匹配。 ID5 与 ID1 或 ID2 不匹配，因为它是不同的 ProductType，即使它位于同一位置。

ID9 与 ID 6、7 或 8 不匹配。

ID10 和 ID11 不匹配。

我如何识别和编号重复的集合，以便我以后可以对它们进行颜色编码？

理想情况下，与其拥有一千种颜色，不如为每个 Lat/Lng 重置 ColourGroup 编号，这样我就可以使用一组大约十种颜色。

【问题讨论】：

【参考方案1】：

dense_rank 窗口函数在这里很方便。它将根据您在“over()”部分中传递的标准来计算组。试试这个

with pct_diff as (
select sd.ProductID, 
           sd_1.ProductID duplicate, 
           sd.latitude, 
           sd.longitude,
           sd.producttype,
           round((1.0 * min(sd.price) over(partition by sd.latitude, 
                                    sd.longitude, 
                                    sd.ProductType)) / (1.0 * sd.price) * 100 / 5, 0) pct_diff
      from some_data sd
      join some_data sd_1
        on sd.latitude = sd_1.latitude
       and sd.longitude = sd_1.longitude
       and sd.ProductType = sd_1.ProductType
       and sd.productid <> sd_1.productid
       and sd.price between sd_1.price - sd_1.Price * 0.05 and sd_1.price + sd_1.Price * 0.05)

  select ProductID,
         duplicate,
         Latitude,
         Longitude,
         ProductType,
         DENSE_RANK() over(order by Latitude,
         Longitude,
         ProductType,pct_diff) color_group
    from pct_diff

【讨论】：

感谢您的帮助，但这忽略了价格 +/- 5%。这是必要的，因为每个位置都有许多按 ProductType 的类似产品。必须包含价格以避免大多数误报。啊，对不起。我没注意到 @user2470281 请检查更新的答案谢谢，但是...我已经有该代码可以选择 +/- 5%。问题是 DENSE_RANK 仍然仅按位置和产品类型进行排名。它将对可能是相同类型但价格差异很大的产品进行分组。它应该只对那些按位置、产品类型和价格 +/- 5% 相互明确匹配的产品进行分组。 @user2470281 我已经更新了 dense_rank 的参数。让我们看看这是否成功。如果没有，请您提供更多的测试数据。尤其是一些不在 +-5% 容差范围内的线条

以上是关于SQL Server：分组重复行集的主要内容，如果未能解决你的问题，请参考以下文章