如何在按唯一日期时间排序时删除重复项

Posted 2023-03-29

技术标签:

【中文标题】如何在按唯一日期时间排序时删除重复项【英文标题】：How to remove duplicates while sorting by unique datetime 【发布时间】：2021-10-28 02:02:25 【问题描述】：

我正在使用一个相当糟糕的数据源，包含我需要的信息的列位于 varchar(max) 内并被分隔。但是，数据可以跨多行重复，因此我正在尝试删除这些重复项。

这可以通过修剪我感兴趣的列来完成，因为当我重复发生时，“ID”会重新附加到列的末尾。然后我将其区别对待，然后将结果连接起来；不漂亮。

示例数据和我目前使用的查询SQL Fiddle

数据表

| id | callID |                callDateTime |                             history |
|----|--------|-----------------------------|-------------------------------------|
|  1 |      1 | 2021-01-01 10:00:00.0000000 |         Amount: 10, Ref:123, ID:123 |
|  2 |      1 | 2021-01-01 10:01:00.0000000 | Amount: 10, Ref:123, ID:123, ID:123 |
|  3 |      2 | 2021-01-01 11:00:00.0000000 |       Amount:12.44, Ref:SIS, ID:124 |
|  4 |      2 | 2021-01-01 11:02:00.0000000 |       Amount:11.22, Ref:Dad, ID:124 |
|  5 |      2 | 2021-01-01 11:01:00.0000000 |       Amount:11.22, Ref:Mum, ID:124 |
|  6 |      3 | 2021-01-01 12:00:00.0000000 |                   Amount:11, ID:125 |

查询

select CallID, Concat([1],',', [2],',',[3])
from
(
  select CallID, historyEdit, ROW_NUMBER() over (partition by callID order by callID) as rowNum
  from
  (
    select distinct callID, 
    substring(history, 0, charindex(', ID:',history)) historyEdit
    from test
  ) a
 )b
PIVOT(max(historyEdit) for rowNum IN ([1],[2],[3])) piv

结果

| CallID |                                                                   |
|--------|-------------------------------------------------------------------|
|      1 |                                             Amount: 10, Ref:123,, |
|      2 | Amount:11.22, Ref:Dad,Amount:11.22, Ref:Mum,Amount:12.44, Ref:SIS |
|      3 |                                                       Amount:11,, |

问题是我需要确保连接部分按照事件发生的顺序执行。在上面您会看到 CallID 2 的顺序错误，因为信息 3 在信息 2 之前出现，我确实尝试先按 callDateTime 对基表进行排序，然后运行查询，但它似乎确实产生了一些随机结果。有时它会以正确的顺序排列，有时则不会。我认为这是因为我没有在查询中指定任何 order by 子句。

在结果中包含 callDateTime 会导致 distinct 不返回 unqiue 数据行，因为 callDateTime 对于该重复的数据行仍然是唯一的

我正在使用 SQL Server v12

期望的结果

| CallID |                                                                   |
|--------|-------------------------------------------------------------------|
|      1 |                                             Amount: 10, Ref:123,, |
|      2 | Amount:12.44, Ref:SiS,Amount:11.22, Ref:Mum,Amount:11.22, Ref:Dad |
|      3 |                                                       Amount:11,, |

【问题讨论】：

您能更清楚地了解您要完成的工作吗？您要删除什么重复数据？ 【参考方案1】：

如果我理解正确，您想为每个 callid 拆分历史并重新组合（不重复）。如果是这样，您可以使用string_split() 和string_agg()：

select callid, string_agg(value, ', ')
from (select distinct t.callid, s.value
      from test t cross apply
           (select trim(s.value) as value
            from string_split(t.history, ',') s
           ) s
     ) st
group by callid;

Here 是一个 dbfiddle。

【讨论】：

结果明智的是，我相信你已经明白我所追求的，虽然我不在这个版本的 SQL Server 上，所以无法访问 string_agg 或 string_split 值得补充的是，历史字段中可能存在“重复”数据，但在正版中存在重复。唯一不真实的情况是，在删除最终的“ID：xxx”部分时，该给定 callID 的整个历史列是否完全重复，这就是我走不同路径的原因我已经更新了问题中的一些内容以使其更清晰【参考方案2】：

如果您确定以下记录的数量，您可以使用select 中的TOP 子句对记录进行排序，然后再转置结果：

select callID, historyEdit
   from
   (
     select distinct top 100000 callID, callDateTime,
     substring(history, 0, charindex(', ID:',history)) historyEdit
     from test
     order by callDateTime
   )t

请查看结果here。

【讨论】：

callDateTime 将导致 distinct 仍然返回重复数据，因为 callDateTime 对于该行是唯一的 @Chris 但如果我没记错的话，您也可以选择其中一个使用 group by （最小或最大 callDateTime）。请查看here【参考方案3】：

计算字符串的一种方法是交叉应用，使用序号拆分器将“历史”列分成可以枚举的组件。结果非常接近问题中提供的内容。也许提供的预期结果不准确具有代表性？像这样的

序数分离器described here

CREATE FUNCTION [dbo].[DelimitedSplit8K_LEAD]
--===== Define I/O parameters
        (@pString VARCHAR(8000), @pDelimiter CHAR(1))
RETURNS TABLE WITH SCHEMABINDING AS
 RETURN
  WITH E1(N) AS (
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
                ),                          --10E+1 or 10 rows
       E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
       E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
 cteTally(N) AS (--==== This provides the "zero base" and limits the number of rows right up front
                     -- for both a performance gain and prevention of accidental "overruns"
                 SELECT 0 UNION ALL
                 SELECT TOP (DATALENGTH(ISNULL(@pString,1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
                ),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
                 SELECT t.N+1
                   FROM cteTally t
                  WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0) 
                )
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
 SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY s.N1),
        Item = SUBSTRING(@pString,s.N1,ISNULL(NULLIF((LEAD(s.N1,1,1) OVER (ORDER BY s.N1) - 1),0)-s.N1,8000))
   FROM cteStart s
;

查询

with
unq_cte as (
    select distinct callID
    from #test),
exp_cte as (
    select callID, callDateTime , dl.*,
           row_number() over (partition by callID, dl.Item order by callDateTime) as rn
    from #test t
         cross apply dbo.DelimitedSplit8K_LEAD(t.history, ',') dl)
select t.callID, 
       stuff((select ',' + case when rn>1 then '' else Item end
              from exp_cte tt
              where t.callID = tt.callID
                    and ltrim(rtrim(Item)) not like 'ID%'
              order by tt.callDateTime, tt.ItemNumber for xml path('')), 1, 1, '') [value1]
from unq_cte t
group by t.callID;

callID  value1
1       Amount: 10, Ref:123,,
2       Amount:12.44, Ref:SIS,Amount:11.22, Ref:Mum,, Ref:Dad
3       Amount:11

【讨论】：

以上是关于如何在按唯一日期时间排序时删除重复项的主要内容，如果未能解决你的问题，请参考以下文章