标记重复记录的T-SQL查询

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了标记重复记录的T-SQL查询相关的知识,希望对你有一定的参考价值。

我有一张表有500,000多条记录。每个记录都有一个LineNumber字段,该字段不是唯一的,不是主键的一部分。每条记录都有一个CreatedOn字段。

我需要更新所有500,000条记录以识别重复记录。

重复记录由在其CreatedOn字段的最后七天内具有相同LineNumber的记录定义。

alt text

在上图中,第4行是重复,因为它仅在第1行发生了5天。第6行不是重复,即使它仅在第4行发生四天,但第4行本身已经是重复,所以第6行只能与第6行之前9天的第1行进行比较,因此第6行不是重复。

我不知道如何更新IsRepeat字段,通过光标或其他东西逐个单步执行每个记录。

我不相信游标是可行的方式,但我坚持使用任何其他可能的解决方案。

我考虑过,也许Common Table Expressions可能会有所帮助,但我对它们没有经验,也不知道从哪里开始。

基本上,每天都需要在桌面上完成相同的过程,因为每天都会截断并重新填充表格。重新填充表后,如果是重复,我必须重新标记每条记录。

一些援助将是最受赞赏的。

UPDATE

这是一个用于创建表和插入测试数据的脚本

USE [Test]
GO

/****** Object:  Table [dbo].[Job]    Script Date: 08/18/2009 07:55:25 ******/
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[Job]') AND type in (N'U'))
DROP TABLE [dbo].[Job]
GO

USE [Test]
GO

/****** Object:  Table [dbo].[Job]    Script Date: 08/18/2009 07:55:25 ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[Job]') AND type in (N'U'))
BEGIN
CREATE TABLE [dbo].[Job](
    [JobID] [int] IDENTITY(1,1) NOT NULL,
    [LineNumber] [nvarchar](20) NULL,
    [IsRepeat] [bit] NULL,
    [CreatedOn] [smalldatetime] NOT NULL,
 CONSTRAINT [PK_Job] PRIMARY KEY CLUSTERED 
(
    [JobID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
END
GO


SET NOCOUNT ON

INSERT INTO dbo.Job VALUES ('1006',NULL,'2009-07-01 07:52:08')
INSERT INTO dbo.Job VALUES ('1019',NULL,'2009-07-01 08:30:01')
INSERT INTO dbo.Job VALUES ('1028',NULL,'2009-07-01 09:30:35')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-01 10:51:10')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-02 09:22:30')
INSERT INTO dbo.Job VALUES ('1027',NULL,'2009-07-02 10:27:28')
INSERT INTO dbo.Job VALUES (NULL,NULL,'2009-07-02 11:15:33')
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-02 13:01:13')
INSERT INTO dbo.Job VALUES ('1014',NULL,'2009-07-03 12:05:56')
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-03 13:57:34')
INSERT INTO dbo.Job VALUES ('1025',NULL,'2009-07-03 15:38:54')
INSERT INTO dbo.Job VALUES ('1006',NULL,'2009-07-04 16:32:20')
INSERT INTO dbo.Job VALUES ('1025',NULL,'2009-07-05 13:46:46')
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-05 15:08:35')
INSERT INTO dbo.Job VALUES ('1000',NULL,'2009-07-05 15:19:50')
INSERT INTO dbo.Job VALUES ('1011',NULL,'2009-07-05 16:37:19')
INSERT INTO dbo.Job VALUES ('1019',NULL,'2009-07-05 17:14:09')
INSERT INTO dbo.Job VALUES ('1009',NULL,'2009-07-05 20:55:08')
INSERT INTO dbo.Job VALUES (NULL,NULL,'2009-07-06 08:29:29')
INSERT INTO dbo.Job VALUES ('1002',NULL,'2009-07-07 11:22:38')
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-07 12:25:23')
INSERT INTO dbo.Job VALUES ('1023',NULL,'2009-07-08 09:32:07')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-08 09:46:33')
INSERT INTO dbo.Job VALUES ('1016',NULL,'2009-07-08 10:09:08')
INSERT INTO dbo.Job VALUES ('1023',NULL,'2009-07-09 10:45:04')
INSERT INTO dbo.Job VALUES ('1027',NULL,'2009-07-09 11:31:23')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-09 13:10:06')
INSERT INTO dbo.Job VALUES ('1006',NULL,'2009-07-09 15:04:06')
INSERT INTO dbo.Job VALUES ('1010',NULL,'2009-07-09 17:32:16')
INSERT INTO dbo.Job VALUES ('1012',NULL,'2009-07-09 19:51:28')
INSERT INTO dbo.Job VALUES ('1000',NULL,'2009-07-10 15:09:42')
INSERT INTO dbo.Job VALUES ('1025',NULL,'2009-07-10 16:15:31')
INSERT INTO dbo.Job VALUES ('1006',NULL,'2009-07-10 21:55:43')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-11 08:49:03')
INSERT INTO dbo.Job VALUES ('1022',NULL,'2009-07-11 16:47:21')
INSERT INTO dbo.Job VALUES ('1026',NULL,'2009-07-11 18:23:16')
INSERT INTO dbo.Job VALUES ('1010',NULL,'2009-07-11 19:49:31')
INSERT INTO dbo.Job VALUES ('1029',NULL,'2009-07-12 11:57:26')
INSERT INTO dbo.Job VALUES ('1003',NULL,'2009-07-13 08:32:20')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-13 09:31:32')
INSERT INTO dbo.Job VALUES ('1021',NULL,'2009-07-14 09:52:54')
INSERT INTO dbo.Job VALUES ('1021',NULL,'2009-07-14 11:22:31')
INSERT INTO dbo.Job VALUES ('1023',NULL,'2009-07-14 11:54:14')
INSERT INTO dbo.Job VALUES (NULL,NULL,'2009-07-14 15:17:08')
INSERT INTO dbo.Job VALUES ('1005',NULL,'2009-07-15 13:27:08')
INSERT INTO dbo.Job VALUES ('1010',NULL,'2009-07-15 14:10:56')
INSERT INTO dbo.Job VALUES ('1011',NULL,'2009-07-15 15:20:50')
INSERT INTO dbo.Job VALUES ('1028',NULL,'2009-07-15 15:39:18')
INSERT INTO dbo.Job VALUES ('1012',NULL,'2009-07-15 16:06:17')
INSERT INTO dbo.Job VALUES ('1017',NULL,'2009-07-16 11:52:08')

SET NOCOUNT OFF
GO
答案

忽略LineNumber为空。在这种情况下应该如何处理IsRepeat?

它适用于测试数据。它是否足以满足生产量的需求?

如果对上有重复(LineNumber,CreatedOn),则任意选择一个。 (具有最小JobId的那个)

基本理念:

  1. 按行号获取至少相隔七天的所有JobId对。
  2. 计算从左侧开始超过七天的行数,直至并包括右侧。 (CNT)
  3. 然后我们知道JobId x是不是重复,下一个不是重复是左边的X对,CNT = 1
  4. 使用递归CTE从每个行号的第一行开始
  5. 递归元素使用带有计数的对来获取下一行。
  6. 最后更新,将所有IsRepeat设置为0表示非重复,1表示其他所有。

; with AllPairsByLineNumberAtLeast7DaysApart (LineNumber
            , LeftJobId
            , RightJobId
            , BeginCreatedOn
            , EndCreatedOn) as
        (select l.LineNumber
            , l.JobId
            , r.JobId
            , dateadd(day, 7, l.CreatedOn)
            , r.CreatedOn
        from Job l
        inner join Job r
            on l.LineNumber = r.LineNumber
            and dateadd(day, 7, l.CreatedOn) < r.CreatedOn
            and l.JobId <> r.JobId)
    -- Count the number of rows within from BeginCreatedOn 
    -- up to and including EndCreatedOn
    -- In the case of CreatedOn = EndCreatedOn, 
    -- include only jobId <= jobid, to handle ties in CreatedOn        
    , AllPairsCount(LineNumber, LeftJobId, RightJobId, Cnt) as
        (select ap.LineNumber, ap.LeftJobId, ap.RightJobId, count(*)
        from AllPairsByLineNumberAtLeast7DaysApart ap
        inner join Job j
            on j.LineNumber = ap.LineNumber
            and ap.BeginCreatedOn <= j.createdOn
            and (j.CreatedOn < ap.EndCreatedOn
                or (j.CreatedOn = ap.EndCreatedOn 
                    and j.JobId <= ap.RightJobId))
         group by ap.LineNumber, ap.LeftJobId, ap.RightJobId)
    , Step1 (LineNumber, JobId, CreatedOn, RN) as
        (select LineNumber, JobId, CreatedOn
            , row_number() over 
                (partition by LineNumber order by CreatedOn, JobId)
        from Job)
    , Results (JobId, LineNumber, CreatedOn) as    
        -- Start with the first rows.
        (select JobId, LineNumber, CreatedOn
        from Step1
        where RN = 1
        and LineNumber is not null
        -- get the next row
        union all
        select j.JobId, j.LineNumber, j.CreatedOn
        from Results r
        inner join AllPairsCount apc on apc.LeftJobId = r.JobId
        inner join Job j
            on j.JobId = apc.RightJobId
            and apc.CNT = 1)
    update j
    set IsRepeat = case when R.JobId is not null then 0 else 1 end
    from Job j
    left outer join Results r
        on j.JobId = R.JobId
    where j.LineNumber is not null

编辑:

我昨晚关掉电脑后意识到我做的事情比他们需要的更复杂。更简单(并且在测试数据上,稍微有效)查询:

基本理念:

  1. 生成的PotentialStep(FromJobId,ToJobId)这些是如果FromJobId不是重复的对,那么ToJobId也不是重复。 (FromNobId超过7天的第一行LineNumber)
  2. 使用递归CTE从每个LineNumber的第一个JobId开始,然后使用PontentialSteps步骤到每个非重复JobId

; with PotentialSteps (FromJobId, ToJobId) as
    (select FromJobId, ToJobId
    from (select f.JobId as FromJobId
            , t.JobId as ToJobId
            , row_number() over
                 (partition by f.LineNumber order by t.CreatedOn, t.JobId) as RN
        from Job f
        inner join Job t
            on f.LineNumber = t.LineNumber
            and dateadd(day, 7, f.CreatedOn) < t.CreatedOn) t
        where RN = 1)
, NonRepeats (JobId) as
    (select JobId
    from (select JobId
            , row_number() over
                (partition by LineNumber order by CreatedOn, JobId) as RN
        from Job) Start
    where RN = 1
    union all
    select J.JobId
    from NonRepeats NR
    inner join PotentialSteps PS
        on NR.JobId = PS.FromJobId
    inner join Job J
        on PS.ToJobId = J.JobId)
update J
set IsRepeat = case when NR.JobId is not null then 0 else 1 end
from Job J
left outer join NonRepeats NR
on J.JobId = NR.JobId
where J.LineNumber is not null
另一答案
UPDATE Jobs 
SET Jobs.IsRepeat = 0 -- mark all of them IsRepeat = false

UPDATE Jobs 
SET Jobs.IsRepeat = 1
WHERE EXISTS 
   (SELECT TOP 1 i.LineNumber FROM Jobs i WHERE i.LineNumber = Jobs.LineNumber
    AND i.CreatedOn <> Jobs.CreatedOn and i.CreatedOn BETWEEN Jobs.CreatedOn - 7 
    AND Jobs.CreatedOn)

注意:我希望这会对你有所帮助。如果您发现在较大的数据集上遇到任何差异,请与我们联系。

另一答案

我并不为此感到骄傲,它做了很多假设(例如,CreatedOn只是日期,而(LineNUmber,CreatedOn)是一个关键。可能需要一些调整,只适用于测试数据。

换句话说,我为了求知欲而创造了更多,而不是因为我认为这是一个真正的解决方案。最终选择可以是基于V4中行的存在而在基表中设置IsRepeat的更新。在让人们看到邪恶之前的最后一点 - 人们可以在评论中发布不适用的数据集的测试数据。有可能把它变成一个真正的解决方案:

with V1 as (
select t1.LineNumber,t1.CreatedOn,t2.CreatedOn as PrevDate from
T1 t1 inner join T1 t2 on t1.LineNumber = t2.LineNumber and t1.CreatedOn > t2.CreatedOn and DATEDIFF(DAY,t2.CreatedOn,t1.CreatedOn) < 7
), V2 as (
select v1.LineNumber,v1.CreatedOn,V1.PrevDate from V1
union all
select v1.LineNumber,v1.CreatedOn,v2.PrevDate from v1 inner join v2 on V1.LineNumber = v2.LineNumber and v1.PrevDate = v2.CreatedOn
), V3 as (
select LineNumber,CreatedOn,MIN(PrevDate) as PrevDate from V2 group by LineNumber,CreatedOn
), V4 as (
select LineNumber,CreatedOn from V3 where DATEDIFF(DAY,PrevDate,CreatedOn) < 7
)
select
    T1.LineNumber,
    T1.CreatedOn,
    CASE WHEN V4.LineNumber is Null then 0 else 1 end as IsRepeat
from
    T1
        left join
    V4
        on
            T1.LineNumber = V4.LineNumber and
            T1.CreatedOn = V4.CreatedOn
order by T1.CreatedOn,T1.LineNumber
option (maxrecursion 7)

以上是关于标记重复记录的T-SQL查询的主要内容,如果未能解决你的问题,请参考以下文章

只要它们不存在于另一个表T-SQL中,就从一个表中检索记录[重复]

使用 T-SQL Merge 语句时如何避免插入重复记录

T-SQL语句

如何制定 T-SQL 以避免主键约束?

T-SQL,在视图中重复相同的标量子查询性能

T-SQL:与字符串连接相反-如何将字符串拆分为多个记录[重复]