过滤掉 n 天内的记录

Posted

技术标签:

【中文标题】过滤掉 n 天内的记录【英文标题】:Filter out records within n days 【发布时间】:2018-08-24 22:12:23 【问题描述】:

我不知道如何命名这个挑战..

我想标记(以便稍后过滤)某些记录,这些记录由 TypeID 列分区,它们在第一个记录的日期值的 n 天内(在本例中为 3)内观察分区数据集。这很简单,但在同一个分区集中,如果 3 天限制之后出现更多记录 - 该组的新“第一个”记录应该开始一个新的链以标记 3 天内的所有后续记录.等等……

我在此屏幕截图中说明了所需的输出,我想在其中标记/过滤掉标有黄色的行。保留所有其他行。

我已经用窗口函数等进行了喷涂和祈祷,但似乎找不到一个优雅的解决方案。你将如何使用 T-SQL 解决这个问题?

sqlfiddle 没有响应 sql-server atm,所以在这里发布 DDL 代码:

DROP TABLE IF EXISTS [dbo].[testTable];

CREATE TABLE [dbo].[testTable](
    [RowID] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY,
    [CustID] [int] NULL,
    [TransTypeID] [int] NULL,
    [Date] [date] NULL,
)
GO
SET IDENTITY_INSERT [dbo].[testTable] ON 
GO
INSERT [dbo].[testTable] ([RowID], [CustID], [TransTypeID], [Date]) VALUES (1, 9362, 1, CAST(N'2018-01-11' AS Date))
GO
INSERT [dbo].[testTable] ([RowID], [CustID], [TransTypeID], [Date]) VALUES (2, 9362, 1, CAST(N'2018-01-22' AS Date))
GO
INSERT [dbo].[testTable] ([RowID], [CustID], [TransTypeID], [Date]) VALUES (3, 9362, 2, CAST(N'2018-01-04' AS Date))
GO
INSERT [dbo].[testTable] ([RowID], [CustID], [TransTypeID], [Date]) VALUES (4, 9362, 2, CAST(N'2018-01-07' AS Date))
GO
INSERT [dbo].[testTable] ([RowID], [CustID], [TransTypeID], [Date]) VALUES (5, 9362, 2, CAST(N'2018-01-09' AS Date))
GO
INSERT [dbo].[testTable] ([RowID], [CustID], [TransTypeID], [Date]) VALUES (6, 9362, 2, CAST(N'2018-01-22' AS Date))
GO
INSERT [dbo].[testTable] ([RowID], [CustID], [TransTypeID], [Date]) VALUES (7, 9362, 2, CAST(N'2018-01-23' AS Date))
GO
INSERT [dbo].[testTable] ([RowID], [CustID], [TransTypeID], [Date]) VALUES (8, 9362, 2, CAST(N'2018-01-24' AS Date))
GO
INSERT [dbo].[testTable] ([RowID], [CustID], [TransTypeID], [Date]) VALUES (9, 9362, 2, CAST(N'2018-01-26' AS Date))
GO
INSERT [dbo].[testTable] ([RowID], [CustID], [TransTypeID], [Date]) VALUES (10, 9362, 3, CAST(N'2018-01-22' AS Date))
GO
INSERT [dbo].[testTable] ([RowID], [CustID], [TransTypeID], [Date]) VALUES (11, 9362, 5, CAST(N'2018-01-01' AS Date))
GO
INSERT [dbo].[testTable] ([RowID], [CustID], [TransTypeID], [Date]) VALUES (12, 9362, 5, CAST(N'2018-01-02' AS Date))
GO
INSERT [dbo].[testTable] ([RowID], [CustID], [TransTypeID], [Date]) VALUES (13, 9362, 5, CAST(N'2018-01-02' AS Date))
GO
INSERT [dbo].[testTable] ([RowID], [CustID], [TransTypeID], [Date]) VALUES (14, 9362, 5, CAST(N'2018-01-04' AS Date))
GO
INSERT [dbo].[testTable] ([RowID], [CustID], [TransTypeID], [Date]) VALUES (15, 9362, 5, CAST(N'2018-01-07' AS Date))
GO
INSERT [dbo].[testTable] ([RowID], [CustID], [TransTypeID], [Date]) VALUES (16, 9362, 5, CAST(N'2018-01-17' AS Date))
GO
INSERT [dbo].[testTable] ([RowID], [CustID], [TransTypeID], [Date]) VALUES (17, 9362, 5, CAST(N'2018-02-08' AS Date))
GO
INSERT [dbo].[testTable] ([RowID], [CustID], [TransTypeID], [Date]) VALUES (18, 9362, 5, CAST(N'2018-02-18' AS Date))
GO
SET IDENTITY_INSERT [dbo].[testTable] OFF
GO

【问题讨论】:

【参考方案1】:

使用递归 CTE 应该可以做到这一点。首先SELECT 组内具有最短日期的所有行。这可以使用row_number() 来完成。然后递归UNION ALL 组中日期大于结果中已经存在的最大日期加上 3 天的最小日期的行,从而跳过 3 天。同样row_number() 可用于此,dateadd() 用于日期算术。

WITH [cte]
AS
(
SELECT [x].[RowID],
       [x].[CustID],
       [x].[TransTypeId],
       [x].[Date]
       FROM (SELECT [testTable].[RowID],
                    [testTable].[CustID],
                    [testTable].[TransTypeId],
                    [testTable].[Date],
                    row_number() OVER (PARTITION BY [testTable].[CustId],
                                                    [testTable].[TransTypeID]
                                       ORDER BY [testTable].[Date]) [row#]
                    FROM [dbo].[testTable]) [x]
       WHERE [x].[row#] = 1
UNION ALL
SELECT [x].[RowID],
       [x].[CustID],
       [x].[TransTypeId],
       [x].[Date]
       FROM (SELECT [testTable].[RowID],
                    [testTable].[CustID],
                    [testTable].[TransTypeId],
                    [testTable].[Date],
                    row_number() OVER (PARTITION BY [testTable].[CustId],
                                                    [testTable].[TransTypeID]
                                       ORDER BY [testTable].[Date]) [row#]
                    FROM [dbo].[testTable]
                         INNER JOIN [cte]
                                    ON [cte].[CustId] = [testTable].[CustId]
                                       AND [cte].[TransTypeId] = [testTable].[TransTypeID]
                                       AND dateadd(day, 3, [cte].[Date]) < [testTable].[Date]) [x]
       WHERE [x].[row#] = 1
)
SELECT *
       FROM [cte]
       ORDER BY [cte].[CustID],
                [cte].[TransTypeID],
                [cte].[Date];

结果:

RowID | CustID | TransTypeId | Date               
----: | -----: | ----------: | :------------------
    1 |   9362 |           1 | 11/01/2018 00:00:00
    2 |   9362 |           1 | 22/01/2018 00:00:00
    3 |   9362 |           2 | 04/01/2018 00:00:00
    5 |   9362 |           2 | 09/01/2018 00:00:00
    6 |   9362 |           2 | 22/01/2018 00:00:00
    9 |   9362 |           2 | 26/01/2018 00:00:00
   10 |   9362 |           3 | 22/01/2018 00:00:00
   11 |   9362 |           5 | 01/01/2018 00:00:00
   15 |   9362 |           5 | 07/01/2018 00:00:00
   16 |   9362 |           5 | 17/01/2018 00:00:00
   17 |   9362 |           5 | 08/02/2018 00:00:00
   18 |   9362 |           5 | 18/02/2018 00:00:00

db<>fiddle

(我假设这些组不仅由[TransTypeID] 定义,还由[CustID] 定义。这对我来说并不是很清楚。如果我的假设错误,请从PARTITION BY 子句中删除[CustID]。)

【讨论】:

是的,CustID 也是该组的一部分。它完美无缺!我想标记记录,而不是立即过滤。所以我将它插入到一个临时表中,将它加入到原始表中,并将没有匹配的行标记为 0,否则为 1,等等。很好!

以上是关于过滤掉 n 天内的记录的主要内容,如果未能解决你的问题,请参考以下文章

过滤掉多行中的mysql重复记录

从大型数据集中过滤掉记录的最佳方法是什么

统计过滤掉某个条件的记录数

高效查找最近 k 天内未更新的记录

javascript Javascript从数组中过滤掉不需要的记录

Pyspark Dataframe - 如何过滤掉另一个数据框中匹配的记录?