如何识别 T-SQL 中每个不同成员的多个开始和结束日期范围中的第一个间隙
Posted
技术标签:
【中文标题】如何识别 T-SQL 中每个不同成员的多个开始和结束日期范围中的第一个间隙【英文标题】:How to identify the first gap in multiple start and end date ranges for each distinct member in T-SQL 【发布时间】:2012-08-29 12:27:01 【问题描述】:我一直在做以下工作,但没有得到任何结果,而且截止日期很快就要到了。此外,还有超过一百万行,如下所示。感谢您对以下内容的帮助。
目标:按成员对结果进行分组,并通过组合各个日期范围来为每个成员构建连续覆盖范围,这些日期范围可以重叠或彼此连续运行,并且在范围的开始和结束日期之间没有中断。
我有以下格式的数据:
MemberCode ----- ClaimID ----- StartDate ----- EndDate
00001 ----- 012345 ----- 2010-01-15 ----- 2010-01-20
00001 ----- 012350 ----- 2010-01-19 ----- 2010-01-22
00001 ----- 012352 ----- 2010-01-20 ----- 2010-01-25
00001 ----- 012355 ----- 2010-01-26 ----- 2010-01-30
00002 ----- 012357 ----- 2010-01-20 ----- 2010-01-25
00002 ----- 012359 ----- 2010-01-30 ----- 2010-02-05
00002 ----- 012360 ----- 2010-02-04 ----- 2010-02-15
00003 ----- 012365 ----- 2010-02-15 ----- 2010-02-30
...
上述成员 (00001) 是有效成员,因为从 2010-01-15 到 连续的日期范围 2010-01-30(没有间隙)。请注意,此成员的声明 ID 012355 紧邻声明 ID 012352 的结束日期开始。这仍然有效,因为它形成了一个连续的范围。
但是,会员 (00002) 应该是无效会员,因为索赔 ID 012357 的结束日期和索赔 ID 的开始日期之间有 5 天的间隔强>012359
我想要做的是只列出那些在连续日期范围(对于每个成员)的每一天都有声明的成员的列表,在 MIN(开始日期)和 Max(结束日期)之间没有间隙) 对于每个 Distinct 成员。有差距的成员将被丢弃。
提前致谢。
更新:
我已经到了这里。
注:FILLED_DT = Start Date & PresCoverEndDT = End Date
SELECT PresCoverEndDT, FILLED_DT
FROM
(
SELECT DISTINCT FILLED_DT, ROW_NUMBER() OVER (ORDER BY FILLED_DT) RN
FROM Temp_Claims_PRIOR_STEP_5 T1
WHERE NOT EXISTS
(SELECT * FROM Temp_Claims_PRIOR_STEP_5 T2
WHERE T1.FILLED_DT > T2.FILLED_DT AND T1.FILLED_DT< T2.PresCoverEndDT
AND T1.MBR_KEY = T2.MBR_KEY )
) T1
JOIN (SELECT DISTINCT PresCoverEndDT, ROW_NUMBER() OVER (ORDER BY PresCoverEndDT) RN
FROM Temp_Claims_PRIOR_STEP_5 T1
WHERE NOT EXISTS
(SELECT * FROM Temp_Claims_PRIOR_STEP_5 T2
WHERE T1.PresCoverEndDT > T2.FILLED_DT AND T1.PresCoverEndDT < T2.PresCoverEndDT AND T1.MBR_KEY = T2.MBR_KEY )
) T2
ON T1.RN - 1 = T2.RN
WHERE PresCoverEndDT < FILLED_DT
上面的代码似乎有错误,因为我只得到一行,而且它也不正确。我想要的输出只有 1 列,如下所示:
Valid_Member_Code
00001
00007
00009
...等等,
【问题讨论】:
看看这里:simple-talk.com/sql/t-sql-programming/… islands and gaps tsql的可能重复 此处讨论的类似主题:***.com/questions/12088959/merging-unused-timeslots/… @Chris Gessler - 抱歉,如果以下问题很幼稚。但是,我的要求是针对每个不同的成员。如何使用链接中的步骤查找每个不同成员的日期范围?谢谢。 @Vijay - 如果您想查找无效的会员代码,您可以自行加入并查找 t1.EndDate > t2.StartDate。要查找所有日期间隔,您必须对结果进行分区。您是否偶然使用 SQL Server? 【参考方案1】:试试这个:http://www.sqlfiddle.com/#!3/c3365/20
with s as
(
select *, row_number() over(partition by membercode order by startdate) rn
from tbl
)
,gaps as
(
select a.membercode, a.startdate, a.enddate, b.startdate as nextstartdate
,datediff(d, a.enddate, b.startdate) as gap
from s a
join s b on b.membercode = a.membercode and b.rn = a.rn + 1
)
select membercode
from gaps
group by membercode
having sum(case when gap <= 1 then 1 end) = count(*);
在此处查看查询进度:http://www.sqlfiddle.com/#!3/c3365/20
它是如何工作的,将当前结束日期与其下一个开始日期进行比较并检查日期差距:
with s as
(
select *, row_number() over(partition by membercode order by startdate) rn
from tbl
)
select a.membercode, a.startdate, a.enddate, b.startdate as nextstartdate
,datediff(d, a.enddate, b.startdate) as gap
from s a
join s b on b.membercode = a.membercode and b.rn = a.rn + 1;
输出:
| MEMBERCODE | STARTDATE | ENDDATE | NEXTSTARTDATE | GAP |
--------------------------------------------------------------
| 1 | 2010-01-15 | 2010-01-20 | 2010-01-19 | -1 |
| 1 | 2010-01-19 | 2010-01-22 | 2010-01-20 | -2 |
| 1 | 2010-01-20 | 2010-01-25 | 2010-01-26 | 1 |
| 2 | 2010-01-20 | 2010-01-25 | 2010-01-30 | 5 |
| 2 | 2010-01-30 | 2010-02-05 | 2010-02-04 | -1 |
然后检查一个成员的索赔数量是否与其总索赔没有差距:
with s as
(
select *, row_number() over(partition by membercode order by startdate) rn
from tbl
)
,gaps as
(
select a.membercode, a.startdate, a.enddate, b.startdate as nextstartdate
,datediff(d, a.enddate, b.startdate) as gap
from s a
join s b on b.membercode = a.membercode and b.rn = a.rn + 1
)
select membercode, count(*) as count, sum(case when gap <= 1 then 1 end) as gapless_count
from gaps
group by membercode;
输出:
| MEMBERCODE | COUNT | GAPLESS_COUNT |
--------------------------------------
| 1 | 3 | 3 |
| 2 | 2 | 1 |
最后,过滤他们,在他们的声明中没有空白的成员:
with s as
(
select *, row_number() over(partition by membercode order by startdate) rn
from tbl
)
,gaps as
(
select a.membercode, a.startdate, a.enddate, b.startdate as nextstartdate
,datediff(d, a.enddate, b.startdate) as gap
from s a
join s b on b.membercode = a.membercode and b.rn = a.rn + 1
)
select membercode
from gaps
group by membercode
having sum(case when gap <= 1 then 1 end) = count(*);
输出:
| MEMBERCODE |
--------------
| 1 |
请注意,您无需执行 COUNT(*) > 1
即可检测具有 2 个或更多声明的成员。我们不使用LEFT JOIN
,而是使用JOIN
,这将自动丢弃尚未进行第二次声明的成员。如果您选择使用LEFT JOIN
代替(与上面相同的输出),这是版本(更长):
with s as
(
select *, row_number() over(partition by membercode order by startdate) rn
from tbl
)
,gaps as
(
select a.membercode, a.startdate, a.enddate, b.startdate as nextstartdate
,datediff(d, a.enddate, b.startdate) as gap
from s a
left join s b on b.membercode = a.membercode and b.rn = a.rn + 1
)
select membercode
from gaps
group by membercode
having sum(case when gap <= 1 then 1 end) = count(gap)
and count(*) > 1; -- members who have two ore more claims only
这是在过滤之前查看上述查询数据的方式:
with s as
(
select *, row_number() over(partition by membercode order by startdate) rn
from tbl
)
,gaps as
(
select a.membercode, a.startdate, a.enddate, b.startdate as nextstartdate
,datediff(d, a.enddate, b.startdate) as gap
from s a
left join s b on b.membercode = a.membercode and b.rn = a.rn + 1
)
select * from gaps;
输出:
| MEMBERCODE | STARTDATE | ENDDATE | NEXTSTARTDATE | GAP |
-----------------------------------------------------------------
| 1 | 2010-01-15 | 2010-01-20 | 2010-01-19 | -1 |
| 1 | 2010-01-19 | 2010-01-22 | 2010-01-20 | -2 |
| 1 | 2010-01-20 | 2010-01-25 | 2010-01-26 | 1 |
| 1 | 2010-01-26 | 2010-01-30 | (null) | (null) |
| 2 | 2010-01-20 | 2010-01-25 | 2010-01-30 | 5 |
| 2 | 2010-01-30 | 2010-02-05 | 2010-02-04 | -1 |
| 2 | 2010-02-04 | 2010-02-15 | (null) | (null) |
| 3 | 2010-02-15 | 2010-03-02 | (null) | (null) |
编辑需求说明:
在您的澄清中,您想包括尚未获得第二次声明的成员,请改为:http://sqlfiddle.com/#!3/c3365/22
with s as
(
select *, row_number() over(partition by membercode order by startdate) rn
from tbl
)
,gaps as
(
select a.membercode, a.startdate, a.enddate, b.startdate as nextstartdate
,datediff(d, a.enddate, b.startdate) as gap
from s a
left join s b on b.membercode = a.membercode and b.rn = a.rn + 1
)
select membercode
from gaps
group by membercode
having sum(case when gap <= 1 then 1 end) = count(gap)
-- members who have yet to have a second claim are valid too
or count(nextstartdate) = 0;
输出:
| MEMBERCODE |
--------------
| 1 |
| 3 |
技术是计算成员的nextstartdate
,如果他们没有下一个开始日期日期(即count(nextstartdate) = 0
)那么他们只是单一的声明并且也是有效的,那么只需附加这个OR
条件:
or count(nextstartdate) = 0;
实际上,下面的条件也足够了,不过我想让查询更加自我记录,因此我建议依靠成员的 nextstartdate。以下是计算尚未有第二次声明的成员的另一种条件:
or count(*) = 1;
顺便说一句,我们还必须从这里更改比较:
sum(case when gap <= 1 then 1 end) = count(*)
对此(我们现在使用LEFT JOIN
):
sum(case when gap <= 1 then 1 end) = count(gap)
【讨论】:
非常感谢迈克尔!更新后的查询从总共 167 万个索赔中返回了 37052 个有效成员。查询进度和您对步骤的解释非常有教育意义:-) 嗨,迈克尔,因为我还需要尚未提出第二次声明的成员(换句话说,在我的情况下,具有单一声明的成员有效),所以我使用 LEFT JOIN 而不是 JOIN 根据您的注意在最后,上面。但是,我得到 0 个结果。想知道我是否在这里遗漏了什么? @Vijay 答案已根据您的明确要求修改ツ 感谢 Michael :-) 我试图弄清楚为什么我得到 0。当我随机检查成员时,修改后的代码确实给了我想要的输出。另外,感谢您出色的详细解释和查询进度!【参考方案2】:试试这个,它按MemberCode
对行进行分区并给它们序号。然后将行与后续的num
值进行比较,如果一行的结束日期和下一行的开始日期之间的差异大于一天,则它是一个无效成员:
DECLARE @t TABLE (MemberCode VARCHAR(100), ClaimID
INT,StartDate DATETIME,EndDate DATETIME)
INSERT @t
VALUES
('00001' , 012345 , '2010-01-15' , '2010-01-20')
,('00001' , 012350 , '2010-01-19' , '2010-01-22')
,('00001' , 012352 , '2010-01-20' , '2010-01-25')
,('00001' , 012355 , '2010-01-26' , '2010-01-30')
,('00002' , 012357 , '2010-01-20' , '2010-01-25')
,('00002' , 012359 , '2010-01-30' , '2010-02-05')
,('00002' , 012360 , '2010-02-04' , '2010-02-15')
,('00003' , 012365 , '2010-02-15' , '2010-02-28')
,('00004' , 012366 , '2010-03-18' , '2010-03-23')
,('00005' , 012367 , '2010-03-19' , '2010-03-25')
,('00006' , 012368 , '2010-03-20' , '2010-03-21')
;WITH tbl AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY MemberCode ORDER BY StartDate)
AS num
FROM @t
), invalid AS (
SELECT tbl.MemberCode
FROM tbl
JOIN tbl _tbl ON
tbl.num = _tbl.num - 1
AND tbl.MemberCode = _tbl.MemberCode
WHERE DATEDIFF(DAY, tbl.EndDate, _tbl.StartDate) > 1
)
SELECT MemberCode
FROM tbl
EXCEPT
SELECT MemberCode
FROM invalid
【讨论】:
感谢 Ivan,我已经在上面执行 Michael 的代码的同一张表上执行了查询。然而,结果大相径庭。虽然您的代码返回 328,256 个声明(针对 112,077 个 DISTINCT 成员),但 Michael 的代码返回 37,052 个有效成员。 感谢您试用,这是一个有趣的结果。如果您有任何查询未能排除无效成员的示例数据,请发布,我很想知道它在哪里失败。 @Vijay 我相信我发现我和 Michael 的查询之间存在一个区别,我的查询不排除某些行,因为我相信它们是有效的,这些行对于给定的 @987654324 只有一次出现@,我将它们添加到示例输入数据中; 0004、0005、0006。你认为这些行是有效的还是无效的? 是的,只有一个声明的会员代码是有效会员。 在这种情况下,此查询将返回这些成员,这可能是它返回更多成员的原因。【参考方案3】:我认为您的查询会返回误报,因为它只检查连续行之间的时间间隔。在我看来,差距有可能被前面的一条线所补偿。举个例子吧:
第 l 行:2010-01-01 | 2010-01-31 第 2 行:2010-01-10 | 2010-01-15 第 3 行:2010-01-20 | 2010-01-25
您的代码将报告第 2 行和第 3 行之间的间隙,而该间隙由第 1 行填充。您的代码不会检测到这一点。 您应该使用 DATEDIFF 函数中所有 previous 行的 MAX(EndDate)。
DECLARE @t TABLE (PersonID VARCHAR(100), StartDate DATETIME, EndDate DATETIME)
INSERT @t VALUES('00001' , '2010-01-01' , '2010-01-17')
INSERT @t VALUES('00001' , '2010-01-19' , '2010-01-22')
INSERT @t VALUES('00001' , '2010-01-20' , '2010-01-25')
INSERT @t VALUES('00001' , '2010-01-26' , '2010-01-31')
INSERT @t VALUES('00002' , '2010-01-20' , '2010-01-25')
INSERT @t VALUES('00002' , '2010-02-04' , '2010-02-05')
INSERT @t VALUES('00002' , '2010-02-04' , '2010-02-15')
INSERT @t VALUES('00003' , '2010-02-15' , '2010-02-28')
INSERT @t VALUES('00004' , '2010-03-18' , '2010-03-23')
INSERT @t VALUES('00005' , '2010-03-19' , '2010-03-25')
INSERT @t VALUES('00006' , '2010-01-01' , '2010-04-20')
INSERT @t VALUES('00006' , '2010-01-20' , '2010-01-21')
INSERT @t VALUES('00006' , '2010-01-25' , '2010-01-26')
;WITH tbl AS (
SELECT
*, ROW_NUMBER() OVER (PARTITION BY PersonID ORDER BY StartDate) AS num
FROM @t
), invalid AS (
SELECT tbl.PersonID
FROM tbl
JOIN tbl _tbl ON
tbl.num = _tbl.num - 1 AND tbl.PersonID = _tbl.PersonID
WHERE DATEDIFF(DAY, (SELECT MAX(tbl3.EndDate) FROM tbl tbl3 WHERE tbl3.num <= tbl.num AND tbl3.PersonID = tbl.PersonID), _tbl.StartDate) > 1
)
SELECT PersonID
FROM tbl
EXCEPT
SELECT PersonID
FROM invalid
【讨论】:
以上是关于如何识别 T-SQL 中每个不同成员的多个开始和结束日期范围中的第一个间隙的主要内容,如果未能解决你的问题,请参考以下文章