从表中选择连续范围

Posted

技术标签:

【中文标题】从表中选择连续范围【英文标题】:Select continuous ranges from table 【发布时间】:2012-02-09 15:13:08 【问题描述】:

我需要根据连续数字(N 列)和这些数字相关的相同“类别”(下面的 C 列)从表中提取连续范围。从图形上看是这样的:

 N  C  D
--------
 1  x  a           C  N1  N2  D1  D2
 2  x  b          ------------------
 3  x  c           x   1   4   a   d     (continuous range with same N)
 4  x  d    ==>    x   6   7   e   f     (new range because "5" is missing)
 6  x  e           y   8  10   g   h     (new range because C changed to "y")
 7  x  f
 8  y  g
 9  y  h
10  y  i

SQL Server 是 2005。谢谢。

【问题讨论】:

可以通过存储过程来实现吗? 如果您可以访问 SQL Cookbook,这是配方 10.3。 amazon.com/Cookbook-Cookbooks-OReilly-Anthony-Molinaro/dp/… 很牵强。 @MattFenwick:谢谢,看起来那个recepie 可以做我需要的,在我可以咀嚼它之后。 【参考方案1】:
DECLARE @myTable Table
(
    N INT,
    C CHAR(1),
    D CHAR(1)
)
INSERT INTO @myTable(N,C,D) VALUES(1,  'x', 'a');
INSERT INTO @myTable(N,C,D) VALUES(2,  'x', 'b');
INSERT INTO @myTable(N,C,D) VALUES(3,  'x', 'c');
INSERT INTO @myTable(N,C,D) VALUES(4,  'x', 'd');
INSERT INTO @myTable(N,C,D) VALUES(6,  'x', 'e');
INSERT INTO @myTable(N,C,D) VALUES(7,  'x', 'f');
INSERT INTO @myTable(N,C,D) VALUES(8,  'y', 'g');
INSERT INTO @myTable(N,C,D) VALUES(9,  'y', 'h');
INSERT INTO @myTable(N,C,D) VALUES(10, 'y', 'i');


WITH StartingPoints AS(

    SELECT A.*, ROW_NUMBER() OVER(ORDER BY A.N) AS rownum
    FROM @myTable AS A
    WHERE NOT EXISTS(
        SELECT *
        FROM @myTable B
        WHERE B.C = A.C
          AND B.N = A.N - 1
    )
 ),
 EndingPoints AS(
    SELECT A.*, ROW_NUMBER() OVER(ORDER BY A.N) AS rownum
    FROM @myTable AS A
    WHERE NOT EXISTS (
        SELECT *
        FROM @myTable B
        WHERE B.C = A.C
          AND B.N = A.N + 1
    )
 ) 
SELECT StartingPoints.C,
       StartingPoints.N AS [N1],
       EndingPoints.N AS [N2],
       StartingPoints.D AS [D1],
       EndingPoints.D AS [D2] 
FROM StartingPoints
JOIN EndingPoints ON StartingPoints.rownum = EndingPoints.rownum

结果

C    N1          N2          D1   D2
---- ----------- ----------- ---- ----
x    1           4           a    d
x    6           7           e    f
y    8           10          g    i

【讨论】:

【参考方案2】:

RANK 函数比 ROW_NUMBER 更安全,以防任何 N 个值重复,如下例所示:

declare @ncd table(N int, C char, D char);

insert into @ncd
select 1,'x','a' union all
select 2,'x','b' union all
select 3,'x','c' union all
select 4,'x','d' union all
select 4,'x','e' union all
select 7,'x','f' union all
select 8,'y','g' union all
select 9,'y','h' union all
select 10,'y','i' union all
select 10,'y','j';

with a as (
    select *
    , r = N-rank()over(partition by C order by N)
    from @ncd
)
select C=MIN(C)
, N1=MIN(N)
, N2=MAX(N)
, D1=MIN(D)
, D2=MAX(D)
from a
group by r;

结果,正确承受重复的4和10:

C    N1          N2          D1   D2
---- ----------- ----------- ---- ----
x    1           4           a    e
x    7           7           f    f
y    8           10          g    j

【讨论】:

【参考方案3】:

以this answer 为起点,我得到了以下结果:

;
WITH data (N, C, D) AS (
  SELECT 1,  'x', 'a' UNION ALL
  SELECT 2,  'x', 'b' UNION ALL
  SELECT 3,  'x', 'c' UNION ALL
  SELECT 4,  'x', 'd' UNION ALL
  SELECT 6,  'x', 'e' UNION ALL
  SELECT 7,  'x', 'f' UNION ALL
  SELECT 8,  'y', 'g' UNION ALL
  SELECT 9,  'y', 'h' UNION ALL
  SELECT 10, 'y', 'i'
),
ranked AS (
  SELECT
    curr.*,
    Grp     = curr.N - ROW_NUMBER() OVER (PARTITION BY curr.C ORDER BY curr.N),
    IsStart = CASE WHEN pred.C IS NULL THEN 1 ELSE 0 END,
    IsEnd   = CASE WHEN succ.C IS NULL THEN 1 ELSE 0 END
  FROM data AS curr
    LEFT JOIN data AS pred ON curr.C = pred.C AND curr.N = pred.N + 1
    LEFT JOIN data AS succ ON curr.C = succ.C AND curr.N = succ.N - 1
)
SELECT
  C,
  N1 = MIN(N),
  N2 = MAX(N),
  D1 = MAX(CASE IsStart WHEN 1 THEN D END),
  D2 = MAX(CASE IsEnd   WHEN 1 THEN D END)
FROM ranked
WHERE 1 IN (IsStart, IsEnd)
GROUP BY C, Grp

【讨论】:

【参考方案4】:

编写一个存储过程。它将创建并填充一个临时表,其中包含 C、N1、N2、D1 和 D2 列。

创建临时表 使用游标在包含按 N 排序的 N、C、D 的表中的条目上循环 使用变量检测新范围 (Ni 为检测到的每个范围插入临时表(检测到的新范围或游标的和)

如果您需要代码示例,请告诉我。

【讨论】:

谢谢,我会制定代码的。我希望有非光标解决方案。源表可能有数百万行长。

以上是关于从表中选择连续范围的主要内容,如果未能解决你的问题,请参考以下文章

从表中随机选择行 - Python Pandas Read SQL

从表中检索特定 24 小时时间范围内的记录

从表中的一系列数字中获取范围并将所有范围存储在 PLSQL/Oracle Forms 中的字符串变量中

在 Redshift 中从表中选择 Date1、Date2

BigQuery 范围装饰器重复问题

如何获取指定日期范围的值?