查找列值连续增加的行

Posted

技术标签:

【中文标题】查找列值连续增加的行【英文标题】:Finding rows with consecutive increase in the values of a column 【发布时间】:2012-04-27 16:32:38 【问题描述】:

我有一个存储股票每日价格的 sql 表。每天收市后都会插入新记录。我想找出价格连续上涨的股票。

该表有很多列,但这是相关的子集:

quoteid     stockid      closeprice     createdate
--------------------------------------------------
    1           1               1       01/01/2012
    2           2              10       01/01/2012
    3           3              15       01/01/2012

    4           1               2       01/02/2012
    5           2              11       01/02/2012
    6           3              13       01/02/2012

    7           1               5       01/03/2012
    8           2              13       01/03/2012
    9           3              17       01/03/2012

   10           1               7       01/04/2012
   11           2              14       01/04/2012
   12           3              18       01/04/2012

   13           1               9       01/05/2012
   14           2              11       01/05/2012
   15           3              10       01/05/2012

quoteid 列是主键。

在表格中,股票 id 1 的收盘价每天都在上涨。股票id 3 波动很大,股票id 2 的价格在最后一天下跌。

我正在寻找这样的结果:

stockid     Consecutive Count (CC)
----------------------------------
    1                5
    2                4

如果你能得到连续连胜的日期输出,那就更好了:

stockid     Consecutive Count (CC)      StartDate      EndDate
---------------------------------------------------------------
    1                5                 01/01/2012    01/05/2012
    2                4                 01/01/2012    01/04/2012

StartDate 是价格开始上涨的时间,EndDate 是牛市实际结束的时间。

我认为这不是一个容易的问题。我在这里查看了其他帖子,这些帖子也处理了这种连续场景,但它们不符合我的需求。如果您知道任何与我相似的帖子,请告诉我。

【问题讨论】:

您希望连续增加的最小长度是多少 - 仅大于一天?或者以某种方式被它们的减少所抵消?如果有相关数据,我假设您希望查看多次运行。 数据中是否存在任何差距(例如周末)以及需要在那里做什么? 我没有连续增加的规则,它必须大于前一天。是的,我正在寻找多次运行。我将对过去 3 个月、6 个月或更多的数据运行此查询。数据会有空隙,我们可以使用主键列来获取前一天的记录 【参考方案1】:

在任何情况下,将其放在增加每只股票的行数方面是有帮助的(实际的quoteid 值在这里并没有真正的帮助)。 捕获的天数(在此表中)是最简单的 - 如果您想要其他的东西(例如仅工作日、忽略周末/节假日或其他),它会涉及更多;你可能需要一个日历文件。如果你还没有索引,你会想要一个超过 [stockid, createdate] 的索引。

WITH StockRow AS (SELECT stockId, closePrice, createdDate,
                         ROW_NUMBER() OVER(PARTITION BY stockId 
                                           ORDER BY createdDate) rn
                  FROM Quote),

     RunGroup AS (SELECT Base.stockId, Base.createdDate,
                         MAX(Restart.rn) OVER(PARTITION BY Base.stockId
                                              ORDER BY Base.createdDate) groupingId
                  FROM StockRow Base
                  LEFT JOIN StockRow Restart
                         ON Restart.stockId = Base.stockId
                            AND Restart.rn = Base.rn - 1
                            AND Restart.closePrice > Base.closePrice)

SELECT stockId, 
       COUNT(*) AS consecutiveCount, 
       MIN(createdDate) AS startDate, MAX(createdDate) AS endDate
FROM RunGroup
GROUP BY stockId, groupingId
HAVING COUNT(*) >= 3
ORDER BY stockId, startDate

从提供的数据中产生以下结果:

Increasing_Run
stockId   consecutiveCount  startDate    endDate
===================================================
1         5                 2012-01-01   2012-01-05
2         4                 2012-01-01   2012-01-04
3         3                 2012-01-02   2012-01-04

SQL Fiddle Example (小提琴也有一个多次运行的例子)

此分析将忽略所有间隙,正确匹配所有运行(下一次肯定运行开始时)。


那么这里发生了什么?

StockRow AS (SELECT stockId, closePrice, createdDate,
                    ROW_NUMBER() OVER(PARTITION BY stockId 
                                      ORDER BY createdDate) rn
             FROM Quote)

此 CTE 用于一个目的:我们需要一种方法来查找下一行/上一行,因此首先我们按(日期)顺序对每一行进行编号...

RunGroup AS (SELECT Base.stockId, Base.createdDate,
                    MAX(Restart.rn) OVER(PARTITION BY Base.stockId
                                         ORDER BY Base.createdDate) groupingId
             FROM StockRow Base
             LEFT JOIN StockRow Restart
                    ON Restart.stockId = Base.stockId
                       AND Restart.rn = Base.rn - 1
                           AND Restart.closePrice > Base.closePrice)

... 然后根据索引加入它们。如果您最终选择了具有LAG()/LEAD() 的东西,那么使用它们几乎肯定会是一个更好的选择。不过,这里有一件关键的事情 - 只有当行是 out-of-sequence 时才匹配(小于上一行)。否则,该值最终为null(使用LAG(),之后您需要使用CASE 之类的东西才能完成此操作)。你会得到一个看起来像这样的临时集合:

B.rn   B.closePrice   B.createdDate  R.rn   R.closePrice   R.createdDate  groupingId
1      15             2012-01-01     -      -              -              -
2      13             2012-01-02     1      15             2012-01-01     1
3      17             2012-01-03     -      -              -              1
4      18             2012-01-04     -      -              -              1
5      10             2012-01-05     4      18             2012-01-04     4

...所以只有当前一个大于“当前”行时,Restart 才有值。 MAX() 在窗口函数中的使用被用于迄今为止看到的最大值......因为null 是最低的,导致所有其他行保留行索引,直到发生另一个不匹配(这给出一个新值)。此时,我们基本上有了gaps-and-islands 查询的中间结果,为最终聚合做好了准备。

SELECT stockId, 
       COUNT(*) AS consecutiveCount, 
       MIN(createdDate) AS startDate, MAX(createdDate) AS endDate
FROM RunGroup
GROUP BY stockId, groupingId
HAVING COUNT(*) >= 3
ORDER BY stockId, startDate

查询的最后一部分是获取运行的开始日期和结束日期,并计算这些日期之间的条目数。如果日期计算有更复杂的事情,则可能需要在此时发生。 GROUP BY 显示了为数不多的 not 合法实例之一,包括 SELECT 子句中的一列。 HAVING 子句用于消除“太短”的运行。

【讨论】:

计算连续上涨,但也计算连续停滞(价格不上涨不下跌。)可以修改代码以计算仅连续上涨 ? 我想通了。我们可以通过用>=替换AND Restart.closePrice > Base.closePrice中的>来实现这一点【参考方案2】:

我会尝试 CTE,大致如下:

with increase (stockid, startdate, enddate, cc) as
(
    select d2.stockid, d1.createdate as startdate, d2.createdate as enddate, 1
    from quote d1, quote d2
    where d1.stockid = d2.stockid
    and d2.closedprice > d1.closedprice
    and dateadd(day, 1, d1.createdate) = d2.createdate

    union all

    select d2.stockid, d1.createdate as startdate, cend.enddate as enddate, cend.cc + 1
    from quote d1, quote d2, increase cend
    where d1.stockid = d2.stockid and d2.stockid = cend.stockid
    and d2.closedprice > d1.closedprice
    and d2.createdate = cend.startdate
    and dateadd(day, 1, d1.createdate) = d2.createdate
)
select o.stockid, o.cc, o.startdate, o.enddate
from increase o where cc = (select max(cc) from increase i where i.stockid = o.stockid and i.enddate = o.enddate)

这假定没有间隙。条件 dateadd(day, 1, d1.createdate) = d2.createdate 必须替换为指示 d2 是否是 d1 之后的“下一个”天的其他内容。

【讨论】:

【参考方案3】:

这是根据我的需要的最终工作 SQL。测试表明它工作正常。我正在使用@Oran 的 CC 方法

WITH StockRow (stockId, [close], createdDate, rowNum)
 as
 (
     SELECT stockId,         [close],                   createdDate,
            ROW_NUMBER() OVER(PARTITION BY stockId ORDER BY createdDate)
     FROM dbo.Quote
     where createddate >= '01/01/2012' --Beginning of this year
     ),

     RunStart (stockId, [close], createdDate, runId) as (
     SELECT      a.stockId,       a.[close], a.createdDate,
            ROW_NUMBER() OVER(PARTITION BY a.stockId ORDER BY a.createdDate)
     FROM StockRow as a
     LEFT JOIN StockRow as b
     ON b.stockId = a.stockId
     AND b.rowNum = a.rowNum - 1
     AND b.[close] < a.[close]
     WHERE b.stockId IS NULL)
     ,
 RunEnd (stockId, [close], createdDate, runId) as (
     SELECT a.stockId, a.[close], a.createdDate,
            ROW_NUMBER() OVER(PARTITION BY a.stockId ORDER BY a.createdDate)
     FROM StockRow as a
     LEFT JOIN StockRow as b
     ON b.stockId = a.stockId
     AND b.rowNum = a.rowNum + 1
     AND b.[close] > a.[close]
     WHERE b.stockId IS NULL) 

SELECT a.stockId,        s.companyname,         s.Symbol, 
a.createdDate as startdate,        b.createdDate as enddate,
(select count(r.createdDate)       from      dbo.quote r      where r.stockid = b.stockid and        r.createdDate          between  a.createdDate      and       b.createdDate) as BullRunDuration
FROM RunStart as a JOIN RunEnd as b
ON b.stockId = a.stockId
join dbo.stock as s
on a.stockid = s.stockid
AND b.runId = a.runId
AND b.[close] > a.[close]
and (select count(r.createdDate) from dbo.quote r where r.stockid = b.stockid and 
r.createdDate between  a.createdDate and b.createdDate)  > 2 -- trying to avoid cluter
order by 6 desc, a.stockid

【讨论】:

以上是关于查找列值连续增加的行的主要内容,如果未能解决你的问题,请参考以下文章

在EXCEL中 如何用VBA查找某特定单元格并返回该单元格的行和列值?

如何在 pyspark 中查找不合规的行

SQL:查找连续几天不存在的行组

Excel - 在表中查找值以返回第 n 列值

在 db2 中查找不具有连续日期范围的行

查找具有已定义结束的连续相同值的行组 (SQL Redshift)