填补日期空白和插值

Posted

技术标签:

【中文标题】填补日期空白和插值【英文标题】:Fill date gaps and Interpolate values 【发布时间】:2019-07-31 14:16:22 【问题描述】:

我有一个非常简单的表,其中包含 SQL Server 表中当天的日期(以天为单位)、设备名称和引擎小时数(累积)。原始数据表显示日期之间存在差距。我需要填补空白并进行插值,以便为这些新行提供小时值。 “期望结果”表显示了最终产品的外观。

我最初的想法是创建一个“日期”表(递归函数)并使用左连接来创建完整的表,但是在这个阶段用插值数据填充小时列已经超出了我的范围。有什么想法吗?

原始数据

+------------+-----------+-------+--+--+
| Date       | Equipment | Hours |  |  |
+------------+-----------+-------+--+--+
| 2019/01/01 | EQ1       | 50    |  |  |
+------------+-----------+-------+--+--+
| 2019/01/02 | EQ1       | 67    |  |  |
+------------+-----------+-------+--+--+
| 2019/01/03 | EQ1       | 87    |  |  |
+------------+-----------+-------+--+--+
| 2019/01/04 | EQ1       | 105   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/07 | EQ1       | 150   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/08 | EQ1       | 169   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/09 | EQ1       | 187   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/12 | EQ1       | 247   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/13 | EQ1       | 265   |  |  |
+------------+-----------+-------+--+--+
|            |           |       |  |  |
+------------+-----------+-------+--+--+
| 2019/01/01 | EQ2       | 150   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/02 | EQ2       | 168   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/03 | EQ2       | 187   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/04 | EQ2       | 205   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/05 | EQ2       | 222   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/06 | EQ2       | 239   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/07 | EQ2       | 255   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/10 | EQ2       | 306   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/13 | EQ2       | 357   |  |  |
+------------+-----------+-------+--+--+

想要的结果

+------------+-----------+-------+--+--+
| Date       | Equipment | Hours |  |  |
+------------+-----------+-------+--+--+
| 2019/01/01 | EQ1       | 50    |  |  |
+------------+-----------+-------+--+--+
| 2019/01/02 | EQ1       | 67    |  |  |
+------------+-----------+-------+--+--+
| 2019/01/03 | EQ1       | 87    |  |  |
+------------+-----------+-------+--+--+
| 2019/01/04 | EQ1       | 105   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/05 | EQ1       | 120   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/06 | EQ1       | 135   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/07 | EQ1       | 150   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/08 | EQ1       | 169   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/09 | EQ1       | 187   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/10 | EQ1       | 207   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/11 | EQ1       | 227   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/12 | EQ1       | 247   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/13 | EQ1       | 265   |  |  |
+------------+-----------+-------+--+--+
|            |           |       |  |  |
+------------+-----------+-------+--+--+
| 2019/01/01 | EQ2       | 150   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/02 | EQ2       | 168   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/03 | EQ2       | 187   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/04 | EQ2       | 205   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/05 | EQ2       | 222   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/06 | EQ2       | 239   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/07 | EQ2       | 255   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/08 | EQ2       | 272   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/09 | EQ2       | 289   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/10 | EQ2       | 306   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/11 | EQ2       | 323   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/12 | EQ2       | 340   |  |  |
+------------+-----------+-------+--+--+
| 2019/01/13 | EQ2       | 357   |  |  |
+------------+-----------+-------+--+--+

【问题讨论】:

您最初的想法是正确的,日历表是一种方法。而且,你有一个差距和孤岛问题。插值将是每个岛屿末端之间的平均值。 【参考方案1】:

你可以试试这个查询。

DECLARE @SampleTable TABLE ( [Date] Date, Equipment VARCHAR(10),  Hours INT)
INSERT INTO @SampleTable VALUES
('2019/01/01','EQ1', 50 ),
('2019/01/02','EQ1', 67 ),
('2019/01/03','EQ1', 87 ),
('2019/01/04','EQ1', 105),
('2019/01/07','EQ1', 150),
('2019/01/08','EQ1', 169),
('2019/01/09','EQ1', 187),
('2019/01/12','EQ1', 247),
('2019/01/13','EQ1', 265),

('2019/01/01','EQ2', 150),
('2019/01/02','EQ2', 168),
('2019/01/03','EQ2', 187),
('2019/01/04','EQ2', 205),
('2019/01/05','EQ2', 222),
('2019/01/06','EQ2', 239),
('2019/01/07','EQ2', 255),
('2019/01/10','EQ2', 306),
('2019/01/13','EQ2', 357)


;WITH CTE AS (
    SELECT MIN([Date]) [Date], Equipment FROM @SampleTable T GROUP BY Equipment 
    UNION ALL
    SELECT DATEADD(DAY,1,CTE.[Date]),  CTE.Equipment FROM CTE 
        WHERE EXISTS( SELECT * FROM @SampleTable T WHERE T.Equipment = CTE.Equipment and DATEADD(DAY,1,CTE.[Date] ) <= T.[Date]  )
)
SELECT  CTE.[Date], CTE.Equipment, 
    X1.Hours +  
        DATEDIFF(DAY, X1.[Date],CTE.[Date]) * 
        CASE WHEN DATEDIFF(DAY, X1.[Date],X2.[Date]) > 0 
            THEN (X2.Hours - X1.Hours ) / DATEDIFF(DAY, X1.[Date], X2.[Date]) 
            ELSE X1.Hours END [Hours]
    FROM CTE
        OUTER APPLY( SELECT TOP 1 * FROM @SampleTable S1 WHERE S1.Equipment = CTE.Equipment and CTE.[Date]  >= S1.[Date] ORDER BY S1.Date DESC) X1
        OUTER APPLY( SELECT TOP 1 * FROM @SampleTable S1 WHERE S1.Equipment = CTE.Equipment and CTE.[Date]  <= S1.[Date] ORDER BY S1.Date ASC ) X2
ORDER BY CTE.Equipment, CTE.[Date]

结果:

Date       Equipment  Hours
---------- ---------- -----------
2019-01-01 EQ1        50
2019-01-02 EQ1        67
2019-01-03 EQ1        87
2019-01-04 EQ1        105
2019-01-05 EQ1        120
2019-01-06 EQ1        135
2019-01-07 EQ1        150
2019-01-08 EQ1        169
2019-01-09 EQ1        187
2019-01-10 EQ1        207
2019-01-11 EQ1        227
2019-01-12 EQ1        247
2019-01-13 EQ1        265

2019-01-01 EQ2        150
2019-01-02 EQ2        168
2019-01-03 EQ2        187
2019-01-04 EQ2        205
2019-01-05 EQ2        222
2019-01-06 EQ2        239
2019-01-07 EQ2        255
2019-01-08 EQ2        272
2019-01-09 EQ2        289
2019-01-10 EQ2        306
2019-01-11 EQ2        323
2019-01-12 EQ2        340
2019-01-13 EQ2        357

【讨论】:

【参考方案2】:

下面是一个原型逻辑来解决你的问题。

此逻辑假设您有一个 Dates 表(可以是表变量、临时表等)。您可以在网上找到有关如何创建的代码(一种简单的方法:How to create a Calendar table for 100 years in Sql)

-- 3. Final result: should return values only for missing days 
SELECT DT.Date, Filterred.Equipment,
    -- Logic: Hours value at the start of the gap + ( number of days between the start and "current" date * average hours change )
    FilterredGaps.[Hours] + DATEDIFF( DAY, FilterredGaps.[Date], DT.[Date] ) * AvgHoursChange
FROM
    -- 2. Filter out consecutive days and calculate Avg Hour Change
    ( SELECT *,
        -- Calculate avg daily change (if you have duplicate dates for a given Equipment, you may get devide by zero errors)
        (( NextHours - Hours ) / DATEDIFF( DAY, [Date], NextDate )) AS AvgHoursChange
    FROM
        -- 1. Find gaps
        ( SELECT *,
            -- Find next date and next hours value
            LEAD( [Date] ) OVER ( PARTITION BY Equipment ORDER BY [Date] ) AS NextDate,
            LEAD( [Hours] ) OVER ( PARTITION BY Equipment ORDER BY [Date] ) AS NextHours,
        FROM EquipmentTable ) AS Gaps
    -- Leave only gaps of more than 1 day
    WHERE DATEADD( DAY, 1, [Date] ) < NextDate ) AS FilterredGaps
        -- Finally join filterred gaps to the dates table to get only missing dates
        INNER JOIN DatesTable AS DT ON FilterredGaps.[Date] < DT.[Date] AND DT.[Date] < FilterredGaps.[Date]

想法来自:https://www.mssqltips.com/sqlservertutorial/9130/sql-server-window-functions-gaps-and-islands-problem/ 我强烈建议您阅读这篇文章以熟悉问题和建议的解决方案。

注意:此代码未经测试

【讨论】:

【参考方案3】:

我将创建一个视图并使用基于 https://github.com/atifaziz/NCrontab/wiki/SQL-Server-Crontab 的表值 Crontab 函数来生成日期/时间序列,然后是 LEFT OUTER JOIN-ed。

计算值需要上一个(当前设备的)有值日期,下一个(当前设备的)有值日期,当前行的日期以及上一个和下一个值。我将其实现为 2 个标量值函数(如果它们没有上一个或下一个,则可能返回 NULL)。一个获取上一个/下一个日期(参数:@currentDate 和一个 BIT @next(否则返回上一个))和一个获取上一个/下一个小时数(相同的参数)。结果也可以是日期和小时数的组合字符串,然后进行解析 - 最好衡量什么表现更好。如果当前日期有值,则下一个日期逻辑返回此日期。

然后创建一个标量值函数,获取这些值并执行如下计算(验证我没有犯任何错误):

myGapInDays = NextDate - PreviousDate
myHourDiff = NextHours - PreviousHours
myIncrementPerDay (FLOAT) = myHourDiff / myGapInDays
myFactor = CurrentDate - PreviousDate
myResult = PreviousHours + Round(myFactor * myIncrementPerDay)

希望对你有帮助。

【讨论】:

以上是关于填补日期空白和插值的主要内容,如果未能解决你的问题,请参考以下文章

复制记录组以填补 Google BigQuery 中的多个日期空白

填补 SQL Server 日期范围中的空白

填补 MultiIndex Pandas Dataframe 中的日期空白

填补xts中日期之间的空白

填补熊猫数据框中的日期空白

复制记录以填补 Google BigQuery 中日期之间的空白