填补日期空白和插值
Posted
技术标签:
【中文标题】填补日期空白和插值【英文标题】:Fill date gaps and Interpolate values 【发布时间】:2019-07-31 14:16:22 【问题描述】:我有一个非常简单的表,其中包含 SQL Server 表中当天的日期(以天为单位)、设备名称和引擎小时数(累积)。原始数据表显示日期之间存在差距。我需要填补空白并进行插值,以便为这些新行提供小时值。 “期望结果”表显示了最终产品的外观。
我最初的想法是创建一个“日期”表(递归函数)并使用左连接来创建完整的表,但是在这个阶段用插值数据填充小时列已经超出了我的范围。有什么想法吗?
原始数据
+------------+-----------+-------+--+--+
| Date | Equipment | Hours | | |
+------------+-----------+-------+--+--+
| 2019/01/01 | EQ1 | 50 | | |
+------------+-----------+-------+--+--+
| 2019/01/02 | EQ1 | 67 | | |
+------------+-----------+-------+--+--+
| 2019/01/03 | EQ1 | 87 | | |
+------------+-----------+-------+--+--+
| 2019/01/04 | EQ1 | 105 | | |
+------------+-----------+-------+--+--+
| 2019/01/07 | EQ1 | 150 | | |
+------------+-----------+-------+--+--+
| 2019/01/08 | EQ1 | 169 | | |
+------------+-----------+-------+--+--+
| 2019/01/09 | EQ1 | 187 | | |
+------------+-----------+-------+--+--+
| 2019/01/12 | EQ1 | 247 | | |
+------------+-----------+-------+--+--+
| 2019/01/13 | EQ1 | 265 | | |
+------------+-----------+-------+--+--+
| | | | | |
+------------+-----------+-------+--+--+
| 2019/01/01 | EQ2 | 150 | | |
+------------+-----------+-------+--+--+
| 2019/01/02 | EQ2 | 168 | | |
+------------+-----------+-------+--+--+
| 2019/01/03 | EQ2 | 187 | | |
+------------+-----------+-------+--+--+
| 2019/01/04 | EQ2 | 205 | | |
+------------+-----------+-------+--+--+
| 2019/01/05 | EQ2 | 222 | | |
+------------+-----------+-------+--+--+
| 2019/01/06 | EQ2 | 239 | | |
+------------+-----------+-------+--+--+
| 2019/01/07 | EQ2 | 255 | | |
+------------+-----------+-------+--+--+
| 2019/01/10 | EQ2 | 306 | | |
+------------+-----------+-------+--+--+
| 2019/01/13 | EQ2 | 357 | | |
+------------+-----------+-------+--+--+
想要的结果
+------------+-----------+-------+--+--+
| Date | Equipment | Hours | | |
+------------+-----------+-------+--+--+
| 2019/01/01 | EQ1 | 50 | | |
+------------+-----------+-------+--+--+
| 2019/01/02 | EQ1 | 67 | | |
+------------+-----------+-------+--+--+
| 2019/01/03 | EQ1 | 87 | | |
+------------+-----------+-------+--+--+
| 2019/01/04 | EQ1 | 105 | | |
+------------+-----------+-------+--+--+
| 2019/01/05 | EQ1 | 120 | | |
+------------+-----------+-------+--+--+
| 2019/01/06 | EQ1 | 135 | | |
+------------+-----------+-------+--+--+
| 2019/01/07 | EQ1 | 150 | | |
+------------+-----------+-------+--+--+
| 2019/01/08 | EQ1 | 169 | | |
+------------+-----------+-------+--+--+
| 2019/01/09 | EQ1 | 187 | | |
+------------+-----------+-------+--+--+
| 2019/01/10 | EQ1 | 207 | | |
+------------+-----------+-------+--+--+
| 2019/01/11 | EQ1 | 227 | | |
+------------+-----------+-------+--+--+
| 2019/01/12 | EQ1 | 247 | | |
+------------+-----------+-------+--+--+
| 2019/01/13 | EQ1 | 265 | | |
+------------+-----------+-------+--+--+
| | | | | |
+------------+-----------+-------+--+--+
| 2019/01/01 | EQ2 | 150 | | |
+------------+-----------+-------+--+--+
| 2019/01/02 | EQ2 | 168 | | |
+------------+-----------+-------+--+--+
| 2019/01/03 | EQ2 | 187 | | |
+------------+-----------+-------+--+--+
| 2019/01/04 | EQ2 | 205 | | |
+------------+-----------+-------+--+--+
| 2019/01/05 | EQ2 | 222 | | |
+------------+-----------+-------+--+--+
| 2019/01/06 | EQ2 | 239 | | |
+------------+-----------+-------+--+--+
| 2019/01/07 | EQ2 | 255 | | |
+------------+-----------+-------+--+--+
| 2019/01/08 | EQ2 | 272 | | |
+------------+-----------+-------+--+--+
| 2019/01/09 | EQ2 | 289 | | |
+------------+-----------+-------+--+--+
| 2019/01/10 | EQ2 | 306 | | |
+------------+-----------+-------+--+--+
| 2019/01/11 | EQ2 | 323 | | |
+------------+-----------+-------+--+--+
| 2019/01/12 | EQ2 | 340 | | |
+------------+-----------+-------+--+--+
| 2019/01/13 | EQ2 | 357 | | |
+------------+-----------+-------+--+--+
【问题讨论】:
您最初的想法是正确的,日历表是一种方法。而且,你有一个差距和孤岛问题。插值将是每个岛屿末端之间的平均值。 【参考方案1】:你可以试试这个查询。
DECLARE @SampleTable TABLE ( [Date] Date, Equipment VARCHAR(10), Hours INT)
INSERT INTO @SampleTable VALUES
('2019/01/01','EQ1', 50 ),
('2019/01/02','EQ1', 67 ),
('2019/01/03','EQ1', 87 ),
('2019/01/04','EQ1', 105),
('2019/01/07','EQ1', 150),
('2019/01/08','EQ1', 169),
('2019/01/09','EQ1', 187),
('2019/01/12','EQ1', 247),
('2019/01/13','EQ1', 265),
('2019/01/01','EQ2', 150),
('2019/01/02','EQ2', 168),
('2019/01/03','EQ2', 187),
('2019/01/04','EQ2', 205),
('2019/01/05','EQ2', 222),
('2019/01/06','EQ2', 239),
('2019/01/07','EQ2', 255),
('2019/01/10','EQ2', 306),
('2019/01/13','EQ2', 357)
;WITH CTE AS (
SELECT MIN([Date]) [Date], Equipment FROM @SampleTable T GROUP BY Equipment
UNION ALL
SELECT DATEADD(DAY,1,CTE.[Date]), CTE.Equipment FROM CTE
WHERE EXISTS( SELECT * FROM @SampleTable T WHERE T.Equipment = CTE.Equipment and DATEADD(DAY,1,CTE.[Date] ) <= T.[Date] )
)
SELECT CTE.[Date], CTE.Equipment,
X1.Hours +
DATEDIFF(DAY, X1.[Date],CTE.[Date]) *
CASE WHEN DATEDIFF(DAY, X1.[Date],X2.[Date]) > 0
THEN (X2.Hours - X1.Hours ) / DATEDIFF(DAY, X1.[Date], X2.[Date])
ELSE X1.Hours END [Hours]
FROM CTE
OUTER APPLY( SELECT TOP 1 * FROM @SampleTable S1 WHERE S1.Equipment = CTE.Equipment and CTE.[Date] >= S1.[Date] ORDER BY S1.Date DESC) X1
OUTER APPLY( SELECT TOP 1 * FROM @SampleTable S1 WHERE S1.Equipment = CTE.Equipment and CTE.[Date] <= S1.[Date] ORDER BY S1.Date ASC ) X2
ORDER BY CTE.Equipment, CTE.[Date]
结果:
Date Equipment Hours
---------- ---------- -----------
2019-01-01 EQ1 50
2019-01-02 EQ1 67
2019-01-03 EQ1 87
2019-01-04 EQ1 105
2019-01-05 EQ1 120
2019-01-06 EQ1 135
2019-01-07 EQ1 150
2019-01-08 EQ1 169
2019-01-09 EQ1 187
2019-01-10 EQ1 207
2019-01-11 EQ1 227
2019-01-12 EQ1 247
2019-01-13 EQ1 265
2019-01-01 EQ2 150
2019-01-02 EQ2 168
2019-01-03 EQ2 187
2019-01-04 EQ2 205
2019-01-05 EQ2 222
2019-01-06 EQ2 239
2019-01-07 EQ2 255
2019-01-08 EQ2 272
2019-01-09 EQ2 289
2019-01-10 EQ2 306
2019-01-11 EQ2 323
2019-01-12 EQ2 340
2019-01-13 EQ2 357
【讨论】:
【参考方案2】:下面是一个原型逻辑来解决你的问题。
此逻辑假设您有一个 Dates 表(可以是表变量、临时表等)。您可以在网上找到有关如何创建的代码(一种简单的方法:How to create a Calendar table for 100 years in Sql)
-- 3. Final result: should return values only for missing days
SELECT DT.Date, Filterred.Equipment,
-- Logic: Hours value at the start of the gap + ( number of days between the start and "current" date * average hours change )
FilterredGaps.[Hours] + DATEDIFF( DAY, FilterredGaps.[Date], DT.[Date] ) * AvgHoursChange
FROM
-- 2. Filter out consecutive days and calculate Avg Hour Change
( SELECT *,
-- Calculate avg daily change (if you have duplicate dates for a given Equipment, you may get devide by zero errors)
(( NextHours - Hours ) / DATEDIFF( DAY, [Date], NextDate )) AS AvgHoursChange
FROM
-- 1. Find gaps
( SELECT *,
-- Find next date and next hours value
LEAD( [Date] ) OVER ( PARTITION BY Equipment ORDER BY [Date] ) AS NextDate,
LEAD( [Hours] ) OVER ( PARTITION BY Equipment ORDER BY [Date] ) AS NextHours,
FROM EquipmentTable ) AS Gaps
-- Leave only gaps of more than 1 day
WHERE DATEADD( DAY, 1, [Date] ) < NextDate ) AS FilterredGaps
-- Finally join filterred gaps to the dates table to get only missing dates
INNER JOIN DatesTable AS DT ON FilterredGaps.[Date] < DT.[Date] AND DT.[Date] < FilterredGaps.[Date]
想法来自:https://www.mssqltips.com/sqlservertutorial/9130/sql-server-window-functions-gaps-and-islands-problem/ 我强烈建议您阅读这篇文章以熟悉问题和建议的解决方案。
注意:此代码未经测试
【讨论】:
【参考方案3】:我将创建一个视图并使用基于 https://github.com/atifaziz/NCrontab/wiki/SQL-Server-Crontab 的表值 Crontab 函数来生成日期/时间序列,然后是 LEFT OUTER JOIN
-ed。
计算值需要上一个(当前设备的)有值日期,下一个(当前设备的)有值日期,当前行的日期以及上一个和下一个值。我将其实现为 2 个标量值函数(如果它们没有上一个或下一个,则可能返回 NULL)。一个获取上一个/下一个日期(参数:@currentDate 和一个 BIT @next(否则返回上一个))和一个获取上一个/下一个小时数(相同的参数)。结果也可以是日期和小时数的组合字符串,然后进行解析 - 最好衡量什么表现更好。如果当前日期有值,则下一个日期逻辑返回此日期。
然后创建一个标量值函数,获取这些值并执行如下计算(验证我没有犯任何错误):
myGapInDays = NextDate - PreviousDate
myHourDiff = NextHours - PreviousHours
myIncrementPerDay (FLOAT) = myHourDiff / myGapInDays
myFactor = CurrentDate - PreviousDate
myResult = PreviousHours + Round(myFactor * myIncrementPerDay)
希望对你有帮助。
【讨论】:
以上是关于填补日期空白和插值的主要内容,如果未能解决你的问题,请参考以下文章
复制记录组以填补 Google BigQuery 中的多个日期空白