SQL 移动平均线
Posted
技术标签:
【中文标题】SQL 移动平均线【英文标题】:SQL moving average 【发布时间】:2012-05-24 09:23:28 【问题描述】:如何在 SQL 中创建移动平均线?
当前表:
Date Clicks
2012-05-01 2,230
2012-05-02 3,150
2012-05-03 5,520
2012-05-04 1,330
2012-05-05 2,260
2012-05-06 3,540
2012-05-07 2,330
所需的表或输出:
Date Clicks 3 day Moving Average
2012-05-01 2,230
2012-05-02 3,150
2012-05-03 5,520 4,360
2012-05-04 1,330 3,330
2012-05-05 2,260 3,120
2012-05-06 3,540 3,320
2012-05-07 2,330 3,010
【问题讨论】:
你用的是什么数据库系统? @BrianWebster:他在对我(现已删除)帖子的评论中说:他正在使用 Hive。但是你删除了它的标签。 好的,已修复 - 老实说,我没有意识到这是一个数据库系统 【参考方案1】:这是一个常青乔·塞尔科的问题。 我忽略了使用哪个 DBMS 平台。但无论如何,Joe 能够在 10 多年前用标准 SQL 回答。
Joe Celko SQL Puzzles and Answers 引用: “最后一次更新尝试表明我们可以使用谓词来 构造一个可以给我们一个移动平均线的查询:"
SELECT S1.sample_time, AVG(S2.load) AS avg_prev_hour_load
FROM Samples AS S1, Samples AS S2
WHERE S2.sample_time
BETWEEN (S1.sample_time - INTERVAL 1 HOUR)
AND S1.sample_time
GROUP BY S1.sample_time;
额外的列或查询方法更好吗?查询是 技术上更好,因为 UPDATE 方法会使 数据库。但是,如果正在记录的历史数据不 改变和计算移动平均线是昂贵的,你可能 考虑使用列方法。
MS SQL 示例:
CREATE TABLE #TestDW
( Date1 datetime,
LoadValue Numeric(13,6)
);
INSERT INTO #TestDW VALUES('2012-06-09' , '3.540' );
INSERT INTO #TestDW VALUES('2012-06-08' , '2.260' );
INSERT INTO #TestDW VALUES('2012-06-07' , '1.330' );
INSERT INTO #TestDW VALUES('2012-06-06' , '5.520' );
INSERT INTO #TestDW VALUES('2012-06-05' , '3.150' );
INSERT INTO #TestDW VALUES('2012-06-04' , '2.230' );
SQL 谜题查询:
SELECT S1.date1, AVG(S2.LoadValue) AS avg_prev_3_days
FROM #TestDW AS S1, #TestDW AS S2
WHERE S2.date1
BETWEEN DATEADD(d, -2, S1.date1 )
AND S1.date1
GROUP BY S1.date1
order by 1;
【讨论】:
感谢您的信息 - 但我很难翻译它以了解它如何解决问题。你能给出你将用于问题中的表的查询吗? 这更好,因为它可以被修改以找出 N 个月的移动平均值..【参考方案2】:一种方法是在同一张桌子上加入几次。
select
(Current.Clicks
+ isnull(P1.Clicks, 0)
+ isnull(P2.Clicks, 0)
+ isnull(P3.Clicks, 0)) / 4 as MovingAvg3
from
MyTable as Current
left join MyTable as P1 on P1.Date = DateAdd(day, -1, Current.Date)
left join MyTable as P2 on P2.Date = DateAdd(day, -2, Current.Date)
left join MyTable as P3 on P3.Date = DateAdd(day, -3, Current.Date)
调整 ON 子句的 DateAdd 组件以匹配您是否希望移动平均线严格从过去到现在或几天前到几天前。
这非常适用于只需要几个数据点的移动平均值的情况。 对于具有多个数据点的移动平均线,这不是最佳解决方案。【讨论】:
离开加入那些。 (看前两个没有) 对于大型表来说,进行 4 次联接不是一项成本很高的操作吗? 取决于数据,但根据我的经验,这是一个非常快速的操作。【参考方案3】:select t2.date, round(sum(ct.clicks)/3) as avg_clicks
from
(select date from clickstable) as t2,
(select date, clicks from clickstable) as ct
where datediff(t2.date, ct.date) between 0 and 2
group by t2.date
例如here。
显然,您可以将间隔更改为您需要的任何值。您也可以使用 count() 代替幻数来使其更容易更改,但这也会减慢速度。
【讨论】:
您的前两个条目是 1 天和 2 天的平均值。该问题要求这些条目为NULL
。【参考方案4】:
适用于大型数据集的滚动平均值的通用模板
WITH moving_avg AS (
SELECT 0 AS [lag] UNION ALL
SELECT 1 AS [lag] UNION ALL
SELECT 2 AS [lag] UNION ALL
SELECT 3 AS [lag] --ETC
)
SELECT
DATEADD(day,[lag],[date]) AS [reference_date],
[otherkey1],[otherkey2],[otherkey3],
AVG([value1]) AS [avg_value1],
AVG([value2]) AS [avg_value2]
FROM [data_table]
CROSS JOIN moving_avg
GROUP BY [otherkey1],[otherkey2],[otherkey3],DATEADD(day,[lag],[date])
ORDER BY [otherkey1],[otherkey2],[otherkey3],[reference_date];
对于加权滚动平均值:
WITH weighted_avg AS (
SELECT 0 AS [lag], 1.0 AS [weight] UNION ALL
SELECT 1 AS [lag], 0.6 AS [weight] UNION ALL
SELECT 2 AS [lag], 0.3 AS [weight] UNION ALL
SELECT 3 AS [lag], 0.1 AS [weight] --ETC
)
SELECT
DATEADD(day,[lag],[date]) AS [reference_date],
[otherkey1],[otherkey2],[otherkey3],
AVG([value1] * [weight]) / AVG([weight]) AS [wavg_value1],
AVG([value2] * [weight]) / AVG([weight]) AS [wavg_value2]
FROM [data_table]
CROSS JOIN weighted_avg
GROUP BY [otherkey1],[otherkey2],[otherkey3],DATEADD(day,[lag],[date])
ORDER BY [otherkey1],[otherkey2],[otherkey3],[reference_date];
【讨论】:
加权的有趣方法。不过,对于更多离散的时间点(时间戳而不是日期)来说,它不起作用(很好) @msciwoj 在学术练习之外,非均匀间隔上的固定权重滚动平均值有什么用途?您不是先记录数据还是根据区间大小计算权重? 绝对统一。您只需根据与当前时间点的距离,将其扔到适当的重量桶中即可。例如,“对于当前数据点 24 小时内的数据点,权重=1;对于 48 小时内的数据点,权重=0.5……”。这种情况下,连续数据点(如上午 6:12 和晚上 11:48)彼此相距多少很重要……我能想到的一个用例是尝试在数据点不够密集的地方平滑直方图【参考方案5】:select *
, (select avg(c2.clicks) from #clicks_table c2
where c2.date between dateadd(dd, -2, c1.date) and c1.date) mov_avg
from #clicks_table c1
【讨论】:
【参考方案6】:使用不同的连接谓词:
SELECT current.date
,avg(periods.clicks)
FROM current left outer join current as periods
ON current.date BETWEEN dateadd(d,-2, periods.date) AND periods.date
GROUP BY current.date HAVING COUNT(*) >= 3
having 语句将阻止返回没有至少 N 个值的任何日期。
【讨论】:
这不会显示提问者希望在NULL
s 看到的 5 月 1 日和 5 月 2 日的行。【参考方案7】:
假设 x 是要平均的值,xDate 是日期值:
从 myTable WHERE xDate BETWEEN dateadd(d, -2, xDate) 和 xDate 中选择 avg(x)
【讨论】:
【参考方案8】:在蜂巢中,也许你可以尝试
select date, clicks, avg(clicks) over (order by date rows between 2 preceding and current row) as moving_avg from clicktable;
【讨论】:
【参考方案9】:为此,我想创建一个辅助/维度日期表,例如
create table date_dim(date date, date_1 date, dates_2 date, dates_3 dates ...)
date
是关键,date_1
代表今天,date_2
包含今天和前一天; date_3
...
然后你就可以在hive中做equal join了。
使用如下视图:
select date, date from date_dim
union all
select date, date_add(date, -1) from date_dim
union all
select date, date_add(date, -2) from date_dim
union all
select date, date_add(date, -3) from date_dim
【讨论】:
【参考方案10】:注意:这不是答案,而是 Diego Scaravaggi 答案的增强代码示例。由于评论部分不足,我将其发布为答案。请注意,我已将移动平均线的周期参数化。
declare @p int = 3
declare @t table(d int, bal float)
insert into @t values
(1,94),
(2,99),
(3,76),
(4,74),
(5,48),
(6,55),
(7,90),
(8,77),
(9,16),
(10,19),
(11,66),
(12,47)
select a.d, avg(b.bal)
from
@t a
left join @t b on b.d between a.d-(@p-1) and a.d
group by a.d
【讨论】:
【参考方案11】:--@p1 is period of moving average, @01 is offset
declare @p1 as int
declare @o1 as int
set @p1 = 5;
set @o1 = 3;
with np as(
select *, rank() over(partition by cmdty, tenor order by markdt) as r
from p_prices p1
where
1=1
)
, x1 as (
select s1.*, avg(s2.val) as avgval from np s1
inner join np s2
on s1.cmdty = s2.cmdty and s1.tenor = s2.tenor
and s2.r between s1.r - (@p1 - 1) - (@o1) and s1.r - (@o1)
group by s1.cmdty, s1.tenor, s1.markdt, s1.val, s1.r
)
【讨论】:
【参考方案12】:我不确定您的预期结果(输出)是否会显示 3 天的经典“简单移动(滚动)平均值”。因为,例如,根据定义,数字的第一个三元组给出:
ThreeDaysMovingAverage = (2.230 + 3.150 + 5.520) / 3 = 3.6333333
但你期待4.360
,这令人困惑。
不过,我建议使用以下解决方案,它使用窗口函数AVG
。这种方法比其他答案中介绍的SELF-JOIN
更有效(清晰且资源较少)(我很惊讶没有人给出更好的解决方案)。
-- Oracle-SQL dialect
with
data_table as (
select date '2012-05-01' AS dt, 2.230 AS clicks from dual union all
select date '2012-05-02' AS dt, 3.150 AS clicks from dual union all
select date '2012-05-03' AS dt, 5.520 AS clicks from dual union all
select date '2012-05-04' AS dt, 1.330 AS clicks from dual union all
select date '2012-05-05' AS dt, 2.260 AS clicks from dual union all
select date '2012-05-06' AS dt, 3.540 AS clicks from dual union all
select date '2012-05-07' AS dt, 2.330 AS clicks from dual
),
param as (select 3 days from dual)
select
dt AS "Date",
clicks AS "Clicks",
case when rownum >= p.days then
avg(clicks) over (order by dt
rows between p.days - 1 preceding and current row)
end
AS "3 day Moving Average"
from data_table t, param p;
您会看到 AVG
被 case when rownum >= p.days then
包裹以强制 NULL
s 在第一行中,其中“3 天移动平均线”毫无意义。
【讨论】:
【参考方案13】:我们可以应用 Joe Celko 的 “脏”左外连接 方法(如上 Diego Scaravaggi 所引用的)来回答所提出的问题。
declare @ClicksTable table ([Date] date, Clicks int)
insert into @ClicksTable
select '2012-05-01', 2230 union all
select '2012-05-02', 3150 union all
select '2012-05-03', 5520 union all
select '2012-05-04', 1330 union all
select '2012-05-05', 2260 union all
select '2012-05-06', 3540 union all
select '2012-05-07', 2330
这个查询:
SELECT
T1.[Date],
T1.Clicks,
-- AVG ignores NULL values so we have to explicitly NULLify
-- the days when we don't have a full 3-day sample
CASE WHEN count(T2.[Date]) < 3 THEN NULL
ELSE AVG(T2.Clicks)
END AS [3-Day Moving Average]
FROM @ClicksTable T1
LEFT OUTER JOIN @ClicksTable T2
ON T2.[Date] BETWEEN DATEADD(d, -2, T1.[Date]) AND T1.[Date]
GROUP BY T1.[Date]
生成请求的输出:
Date Clicks 3-Day Moving Average
2012-05-01 2,230
2012-05-02 3,150
2012-05-03 5,520 4,360
2012-05-04 1,330 3,330
2012-05-05 2,260 3,120
2012-05-06 3,540 3,320
2012-05-07 2,330 3,010
【讨论】:
以上是关于SQL 移动平均线的主要内容,如果未能解决你的问题,请参考以下文章