每日销售数据的时间序列预测
Posted
技术标签:
【中文标题】每日销售数据的时间序列预测【英文标题】:Time Series Forecasting of daily sales data 【发布时间】:2021-09-20 14:39:40 【问题描述】:您好,我是时间序列预测和 R 的新手。我收集了我兄弟新创业公司的每日销售数据来学习预测(从 28-03-19 到 7-7-21)。我想预测未来 30 天,有人可以帮我预测这些数据。我正在共享此 CSV 文件的链接。我按照 kaggle 某人展示的步骤尝试使用 ARIMA 进行预测,但与 MAPE 值为 28% 的测试数据集的实际值相比,我得到的值完全不同。希望有人能指导我解决这个问题。
https://easyupload.io/hltskc
这是 2021 年的示例数据:
data.frame(date=seq(as.Date("2021-01-01"),as.Date("2021-07-07"),by="1 day"),sales=c(100,105 ,167, 106 ,112, 107, 202, 98, 120, 109 ,114, 195, 110, 121, 89, 128, 104, 194 ,107 ,127, 117, 100, 117 ,205,116, 112, 119, 129, 161, 132, 110, 114, 118, 194, 114, 113, 113 ,172, 101 ,161 ,102, 135, 97, 122, 170, 126 ,160, 110, 118, 108, 111, 163, 110, 123 ,102 ,116, 181, 119, 155, 108, 122, 169, 115, 122, 116, 168 ,115 ,101, 117, 113 ,163, 115 ,107, 106, 171 ,109, 119, 107, 101 ,166, 105, 102 ,174,108, 102, 114, 97, 114, 149, 100, 111, 94,110 ,108, 100 , 92 ,104, 112, 160, 105, 98, 91,117 , 44, 60 , 36 , 50 , 51 , 54, 62, 61 , 62 , 50 , 59 , 85 , 49, 61 , 56 , 63, 39, 110 , 56 , 54 , 55, 56, 63 , 44, 115, 55, 50, 96 ,129 , 61, 59, 98 , 90 ,153, 90, 82 , 98, 79, 149, 97 , 85, 92, 78, 100 , 69, 152, 88, 76 , 91 ,145, 106, 69, 84, 72, 144 , 76, 74 , 94 ,70 , 86 ,76 ,137 , 71 ,87 , 91, 62 ,150 , 66 , 77 ,88, 135, 93 , 62 , 83, 83 , 72, 71 ,148 , 91 , 68 , 78 , 95 ,124, 69 ,78))
【问题讨论】:
您好,欢迎来到 SO。请查看How to Ask 以获取提示。提供一些数据、创建great reproducible example 并提供所需输出的示例是一个好的开始。您可以使用dput()
或data.frame()
编辑您的问题并在其中放置一些示例数据。通常,SO 上的人不喜欢点击外部链接。
@najeel 习惯上赞成并接受解决您查询的答案。如果他们没有,请发布有关您的问题的澄清详细信息。
【参考方案1】:
这是对多个模型的输出进行平均(取中值)并生成预测的一种相当基本的方法。请注意,在建模之前或之后都没有进行必要的 EDA、模型调整或模型验证。此外,我不确定这种平均预测间隔的方法是否合理。如需最佳实践方法,请参阅here。
# Install pacakges if they are not already installed: necessary_packages => vector
necessary_packages <- c("forecast", "ggplot2")
# Create a vector containing the names of any packages needing installation:
# new_pacakges => vector
new_packages <- necessary_packages[!(necessary_packages %in%
installed.packages()[, "Package"])]
# If the vector has more than 0 values, install the new pacakges
# (and their) associated dependencies:
if (length(new_packages) > 0)
install.packages(new_packages, dependencies = TRUE)
# Initialise the packages in the session: list of boolean => stdout (console)
lapply(necessary_packages, require, character.only = TRUE)
# Coerce to multiple seasonality time series object:
y <- msts(
df$sales,
start = c(
min(as.integer(strftime(df$date, "%Y"))),
1
),
seasonal.periods = c(
7.009615384615385,
30.5
)
)
# Function to coalesce fit values with actuals:
coalesceActualsFit <- function(fit_obj)
# Coalesce actual values with fit values:
# res => double vector
res <- transform(
setNames(
data.frame(
cbind(
fit_obj$x,
fit_obj$fitted
)[,1:2]
),
c("x", "xhat")
),
xhat = ifelse(is.na(xhat), x, xhat)
)[,2,drop=TRUE]
# Explicitly define returned object:
# double vector => Global Env()
return(res)
# Store the forecast period (n-days):
fcast_period <- 30
# Fit a holt winters model:
hw_fit <- HoltWinters(y)
# Remove NAs from the fitted values:
hw_fit_vals <- coalesceActualsFit(hw_fit)
# Forecast out n-days:
hw_fcast <- forecast(hw_fit, h = fcast_period)
# Fit a neural network:
nnetar_fit <- nnetar(y, P = 10, repeats = 50)
# Remove NAs from the fitted values:
nnetar_fit_vals <- coalesceActualsFit(nnetar_fit)
# Forecast out n-days:
nnetar_fcast <- forecast(nnetar_fit, h = fcast_period)
# Fit an stlf model:
stlf_fit <- stlf(y)
# Forecast out n-days:
stlf_fcast <- forecast(stlf_fit, h = fcast_period)
# Create an ensemble for the predictions:
med_predict <- apply(
cbind(hw_fit_vals, nnetar_fit_vals, stlf_fit$fitted),
1,
median
)
# Create an ensemble for the forecasts:
med_fcast <- apply(
cbind(hw_fcast$mean, nnetar_fcast$mean, stlf_fcast$mean),
1,
median
)
# Calculate the enesmble prediction intervals:
fcast_pis <- do.call(
cbind,
lapply(
c("lower", "upper"),
function(x)
y <- data.frame(
cbind(
hw_fcast[[x]],
nnetar_fcast[[x]],
stlf_fcast[[x]]
)
)
pi_80 <- apply(
y[,endsWith(names(y), "80.")],
1,
median
)
pi_95 <- apply(
y[,endsWith(names(y), "95.")],
1,
median
)
cbind(
setNames(data.frame(pi_80), paste0(x, "_80")),
setNames(data.frame(pi_95), paste0(x, "_95"))
)
)
)
# Combine them into a single vector:
pt_fcasts <- c(
med_predict,
med_fcast
)
# Create a data.frame containing the original data and
# the point forecasts:
fcast_df <- transform(
data.frame(
actuals = c(df$sales, rep(NA_real_, fcast_period)),
pnt_fcasts = pt_fcasts,
date =
as.Date(
c(
df$date,
seq(
max(df$date) + 1,
max(df$date) + fcast_period,
by = 1
)
),
"%Y-%m-%d"
)
),
pnt_fcasts = ifelse(is.na(pnt_fcasts), sales, pnt_fcasts)
)
# Adjust the forecast predition intervals to be the same
# length as the combined data:
fcast_pis_adjusted <- transform(
rbind(
setNames(
data.frame(
do.call(
cbind,
lapply(seq_len(ncol(fcast_pis)), function(i)
pt_fcasts[1:nrow(df)]
)
)
),
names(fcast_pis)
),
fcast_pis
),
date = fcast_df$date
)
# Merge with the point forecast data:
chart_data <- merge(
fcast_df,
fcast_pis_adjusted,
by = "date"
)
# Chart the forecast vs actuals:
ggplot(chart_data, aes(date)) +
geom_line(aes(y = actuals, colour = "Actuals")) +
geom_line(aes(y = lower_80, colour = "Lower 80 PI")) +
geom_line(aes(y = lower_95, colour = "Lower 95 PI")) +
geom_line(aes(y = upper_80, colour = "Upper 80 PI")) +
geom_line(aes(y = upper_95, colour = "Upper 95 PI")) +
geom_line(aes(y = pnt_fcasts, colour = "Point Forecast")) +
xlab("Date") +
ylab("Sales") +
labs(colour = "Series")
数据:
df <- data.frame(
date = seq(as.Date("2021-01-01"), as.Date("2021-07-07"), by = "1 day"),
sales = c(
100,
105 ,
167,
106 ,
112,
107,
202,
98,
120,
109 ,
114,
195,
110,
121,
89,
128,
104,
194 ,
107 ,
127,
117,
100,
117 ,
205,
116,
112,
119,
129,
161,
132,
110,
114,
118,
194,
114,
113,
113 ,
172,
101 ,
161 ,
102,
135,
97,
122,
170,
126 ,
160,
110,
118,
108,
111,
163,
110,
123 ,
102 ,
116,
181,
119,
155,
108,
122,
169,
115,
122,
116,
168 ,
115 ,
101,
117,
113 ,
163,
115 ,
107,
106,
171 ,
109,
119,
107,
101 ,
166,
105,
102 ,
174,
108,
102,
114,
97,
114,
149,
100,
111,
94,
110 ,
108,
100 ,
92 ,
104,
112,
160,
105,
98,
91,
117 ,
44,
60 ,
36 ,
50 ,
51 ,
54,
62,
61 ,
62 ,
50 ,
59 ,
85 ,
49,
61 ,
56 ,
63,
39,
110 ,
56 ,
54 ,
55,
56,
63 ,
44,
115,
55,
50,
96 ,
129 ,
61,
59,
98 ,
90 ,
153,
90,
82 ,
98,
79,
149,
97 ,
85,
92,
78,
100 ,
69,
152,
88,
76 ,
91 ,
145,
106,
69,
84,
72,
144 ,
76,
74 ,
94 ,
70 ,
86 ,
76 ,
137 ,
71 ,
87 ,
91,
62 ,
150 ,
66 ,
77 ,
88,
135,
93 ,
62 ,
83,
83 ,
72,
71 ,
148 ,
91 ,
68 ,
78 ,
95 ,
124,
69 ,
78
)
)
【讨论】:
【参考方案2】:一个简单的出路是 facebook 发布的先知包。这个预测很容易解决,但您必须进行质量评估(对预测进行评分)。
library(prophet) # possibly have to install it
set.seed(1234) # make it reproducable
# set the column names that prophet requires (ds has to be date type)
names(df) <- c("ds", "y")
# build the model in automatic mode (there are some parameter to tweak)
mod <- prophet::prophet(df)
# build the future df (days you want to predict
fdf <- prophet::make_future_dataframe(mod, periods = 92)
# make the prediction
forec <- predict(mod, fdf)
# plot the data and forecast
plot(mod, forec)
head(forec$yhat)
[1] 121.2439 137.5267 140.1426 132.0549 134.5965 136.5455
您可能希望根据转换需求(BoxCox、日志、差异等)来研究分析时间序列的各个方面。查看您的数据(黑点)可以解释为,由于四月和五月的数据相对较低,因此存在一个外部回归量(附加解释变量)。这个独立变量可能是可预测的/已知的,因此对预测质量有重大影响。由于您使用的是每日数据,因此可能会有一些信息被假期所掩盖,而这些信息在预测者的建模中没有正确表示。最后的建议涉及数据长度:根据您的预测范围,大约 6 个月的输入可能是不够的,特别是如果需要考虑一些年度模式(经验法则是使用至少 2 年的数据,更好 3 )。
您可能对预测包及其功能感兴趣,但我认为它们在每日间隔和短输入数据时表现不佳:
library(forecast)
# make time series object
tsdf <- ts(df$y, frequency = 365)
# make an auto.arima forecast model
mod2 <- forecast::auto.arima(log(tsdf))
# predict
plot(forecast::forecast(mod2, h = 92))
# make a "hip" prediction using neural net with 3 hidden neurons
mod3 <- forecast::nnetar(tsdf, P = 3, repeats = 1000)
plot(forecast::forecast(mod3, h = 92))
一些你可能会看/读/尝试的东西:
Short Openbook about TS in R
Openbook from the creator of the forecast about it and the theory behind
Decision trees implementation for TS
使用的数据:
df <- data.frame(date=seq(as.Date("2021-01-01"),as.Date("2021-07-07"),by="1 day"),sales=c(100,105 ,167, 106 ,112, 107, 202, 98, 120, 109 ,114, 195, 110, 121, 89, 128, 104, 194 ,107 ,127, 117, 100, 117 ,205,116, 112, 119, 129, 161, 132, 110, 114, 118, 194, 114, 113, 113 ,172, 101 ,161 ,102, 135, 97, 122, 170, 126 ,160, 110, 118, 108, 111, 163, 110, 123 ,102 ,116, 181, 119, 155, 108, 122, 169, 115, 122, 116, 168 ,115 ,101, 117, 113 ,163, 115 ,107, 106, 171 ,109, 119, 107, 101 ,166, 105, 102 ,174,108, 102, 114, 97, 114, 149, 100, 111, 94,110 ,108, 100 , 92 ,104, 112, 160, 105, 98, 91,117 , 44, 60 , 36 , 50 , 51 , 54, 62, 61 , 62 , 50 , 59 , 85 , 49, 61 , 56 , 63, 39, 110 , 56 , 54 , 55, 56, 63 , 44, 115, 55, 50, 96 ,129 , 61, 59, 98 , 90 ,153, 90, 82 , 98, 79, 149, 97 , 85, 92, 78, 100 , 69, 152, 88, 76 , 91 ,145, 106, 69, 84, 72, 144 , 76, 74 , 94 ,70 , 86 ,76 ,137 , 71 ,87 , 91, 62 ,150 , 66 , 77 ,88, 135, 93 , 62 , 83, 83 , 72, 71 ,148 , 91 , 68 , 78 , 95 ,124, 69 ,78))
【讨论】:
以上是关于每日销售数据的时间序列预测的主要内容,如果未能解决你的问题,请参考以下文章
如何跳过 ARIMA 模型(python)中的第一个滞后 N 天?
时间序列预测 - ARIMA/ARIMAX 与 R 中的每日数据