如何使用 scikit 线性回归模型同时求解多个独立的时间序列
Posted
技术标签:
【中文标题】如何使用 scikit 线性回归模型同时求解多个独立的时间序列【英文标题】:How to solve several independent time series at the same time using scikit linear regression model 【发布时间】:2016-03-29 22:57:11 【问题描述】:我尝试使用 sklearn 线性回归模型同时预测多个独立的时间序列,但我似乎无法正确。
我的数据组织如下:Xn
是一个矩阵,其中每行包含 4 个观测值的预测窗口,yn
是Xn
每行的目标值。
import numpy as np
# training data
X1=np.array([[-0.31994,-0.32648,-0.33264,-0.33844],[-0.32648,-0.33264,-0.33844,-0.34393],[-0.33264,-0.33844,-0.34393,-0.34913],[-0.33844,-0.34393,-0.34913,-0.35406],[-0.34393,-0.34913,-.35406,-0.35873],[-0.34913,-0.35406,-0.35873,-0.36318],[-0.35406,-0.35873,-0.36318,-0.36741],[-0.35873,-0.36318,-0.36741,-0.37144],[-0.36318,-0.36741,-0.37144,-0.37529],[-0.36741,-.37144,-0.37529,-0.37896],[-0.37144,-0.37529,-0.37896,-0.38069],[-0.37529,-0.37896,-0.38069,-0.38214],[-0.37896,-0.38069,-0.38214,-0.38349],[-0.38069,-0.38214,-0.38349,-0.38475],[-.38214,-0.38349,-0.38475,-0.38593],[-0.38349,-0.38475,-0.38593,-0.38887]])
X2=np.array([[-0.39265,-0.3929,-0.39326,-0.39361],[-0.3929,-0.39326,-0.39361,-0.3931],[-0.39326,-0.39361,-0.3931,-0.39265],[-0.39361,-0.3931,-0.39265,-0.39226],[-0.3931,-0.39265,-0.39226,-0.39193],[-0.39265,-0.39226,-0.39193,-0.39165],[-0.39226,-0.39193,-0.39165,-0.39143],[-0.39193,-0.39165,-0.39143,-0.39127],[-0.39165,-0.39143,-0.39127,-0.39116],[-0.39143,-0.39127,-0.39116,-0.39051],[-0.39127,-0.39116,-0.39051,-0.3893],[-0.39116,-0.39051,-0.3893,-0.39163],[-0.39051,-0.3893,-0.39163,-0.39407],[-0.3893,-0.39163,-0.39407,-0.39662],[-0.39163,-0.39407,-0.39662,-0.39929],[-0.39407,-0.39662,-0.39929,-0.4021]])
# target values
y1=np.array([-0.34393,-0.34913,-0.35406,-0.35873,-0.36318,-0.36741,-0.37144,-0.37529,-0.37896,-0.38069,-0.38214,-0.38349,-0.38475,-0.38593,-0.38887,-0.39184])
y2=np.array([-0.3931,-0.39265,-0.39226,-0.39193,-0.39165,-0.39143,-0.39127,-0.39116,-0.39051,-0.3893,-0.39163,-0.39407,-0.39662,-0.39929,-0.4021,-0.40506])
按预期工作的单个时间序列的正常过程如下:
from sklearn.linear_model import LinearRegression
# train the 1st half, predict the 2nd half
half = len(y1)/2 # or y2 as they have the same length
LR = LinearRegression()
LR.fit(X1[:half], y1[:half])
pred = LR.predict(X1[half:])
r_2 = LR.score(X1[half:],y1[half:])
但是如何将线性回归模型同时应用于多个独立的时间序列呢? 我尝试了以下方法:
y_stack = np.vstack((y1[None],y2[None]))
X_stack = np.vstack((X1[None],X2[None]))
print 'y1 shape:',y1.shape, 'X1 shape:',X1.shape
print 'y_stack shape:',y_stack.shape, 'X_stack:',X_stack.shape
y1 shape: (16,) X1 shape: (16, 4)
y_stack shape: (2, 16) X_stack: (2, 16, 4)
但是线性模型的拟合失败如下:
LR.fit(X_stack[:,half:],y_stack[:,half:])
说明维度数量高于预期:
C:\Python27\lib\site-packages\sklearn\utils\validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
394 if not allow_nd and array.ndim >= 3:
395 raise ValueError("Found array with dim %d. %s expected <= 2."
--> 396 % (array.ndim, estimator_name))
397 if force_all_finite:
398 _assert_all_finite(array)
ValueError: Found array with dim 3. Estimator expected <= 2.
非常感谢任何建议或提示。
更新
我可以使用 for 循环,但由于 n
实际上是 10000 或更多,我希望找到包含数组操作的解决方案,因为这些是 numpy、scipy 和希望 sklearn 的显式功能
【问题讨论】:
为什么不能将数据视为一组因变量和自变量? @Riyaz 我的数据或每个时间序列彼此不相关,我希望将它们视为一组时一定是这种情况? 是的,回归确实适用于相关性。那你就不能建立两个独立的模型吗? @Riyaz 在这篇文章中我以两个为例,但实际上n
可以在 10000 或更多的范围内
***.com/q/30442377/1461210的可能重复
【参考方案1】:
@ali_m 我不认为这是一个重复的问题,但它们是部分相关的。 当然,也可以使用类似于 sklearn 的线性回归模型同时应用和预测时间序列:
我创建了一个新类LinearRegression_Multi
:
class LinearRegression_Multi:
def stacked_lstsq(self, L, b, rcond=1e-10):
"""
Solve L x = b, via SVD least squares cutting of small singular values
L is an array of shape (..., M, N) and b of shape (..., M).
Returns x of shape (..., N)
"""
u, s, v = np.linalg.svd(L, full_matrices=False)
s_max = s.max(axis=-1, keepdims=True)
s_min = rcond*s_max
inv_s = np.zeros_like(s)
inv_s[s >= s_min] = 1/s[s>=s_min]
x = np.einsum('...ji,...j->...i', v,
inv_s * np.einsum('...ji,...j->...i', u, b.conj()))
return np.conj(x, x)
def center_data(self, X, y):
""" Centers data to have mean zero along axis 0.
"""
# center X
X_mean = np.average(X,axis=1)
X_std = np.ones(X.shape[0::2])
X = X - X_mean[:,None,:]
# center y
y_mean = np.average(y,axis=1)
y = y - y_mean[:,None]
return X, y, X_mean, y_mean, X_std
def set_intercept(self, X_mean, y_mean, X_std):
""" Calculate the intercept_
"""
self.coef_ = self.coef_ / X_std # not really necessary
self.intercept_ = y_mean - np.einsum('ij,ij->i',X_mean,self.coef_)
def scores(self, y_pred, y_true ):
"""
The coefficient R^2 is defined as (1 - u/v), where u is the regression
sum of squares ((y_true - y_pred) ** 2).sum() and v is the residual
sum of squares ((y_true - y_true.mean()) ** 2).sum().
"""
u = ((y_true - y_pred) ** 2).sum(axis=-1)
v = ((y_true - y_true.mean(axis=-1)[None].T) ** 2).sum(axis=-1)
r_2 = 1 - u/v
return r_2
def fit(self,X, y):
""" Fit linear model.
"""
# get coefficients by applying linear regression on stack
X_, y, X_mean, y_mean, X_std = self.center_data(X, y)
self.coef_ = self.stacked_lstsq(X_, y)
self.set_intercept(X_mean, y_mean, X_std)
def predict(self, X):
"""Predict using the linear model
"""
return np.einsum('ijx,ix->ij',X,self.coef_) + self.intercept_[None].T
可以如下应用,使用与问题中相同的声明变量:
LR_Multi = LinearRegression_Multi()
LR_Multi.fit(X_stack[:,:half], y_stack[:,:half])
y_stack_pred = LR_Multi.predict(X_stack[:,half:])
R2 = LR_Multi.scores(y_stack_pred, y_stack[:,half:])
多个时间序列的 R^2 是:
array([ 0.91262442, 0.67247516])
这确实类似于标准sklearn线性回归的预测方法:
from sklearn.linear_model import LinearRegression
LR = LinearRegression()
LR.fit(X1[:half], y1[:half])
R2_1 = LR.score(X1[half:],y1[half:])
LR.fit(X2[:half], y2[:half])
R2_2 = LR.score(X2[half:],y2[half:])
print R2_1, R2_2
0.912624422097 0.67247516054
【讨论】:
【参考方案2】:如果您需要构建单独的模型,则不可能使用 numpy 的强大功能来提高性能,因为您有许多不同的任务。您唯一能做的就是在不同的线程中同时运行它们(通过使用 CPU 的多核),甚至将计算拆分到不同的计算机。
如果您认为所有数据都适合同一个模型,那么显而易见的解决方案就是合并所有 Xn
和 yn
并对其进行学习。这肯定会比计算单独的模型更快。
但实际上问题不在于计算性能,而在于您想要得到的结果。如果您需要不同的模型,您别无选择,只需分别计算即可。如果您需要一个模型,只需合并数据。否则,如果您要计算单独的模型,您将遇到问题:如何从所有模型中获取最终参数。
【讨论】:
以上是关于如何使用 scikit 线性回归模型同时求解多个独立的时间序列的主要内容,如果未能解决你的问题,请参考以下文章
使用 scikit-learn 训练线性回归模型后,如何对原始数据集中不存在的新数据点进行预测?
在 python 中使用 scikit-learn 线性回归模型时出错