时间序列数据的 CV 拆分
Posted
技术标签:
【中文标题】时间序列数据的 CV 拆分【英文标题】:CV split for time series data 【发布时间】:2018-10-07 07:19:14 【问题描述】:我目前正在处理 ts 数据,我正在尝试为模型评估生成交叉验证集。我就是这样做的:
# Splitting train & validation
df['date_posted'] = pd.to_datetime(df['date_posted'])
df_train = df[(df['date_posted'].dt.year > 2009) & (df['date_posted'].dt.year < 2014)].copy()
df_test = df[df['date_posted'].dt.year >= 2014].copy()
from sklearn.model_selection import GroupShuffleSplit
groups = df_train.groupby(df_train['date_posted'].dt.year).groups
X = df_train.drop(['short_description', 'is_exciting'], axis=1).copy()
y = df_train['is_exciting']
cv = GroupShuffleSplit().split(X, y, groups)
# Baseline model
from sklearn.model_selection import KFold
clf_lgbm = lgbm.LGBMClassifier(is_unbalance=True, random_state=0, n_jobs=-1)
# cv = KFold(n_splits=10, random_state=0)
results = cross_val_score(clf_lgbm, X, y, cv=cv)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
这是错误回溯:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-24-7091271a325a> in <module>()
6 # cv = KFold(n_splits=10, random_state=0)
7
----> 8 results = cross_val_score(clf_lgbm, X, y, cv=cv)
9
10 print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
~\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)
340 n_jobs=n_jobs, verbose=verbose,
341 fit_params=fit_params,
--> 342 pre_dispatch=pre_dispatch)
343 return cv_results['test_score']
344
~\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score)
192 X, y, groups = indexable(X, y, groups)
193
--> 194 cv = check_cv(cv, y, classifier=is_classifier(estimator))
195 scorers, _ = _check_multimetric_scoring(estimator, scoring=scoring)
196
~\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py in check_cv(cv, y, classifier)
1913 "object (from sklearn.model_selection) "
1914 "or an iterable. Got %s." % cv)
-> 1915 return _CVIterableWrapper(cv)
1916
1917 return cv # New style cv objects are passed without any modification
~\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py in __init__(self, cv)
1815 """Wrapper class for old style cv objects and iterables."""
1816 def __init__(self, cv):
-> 1817 self.cv = list(cv)
1818
1819 def get_n_splits(self, X=None, y=None, groups=None):
~\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py in split(self, X, y, groups)
1201 to an integer.
1202 """
-> 1203 X, y, groups = indexable(X, y, groups)
1204 for train, test in self._iter_indices(X, y, groups):
1205 yield train, test
~\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\utils\validation.py in indexable(*iterables)
227 else:
228 result.append(np.array(X))
--> 229 check_consistent_length(*result)
230 return result
231
~\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_consistent_length(*arrays)
202 if len(uniques) > 1:
203 raise ValueError("Found input variables with inconsistent numbers of"
--> 204 " samples: %r" % [int(l) for l in lengths])
205
206
ValueError: Found input variables with inconsistent numbers of samples: [439599, 439599, 4]
X 形状是 (439599, 51),y 形状是 (439599,)。
任何帮助将不胜感激。
【问题讨论】:
【参考方案1】:您可以在the documentation of GroupShuffleSplit查看groups
参数的要求:
groups : array-like, with shape (n_samples,), optional Group labels for the samples used while splitting the dataset into train/test set.
组的长度应该等于样本的数量,每个值表示组标签它的一部分。
这一行的输出:
groups = df_train.groupby(df_train['date_posted'].dt.year).groups
是一个字典,其中键是组标签,值是属于该列的行的索引。但这不是 scikit-learn 所期望的。
例如,对于这个数据:
Index A B #Columns
0 0 2
1 3 1
2 0 0
3 0 0
4 0 2
如果您希望前三行属于 group1,最后两列属于 group2,则需要将groups
传递为:
groups=['group1','group1','group1','group2','group2']
或
groups=[0, 0, 0, 1, 1]
注意,这里的总数是 5,对应于我的数据中的行,每个值代表该索引处的行所属的组。
因此,在您的情况下,您可以使用以下代码将返回的 dict 转换为列表:
import numpy as np
groups_proper = np.zeros(len(df_train))
for val in groups.iteritems():
for index in val[1].tolist():
groups_proper[index]=val[0]
然后通过它:
cv = GroupShuffleSplit().split(X, y, groups=groups_proper)
【讨论】:
以上是关于时间序列数据的 CV 拆分的主要内容,如果未能解决你的问题,请参考以下文章
OpenCV 函数学习11-图像通道的拆分(cv2.split)
使用 GridSearchCV 时需要拆分数据吗? [关闭]
机器学习进阶-图片基本处理-ROI区域 1.img[0:200, 0:200]截取图片 2.cv2.split(对图片的颜色通道进行拆分) 3. cv2.merge(将颜色通道进行合并) 4 c