如何为 RandomizedSearchCV 使用预定义拆分

Posted 2023-03-12

技术标签:

【中文标题】如何为 RandomizedSearchCV 使用预定义拆分【英文标题】：How to use predefined split for RandomizedSearchCV 【发布时间】：2020-06-28 16:08:46 【问题描述】：

我正在尝试使用RandomizedSearchCV 规范我的随机森林回归量。使用RandomizedSearchCV 没有明确指定火车和测试，我需要能够指定我的火车测试集，以便在拆分后对它们进行预处理。然后我找到了this helpful QnA 和this。但我仍然不知道该怎么做，因为就我而言，我正在使用交叉验证。我已经尝试从交叉验证中附加我的火车测试集，但它不起作用。它说ValueError: could not broadcast input array from shape (1824,9) into shape (1824) 指的是我的X_test

x = np.array(avo_sales.drop(['TotalBags','Unnamed:0','year','region','Date'],1))
y = np.array(avo_sales.TotalBags)

kf = KFold(n_splits=10)

for train_index, test_index in kf.split(x):
    X_train, X_test, y_train, y_test = x[train_index], x[test_index], y[train_index], y[test_index]

impC = SimpleImputer(strategy='most_frequent')
X_train[:,8] = impC.fit_transform(X_train[:,8].reshape(-1,1)).ravel()
X_test[:,8] = impC.transform(X_test[:,8].reshape(-1,1)).ravel()

imp = SimpleImputer(strategy='median')
X_train[:,1:8] = imp.fit_transform(X_train[:,1:8])
X_test[:,1:8] = imp.transform(X_test[:,1:8])

le = LabelEncoder()
X_train[:,8] = le.fit_transform(X_train[:,8])
X_test[:,8] = le.transform(X_test[:,8])

train_indices = X_train, y_test
test_indices = X_test, y_test
my_test_fold = np.append(train_indices, test_indices)
pds = PredefinedSplit(test_fold=my_test_fold)

n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
random_grid = 'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap
rfr = RandomForestRegressor()
rfr_random = RandomizedSearchCV(estimator = rfr , 
                               param_distributions = random_grid,
                               n_iter = 100,
                               cv = pds, verbose=2, random_state=42, n_jobs = -1) <-- i'll be filling the cv parameter with the predefined split
rfr_random.fit(X_train, y_train)

【问题讨论】：

首先，您的for train_index, test_index in kf.split(x): 根本没有意义，因为您将在此循环中覆盖折叠。将打印包含在循环中以更好地了解您在做什么。其次，对于您的问题，使用cv = kf，您将实现您的目标。修复随机种子的可重复性，您好，谢谢您的回答。但是，如果我删除for train_index, test_index in kf.split(x):，我将无法预处理我的训练测试集，这需要在拆分后完成。我需要明确指定我的训练测试集，以便我可以访问它们进行预处理， 【参考方案1】：

我认为您最好的选择是使用Pipeline 加上ColumnTransformer。管道允许您指定几个计算步骤，包括预处理/后处理，并且列转换器将不同的转换应用于不同的列。在你的情况下，这将是这样的：

pipeline = make_pipeline([
    make_column_transformer([
        (SimpleImputer(strategy='median'), range(1, 8)),
        (make_pipeline([
            SimpleImputer(strategy='most_frequent'),
            LabelEncoder(),
        ]), 8)
    ]),
    RandomForestRegressor()
])

然后您将此模型用作普通估计器，并使用通常的fit 和predict API。特别是，您将其提供给随机搜索：

rfr_random = RandomizedSearchCV(estimator = pipeline, ...)

现在预处理步骤将应用于每个拆分，然后再拟合随机森林。

如果没有进一步的调整，这肯定行不通，但希望你能明白。

【讨论】：

以上是关于如何为 RandomizedSearchCV 使用预定义拆分的主要内容，如果未能解决你的问题，请参考以下文章