工作管道上的 GridSearchCV 返回 ValueError

Posted 2023-03-12

技术标签:

【中文标题】工作管道上的 GridSearchCV 返回 ValueError【英文标题】：GridSearchCV on a working pipeline returns ValueError 【发布时间】：2019-03-17 04:59:12 【问题描述】：

我正在使用 GridSearchCV 为我的管道找到最佳参数。

我的管道似乎运行良好，我可以申请：

pipeline.fit(X_train, y_train)
preds = pipeline.predict(X_test)

我得到了不错的结果。

但 GridSearchCV 显然不喜欢某样东西，我想不通。

我的管道：

feats = FeatureUnion([('age', age),
                      ('education_num', education_num),
                      ('is_education_favo', is_education_favo),
                      ('is_marital_status_favo', is_marital_status_favo),
                      ('hours_per_week', hours_per_week),
                      ('capital_diff', capital_diff),
                      ('sex', sex),
                      ('race', race),
                      ('native_country', native_country)
                     ])

pipeline = Pipeline([
        ('adhocFC',AdHocFeaturesCreation()),
        ('imputers', KnnImputer(target = 'native-country', n_neighbors = 5)),
        ('features',feats),('clf',LogisticRegression())])

我的网格搜索：

hyperparameters = 'imputers__n_neighbors' : [5,21,41], 'clf__C' : [1.0, 2.0]

GSCV = GridSearchCV(pipeline, hyperparameters, cv=3, scoring = 'roc_auc' , refit = False) #change n_jobs = 2, refit = False

GSCV.fit(X_train, y_train)

我收到 11 条类似警告：

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/ipykernel/main.py:11： SettingWithCopyWarning：试图在一个副本上设置一个值从 DataFrame 切片。尝试使用 .loc[row_indexer,col_indexer] = 取而代之的价值

这是错误信息：

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/ipykernel/main.py:11： SettingWithCopyWarning：试图在一个副本上设置一个值从 DataFrame 切片。尝试使用 .loc[row_indexer,col_indexer] = 取而代之的价值

请参阅文档中的注意事项： http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy /home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/ipykernel/main.py:12： SettingWithCopyWarning：试图在一个副本上设置一个值从 DataFrame 切片。尝试使用 .loc[row_indexer,col_indexer] = 取而代之的价值

请参阅文档中的注意事项： http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy /home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/ipykernel/main.py:14： SettingWithCopyWarning：试图在一个副本上设置一个值从 DataFrame 切片。尝试使用 .loc[row_indexer,col_indexer] = 取而代之的价值

请参阅文档中的注意事项： http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

----------------------------------- ---------------------------- ValueError Traceback（最近一次调用最后）在（） 3 GSCV = GridSearchCV（管道，超参数，cv=3，评分 = 'roc_auc' ，refit = False）#change n_jobs = 2，refit = False 4 ----> 5 GSCV.fit(X_train, y_train)

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/model_selection/_search.py 适合（自我，X，y，组） 943 训练/测试集。第944章 --> 945 return self._fit(X, y, groups, ParameterGrid(self.param_grid)) 946 第947章

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/model_selection/_search.py 在 _fit(self, X, y, groups, parameter_iterable) 第562章第563章 --> 564 用于 parameter_iterable 中的参数 565 用于训练，在 cv_iter 中测试）第566章

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py 在调用（自我，可迭代） 756#已发送。特别是这涵盖了边缘 757 # Parallel 的情况与耗尽的迭代器一起使用。 --> 758 而 self.dispatch_one_batch(iterator): 第759章 760 其他：

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py 在 dispatch_one_batch(self, iterator) 606 返回错误 607 其他： --> 608 self._dispatch(任务) 609 返回真 610

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py 在_dispatch（自我，批处理）第569章第570章 --> 571 作业 = self._backend.apply_async(batch, callback=cb) 第572章第573章

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/externals/joblib/_parallel_backends.py 在 apply_async(self, func, 回调) 107 def apply_async（自我，函数，回调=无）： 108 """安排一个函数运行""" --> 109 结果 = 立即结果（函数） 110 如果回调： 111回调（结果）

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/externals/joblib/_parallel_backends.py 在 init（自我，批处理）中 324 # 不要延迟应用，避免保持输入 325 # 参数在内存中 --> 326 self.results = batch() 327 328 def 获取（自我）：

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py 在通话（自己） 129 130 def 呼叫（自我）： --> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items] 132 133 def len（自我）：

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py 在 (.0) 129 130 def 呼叫（自我）： --> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items] 132 133 def len（自我）：

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/model_selection/_validation.py 在 _fit_and_score（估计器，X，y，得分手，训练，测试，详细，参数，fit_params，return_train_score，return_parameters， return_n_test_samples, return_times, error_score) 第236章 237 其他： --> 238 estimator.fit(X_train, y_train, **fit_params) 239 240 例外为 e:

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/pipeline.py 适合（自我，X，y，**fit_params）第266章这个估计第267章 --> 268 Xt, fit_params = self._fit(X, y, **fit_params) 269 如果 self._final_estimator 不是无： 270 self._final_estimator.fit（Xt，y，**fit_params）

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/pipeline.py 在 _fit(self, X, y, **fit_params) 232 次通过第233章 --> 234 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name]) 235 其他：第236章

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/base.py 在 fit_transform(self, X, y, **fit_params) 495 其他： 496 # arity 2的拟合方法（监督变换） --> 497 返回 self.fit(X, y, **fit_params).transform(X) 498 第499章

in fit(self, X, y) 16 self.ohe.fit(X_full) 17 #创建一个不包含任何空值的Dataframe，categ变量为OHE，每一行都有 ---> 18 X_ohe_full = self.ohe.transform(X_full[~X[self.col].isnull()].drop(self.col, 轴=1)) 19 20 #在col为null的行上拟合分类器

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/pandas/core/frame.py 在 getitem(self, key) 2057 返回 self._getitem_multilevel(key) 2058 其他： -> 2059 return self._getitem_column(key) 2060 2061 def _getitem_column(self, key):

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/pandas/core/frame.py in _getitem_column(self, key) 2064 # 获取第 2065 列如果 self.columns.is_unique： -> 2066 return self._get_item_cache(key) 2067 2068 # 重复列和可能的降维

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/pandas/core/generic.py 在 _get_item_cache(self, item) 1384 res = cache.get(item) 1385 如果 res 为无： -> 1386 个值 = self._data.get(item) 1387 res = self._box_item_values(item, values) 1388 缓存[项目] = res

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/pandas/core/internals.py 在 get(self, item, fastpath) 3550 loc = indexer.item() 3551 其他： -> 3552 raise ValueError("cannot label index with a null key") 3553 3554 return self.iget(loc, 快速路径=快速路径）

ValueError: 不能用空键标记索引

【问题讨论】：

您对hyperparameters 的定义似乎很好。您对GridSearchCV 的实例化看起来是正确的。似乎问题可能与您的数据有关。你是如何创建X_train、X_test 和y_train 的？您能否发布用于创建/导入数据和创建这 3 个变量的完整代码？这可能有助于提供有关问题的一些线索。 【参考方案1】：

如果没有其他信息，我相信这是因为您的 X_train 和 y_train 变量是 pandas 数据框，基本的 sci-kit 学习库无法与这些比较：例如，分类器的 .fit 方法是期望的像对象一样的数组。

通过输入 pandas 数据框，您会无意中像 numpy 数组一样索引它们，这在 pandas 中并不那么稳定。

尝试将您的训练数据转换为 numpy 数组：

X_train_arr = X_train.to_numpy()
y_train_arr = y_train.to_numpy()

【讨论】：

以上是关于工作管道上的 GridSearchCV 返回 ValueError的主要内容，如果未能解决你的问题，请参考以下文章