将 Sklearn Pipiline 分支到 GridSearchCV 时出现问题
Posted
技术标签:
【中文标题】将 Sklearn Pipiline 分支到 GridSearchCV 时出现问题【英文标题】:Problem when branching Sklearn Pipiline into a GridSearchCV 【发布时间】:2020-08-08 10:05:07 【问题描述】:我正在尝试使用自己的功能构建管道。为此,我从 sklearn base 继承了 BaseEstimator 和 TransformerMixin,并定义了自己的变换方法。
当我执行 pipeline.fit(X,y) 时,它工作正常。
问题是当我尝试使用管道创建 GridSearchCV 对象时。我收到以下错误: ValueError: 操作数无法与形状 (730,36) (228,) (730,36) 一起广播。
730 就是矩阵 X 的行数除以 'cv' = 2,我在 GridSearchCV 中为交叉验证选择的折叠数。
我不知道如何调试它。我在我的函数中间尝试了一些打印,结果很奇怪。
我正在附加我创建的函数以及管道。如果有人可以提供帮助,我会非常高兴。
这是我为管道创建的函数:
from sklearn.base import BaseEstimator, TransformerMixin
class MissingData(BaseEstimator, TransformerMixin):
def fit( self, X, y = None ):
return self
def transform(self, X , y = None, strategies = ( "most_frequent", "mean") ):
print('Started MissingData')
X_ = X.copy()
#Categorical Variables handling
categorical_variables = list(X_.select_dtypes(include=['category','object']))
imp_category = SimpleImputer(strategy = strategies[0])
X_[categorical_variables] = pd.DataFrame(imp_category.fit_transform(X_[categorical_variables]))
#Numeric varialbes handling
numerical_variables = list(set(X_.columns) - set(categorical_variables))
imp_numerical = SimpleImputer(strategy = strategies[1])
X_[numerical_variables] = pd.DataFrame(imp_numerical.fit_transform(X_[numerical_variables]))
print('Finished MissingData')
print('Inf: ',X_.isnull().sum().sum())
return X_
class OHEncode(BaseEstimator, TransformerMixin):
def fit(self, X, y = None ):
return self
def encode_and_drop_original_and_first_dummy(self,df, feature_to_encode):
dummies = pd.get_dummies(df[feature_to_encode] , prefix = feature_to_encode, drop_first=True) #Drop first equals true will take care of the dummies variables trap
res = pd.concat([df, dummies], axis=1)
res = res.drop([feature_to_encode], axis=1)
return(res)
def transform(self, X , y = None, categorical_variables = None ):
X_ = X.copy()
if categorical_variables == None:
categorical_variables = list(X_.select_dtypes(include=['category','object']))
print('Started Encoding')
#Let's update the matrix X with the one hot ecoded version of all features in categorical_variables
for feature_to_encode in categorical_variables:
X_ = self.encode_and_drop_original_and_first_dummy(X_ , feature_to_encode)
print('Finished Encoding')
print('Inf: ',X_.isnull().sum().sum())
return X_
这是带有 GridSearchCV 的管道:
pca = PCA(n_components=10)
pipeline = Pipeline([('MissingData', MissingData()), ('OHEncode', OHEncode()) ,
('scaler', StandardScaler()) , ('pca', pca), ('rf', LinearRegression())])
parameters = 'pca__n_components': [5, 15, 30, 45, 64]
grid = GridSearchCV(pipeline, param_grid=parameters, cv = 2)
grid.fit(X, y)
最后是完整的输出,包括我的打印和错误:
Started MissingData
Finished MissingData
Inf: 57670
Started Encoding
Finished Encoding
Inf: 26280
Started MissingData
Finished MissingData
Inf: 0
Started Encoding
C:\Users\menoci\AppData\Roaming\Python\Python37\site-packages\sklearn\utils\extmath.py:765: RuntimeWarning: invalid value encountered in true_divide
updated_mean = (last_sum + new_sum) / updated_sample_count
C:\Users\menoci\AppData\Roaming\Python\Python37\site-packages\sklearn\utils\extmath.py:706: RuntimeWarning: Degrees of freedom <= 0 for slice.
result = op(x, *args, **kwargs)
C:\Users\menoci\AppData\Roaming\Python\Python37\site-packages\sklearn\model_selection\_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
FitFailedWarning)
Finished Encoding
Inf: 0
Started MissingData
Finished MissingData
Inf: 57670
Started Encoding
Finished Encoding
Inf: 26280
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-67-f78b56dad89d> in <module>
15
16 #pipeline.set_params(rf__n_estimators = 50)
---> 17 grid.fit(X, y)
18
19 #rf_val_predictions = pipeline.predict(X)
~\AppData\Roaming\Python\Python37\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
710 return results
711
--> 712 self._run_search(evaluate_candidates)
713
714 # For multi-metric evaluation, store the best_index_, best_params_ and
~\AppData\Roaming\Python\Python37\site-packages\sklearn\model_selection\_search.py in _run_search(self, evaluate_candidates)
1151 def _run_search(self, evaluate_candidates):
1152 """Search all candidates in param_grid"""
-> 1153 evaluate_candidates(ParameterGrid(self.param_grid))
1154
1155
~\AppData\Roaming\Python\Python37\site-packages\sklearn\model_selection\_search.py in evaluate_candidates(candidate_params)
689 for parameters, (train, test)
690 in product(candidate_params,
--> 691 cv.split(X, y, groups)))
692
693 if len(out) < 1:
~\AppData\Roaming\Python\Python37\site-packages\joblib\parallel.py in __call__(self, iterable)
1005 self._iterating = self._original_iterator is not None
1006
-> 1007 while self.dispatch_one_batch(iterator):
1008 pass
1009
~\AppData\Roaming\Python\Python37\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
833 return False
834 else:
--> 835 self._dispatch(tasks)
836 return True
837
~\AppData\Roaming\Python\Python37\site-packages\joblib\parallel.py in _dispatch(self, batch)
752 with self._lock:
753 job_idx = len(self._jobs)
--> 754 job = self._backend.apply_async(batch, callback=cb)
755 # A job can complete so quickly than its callback is
756 # called before we get here, causing self._jobs to
~\AppData\Roaming\Python\Python37\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
207 def apply_async(self, func, callback=None):
208 """Schedule a func to be run"""
--> 209 result = ImmediateResult(func)
210 if callback:
211 callback(result)
~\AppData\Roaming\Python\Python37\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
588 # Don't delay the application, to avoid keeping the input
589 # arguments in memory
--> 590 self.results = batch()
591
592 def get(self):
~\AppData\Roaming\Python\Python37\site-packages\joblib\parallel.py in __call__(self)
254 with parallel_backend(self._backend, n_jobs=self._n_jobs):
255 return [func(*args, **kwargs)
--> 256 for func, args, kwargs in self.items]
257
258 def __len__(self):
~\AppData\Roaming\Python\Python37\site-packages\joblib\parallel.py in <listcomp>(.0)
254 with parallel_backend(self._backend, n_jobs=self._n_jobs):
255 return [func(*args, **kwargs)
--> 256 for func, args, kwargs in self.items]
257
258 def __len__(self):
~\AppData\Roaming\Python\Python37\site-packages\sklearn\model_selection\_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score)
542 else:
543 fit_time = time.time() - start_time
--> 544 test_scores = _score(estimator, X_test, y_test, scorer)
545 score_time = time.time() - start_time - fit_time
546 if return_train_score:
~\AppData\Roaming\Python\Python37\site-packages\sklearn\model_selection\_validation.py in _score(estimator, X_test, y_test, scorer)
589 scores = scorer(estimator, X_test)
590 else:
--> 591 scores = scorer(estimator, X_test, y_test)
592
593 error_msg = ("scoring must return a number, got %s (%s) "
~\AppData\Roaming\Python\Python37\site-packages\sklearn\metrics\_scorer.py in __call__(self, estimator, *args, **kwargs)
87 *args, **kwargs)
88 else:
---> 89 score = scorer(estimator, *args, **kwargs)
90 scores[name] = score
91 return scores
~\AppData\Roaming\Python\Python37\site-packages\sklearn\metrics\_scorer.py in _passthrough_scorer(estimator, *args, **kwargs)
369 def _passthrough_scorer(estimator, *args, **kwargs):
370 """Function that wraps estimator.score"""
--> 371 return estimator.score(*args, **kwargs)
372
373
~\AppData\Roaming\Python\Python37\site-packages\sklearn\utils\metaestimators.py in <lambda>(*args, **kwargs)
114
115 # lambda, but not partial, allows help() to work with update_wrapper
--> 116 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
117 # update the docstring of the returned function
118 update_wrapper(out, self.fn)
~\AppData\Roaming\Python\Python37\site-packages\sklearn\pipeline.py in score(self, X, y, sample_weight)
611 Xt = X
612 for _, name, transform in self._iter(with_final=False):
--> 613 Xt = transform.transform(Xt)
614 score_params =
615 if sample_weight is not None:
~\AppData\Roaming\Python\Python37\site-packages\sklearn\preprocessing\_data.py in transform(self, X, copy)
804 else:
805 if self.with_mean:
--> 806 X -= self.mean_
807 if self.with_std:
808 X /= self.scale_
ValueError: operands could not be broadcast together with shapes (730,36) (228,) (730,36)
【问题讨论】:
您的每个转换器都会生成一个具有不同维度的数组。因此,我建议您独立获取每个结果并检查输出尺寸(例如x = MissingData()
然后x.fit(...
)。
感谢您的回复,@Ghanem。如果是这样的话,单独的 pipeline.fit 不应该工作,对吧?但它有效。单独执行转换的工作原理如下: X = MissingData().transform(X); X = OHEncode().transform(X); X = StandardScaler().fit_transform(X); X = pca.fit_transform(X); rf = 线性回归(); rf.fit(X,y) 在这种情况下你推荐什么?谢谢!
很难知道给出以下信息。我建议您尝试以下操作:1)将PCA(n_components=10)
移动到管道内部并检查它是否有效。 2)去掉StandardScaler
和StandardScaler
再申请GS,下一步去掉OHEncode
并测试。
问题确实出在 OHEncode 上。它与 PCA 或 StandardScaler 无关。我想我知道原因:在 OHEncode 中,我对所有分类特征进行了热编码。问题在于,自从 Cross Val 以来。只使用一部分数据进行训练,有可能一些分类值不会出现在训练中,因此不会被编码,所以当我们尝试预测它们时会出现问题。你对如何处理有什么建议吗?我可能不是第一个面临这个问题的人。我应该放弃使用 Pipeline 进行这部分处理吗?
很好,现在问题很清楚了。我建议您编辑您的问题并添加有关错误的这些最终详细信息,以便更好地存档;)。
【参考方案1】:
第一点,我希望你使用 sklearn 中的 OneHotEncoder
(OHE) 类。然后,在OHEncode
的构造函数中定义一个 OHE 对象并将其与您拥有的所有分类值相匹配(以使它们在每次 GridSearch 迭代中“可见”)。然后在OHEncode
的transform
函数中,使用OHE 的对象应用变换。
不要将 OHE 对象放入 fit
函数中,因为那样您将遇到相同的错误;在每次 GridSearch 迭代时,都会应用 fit 和 transform 函数。
【讨论】:
好主意。不过有一个问题。来自 sklearn 的 OneHotEncoder 不处理缺失值。而且我开始认为我真的不知道 Pipeline 是如何工作的。当我尝试您的建议时,我遇到了以下问题: 作品:X_ = MissingData().transform(X) ; pipeline = Pipeline([('OHEncode', OHEncode(X_)),('scaler', StandardScaler()) , ('reduce_dim', PCA()), ('rf', RandomForestRegressor(random_state=1))])有效,但以下无效: X_ = MissingData().transform(X) ; pipeline = Pipeline([('missingdata', MissingData()),('OHEncode', OHEncode(X_)),('scaler', StandardScaler()) , ('reduce_dim', PCA()), ('rf' , RandomForestRegressor(random_state=1))]) 如果我已经估算了缺失值,将 MissingValues 添加到我的管道应该没有任何区别,但我收到错误:'Input contains NaN' 也使用 sklearn imputer 估算丢失的数据,这里是:scikit-learn.org/stable/modules/generated/… 在OHEncode
的构造函数中创建一个管道,同时使用 OneHotEncoder
和 SimpleImputer
并仅适合它.. 稍后应用变换
我已经在使用 SimpleImputer。为什么我应该在 OHEncode 构造函数中同时进行 One Hot 编码和 Imputing?您对为什么我的方法不起作用有任何想法吗?据我测试,我的 MissingData() 函数运行良好,它总是返回没有任何任务价值的数据。
我没有在你的代码中注意到这一点。您是否尝试将 OneHotEncoder
放入构造函数中? .. 遵循我的解决方案?以上是关于将 Sklearn Pipiline 分支到 GridSearchCV 时出现问题的主要内容,如果未能解决你的问题,请参考以下文章