将 Sklearn Pipiline 分支到 GridSearchCV 时出现问题

Posted

技术标签:

【中文标题】将 Sklearn Pipiline 分支到 GridSearchCV 时出现问题【英文标题】:Problem when branching Sklearn Pipiline into a GridSearchCV 【发布时间】:2020-08-08 10:05:07 【问题描述】:

我正在尝试使用自己的功能构建管道。为此,我从 sklearn base 继承了 BaseEstimator 和 TransformerMixin,并定义了自己的变换方法。

当我执行 pipeline.fit(X,y) 时,它工作正常。

问题是当我尝试使用管道创建 GridSearchCV 对象时。我收到以下错误: ValueError: 操作数无法与形状 (730,36) (228,) (730,36) 一起广播。

730 就是矩阵 X 的行数除以 'cv' = 2,我在 GridSearchCV 中为交叉验证选择的折叠数。

我不知道如何调试它。我在我的函数中间尝试了一些打印,结果很奇怪。

我正在附加我创建的函数以及管道。如果有人可以提供帮助,我会非常高兴。

这是我为管道创建的函数:

from sklearn.base import BaseEstimator, TransformerMixin
class MissingData(BaseEstimator, TransformerMixin):


    def fit( self, X, y = None  ):
        return self

    def transform(self, X , y = None, strategies = ( "most_frequent", "mean") ):
        print('Started MissingData')
        X_ = X.copy()

        #Categorical Variables handling
        categorical_variables = list(X_.select_dtypes(include=['category','object']))
        imp_category = SimpleImputer(strategy = strategies[0])
        X_[categorical_variables] = pd.DataFrame(imp_category.fit_transform(X_[categorical_variables]))


        #Numeric varialbes handling
        numerical_variables = list(set(X_.columns) - set(categorical_variables))
        imp_numerical = SimpleImputer(strategy = strategies[1])
        X_[numerical_variables] = pd.DataFrame(imp_numerical.fit_transform(X_[numerical_variables]))
        print('Finished MissingData')


        print('Inf: ',X_.isnull().sum().sum())
        return X_

class OHEncode(BaseEstimator, TransformerMixin):
    def fit(self, X, y = None  ):
        return self

    def encode_and_drop_original_and_first_dummy(self,df, feature_to_encode):
        dummies = pd.get_dummies(df[feature_to_encode] , prefix = feature_to_encode, drop_first=True) #Drop first equals true will take care of the dummies variables trap
        res = pd.concat([df, dummies], axis=1)
        res = res.drop([feature_to_encode], axis=1)
        return(res) 

    def transform(self, X , y = None, categorical_variables  = None ):
        X_ = X.copy()
        if categorical_variables == None:
            categorical_variables  = list(X_.select_dtypes(include=['category','object']))
        print('Started Encoding')
        #Let's update the matrix X with the one hot ecoded version of all features in categorical_variables
        for feature_to_encode in categorical_variables:
            X_ = self.encode_and_drop_original_and_first_dummy(X_ , feature_to_encode)
        print('Finished Encoding')
        print('Inf: ',X_.isnull().sum().sum())
        return X_

这是带有 GridSearchCV 的管道:

pca = PCA(n_components=10)
pipeline = Pipeline([('MissingData', MissingData()), ('OHEncode', OHEncode()) , 
          ('scaler', StandardScaler()) , ('pca', pca), ('rf', LinearRegression())])

parameters = 'pca__n_components': [5, 15, 30, 45, 64]

grid = GridSearchCV(pipeline, param_grid=parameters, cv = 2)
grid.fit(X, y)

最后是完整的输出,包括我的打印和错误:

Started MissingData
Finished MissingData
Inf:  57670
Started Encoding
Finished Encoding
Inf:  26280
Started MissingData
Finished MissingData
Inf:  0
Started Encoding
C:\Users\menoci\AppData\Roaming\Python\Python37\site-packages\sklearn\utils\extmath.py:765: RuntimeWarning: invalid value encountered in true_divide
  updated_mean = (last_sum + new_sum) / updated_sample_count
C:\Users\menoci\AppData\Roaming\Python\Python37\site-packages\sklearn\utils\extmath.py:706: RuntimeWarning: Degrees of freedom <= 0 for slice.
  result = op(x, *args, **kwargs)
C:\Users\menoci\AppData\Roaming\Python\Python37\site-packages\sklearn\model_selection\_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

  FitFailedWarning)
Finished Encoding
Inf:  0
Started MissingData
Finished MissingData
Inf:  57670
Started Encoding
Finished Encoding
Inf:  26280
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-67-f78b56dad89d> in <module>
     15 
     16 #pipeline.set_params(rf__n_estimators = 50)
---> 17 grid.fit(X, y)
     18 
     19 #rf_val_predictions = pipeline.predict(X)

~\AppData\Roaming\Python\Python37\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
    710                 return results
    711 
--> 712             self._run_search(evaluate_candidates)
    713 
    714         # For multi-metric evaluation, store the best_index_, best_params_ and

~\AppData\Roaming\Python\Python37\site-packages\sklearn\model_selection\_search.py in _run_search(self, evaluate_candidates)
   1151     def _run_search(self, evaluate_candidates):
   1152         """Search all candidates in param_grid"""
-> 1153         evaluate_candidates(ParameterGrid(self.param_grid))
   1154 
   1155 

~\AppData\Roaming\Python\Python37\site-packages\sklearn\model_selection\_search.py in evaluate_candidates(candidate_params)
    689                                for parameters, (train, test)
    690                                in product(candidate_params,
--> 691                                           cv.split(X, y, groups)))
    692 
    693                 if len(out) < 1:

~\AppData\Roaming\Python\Python37\site-packages\joblib\parallel.py in __call__(self, iterable)
   1005                 self._iterating = self._original_iterator is not None
   1006 
-> 1007             while self.dispatch_one_batch(iterator):
   1008                 pass
   1009 

~\AppData\Roaming\Python\Python37\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
    833                 return False
    834             else:
--> 835                 self._dispatch(tasks)
    836                 return True
    837 

~\AppData\Roaming\Python\Python37\site-packages\joblib\parallel.py in _dispatch(self, batch)
    752         with self._lock:
    753             job_idx = len(self._jobs)
--> 754             job = self._backend.apply_async(batch, callback=cb)
    755             # A job can complete so quickly than its callback is
    756             # called before we get here, causing self._jobs to

~\AppData\Roaming\Python\Python37\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
    207     def apply_async(self, func, callback=None):
    208         """Schedule a func to be run"""
--> 209         result = ImmediateResult(func)
    210         if callback:
    211             callback(result)

~\AppData\Roaming\Python\Python37\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
    588         # Don't delay the application, to avoid keeping the input
    589         # arguments in memory
--> 590         self.results = batch()
    591 
    592     def get(self):

~\AppData\Roaming\Python\Python37\site-packages\joblib\parallel.py in __call__(self)
    254         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    255             return [func(*args, **kwargs)
--> 256                     for func, args, kwargs in self.items]
    257 
    258     def __len__(self):

~\AppData\Roaming\Python\Python37\site-packages\joblib\parallel.py in <listcomp>(.0)
    254         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    255             return [func(*args, **kwargs)
--> 256                     for func, args, kwargs in self.items]
    257 
    258     def __len__(self):

~\AppData\Roaming\Python\Python37\site-packages\sklearn\model_selection\_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score)
    542     else:
    543         fit_time = time.time() - start_time
--> 544         test_scores = _score(estimator, X_test, y_test, scorer)
    545         score_time = time.time() - start_time - fit_time
    546         if return_train_score:

~\AppData\Roaming\Python\Python37\site-packages\sklearn\model_selection\_validation.py in _score(estimator, X_test, y_test, scorer)
    589         scores = scorer(estimator, X_test)
    590     else:
--> 591         scores = scorer(estimator, X_test, y_test)
    592 
    593     error_msg = ("scoring must return a number, got %s (%s) "

~\AppData\Roaming\Python\Python37\site-packages\sklearn\metrics\_scorer.py in __call__(self, estimator, *args, **kwargs)
     87                                       *args, **kwargs)
     88             else:
---> 89                 score = scorer(estimator, *args, **kwargs)
     90             scores[name] = score
     91         return scores

~\AppData\Roaming\Python\Python37\site-packages\sklearn\metrics\_scorer.py in _passthrough_scorer(estimator, *args, **kwargs)
    369 def _passthrough_scorer(estimator, *args, **kwargs):
    370     """Function that wraps estimator.score"""
--> 371     return estimator.score(*args, **kwargs)
    372 
    373 

~\AppData\Roaming\Python\Python37\site-packages\sklearn\utils\metaestimators.py in <lambda>(*args, **kwargs)
    114 
    115         # lambda, but not partial, allows help() to work with update_wrapper
--> 116         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    117         # update the docstring of the returned function
    118         update_wrapper(out, self.fn)

~\AppData\Roaming\Python\Python37\site-packages\sklearn\pipeline.py in score(self, X, y, sample_weight)
    611         Xt = X
    612         for _, name, transform in self._iter(with_final=False):
--> 613             Xt = transform.transform(Xt)
    614         score_params = 
    615         if sample_weight is not None:

~\AppData\Roaming\Python\Python37\site-packages\sklearn\preprocessing\_data.py in transform(self, X, copy)
    804         else:
    805             if self.with_mean:
--> 806                 X -= self.mean_
    807             if self.with_std:
    808                 X /= self.scale_

ValueError: operands could not be broadcast together with shapes (730,36) (228,) (730,36) 

【问题讨论】:

您的每个转换器都会生成一个具有不同维度的数组。因此,我建议您独立获取每个结果并检查输出尺寸(例如x = MissingData() 然后x.fit(...)。 感谢您的回复,@Ghanem。如果是这样的话,单独的 pipeline.fit 不应该工作,对吧?但它有效。单独执行转换的工作原理如下: X = MissingData().transform(X); X = OHEncode().transform(X); X = StandardScaler().fit_transform(X); X = pca.fit_transform(X); rf = 线性回归(); rf.fit(X,y) 在这种情况下你推荐什么?谢谢! 很难知道给出以下信息。我建议您尝试以下操作:1)将PCA(n_components=10) 移动到管道内部并检查它是否有效。 2)去掉StandardScalerStandardScaler再申请GS,下一步去掉OHEncode并测试。 问题确实出在 OHEncode 上。它与 PCA 或 StandardScaler 无关。我想我知道原因:在 OHEncode 中,我对所有分类特征进行了热编码。问题在于,自从 Cross Val 以来。只使用一部分数据进行训练,有可能一些分类值不会出现在训练中,因此不会被编码,所以当我们尝试预测它们时会出现问题。你对如何处理有什么建议吗?我可能不是第一个面临这个问题的人。我应该放弃使用 Pipeline 进行这部分处理吗? 很好,现在问题很清楚了。我建议您编辑您的问题并添加有关错误的这些最终详细信息,以便更好地存档;)。 【参考方案1】:

第一点,我希望你使用 sklearn 中的 OneHotEncoder (OHE) 类。然后,在OHEncode 的构造函数中定义一个 OHE 对象并将其与您拥有的所有分类值相匹配(以使它们在每次 GridSearch 迭代中“可见”)。然后在OHEncodetransform 函数中,使用OHE 的对象应用变换。

不要将 OHE 对象放入 fit 函数中,因为那样您将遇到相同的错误;在每次 GridSearch 迭代时,都会应用 fit 和 transform 函数。

【讨论】:

好主意。不过有一个问题。来自 sklearn 的 OneHotEncoder 不处理缺失值。而且我开始认为我真的不知道 Pipeline 是如何工作的。当我尝试您的建议时,我遇到了以下问题: 作品:X_ = MissingData().transform(X) ; pipeline = Pipeline([('OHEncode', OHEncode(X_)),('scaler', StandardScaler()) , ('reduce_dim', PCA()), ('rf', RandomForestRegressor(random_state=1))])有效,但以下无效: X_ = MissingData().transform(X) ; pipeline = Pipeline([('missingdata', MissingData()),('OHEncode', OHEncode(X_)),('scaler', StandardScaler()) , ('reduce_dim', PCA()), ('rf' , RandomForestRegressor(random_state=1))]) 如果我已经估算了缺失值,将 MissingValues 添加到我的管道应该没有任何区别,但我收到错误:'Input contains NaN' 也使用 sklearn imputer 估算丢失的数据,这里是:scikit-learn.org/stable/modules/generated/… 在 OHEncode 的构造函数中创建一个管道,同时使用 OneHotEncoderSimpleImputer 并仅适合它.. 稍后应用变换 我已经在使用 SimpleImputer。为什么我应该在 OHEncode 构造函数中同时进行 One Hot 编码和 Imputing?您对为什么我的方法不起作用有任何想法吗?据我测试,我的 MissingData() 函数运行良好,它总是返回没有任何任务价值的数据。 我没有在你的代码中注意到这一点。您是否尝试将 OneHotEncoder 放入构造函数中? .. 遵循我的解决方案?

以上是关于将 Sklearn Pipiline 分支到 GridSearchCV 时出现问题的主要内容,如果未能解决你的问题,请参考以下文章

汇顶科技GR551x系列开发板已支持OpenHarmony

汇顶科技GR551x系列开发板已支持OpenHarmony

python中的sklearn中决策树使用的是哪一种算法

改进和调整这些模型的好方法 Sklearn

Sklearn:将 lemmatizer 添加到 CountVectorizer

将 Keras 集成到 SKLearn 管道?