管道中的自定义 sklearn 转换器为 cross_validate 抛出 IndexError 但在使用 GridSearchCV 时不会

Posted

技术标签:

【中文标题】管道中的自定义 sklearn 转换器为 cross_validate 抛出 IndexError 但在使用 GridSearchCV 时不会【英文标题】:Custom sklearn transformer in Pipeline throwing IndexError for cross_validate but not when using GridSearchCV 【发布时间】:2019-01-12 16:21:16 【问题描述】:

我使用 sklearn 的 TransformerMixin 和 BaseEstimator 类创建了一个自定义转换器 (TopQuantile()),如下所示,基本上只是在颠簸或熊猫输入特征/列上运行 np.percentile()pd.DataFrame.quantile()找出特征中的哪些值属于用户指定的分位数,哪些不属于,然后将每一行的计数写入一个新的 numpy/Pandas 列。

这里的问题是,当我使用cross_validate 运行我的管道时,它会抛出IndexError: index 10 is out of bounds for axis 1 with size 10。我看了又看,这似乎没有任何意义,因为我在变压器的fit() 方法中的所有计算都只假设提供相同数量的特征/列作为输入X,它不在乎有多少行(尽管IndexError 担心axis = 1(行)没有预期的计数。

现在最奇怪的部分是:当我在GridSearchCV 中运行我的管道时,它运行得非常好,并给出了我期望的输出!为什么cross_validate 会抛出如此基本的错误,表明我的转换器有一个固有的缺陷,而GridSearchCV 却可以接受它?请帮忙。下面包括我的转换器的副本、我正在使用的 PipelineGridSearchCV 调用和 cross_validate 调用(请注意,我正在按照我正在做这个项目的课程的要求使用 Python 2.7为):

自定义转换器:

from sklearn.base import TransformerMixin, BaseEstimator

class TopQuantile(BaseEstimator, TransformerMixin):
    '''
    Engineer a new feature using the top quantile values of a given set of features. 

    For every value in those features, check to see if the value is within the top q-quantile
    of that feature. If so, increase the count for that sample by +1. New feature is an integer count
    of how often each sample had a value in the top q-quantile of the specified features.

    This class's fit(), transform(), and fit_transform() methods all assume a pandas DataFrame as input.
    '''

    import pandas as pd

    def __init__(self, new_feature_name = 'top_finance', feature_list = None, q = 0.90):
        '''
        Constructor for TopQuantile objects. 

        Parameters
        ----------
        new_feature_name: str. Name of the feature that will be added as a pandas DataFrame column
                            upon transformation. Only used if X is a DataFrame.

        feature_list: list of str or int.
            If X is a Dataframe: Names of feature columns that should be included in 
                                    the count of top quantile membership.
            If X is a 2D numpy array: Integer positions for the columns to be used

        q: float. Corresponds to the percentage quantile you want to be counting for. For example,
            q = 0.90 looks at the 90% percentile (top decile).
        '''
        self.new_feature_name = new_feature_name
        self.feature_list = feature_list
        self.q = q

    def fit(self, X, y = None):
        '''
        Calculates the q-quantile properly both for features that are largely positive
        and ones that are largely negative (as DataFrame.quantile() does not do this correctly).
        For example, if most of a feature's data points are between (-1E5,0), the "top decile"
        should not be -100, it should be -1E4.

        Parameters
        ----------
        X: features DataFrame or numpy array, one feature per column
        y: labels DataFrame/numpy array, ignored
        '''


        if isinstance(X, pd.DataFrame):
            #Is self.feature_list something other than a list of strings?
            if not isinstance(self.feature_list[0], str):
                raise TypeError('feature_list is not a list of strings')

            #Majority-negative features need to check df.quantile(1-q)
                #in order to be using correct quantile value
            pos = X.loc[:,self.feature_list].quantile(self.q)
            neg = X.loc[:,self.feature_list].quantile(1.0-self.q)

            #Replace negative quantile values of neg within pos to create 
            #merged Series with proper quantile values for majority-positive
            #and majority-negative features
            pos.loc[neg < 0] = neg.loc[neg < 0]
            self.quants = pos

        #Are features a NumPy array?
        elif isinstance(X, np.ndarray):
            #Is self.feature_list something other than a list of int?
            if not isinstance(self.feature_list[0], int):
                raise TypeError('feature_list is not a list of integers')

            #Majority-negative features need to check df.quantile(1-q)
                #in order to be using correct quantile value
            pos = np.percentile(X[:, self.feature_list], self.q * 100, axis = 0)
            neg = np.percentile(X[:, self.feature_list], (1.0 - self.q) * 100, axis = 0)

            #It's easier to work in a DataFrame, and now we don't need to know column names,
            #so let's switch over to a DataFrame for a moment
            #pos = pd.DataFrame(pos)
            #neg = pd.DataFrame(neg)

            #Replace negative quantile values of neg within pos to create 
            #merged Series with proper quantile values for majority-positive
            #and majority-negative features
            pos[neg < 0] = neg[neg < 0]
            self.quants = pos

        else:
            raise TypeError('Features need to be either pandas DataFrame or numpy array')




    def transform(self, X):
        '''
        Using quantile information from fit(), adds a new feature to X that contains integer counts
        of how many times a sample had a value that was in the top q-quantile of its feature, limited
        to only features in self.feature_list

        Parameters
        ----------
        X: features DataFrame or numpy array, one feature per column

        Returns
        ----------
        If X is a DataFrame: Input DataFrame with additional column for new_feature, called self.new_feature_name

        If X is a 2D numpy array: same as for the DataFrame case, except is a numpy array with no column names

        '''
        #Change all values in X to True or False if they are or are not within the
            #top q-quantile
        if isinstance(X, pd.DataFrame):
            self.boolean = X.loc[:,self.feature_list].abs() >= self.quants.abs()

            #Sum across each row to produce the counts
            X[self.new_feature_name] = self.boolean.sum(axis = 1)


        elif isinstance(X, np.ndarray):
            self.boolean = np.absolute(X[:,self.feature_list]) >= np.absolute(self.quants)            
            X = np.vstack((X.T, np.sum(self.boolean, axis = 1))).T

        else:
            raise TypeError('Features need to be either pandas DataFrame or numpy array')    

        return X

    def fit_transform(self, X, y = None):
        '''
        Provides the identical output to running fit() and then transform() in one nice little package.

        Parameters
        ----------
        X: features DataFrame or 2D numpy array, one feature per column
        y: labels DataFrame, ignored
        '''

        self.fit(X, y)
        return self.transform(X)

管道:

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import Imputer, RobustScaler
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.feature_selection import SelectPercentile, f_classif
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import numpy as np

#Suppress the warnings coming from GridSearchCV to reduce output messages
import warnings
import sklearn.exceptions

warnings.filterwarnings("ignore",category=sklearn.exceptions.UndefinedMetricWarning)

features = df.drop(columns = ['poi'])
labels = df['poi']

#--------------------------------- CROSS-VALIDATION -----------------------------------------
#Shuffled and stratified cross-validation binning for this tuning exercise
cv_10 = StratifiedShuffleSplit(n_splits=10, test_size=0.1, random_state = 42)

#--------------------------------- IMPUTATION -----------------------------------------
#Imputation using the median of each feature
imp = Imputer(missing_values='NaN', strategy='median')

#--------------------------------- FEATURE ENGINEERING -----------------------------------------
#Feature Engineering with TopQuantile() to count the top quantile financial features
feats = ['salary', 'total_payments', 'bonus', 'total_stock_value', 'expenses', 
         'exercised_stock_options', 'other', 'restricted_stock']

#Since numpy needs the columns as integer positions instead of names...
feats_loc_list = []
for e in feats:
    feats_loc_list.append(features.columns.get_loc(e))

topQ = TopQuantile(feature_list = feats_loc_list)

#--------------------------------- FEATURE SCALING -----------------------------------------
#Feature Scaling via RobustScaler()
scaler = RobustScaler()

#--------------------------------- FEATURE SELECTION -----------------------------------------
#Feature Selection via SelectPercentile(f_classif, percentile = 75)
selector = SelectPercentile(score_func = f_classif, percentile = 75)

#--------------------------------- TUNING -----------------------------------------
#FeatureUnion to keep track of kNN and SVM model results
knn = KNeighborsClassifier()
knn_param_grid = 'kNN__n_neighbors': range(1,21,1), 'kNN__weights': ['uniform', 'distance'],
                  'kNN__p': [1,2]

#Hyperparameter tuning

knn_pipe = Pipeline([('impute', imp), ('engineer',topQ), ('scale', scaler),
                    ('select', selector), ('kNN', knn)])

GridSearchCV 调用:

knn_gs = GridSearchCV(knn_pipe, knn_param_grid, scoring = ['precision', 'recall', 'f1'], 
                          cv = cv_10, refit = 'f1', return_train_score = False)
    knn_gs.fit(features, labels)

cross_validate 调用:

knn_pipe_tuned = Pipeline([('impute', imp), ('engineer',topQ), ('scale', scaler),
                    ('select', selector), ('kNN', knn_gs.best_estimator_)])


cv_1000 = StratifiedShuffleSplit(n_splits=1000, test_size=0.2, random_state=42)

from sklearn.model_selection import cross_validate
knn_scores = cross_validate(knn_pipe_tuned, features, labels, groups=None, 
                            scoring=['precision', 'recall', 'f1'], cv=cv_1000)

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-147-4f04d5e63a0b> in <module>()
     12 from sklearn.model_selection import cross_validate
     13 knn_scores = cross_validate(knn_pipe_tuned, features, labels, groups=None, 
---> 14                             scoring=['precision', 'recall', 'f1'], cv=cv_1000)
     15 
     16 knn_cv_results = pd.DataFrame(knn_scores)

/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/model_selection/_validation.pyc in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score)
    204             fit_params, return_train_score=return_train_score,
    205             return_times=True)
--> 206         for train, test in cv.split(X, y, groups))
    207 
    208     if return_train_score:

/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
    777             # was dispatched. In particular this covers the edge
    778             # case of Parallel used with an exhausted iterator.
--> 779             while self.dispatch_one_batch(iterator):
    780                 self._iterating = True
    781             else:

/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in dispatch_one_batch(self, iterator)
    623                 return False
    624             else:
--> 625                 self._dispatch(tasks)
    626                 return True
    627 

/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in _dispatch(self, batch)
    586         dispatch_timestamp = time.time()
    587         cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 588         job = self._backend.apply_async(batch, callback=cb)
    589         self._jobs.append(job)
    590 

/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.pyc in apply_async(self, func, callback)
    109     def apply_async(self, func, callback=None):
    110         """Schedule a func to be run"""
--> 111         result = ImmediateResult(func)
    112         if callback:
    113             callback(result)

/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.pyc in __init__(self, batch)
    330         # Don't delay the application, to avoid keeping the input
    331         # arguments in memory
--> 332         self.results = batch()
    333 
    334     def get(self):

/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
    132 
    133     def __len__(self):

/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/model_selection/_validation.pyc in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, error_score)
    456             estimator.fit(X_train, **fit_params)
    457         else:
--> 458             estimator.fit(X_train, y_train, **fit_params)
    459 
    460     except Exception as e:

/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/pipeline.pyc in fit(self, X, y, **fit_params)
    248         Xt, fit_params = self._fit(X, y, **fit_params)
    249         if self._final_estimator is not None:
--> 250             self._final_estimator.fit(Xt, y, **fit_params)
    251         return self
    252 

/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/pipeline.pyc in fit(self, X, y, **fit_params)
    246             This estimator
    247         """
--> 248         Xt, fit_params = self._fit(X, y, **fit_params)
    249         if self._final_estimator is not None:
    250             self._final_estimator.fit(Xt, y, **fit_params)

/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/pipeline.pyc in _fit(self, X, y, **fit_params)
    211                 Xt, fitted_transformer = fit_transform_one_cached(
    212                     cloned_transformer, None, Xt, y,
--> 213                     **fit_params_steps[name])
    214                 # Replace the transformer of the step with the fitted
    215                 # transformer. This is necessary when loading the transformer

/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/externals/joblib/memory.pyc in __call__(self, *args, **kwargs)
    360 
    361     def __call__(self, *args, **kwargs):
--> 362         return self.func(*args, **kwargs)
    363 
    364     def call_and_shelve(self, *args, **kwargs):

/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/pipeline.pyc in _fit_transform_one(transformer, weight, X, y, **fit_params)
    579                        **fit_params):
    580     if hasattr(transformer, 'fit_transform'):
--> 581         res = transformer.fit_transform(X, y, **fit_params)
    582     else:
    583         res = transformer.fit(X, y, **fit_params).transform(X)

<ipython-input-108-dfcab4b62582> in fit_transform(self, X, y)
    138         '''
    139 
--> 140         self.fit(X, y)
    141         return self.transform(X)

<ipython-input-108-dfcab4b62582> in fit(self, X, y)
     73             #Majority-negative features need to check df.quantile(1-q)
     74                 #in order to be using correct quantile value
---> 75             pos = np.percentile(X[:, self.feature_list], self.q * 100, axis = 0)
     76             neg = np.percentile(X[:, self.feature_list], (1.0 - self.q) * 100, axis = 0)
     77 

IndexError: index 10 is out of bounds for axis 1 with size 10

【问题讨论】:

与您的问题有一半相关,但通过设置knn_pipe_tuned = knn_gs 或只是knn_scores = cross_validate(knn_gs, features, labels, groups=None, scoring=['precision', 'recall', 'f1'], cv=cv_1000),您将进行嵌套交叉验证,这将为您提供无偏见的简历结果。 【参考方案1】:

当您将管道发送到 GridSearchCV 时,best_estimator_ 还包含一个管道对象(无论您只调整了该管道的单个部分还是所有部分)。

所以当你这样做时:

knn_pipe_tuned = Pipeline([('impute', imp), ('engineer',topQ), ('scale', scaler),
                ('select', selector), ('kNN', knn_gs.best_estimator_)])

你实际上是在这样做:

knn_pipe_tuned = Pipeline([('impute', imp), ('engineer',topQ), ('scale', scaler),
                ('select', selector), ('kNN', Pipeline([('impute', imp),           
                                                        ('engineer',topQ), 
                                                        ('scale', scaler),
                                                        ('select', selector), 
                                                        ('kNN', knn)]))])

所以这将再次imputeengineerscaleselect 已经通过这一切的数据。我确信这不是你想要的。

在做cross_validate时,只需要这样做:

knn_pipe_tuned = knn_gs.best_estimator_

【讨论】:

哦!我完全错过了。我一直假设best_estimator_ 仅仅是估算器,而不是整个管道,但你所说的完全有道理。而且由于代码现在可以顺利运行并产生看起来具有适当规模的结果,因此我将其标记为“完成”。感谢您的帮助!

以上是关于管道中的自定义 sklearn 转换器为 cross_validate 抛出 IndexError 但在使用 GridSearchCV 时不会的主要内容,如果未能解决你的问题,请参考以下文章

如何将交叉验证目标输入管道中的自定义转换器

将额外参数传递给 sklearn 管道中的自定义评分函数

为分类变量sklearn创建我的自定义Imputer

自定义 sklearn 管道变压器给出“pickle.PicklingError”

scikit 学习。 GridSearchCV 管道中的自定义 Transformer set_params 逻辑。

在 sklearn 管道中转换估计器的结果