管道中的自定义 sklearn 转换器为 cross_validate 抛出 IndexError 但在使用 GridSearchCV 时不会
Posted
技术标签:
【中文标题】管道中的自定义 sklearn 转换器为 cross_validate 抛出 IndexError 但在使用 GridSearchCV 时不会【英文标题】:Custom sklearn transformer in Pipeline throwing IndexError for cross_validate but not when using GridSearchCV 【发布时间】:2019-01-12 16:21:16 【问题描述】:我使用 sklearn 的 TransformerMixin 和 BaseEstimator 类创建了一个自定义转换器 (TopQuantile()
),如下所示,基本上只是在颠簸或熊猫输入特征/列上运行 np.percentile()
或 pd.DataFrame.quantile()
找出特征中的哪些值属于用户指定的分位数,哪些不属于,然后将每一行的计数写入一个新的 numpy/Pandas 列。
这里的问题是,当我使用cross_validate
运行我的管道时,它会抛出IndexError: index 10 is out of bounds for axis 1 with size 10
。我看了又看,这似乎没有任何意义,因为我在变压器的fit()
方法中的所有计算都只假设提供相同数量的特征/列作为输入X
,它不在乎有多少行(尽管IndexError
担心axis = 1
(行)没有预期的计数。
现在最奇怪的部分是:当我在GridSearchCV
中运行我的管道时,它运行得非常好,并给出了我期望的输出!为什么cross_validate
会抛出如此基本的错误,表明我的转换器有一个固有的缺陷,而GridSearchCV
却可以接受它?请帮忙。下面包括我的转换器的副本、我正在使用的 Pipeline
、GridSearchCV
调用和 cross_validate
调用(请注意,我正在按照我正在做这个项目的课程的要求使用 Python 2.7为):
自定义转换器:
from sklearn.base import TransformerMixin, BaseEstimator
class TopQuantile(BaseEstimator, TransformerMixin):
'''
Engineer a new feature using the top quantile values of a given set of features.
For every value in those features, check to see if the value is within the top q-quantile
of that feature. If so, increase the count for that sample by +1. New feature is an integer count
of how often each sample had a value in the top q-quantile of the specified features.
This class's fit(), transform(), and fit_transform() methods all assume a pandas DataFrame as input.
'''
import pandas as pd
def __init__(self, new_feature_name = 'top_finance', feature_list = None, q = 0.90):
'''
Constructor for TopQuantile objects.
Parameters
----------
new_feature_name: str. Name of the feature that will be added as a pandas DataFrame column
upon transformation. Only used if X is a DataFrame.
feature_list: list of str or int.
If X is a Dataframe: Names of feature columns that should be included in
the count of top quantile membership.
If X is a 2D numpy array: Integer positions for the columns to be used
q: float. Corresponds to the percentage quantile you want to be counting for. For example,
q = 0.90 looks at the 90% percentile (top decile).
'''
self.new_feature_name = new_feature_name
self.feature_list = feature_list
self.q = q
def fit(self, X, y = None):
'''
Calculates the q-quantile properly both for features that are largely positive
and ones that are largely negative (as DataFrame.quantile() does not do this correctly).
For example, if most of a feature's data points are between (-1E5,0), the "top decile"
should not be -100, it should be -1E4.
Parameters
----------
X: features DataFrame or numpy array, one feature per column
y: labels DataFrame/numpy array, ignored
'''
if isinstance(X, pd.DataFrame):
#Is self.feature_list something other than a list of strings?
if not isinstance(self.feature_list[0], str):
raise TypeError('feature_list is not a list of strings')
#Majority-negative features need to check df.quantile(1-q)
#in order to be using correct quantile value
pos = X.loc[:,self.feature_list].quantile(self.q)
neg = X.loc[:,self.feature_list].quantile(1.0-self.q)
#Replace negative quantile values of neg within pos to create
#merged Series with proper quantile values for majority-positive
#and majority-negative features
pos.loc[neg < 0] = neg.loc[neg < 0]
self.quants = pos
#Are features a NumPy array?
elif isinstance(X, np.ndarray):
#Is self.feature_list something other than a list of int?
if not isinstance(self.feature_list[0], int):
raise TypeError('feature_list is not a list of integers')
#Majority-negative features need to check df.quantile(1-q)
#in order to be using correct quantile value
pos = np.percentile(X[:, self.feature_list], self.q * 100, axis = 0)
neg = np.percentile(X[:, self.feature_list], (1.0 - self.q) * 100, axis = 0)
#It's easier to work in a DataFrame, and now we don't need to know column names,
#so let's switch over to a DataFrame for a moment
#pos = pd.DataFrame(pos)
#neg = pd.DataFrame(neg)
#Replace negative quantile values of neg within pos to create
#merged Series with proper quantile values for majority-positive
#and majority-negative features
pos[neg < 0] = neg[neg < 0]
self.quants = pos
else:
raise TypeError('Features need to be either pandas DataFrame or numpy array')
def transform(self, X):
'''
Using quantile information from fit(), adds a new feature to X that contains integer counts
of how many times a sample had a value that was in the top q-quantile of its feature, limited
to only features in self.feature_list
Parameters
----------
X: features DataFrame or numpy array, one feature per column
Returns
----------
If X is a DataFrame: Input DataFrame with additional column for new_feature, called self.new_feature_name
If X is a 2D numpy array: same as for the DataFrame case, except is a numpy array with no column names
'''
#Change all values in X to True or False if they are or are not within the
#top q-quantile
if isinstance(X, pd.DataFrame):
self.boolean = X.loc[:,self.feature_list].abs() >= self.quants.abs()
#Sum across each row to produce the counts
X[self.new_feature_name] = self.boolean.sum(axis = 1)
elif isinstance(X, np.ndarray):
self.boolean = np.absolute(X[:,self.feature_list]) >= np.absolute(self.quants)
X = np.vstack((X.T, np.sum(self.boolean, axis = 1))).T
else:
raise TypeError('Features need to be either pandas DataFrame or numpy array')
return X
def fit_transform(self, X, y = None):
'''
Provides the identical output to running fit() and then transform() in one nice little package.
Parameters
----------
X: features DataFrame or 2D numpy array, one feature per column
y: labels DataFrame, ignored
'''
self.fit(X, y)
return self.transform(X)
管道:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import Imputer, RobustScaler
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.feature_selection import SelectPercentile, f_classif
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import numpy as np
#Suppress the warnings coming from GridSearchCV to reduce output messages
import warnings
import sklearn.exceptions
warnings.filterwarnings("ignore",category=sklearn.exceptions.UndefinedMetricWarning)
features = df.drop(columns = ['poi'])
labels = df['poi']
#--------------------------------- CROSS-VALIDATION -----------------------------------------
#Shuffled and stratified cross-validation binning for this tuning exercise
cv_10 = StratifiedShuffleSplit(n_splits=10, test_size=0.1, random_state = 42)
#--------------------------------- IMPUTATION -----------------------------------------
#Imputation using the median of each feature
imp = Imputer(missing_values='NaN', strategy='median')
#--------------------------------- FEATURE ENGINEERING -----------------------------------------
#Feature Engineering with TopQuantile() to count the top quantile financial features
feats = ['salary', 'total_payments', 'bonus', 'total_stock_value', 'expenses',
'exercised_stock_options', 'other', 'restricted_stock']
#Since numpy needs the columns as integer positions instead of names...
feats_loc_list = []
for e in feats:
feats_loc_list.append(features.columns.get_loc(e))
topQ = TopQuantile(feature_list = feats_loc_list)
#--------------------------------- FEATURE SCALING -----------------------------------------
#Feature Scaling via RobustScaler()
scaler = RobustScaler()
#--------------------------------- FEATURE SELECTION -----------------------------------------
#Feature Selection via SelectPercentile(f_classif, percentile = 75)
selector = SelectPercentile(score_func = f_classif, percentile = 75)
#--------------------------------- TUNING -----------------------------------------
#FeatureUnion to keep track of kNN and SVM model results
knn = KNeighborsClassifier()
knn_param_grid = 'kNN__n_neighbors': range(1,21,1), 'kNN__weights': ['uniform', 'distance'],
'kNN__p': [1,2]
#Hyperparameter tuning
knn_pipe = Pipeline([('impute', imp), ('engineer',topQ), ('scale', scaler),
('select', selector), ('kNN', knn)])
GridSearchCV 调用:
knn_gs = GridSearchCV(knn_pipe, knn_param_grid, scoring = ['precision', 'recall', 'f1'],
cv = cv_10, refit = 'f1', return_train_score = False)
knn_gs.fit(features, labels)
cross_validate 调用:
knn_pipe_tuned = Pipeline([('impute', imp), ('engineer',topQ), ('scale', scaler),
('select', selector), ('kNN', knn_gs.best_estimator_)])
cv_1000 = StratifiedShuffleSplit(n_splits=1000, test_size=0.2, random_state=42)
from sklearn.model_selection import cross_validate
knn_scores = cross_validate(knn_pipe_tuned, features, labels, groups=None,
scoring=['precision', 'recall', 'f1'], cv=cv_1000)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-147-4f04d5e63a0b> in <module>()
12 from sklearn.model_selection import cross_validate
13 knn_scores = cross_validate(knn_pipe_tuned, features, labels, groups=None,
---> 14 scoring=['precision', 'recall', 'f1'], cv=cv_1000)
15
16 knn_cv_results = pd.DataFrame(knn_scores)
/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/model_selection/_validation.pyc in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score)
204 fit_params, return_train_score=return_train_score,
205 return_times=True)
--> 206 for train, test in cv.split(X, y, groups))
207
208 if return_train_score:
/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
777 # was dispatched. In particular this covers the edge
778 # case of Parallel used with an exhausted iterator.
--> 779 while self.dispatch_one_batch(iterator):
780 self._iterating = True
781 else:
/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in dispatch_one_batch(self, iterator)
623 return False
624 else:
--> 625 self._dispatch(tasks)
626 return True
627
/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in _dispatch(self, batch)
586 dispatch_timestamp = time.time()
587 cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 588 job = self._backend.apply_async(batch, callback=cb)
589 self._jobs.append(job)
590
/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.pyc in apply_async(self, func, callback)
109 def apply_async(self, func, callback=None):
110 """Schedule a func to be run"""
--> 111 result = ImmediateResult(func)
112 if callback:
113 callback(result)
/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.pyc in __init__(self, batch)
330 # Don't delay the application, to avoid keeping the input
331 # arguments in memory
--> 332 self.results = batch()
333
334 def get(self):
/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
133 def __len__(self):
/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/model_selection/_validation.pyc in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, error_score)
456 estimator.fit(X_train, **fit_params)
457 else:
--> 458 estimator.fit(X_train, y_train, **fit_params)
459
460 except Exception as e:
/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/pipeline.pyc in fit(self, X, y, **fit_params)
248 Xt, fit_params = self._fit(X, y, **fit_params)
249 if self._final_estimator is not None:
--> 250 self._final_estimator.fit(Xt, y, **fit_params)
251 return self
252
/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/pipeline.pyc in fit(self, X, y, **fit_params)
246 This estimator
247 """
--> 248 Xt, fit_params = self._fit(X, y, **fit_params)
249 if self._final_estimator is not None:
250 self._final_estimator.fit(Xt, y, **fit_params)
/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/pipeline.pyc in _fit(self, X, y, **fit_params)
211 Xt, fitted_transformer = fit_transform_one_cached(
212 cloned_transformer, None, Xt, y,
--> 213 **fit_params_steps[name])
214 # Replace the transformer of the step with the fitted
215 # transformer. This is necessary when loading the transformer
/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/externals/joblib/memory.pyc in __call__(self, *args, **kwargs)
360
361 def __call__(self, *args, **kwargs):
--> 362 return self.func(*args, **kwargs)
363
364 def call_and_shelve(self, *args, **kwargs):
/Users/emigre459/anaconda3/envs/ML_MiniProjects/lib/python2.7/site-packages/sklearn/pipeline.pyc in _fit_transform_one(transformer, weight, X, y, **fit_params)
579 **fit_params):
580 if hasattr(transformer, 'fit_transform'):
--> 581 res = transformer.fit_transform(X, y, **fit_params)
582 else:
583 res = transformer.fit(X, y, **fit_params).transform(X)
<ipython-input-108-dfcab4b62582> in fit_transform(self, X, y)
138 '''
139
--> 140 self.fit(X, y)
141 return self.transform(X)
<ipython-input-108-dfcab4b62582> in fit(self, X, y)
73 #Majority-negative features need to check df.quantile(1-q)
74 #in order to be using correct quantile value
---> 75 pos = np.percentile(X[:, self.feature_list], self.q * 100, axis = 0)
76 neg = np.percentile(X[:, self.feature_list], (1.0 - self.q) * 100, axis = 0)
77
IndexError: index 10 is out of bounds for axis 1 with size 10
【问题讨论】:
与您的问题有一半相关,但通过设置knn_pipe_tuned = knn_gs
或只是knn_scores = cross_validate(knn_gs, features, labels, groups=None, scoring=['precision', 'recall', 'f1'], cv=cv_1000)
,您将进行嵌套交叉验证,这将为您提供无偏见的简历结果。
【参考方案1】:
当您将管道发送到 GridSearchCV 时,best_estimator_
还包含一个管道对象(无论您只调整了该管道的单个部分还是所有部分)。
所以当你这样做时:
knn_pipe_tuned = Pipeline([('impute', imp), ('engineer',topQ), ('scale', scaler),
('select', selector), ('kNN', knn_gs.best_estimator_)])
你实际上是在这样做:
knn_pipe_tuned = Pipeline([('impute', imp), ('engineer',topQ), ('scale', scaler),
('select', selector), ('kNN', Pipeline([('impute', imp),
('engineer',topQ),
('scale', scaler),
('select', selector),
('kNN', knn)]))])
所以这将再次impute
、engineer
、scale
、select
已经通过这一切的数据。我确信这不是你想要的。
在做cross_validate
时,只需要这样做:
knn_pipe_tuned = knn_gs.best_estimator_
【讨论】:
哦!我完全错过了。我一直假设best_estimator_
仅仅是估算器,而不是整个管道,但你所说的完全有道理。而且由于代码现在可以顺利运行并产生看起来具有适当规模的结果,因此我将其标记为“完成”。感谢您的帮助!以上是关于管道中的自定义 sklearn 转换器为 cross_validate 抛出 IndexError 但在使用 GridSearchCV 时不会的主要内容,如果未能解决你的问题,请参考以下文章
自定义 sklearn 管道变压器给出“pickle.PicklingError”