我正在尝试实现 GridSearchCV 来调整 K 最近邻分类器的参数

Posted

技术标签:

【中文标题】我正在尝试实现 GridSearchCV 来调整 K 最近邻分类器的参数【英文标题】:I am trying to implement GridSearchCV to tune the parameters of K nearest neighbor classifier 【发布时间】:2016-10-07 07:16:50 【问题描述】:
import sklearn.cross_validation
import sklearn.grid_search
import sklearn.metrics
import sklearn.neighbors
import sklearn.decomposition
import sklearn
import pandas as pd

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data", names=['Id', 'ClumpThickness', 'UniformityCellSize', 'UniformityCellShape', 'MarginalAdhesion', 'EpithelialCellSize', 'BareNuclei', 'BlandChromatin', 'NormalNucleoli', 'Mitoses','Class'])
X = df.iloc[0:699,1:10]
Y = df.iloc[0:699,-1:]

print X.shape, Y.shape

(699, 9) (699, 1)

X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(X,Y,test_size=0.33,random_state=42)

k = np.arange(20)+1
parameters = 'n_neighbors': k
knn = sklearn.neighbors.KNeighborsClassifier()
clf = sklearn.grid_search.GridSearchCV(knn,parameters,cv=10)
print Y_train.shape
print X_train.shape
clf

(468, 1) (468, 9)

GridSearchCV(cv=10, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=, iid=True, n_jobs=1,
       param_grid='n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20]),
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

clf.fit(X_train,Y_train)

现在,当我尝试将训练数据拟合到 clf 时,它显示以下错误 -

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-49-d072fe7672f3> in <module>()
----> 1 clf.fit(X_train,Y_train)

/home/vagisha/anaconda2/lib/python2.7/site-packages/sklearn/grid_search.pyc in fit(self, X, y)
    802 
    803         """
--> 804         return self._fit(X, y, ParameterGrid(self.param_grid))
    805 
    806 

    /home/vagisha/anaconda2/lib/python2.7/site-packages/sklearn/grid_search.pyc in _fit(self, X, y, parameter_iterable)
        530                                  'of samples (%i) than data (X: %i samples)'
        531                                  % (len(y), n_samples))
    --> 532         cv = check_cv(cv, X, y, classifier=is_classifier(estimator))
        533 
        534         if self.verbose > 0:

    /home/vagisha/anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.pyc in check_cv(cv, X, y, classifier)
       1675         if classifier:
       1676             if type_of_target(y) in ['binary', 'multiclass']:
    -> 1677                 cv = StratifiedKFold(y, cv)
       1678             else:
       1679                 cv = KFold(_num_samples(y), cv)
/home/vagisha/anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.pyc in __init__(self, y, n_folds, shuffle, random_state)
    531         for test_fold_idx, per_label_splits in enumerate(zip(*per_label_cvs)):
    532             for label, (_, test_split) in zip(unique_labels, per_label_splits):
--> 533                 label_test_folds = test_folds[y == label]
    534                 # the test split can be too big because we used
    535                 # KFold(max(c, self.n_folds), self.n_folds) instead of

IndexError: too many indices for array

【问题讨论】:

【参考方案1】:

我相信您的目标“Y_train”必须是一维数组 (468, )。试试:

X_train.ravel()

在训练分类器之前。

【讨论】:

有趣的是,KNeighborsClassifier 对象仅在标签数组为(n_samples,1) 而不是(n_samples,) 时发出警告。 代码通过使用 X_train.valuesY_train.valuesX_trainY_train 更改为 numpy 数组并将Y_train 重新整形为 (468,) 来工作

以上是关于我正在尝试实现 GridSearchCV 来调整 K 最近邻分类器的参数的主要内容,如果未能解决你的问题,请参考以下文章

使用 GridSearchCV 调整 scikit-learn 的随机森林超参数

使用 GridSearchCV 进行超参数调整

在 python 中使用 gridsearchcv 进行梯度提升分类器的参数调整

使用 GridSearchCv 优化 SVR() 参数

如何保存 GridSearchCV 对象?

为啥 GridSearchCV 模型结果与我手动调整的模型不同?