实现 GridSearchCV 和 Pipelines 以执行 KNN 算法的超参数调整

Posted

技术标签:

【中文标题】实现 GridSearchCV 和 Pipelines 以执行 KNN 算法的超参数调整【英文标题】:Implementing GridSearchCV and Pipelines to perform Hyperparameters Tuning for KNN Algorithm 【发布时间】:2022-01-17 14:09:04 【问题描述】:

我一直在阅读有关为 KNN 算法执行超参数调整的信息,并了解实现它的最佳实践是确保对于每个折叠,我的数据集都应该使用管道进行归一化和过采样(以避免数据泄漏和过拟合)。 我正在尝试做的是,我正在尝试确定尽可能多的邻居 (n_neighbors),从而使我在训练中获得最佳准确性。在代码中,我将邻居的数量设置为列表range (1,50),并将迭代次数设置为cv=10

我的代码如下:

# dataset reading & preprocessing libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

#oversmapling
from imblearn.over_sampling import SMOTE

#KNN Model related Libraries
import cuml 
from imblearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from cuml.neighbors import KNeighborsClassifier

#loading the dataset
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/dataset/IanDataset.csv")

#filling missing values with zeros
df = df.fillna(0)

#replace the data in from being objects to integers
df["command response"].replace("b'0'": "0", "b'1'": "1", inplace=True)
df["binary result"].replace("b'0'": "0", "b'1'": "1", inplace=True)

#change the datatype of some features to be able to be used later 
df["command response"] = pd.to_numeric(df["command response"]).astype(float)
df["binary result"] = pd.to_numeric(df["binary result"]).astype(int)

# dataset splitting
X = df.iloc[:, 0:17]
y_bin = df.iloc[:, 17]

# spliting the dataset into train and test for binary classification
X_train, X_test, y_bin_train, y_bin_test = train_test_split(X, y_bin, random_state=0, test_size=0.2)

#making pipleline that normalize, oversample and use classifier before GridSearchCV
pipe = Pipeline([
        ('normalization', MinMaxScaler()),
        ('oversampling', SMOTE()),
        ('classifier', KNeighborsClassifier(metric='eculidean', output='input'))
])

#Using GridSearchCV
neighbors = list(range(1,50))
parameters = 
    'classifier__n_neighbors': neighbors 


grid_search = GridSearchCV(pipe, parameters, cv=10)
grid_search.fit(X_train, y_bin_train)

print("Best Accuracy: " .format(grid_search.best_score_))
print("Best num of neighbors: " .format(grid_search.best_estimator_.get_params()['n_neighbors']))

在步骤grid_search.fit(X_train, y_bin_train),程序重复我得到的错误是:

/usr/local/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:619: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/imblearn/pipeline.py", line 266, in fit
    self._final_estimator.fit(Xt, yt, **fit_params_last_step)
  File "/usr/local/lib/python3.7/site-packages/cuml/internals/api_decorators.py", line 409, in inner_with_setters
    return func(*args, **kwargs)
  File "cuml/neighbors/kneighbors_classifier.pyx", line 176, in cuml.neighbors.kneighbors_classifier.KNeighborsClassifier.fit
  File "/usr/local/lib/python3.7/site-packages/cuml/internals/api_decorators.py", line 409, in inner_with_setters
    return func(*args, **kwargs)
  File "cuml/neighbors/nearest_neighbors.pyx", line 397, in cuml.neighbors.nearest_neighbors.NearestNeighbors.fit
ValueError: Metric  is not valid. Use sorted(cuml.neighbors.VALID_METRICSeculidean[brute]) to get valid options.

我不确定这个错误来自哪一方,是因为我从 cuML Library 而不是 sklearn 导入 KNN Algorthim 吗?还是我的 Pipeline 和 GridSearchCV 实现有问题?

【问题讨论】:

【参考方案1】:

此错误表明您为 metric 参数(在 scikit-learn 和 cuML 中)传递了无效值。你拼错了“euclidean”。

import cuml
from sklearn import datasets
​
from sklearn.preprocessing import MinMaxScaler
​
from imblearn.over_sampling import SMOTE
​
from imblearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from cuml.neighbors import KNeighborsClassifier
​
X, y = datasets.make_classification(
    n_samples=100
)
​
pipe = Pipeline([
        ('normalization', MinMaxScaler()),
        ('oversampling', SMOTE()),
        ('classifier', KNeighborsClassifier(metric='euclidean', output='input'))
])
​
parameters = 
    'classifier__n_neighbors': [1,3,6] 

​
grid_search = GridSearchCV(pipe, parameters, cv=2)
grid_search.fit(X, y)
GridSearchCV(cv=2,
             estimator=Pipeline(steps=[('normalization', MinMaxScaler()),
                                       ('oversampling', SMOTE()),
                                       ('classifier', KNeighborsClassifier())]),
             param_grid='classifier__n_neighbors': [1, 3, 6])

【讨论】:

感谢您的帮助.. 没想到会这样!! 除了错误(现在已修复),你认为我的 Pipeline 和 GridSearchCV 的实现是正确的吗?我想确保 X_train 已标准化,并且 X_trainy_bin_train 在将其放入 KNN 分类器之前在每个折叠上都进行过采样。我的代码是否正确地完成了这项工作,以便我可以从中得到正确的结果? 如所写,您的管道将在折叠内运行,这是您的预处理步骤所需要的。

以上是关于实现 GridSearchCV 和 Pipelines 以执行 KNN 算法的超参数调整的主要内容,如果未能解决你的问题,请参考以下文章

管道和 GridSearchCV 的问题

如何实现 sklearn 的 Estimator 接口以在 GridSearchCV 管道中使用?

使用 Keras 和 sklearn GridSearchCV 交叉验证提前停止

在 GridSearchCV 中对测试集进行预处理的问题

尝试实现逻辑回归,但 gridsearchCV 显示输入变量的样本数量不一致:[60000, 60001]

我正在尝试实现 GridSearchCV 来调整 K 最近邻分类器的参数