实现 GridSearchCV 和 Pipelines 以执行 KNN 算法的超参数调整

Posted 2023-03-16

技术标签:

【中文标题】实现 GridSearchCV 和 Pipelines 以执行 KNN 算法的超参数调整【英文标题】：Implementing GridSearchCV and Pipelines to perform Hyperparameters Tuning for KNN Algorithm 【发布时间】：2022-01-17 14:09:04 【问题描述】：

我一直在阅读有关为 KNN 算法执行超参数调整的信息，并了解实现它的最佳实践是确保对于每个折叠，我的数据集都应该使用管道进行归一化和过采样（以避免数据泄漏和过拟合）。我正在尝试做的是，我正在尝试确定尽可能多的邻居 (n_neighbors)，从而使我在训练中获得最佳准确性。在代码中，我将邻居的数量设置为列表range (1,50)，并将迭代次数设置为cv=10。

我的代码如下：

# dataset reading & preprocessing libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

#oversmapling
from imblearn.over_sampling import SMOTE

#KNN Model related Libraries
import cuml 
from imblearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from cuml.neighbors import KNeighborsClassifier

#loading the dataset
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/dataset/IanDataset.csv")

#filling missing values with zeros
df = df.fillna(0)

#replace the data in from being objects to integers
df["command response"].replace("b'0'": "0", "b'1'": "1", inplace=True)
df["binary result"].replace("b'0'": "0", "b'1'": "1", inplace=True)

#change the datatype of some features to be able to be used later 
df["command response"] = pd.to_numeric(df["command response"]).astype(float)
df["binary result"] = pd.to_numeric(df["binary result"]).astype(int)

# dataset splitting
X = df.iloc[:, 0:17]
y_bin = df.iloc[:, 17]

# spliting the dataset into train and test for binary classification
X_train, X_test, y_bin_train, y_bin_test = train_test_split(X, y_bin, random_state=0, test_size=0.2)

#making pipleline that normalize, oversample and use classifier before GridSearchCV
pipe = Pipeline([
        ('normalization', MinMaxScaler()),
        ('oversampling', SMOTE()),
        ('classifier', KNeighborsClassifier(metric='eculidean', output='input'))
])

#Using GridSearchCV
neighbors = list(range(1,50))
parameters = 
    'classifier__n_neighbors': neighbors 


grid_search = GridSearchCV(pipe, parameters, cv=10)
grid_search.fit(X_train, y_bin_train)

print("Best Accuracy: " .format(grid_search.best_score_))
print("Best num of neighbors: " .format(grid_search.best_estimator_.get_params()['n_neighbors']))

在步骤grid_search.fit(X_train, y_bin_train)，程序重复我得到的错误是：

/usr/local/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:619: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/imblearn/pipeline.py", line 266, in fit
    self._final_estimator.fit(Xt, yt, **fit_params_last_step)
  File "/usr/local/lib/python3.7/site-packages/cuml/internals/api_decorators.py", line 409, in inner_with_setters
    return func(*args, **kwargs)
  File "cuml/neighbors/kneighbors_classifier.pyx", line 176, in cuml.neighbors.kneighbors_classifier.KNeighborsClassifier.fit
  File "/usr/local/lib/python3.7/site-packages/cuml/internals/api_decorators.py", line 409, in inner_with_setters
    return func(*args, **kwargs)
  File "cuml/neighbors/nearest_neighbors.pyx", line 397, in cuml.neighbors.nearest_neighbors.NearestNeighbors.fit
ValueError: Metric  is not valid. Use sorted(cuml.neighbors.VALID_METRICSeculidean[brute]) to get valid options.

我不确定这个错误来自哪一方，是因为我从 cuML Library 而不是 sklearn 导入 KNN Algorthim 吗？还是我的 Pipeline 和 GridSearchCV 实现有问题？

【问题讨论】：

【参考方案1】：

此错误表明您为 metric 参数（在 scikit-learn 和 cuML 中）传递了无效值。你拼错了“euclidean”。

import cuml
from sklearn import datasets

from sklearn.preprocessing import MinMaxScaler

from imblearn.over_sampling import SMOTE

from imblearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from cuml.neighbors import KNeighborsClassifier

X, y = datasets.make_classification(
    n_samples=100
)

pipe = Pipeline([
        ('normalization', MinMaxScaler()),
        ('oversampling', SMOTE()),
        ('classifier', KNeighborsClassifier(metric='euclidean', output='input'))
])

parameters = 
    'classifier__n_neighbors': [1,3,6] 


grid_search = GridSearchCV(pipe, parameters, cv=2)
grid_search.fit(X, y)
GridSearchCV(cv=2,
             estimator=Pipeline(steps=[('normalization', MinMaxScaler()),
                                       ('oversampling', SMOTE()),
                                       ('classifier', KNeighborsClassifier())]),
             param_grid='classifier__n_neighbors': [1, 3, 6])

【讨论】：

感谢您的帮助.. 没想到会这样！！除了错误（现在已修复），你认为我的 Pipeline 和 GridSearchCV 的实现是正确的吗？我想确保 X_train 已标准化，并且 X_train 和 y_bin_train 在将其放入 KNN 分类器之前在每个折叠上都进行过采样。我的代码是否正确地完成了这项工作，以便我可以从中得到正确的结果？如所写，您的管道将在折叠内运行，这是您的预处理步骤所需要的。

以上是关于实现 GridSearchCV 和 Pipelines 以执行 KNN 算法的超参数调整的主要内容，如果未能解决你的问题，请参考以下文章