实现 GridSearchCV 和 Pipelines 以执行 KNN 算法的超参数调整
Posted
技术标签:
【中文标题】实现 GridSearchCV 和 Pipelines 以执行 KNN 算法的超参数调整【英文标题】:Implementing GridSearchCV and Pipelines to perform Hyperparameters Tuning for KNN Algorithm 【发布时间】:2022-01-17 14:09:04 【问题描述】:我一直在阅读有关为 KNN 算法执行超参数调整的信息,并了解实现它的最佳实践是确保对于每个折叠,我的数据集都应该使用管道进行归一化和过采样(以避免数据泄漏和过拟合)。
我正在尝试做的是,我正在尝试确定尽可能多的邻居 (n_neighbors
),从而使我在训练中获得最佳准确性。在代码中,我将邻居的数量设置为列表range (1,50)
,并将迭代次数设置为cv=10
。
我的代码如下:
# dataset reading & preprocessing libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
#oversmapling
from imblearn.over_sampling import SMOTE
#KNN Model related Libraries
import cuml
from imblearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from cuml.neighbors import KNeighborsClassifier
#loading the dataset
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/dataset/IanDataset.csv")
#filling missing values with zeros
df = df.fillna(0)
#replace the data in from being objects to integers
df["command response"].replace("b'0'": "0", "b'1'": "1", inplace=True)
df["binary result"].replace("b'0'": "0", "b'1'": "1", inplace=True)
#change the datatype of some features to be able to be used later
df["command response"] = pd.to_numeric(df["command response"]).astype(float)
df["binary result"] = pd.to_numeric(df["binary result"]).astype(int)
# dataset splitting
X = df.iloc[:, 0:17]
y_bin = df.iloc[:, 17]
# spliting the dataset into train and test for binary classification
X_train, X_test, y_bin_train, y_bin_test = train_test_split(X, y_bin, random_state=0, test_size=0.2)
#making pipleline that normalize, oversample and use classifier before GridSearchCV
pipe = Pipeline([
('normalization', MinMaxScaler()),
('oversampling', SMOTE()),
('classifier', KNeighborsClassifier(metric='eculidean', output='input'))
])
#Using GridSearchCV
neighbors = list(range(1,50))
parameters =
'classifier__n_neighbors': neighbors
grid_search = GridSearchCV(pipe, parameters, cv=10)
grid_search.fit(X_train, y_bin_train)
print("Best Accuracy: " .format(grid_search.best_score_))
print("Best num of neighbors: " .format(grid_search.best_estimator_.get_params()['n_neighbors']))
在步骤grid_search.fit(X_train, y_bin_train)
,程序重复我得到的错误是:
/usr/local/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:619: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 598, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/usr/local/lib/python3.7/dist-packages/imblearn/pipeline.py", line 266, in fit
self._final_estimator.fit(Xt, yt, **fit_params_last_step)
File "/usr/local/lib/python3.7/site-packages/cuml/internals/api_decorators.py", line 409, in inner_with_setters
return func(*args, **kwargs)
File "cuml/neighbors/kneighbors_classifier.pyx", line 176, in cuml.neighbors.kneighbors_classifier.KNeighborsClassifier.fit
File "/usr/local/lib/python3.7/site-packages/cuml/internals/api_decorators.py", line 409, in inner_with_setters
return func(*args, **kwargs)
File "cuml/neighbors/nearest_neighbors.pyx", line 397, in cuml.neighbors.nearest_neighbors.NearestNeighbors.fit
ValueError: Metric is not valid. Use sorted(cuml.neighbors.VALID_METRICSeculidean[brute]) to get valid options.
我不确定这个错误来自哪一方,是因为我从 cuML Library 而不是 sklearn 导入 KNN Algorthim 吗?还是我的 Pipeline 和 GridSearchCV 实现有问题?
【问题讨论】:
【参考方案1】:此错误表明您为 metric
参数(在 scikit-learn 和 cuML 中)传递了无效值。你拼错了“euclidean”。
import cuml
from sklearn import datasets
from sklearn.preprocessing import MinMaxScaler
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from cuml.neighbors import KNeighborsClassifier
X, y = datasets.make_classification(
n_samples=100
)
pipe = Pipeline([
('normalization', MinMaxScaler()),
('oversampling', SMOTE()),
('classifier', KNeighborsClassifier(metric='euclidean', output='input'))
])
parameters =
'classifier__n_neighbors': [1,3,6]
grid_search = GridSearchCV(pipe, parameters, cv=2)
grid_search.fit(X, y)
GridSearchCV(cv=2,
estimator=Pipeline(steps=[('normalization', MinMaxScaler()),
('oversampling', SMOTE()),
('classifier', KNeighborsClassifier())]),
param_grid='classifier__n_neighbors': [1, 3, 6])
【讨论】:
感谢您的帮助.. 没想到会这样!! 除了错误(现在已修复),你认为我的 Pipeline 和 GridSearchCV 的实现是正确的吗?我想确保X_train
已标准化,并且 X_train
和 y_bin_train
在将其放入 KNN 分类器之前在每个折叠上都进行过采样。我的代码是否正确地完成了这项工作,以便我可以从中得到正确的结果?
如所写,您的管道将在折叠内运行,这是您的预处理步骤所需要的。以上是关于实现 GridSearchCV 和 Pipelines 以执行 KNN 算法的超参数调整的主要内容,如果未能解决你的问题,请参考以下文章
如何实现 sklearn 的 Estimator 接口以在 GridSearchCV 管道中使用?
使用 Keras 和 sklearn GridSearchCV 交叉验证提前停止