使用管道进行岭回归网格搜索

Posted 2023-02-23

技术标签:

【中文标题】使用管道进行岭回归网格搜索【英文标题】：Ridge Regression Grid Search with Pipeline 【发布时间】：2019-12-14 01:58:04 【问题描述】：

我正在尝试优化岭回归的超参数。还要添加多项式特征。因此，管道看起来不错，但在尝试使用 gridsearchcv 时出现错误。这里：

# Importing the Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
from collections import Counter
from IPython.core.display import display, html
sns.set_style('darkgrid')

# Data Preprocessing 
from sklearn.datasets import load_boston
boston_dataset = load_boston()
dataset = pd.DataFrame(boston_dataset.data, columns = boston_dataset.feature_names)
dataset['MEDV'] = boston_dataset.target

# X and y Variables
X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values.reshape(-1,1)

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 25)

# Building the Model ------------------------------------------------------------------------

# Fitting regressior to the Training set
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

steps = [
    ('scalar', StandardScaler()),
    ('poly', PolynomialFeatures(degree=2)),
    ('model', Ridge())
]

ridge_pipe = Pipeline(steps)
ridge_pipe.fit(X_train, y_train)
# Predicting the Test set results
y_pred = ridge_pipe.predict(X_test)

# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = ridge_pipe, X = X_train, y = y_train, cv = 10)
accuracies.mean()
#accuracies.std()

# Applying Grid Search to find the best model and the best parameters
from sklearn.model_selection import GridSearchCV

parameters = [ 'alpha': np.arange(0, 0.2, 0.01)  ]

grid_search = GridSearchCV(estimator = ridge_pipe, 
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)  # <-- GETTING ERROR IN HERE

错误：

ValueError: Invalid parameter ridge for estimator

要做什么，或者，有没有更好的方法将岭回归与管道结合使用？如果能提供一些关于 gridsearch 的资源，我会很高兴，因为我是这方面的新手。错误：

【问题讨论】：

也许this 可能会帮助你。 【参考方案1】：

您的代码中有两个问题。首先，由于您使用的是pipeline，因此您需要在params 列表中指定参数属于管道的哪一部分。请参阅the official documentation 了解更多信息：

管道的目的是组装几个步骤，可以在设置不同参数的同时交叉验证。为了这，它可以使用它们的名称设置各个步骤的参数和参数名称以‘__’分隔，如下例所示

在这种情况下，由于alpha 将与ridge-regression 一起使用，并且您在管道定义中使用了字符串model，因此您需要将键alpha 重命名为model_alpha：

steps = [
    ('scalar', StandardScaler()),
    ('poly', PolynomialFeatures(degree=2)),
    ('model', Ridge())  # <------ Whatever string you assign here will be used later
]

# Since you have named it as 'model', you need change it to 'model_alpha'
parameters = [ 'model__alpha': np.arange(0, 0.2, 0.01)  ]

接下来，您需要了解此数据集是用于回归。您不应在此处使用accuracy，而应使用基于回归的评分函数，例如mean_squared_error。这里有一些你可以使用的other metrics for regression。像这样的

from sklearn.metrics import mean_squared_error, make_scorer
scoring_func = make_scorer(mean_squared_error)

grid_search = GridSearchCV(estimator = ridge_pipe, 
                           param_grid = parameters,
                           scoring = scoring_func,  #<--- Use the scoring func defined above
                           cv = 10,
                           n_jobs = -1)

这是一个指向Google colab notebook 的链接，其中包含工作代码。

【讨论】：

【参考方案2】：

对于 GridSearchCV 参数，ridge 的参数名称应为 'ridge__alpha'（注意 2 个下划线），而不仅仅是 'alpha'。

【讨论】：

以上是关于使用管道进行岭回归网格搜索的主要内容，如果未能解决你的问题，请参考以下文章

岭回归与lasso回归算法

spss20.0岭回归怎么看k值？

R语言如何和何时使用glmnet岭回归

关于 python 岭回归的问题：缩放和解释

岭回归和Lasso回归有啥区别？