使用 Gridsearch 进行 RFE 排名

Posted

技术标签:

【中文标题】使用 Gridsearch 进行 RFE 排名【英文标题】:RFE ranking with Gridsearch 【发布时间】:2020-11-05 22:58:48 【问题描述】:

我想在管道中使用 RFE 进行特征选择。我没有问题让它在没有 GridSearch 的管道中工作。但是,当我尝试合并 GridSearch 时,我不断收到值错误(注意。没有 RFE 的模型很好)。

我已尝试按照本主题中的建议使用 feature_selection:Grid Search with Recursive Feature Elimination in scikit-learn pipeline returns an error,但这会导致相同的错误。

可能出了什么问题?

我的错误: ValueError:估计器 RFE 的参数 alpha 无效(estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None, normalize=True,random_state=None,求解器='auto', tol=0.001), n_features_to_select=4,步骤=1,详细=1)。使用estimator.get_params().keys()检查可用参数列表。

这很好用:

rfe=RFE(estimator=LinearRegression(), n_features_to_select=4, verbose=1)

#setup the pipeline steps
steps = [('scaler', StandardScaler()),
         ('imputation', SimpleImputer(missing_values = np.NaN, strategy='most_frequent')), 
         ('reg',  rfe)]
          
# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


# Fit the pipeline to the training set: 
pipeline.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = pipeline.predict(X_test)

print()
# Print the features and their ranking (high = dropped early on)
print(dict(zip(X.columns, rfe.ranking_)))
# Print the features that are not eliminated
print(X.columns[rfe.support_])
print()

print("R^2: ".format(pipeline.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: ".format(rmse))

这不起作用

rfe=RFE(estimator=Ridge(normalize=True), n_features_to_select=4, verbose=1)

#setup the pipeline steps
steps = [('scaler', StandardScaler()),
         ('imputation', SimpleImputer(missing_values=np.NaN, strategy='most_frequent')), 
         ('ridge', rfe)]
          
# Create the pipeline: pipeline
pipeline = Pipeline(steps)

#Define hyperparameters and range of Grid Search
parameters = "ridge__alpha": np.linspace(0,1,100)

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# run cross validation
cv = GridSearchCV(pipeline, param_grid = parameters, cv=3)

# Fit the pipeline to the training set: 
cv.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = cv.predict(X_test)

# Compute and print R^2 and RMSE
print("R^2: ".format(cv.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: ".format(rmse))
print("Tuned Model Parameters: ".format(cv.best_params_))

使用 feature_selection 也不起作用

selector = feature_selection.RFE(Ridge(normalize=True))

#setup the pipeline steps
steps = [('scaler', StandardScaler()),
         ('imputation', SimpleImputer(missing_values=np.NaN, strategy='most_frequent')), 
         ('RFE', selector)]
          
# Create the pipeline: pipeline
pipeline = Pipeline(steps)

【问题讨论】:

【参考方案1】:

这个问题很老了,但万一有人偶然发现它:

您可以使用参数 '__estimator__' 访问 feature_selection(estimator=) 中的超参数 alpha 或估计器的任何参数:

from sklearn.pipeline import Pipeline
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.feature_selection import RFE

model = RFE(estimator=Ridge())

pipe = Pipeline(
    steps = [
        ("scaler", StandardScaler()),
        ("rfe", model)
    ]
)

param = 
    "rfe__step" : np.linspace(0.1, 1, 10),
    "rfe__estimator__alpha" : np.logspace(-3, 3, 7)


tscv = TimeSeriesSplit(n_splits=5).split(X_train)

gridsearch = GridSearchCV(estimator=pipe, cv=tscv, param_grid=param, refit=True, return_train_score=True, n_jobs=-1)
fit = gridsearch.fit(X_train, y_train)

【讨论】:

以上是关于使用 Gridsearch 进行 RFE 排名的主要内容,如果未能解决你的问题,请参考以下文章

机器学习第21篇 - 特征递归消除RFE算法 理论

使用 Gridsearch 进行超参数搜索,给出不起作用的参数值

使用 GridSearch 进行超参数优化

R语言基于递归特征消除RFE(Recursive Feature Elimination)进行特征筛选(feature selection)

机器学习入门-使用GridSearch进行网格参数搜索GridSeach(RandomRegressor(), param_grid, cv=3)

使用带有管道和 GridSearch 的 cross_val_score 进行嵌套交叉验证