同时进行特征选择和超参数调整

Posted

技术标签:

【中文标题】同时进行特征选择和超参数调整【英文标题】:Simultaneous feature selection and hyperparameter tuning 【发布时间】:2021-08-29 16:02:06 【问题描述】:

我正在尝试对 sklearn SVC 模型进行超参数调整和特征选择。

我尝试了下面的代码,但我收到了一个错误。

clf = Pipeline([('anova', SelectPercentile(f_classif)),
                ('svc',  SVC( probability = True))])

score_means = list()
score_params = list()
percentiles = (1, 3, 6, 10, 15, 20, 30, 40, 60, 80, 100)

params = 
    "C": np.logspace(-3, 17, 21),
    "gamma": np.logspace(-20, 1, 21),
    'class_weight' : [None, 'balanced']
    

halving_search = HalvingGridSearchCV(estimator = clf,
                                     param_grid = params,
                                     scoring = 'neg_brier_score',
                                     factor = 2, 
                                     
                                     verbose = 2,
                                     cv = 2)


for percentile in percentiles:
    clf.set_params(anova__percentile=percentile)
    this_scores = halving_search.fit(x_train, y_train)
    score_means.append(this_scores.best_score_)
    score_params.append(this_scores.best_params)

使用与 HalvingGridSearchCV 分开的 cross_val_score 运行管道代码是可行的,但我想,以找出哪种特征和超参数组合产生最佳模型。

当我运行上面的代码时,我得到以下错误:

Traceback (most recent call last):

  File "<ipython-input-83-cf714445297c>", line 4, in <module>
    this_scores = halving_search.fit(x_train, y_train)

  File "C:\Users\fredd\Anaconda3\lib\site-packages\sklearn\model_selection\_search_successive_halving.py", line 213, in fit
    super().fit(X, y=y, groups=None, **fit_params)

  File "C:\Users\fredd\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
    return f(*args, **kwargs)

  File "C:\Users\fredd\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py", line 841, in fit
    self._run_search(evaluate_candidates)

  File "C:\Users\fredd\Anaconda3\lib\site-packages\sklearn\model_selection\_search_successive_halving.py", line 320, in _run_search
    more_results=more_results)

  File "C:\Users\fredd\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py", line 809, in evaluate_candidates
    enumerate(cv.split(X, y, groups))))

  File "C:\Users\fredd\Anaconda3\lib\site-packages\joblib\parallel.py", line 1041, in __call__
    if self.dispatch_one_batch(iterator):

  File "C:\Users\fredd\Anaconda3\lib\site-packages\joblib\parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)

  File "C:\Users\fredd\Anaconda3\lib\site-packages\joblib\parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)

  File "C:\Users\fredd\Anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)

  File "C:\Users\fredd\Anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 572, in __init__
    self.results = batch()

  File "C:\Users\fredd\Anaconda3\lib\site-packages\joblib\parallel.py", line 263, in __call__
    for func, args, kwargs in self.items]

  File "C:\Users\fredd\Anaconda3\lib\site-packages\joblib\parallel.py", line 263, in <listcomp>
    for func, args, kwargs in self.items]

  File "C:\Users\fredd\Anaconda3\lib\site-packages\sklearn\utils\fixes.py", line 222, in __call__
    return self.function(*args, **kwargs)

  File "C:\Users\fredd\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 581, in _fit_and_score
    estimator = estimator.set_params(**cloned_parameters)

  File "C:\Users\fredd\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 150, in set_params
    self._set_params('steps', **kwargs)

  File "C:\Users\fredd\Anaconda3\lib\site-packages\sklearn\utils\metaestimators.py", line 54, in _set_params
    super().set_params(**params)

  File "C:\Users\fredd\Anaconda3\lib\site-packages\sklearn\base.py", line 233, in set_params
    (key, self))

ValueError: Invalid parameter C for estimator Pipeline(steps=[('anova', SelectPercentile(percentile=1)),
                ('svc', SVC(probability=True))]). Check the list of available parameters with `estimator.get_params().keys()`.

看起来 halvingsearch 正在尝试将管道作为 C 的输入传递。

【问题讨论】:

【参考方案1】:

您想对Pipeline 对象执行网格搜索。在为管道的不同步骤定义参数时,必须使用&lt;step&gt;__&lt;parameter&gt; 语法:

params = 
    "svc__C": np.logspace(-3, 17, 21),
    "svc__gamma": np.logspace(-20, 1, 21),
    "svc__class_weight" : [None, 'balanced']

请参阅user guide 了解更多信息。

【讨论】:

以上是关于同时进行特征选择和超参数调整的主要内容,如果未能解决你的问题,请参考以下文章

机器学习 | 特征工程- 超参数调优方法整理

特征工程之特征选择----包装法

通过应用 RFE 选择给出最佳调整 R 平方值的特征子集

Python数据挖掘—特征工程—特征选择

特征选择方法之主成分分析

详解数据预处理和特征工程-特征选择-Embedded嵌入法菜菜的sklearn课堂笔记