使用 scikit-learn Pipeline 和 GridSearchCV 时出错

Posted

技术标签:

【中文标题】使用 scikit-learn Pipeline 和 GridSearchCV 时出错【英文标题】:Error while using scikit-learn Pipeline and GridSearchCV 【发布时间】:2018-01-27 23:47:14 【问题描述】:

我想尝试不同的文本分类管道配置。

我做了这个

pipe = Pipeline([('c_vect', CountVectorizer()),('feat_select', SelectKBest()),
                                    ('ridge', RidgeClassifier())])

parameters = 'c_vect__max_features': [10, 50, 100, None], 
                        'feat_select__score_func': [chi2, f_classif, mutual_info_classif, SelectFdr, SelectFwe, SelectFpr], 
                        'ridge__solver': ['sparse_cg', 'lsqr', 'sag'], 'ridge__tol': [1e-2, 1e-3], 'ridge__alpha': [0.01, 0.1, 1]

gs_clf = GridSearchCV(pipe, parameters, n_jobs=5)
gs_clf = gs_clf.fit(clean_train_data, train_labels_list)

但我收到此错误,即使根据此处 SelectKBest 的文档,SelectFdr 应该是可用的功能选择功能之一:http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html

Traceback (most recent call last):
  File ".../anaconda3/lib/python3.5/site-packages/sklearn/externals/joblib/_parallel_backends.p
y", line 350, in __call__
    return self.func(*args, **kwargs)
  File ".../anaconda3/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py", line 1
31, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File ".../anaconda3/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py", line 1
31, in <listcomp>
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File ".../anaconda3/lib/python3.5/site-packages/sklearn/model_selection/_validation.py", line
 437, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File ".../anaconda3/lib/python3.5/site-packages/sklearn/pipeline.py", line 257, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File ".../anaconda3/lib/python3.5/site-packages/sklearn/pipeline.py", line 222, in _fit
    **fit_params_steps[name])
  File ".../anaconda3/lib/python3.5/site-packages/sklearn/externals/joblib/memory.py", line 362
, in __call__
    return self.func(*args, **kwargs)
  File ".../anaconda3/lib/python3.5/site-packages/sklearn/pipeline.py", line 589, in _fit_trans
form_one
    res = transformer.fit_transform(X, y, **fit_params)
  File ".../anaconda3/lib/python3.5/site-packages/sklearn/base.py", line 521, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File ".../anaconda3/lib/python3.5/site-packages/sklearn/feature_selection/base.py", line 76,
in transform
    mask = self.get_support()
  File ".../anaconda3/lib/python3.5/site-packages/sklearn/feature_selection/base.py", line 47,
in get_support
    mask = self._get_support_mask()
  File ".../anaconda3/lib/python3.5/site-packages/sklearn/feature_selection/univariate_selectio
n.py", line 503, in _get_support_mask
    scores = _clean_nans(self.scores_)
  File ".../anaconda3/lib/python3.5/site-packages/sklearn/feature_selection/univariate_selectio
n.py", line 30, in _clean_nans
    scores = as_float_array(scores, copy=True)
  File ".../anaconda3/lib/python3.5/site-packages/sklearn/utils/validation.py", line 93, in as_
float_array
    return X.astype(return_dtype)
TypeError: float() argument must be a string or a number, not 'SelectFdr'

知道为什么会这样吗?

【问题讨论】:

【参考方案1】:

SelectFdr、SelectFwe、SelectFpr 是类似于 SelectKBest 的。它们不是评分功能。

可用的评分函数有given in documentation:

For regression: f_regression, mutual_info_regression
For classification: chi2, f_classif, mutual_info_classif

这些类(SelectFdr、SelectFwe、SelectFpr)默认使用评分函数f_classif。所以你需要从你的参数中删除这些。

如果你想使用这些:你可以像这样改变参数网格:

parameters = 'c_vect__max_features': [10, 50, 100, None],
              'feat_select':[SelectKBest(), SelectFdr(), SelectFwe(), SelectFdr()]
              'feat_select__score_func': [chi2, f_classif, mutual_info_classif], 
              'ridge__solver': ['sparse_cg', 'lsqr', 'sag'], 
              'ridge__tol': [1e-2, 1e-3], 'ridge__alpha': [0.01, 0.1, 1]

注意其中的新参数 "feat_select"。是的,您甚至可以在发送到 GridSearchCV 时更改管道内的转换器对象。希望这可以帮助。如有疑问请追问。

【讨论】:

非常感谢!我不知道你能做到这一点。我还有一个有点不同的问题。 SelectFdr 将尝试减少误报,对吗?有减少假阴性的功能吗?如果没有,有没有办法指定我希望在管道中被视为正面的标签?

以上是关于使用 scikit-learn Pipeline 和 GridSearchCV 时出错的主要内容,如果未能解决你的问题,请参考以下文章

使用 scikit-learn Pipeline 和 GridSearchCV 时出错

在 scikit-learn 管道中插入或删除步骤

如何检查 Scikit-Learn Pipeline 所做的更改?

使用 Python Scikit-learn 中的 Pipeline 和 featureUnion 将多个功能合二为一

Scikit-Learn Pipeline 中的新功能 - 两个现有功能之间的交互

Scikit-Learn 的 Pipeline:传递了一个稀疏矩阵,但需要密集数据