将 GridSearchCV 用于 RandomForestRegressor

Posted

技术标签:

【中文标题】将 GridSearchCV 用于 RandomForestRegressor【英文标题】:Using GridSearchCV for RandomForestRegressor 【发布时间】:2015-03-09 12:51:54 【问题描述】:

我尝试将GridSearchCV 用于RandomForestRegressor,但总是得到ValueError: Found array with dim 100. Expected 500。考虑这个玩具示例:

import numpy as np

from sklearn import ensemble
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import r2_score

if __name__ == '__main__':

    X = np.random.rand(1000, 2)
    y = np.random.rand(1000)

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.5, random_state=1)

    # Set the parameters by cross-validation
    tuned_parameters = 'n_estimators': [500, 700, 1000], 'max_depth': [None, 1, 2, 3], 'min_samples_split': [1, 2, 3]

    # clf = ensemble.RandomForestRegressor(n_estimators=500, n_jobs=1, verbose=1)
    clf = GridSearchCV(ensemble.RandomForestRegressor(), tuned_parameters, cv=5, scoring=r2_score, n_jobs=-1, verbose=1)
    clf.fit(X_train, y_train)
    print clf.best_estimator_

这是我得到的:

Fitting 5 folds for each of 36 candidates, totalling 180 fits
Traceback (most recent call last):
  File "C:\Users\abudis\Dropbox\machine_learning\toy_example.py", line 21, in <module>
    clf.fit(X_train, y_train)
  File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\grid_search.py", line 596, in fit
    return self._fit(X, y, ParameterGrid(self.param_grid))
  File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\grid_search.py", line 378, in _fit
    for parameters in parameter_iterable
  File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\externals\joblib\parallel.py", line 653, in __call__
    self.dispatch(function, args, kwargs)
  File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\externals\joblib\parallel.py", line 400, in dispatch
    job = ImmediateApply(func, args, kwargs)
  File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\externals\joblib\parallel.py", line 138, in __init__
    self.results = func(*args, **kwargs)
  File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\cross_validation.py", line 1240, in _fit_and_score
    test_score = _score(estimator, X_test, y_test, scorer)
  File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\cross_validation.py", line 1296, in _score
    score = scorer(estimator, X_test, y_test)
  File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\metrics\metrics.py", line 2324, in r2_score
    y_type, y_true, y_pred = _check_reg_targets(y_true, y_pred)
  File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\metrics\metrics.py", line 65, in _check_reg_targets
    y_true, y_pred = check_arrays(y_true, y_pred)
  File "C:\Users\abudis\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\utils\validation.py", line 254, in check_arrays
    % (size, n_samples))
ValueError: Found array with dim 100. Expected 500

出于某种原因GridSearchCV 认为n_estimators 参数应该等于每个折叠的大小。如果我在 tune_parameters 列表中更改 n_estimators 的第一个值,我会得到 ValueError 和另一个预期值。

使用clf = ensemble.RandomForestRegressor(n_estimators=500, n_jobs=1, verbose=1) 只训练一个模型效果很好,所以不确定是我做错了什么还是scikit-learn 某处存在错误。

【问题讨论】:

【参考方案1】:

看起来像一个错误,但在您的情况下,如果您使用 RandomForestRegressor 自己的得分器(巧合的是 R^2 得分)通过在 GridSearchCV 中未指定任何得分函数,它应该可以工作:

clf = GridSearchCV(ensemble.RandomForestRegressor(), tuned_parameters, cv=5, 
                   n_jobs=-1, verbose=1)

编辑:正如@jnothman 在#4081 中提到的,这是真正的问题:

评分不接受度量函数。它接受签名函数(估计器,> X,y_true=None)-> 浮点分数。您可以使用scoring='r2' 或scoring=make_scorer(r2_score)。

【讨论】:

我创建了两个问题#4080 和#4081。 啊,好吧。是的,我指定了评分参数,因为我真的不知道回归量(mse 或 r2)的默认值是什么。完全删除它就可以了,谢谢!

以上是关于将 GridSearchCV 用于 RandomForestRegressor的主要内容,如果未能解决你的问题,请参考以下文章

将 OneClassSVM 与 GridSearchCV 结合使用

用于多项式回归的 GridsearchCV

评分“roc_auc”值不适用于gridsearchCV应用RandomForestclassifer

GridSearchCV 的替代方法,用于查找 SVM 模型的参数

n_jobs=-1 的 GridSearchCV 不适用于决策树/随机森林分类

如何使用带有 SVC 估计器的 OneVsRestClassifier 的 GridSearchCV?