给定自定义模型，网格搜索返回完全相同的结果

Posted 2023-03-12

技术标签:

【中文标题】给定自定义模型，网格搜索返回完全相同的结果【英文标题】：Grid Search Returns the Exactly Same Result Given a Custom Model 【发布时间】：2020-11-25 22:09:33 【问题描述】：

我将 Scikit-Learn 随机森林模型包装在一个函数中，如下所示：

from sklearn.base import BaseEstimator, RegressorMixin

class Model(BaseEstimator, RegressorMixin):
    def __init__(self, model):
        self.model = model
    
    def fit(self, X, y):
        self.model.fit(X, y)
        
        return self
    
    def score(self, X, y):
           
        from sklearn.metrics import mean_squared_error
        
        return mean_squared_error(y_true=y, 
                                  y_pred=self.model.predict(X), 
                                  squared=False)
    
    def predict(self, X):
        return self.model.predict(X)

class RandomForest(Model):
    def __init__(self, n_estimators=100, 
                 max_depth=None, min_samples_split=2,
                 min_samples_leaf=1, max_features=None):
        
        self.n_estimators=n_estimators 
        self.max_depth=max_depth
        self.min_samples_split=min_samples_split
        self.min_samples_leaf=min_samples_leaf
        self.max_features=max_features
           
        from sklearn.ensemble import RandomForestRegressor
 
        self.model = RandomForestRegressor(n_estimators=self.n_estimators, 
                                           max_depth=self.max_depth, 
                                           min_samples_split=self.min_samples_split,
                                           min_samples_leaf=self.min_samples_leaf, 
                                           max_features=self.max_features,
                                           random_state = 777)
    
    
    def get_params(self, deep=True):
        return "n_estimators": self.n_estimators,
                "max_depth": self.max_depth,
                "min_samples_split": self.min_samples_split,
                "min_samples_leaf": self.min_samples_leaf,
                "max_features": self.max_features

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self

我主要遵循 Scikit-Learn 官方指南，可以在 https://scikit-learn.org/stable/developers/develop.html 找到

这是我的网格搜索的样子：

grid_search = GridSearchCV(estimator=RandomForest(), 
                            param_grid='max_depth':[1, 3, 6], 'n_estimators':[10, 100, 300],
                            n_jobs=-1, 
                            scoring='neg_root_mean_squared_error',
                            cv=5, verbose=True).fit(X, y)
    
print(pd.DataFrame(grid_search.cv_results_).sort_values(by='rank_test_score'))

网格搜索输出结果和grid_search.cv_results_打印在下面

Fitting 5 folds for each of 9 candidates, totalling 45 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
   mean_fit_time  std_fit_time  mean_score_time  std_score_time  \
0       0.210918      0.002450         0.016754        0.000223   
1       0.207049      0.001675         0.016579        0.000147   
2       0.206495      0.002001         0.016598        0.000158   
3       0.206799      0.002417         0.016740        0.000144   
4       0.207534      0.001603         0.016668        0.000269   
5       0.206384      0.001396         0.016605        0.000136   
6       0.220052      0.024280         0.017247        0.001137   
7       0.226838      0.027507         0.017351        0.000979   
8       0.205738      0.003420         0.016246        0.000626   

  param_max_depth param_n_estimators                                 params  \
0               1                 10   'max_depth': 1, 'n_estimators': 10   
1               1                100  'max_depth': 1, 'n_estimators': 100   
2               1                300  'max_depth': 1, 'n_estimators': 300   
3               3                 10   'max_depth': 3, 'n_estimators': 10   
4               3                100  'max_depth': 3, 'n_estimators': 100   
5               3                300  'max_depth': 3, 'n_estimators': 300   
6               6                 10   'max_depth': 6, 'n_estimators': 10   
7               6                100  'max_depth': 6, 'n_estimators': 100   
8               6                300  'max_depth': 6, 'n_estimators': 300   

   split0_test_score  split1_test_score  split2_test_score  split3_test_score  \
0          -5.246725          -3.200585          -3.326962          -3.209387   
1          -5.246725          -3.200585          -3.326962          -3.209387   
2          -5.246725          -3.200585          -3.326962          -3.209387   
3          -5.246725          -3.200585          -3.326962          -3.209387   
4          -5.246725          -3.200585          -3.326962          -3.209387   
5          -5.246725          -3.200585          -3.326962          -3.209387   
6          -5.246725          -3.200585          -3.326962          -3.209387   
7          -5.246725          -3.200585          -3.326962          -3.209387   
8          -5.246725          -3.200585          -3.326962          -3.209387   

   split4_test_score  mean_test_score  std_test_score  rank_test_score  
0          -2.911422        -3.579016        0.845021                1  
1          -2.911422        -3.579016        0.845021                1  
2          -2.911422        -3.579016        0.845021                1  
3          -2.911422        -3.579016        0.845021                1  
4          -2.911422        -3.579016        0.845021                1  
5          -2.911422        -3.579016        0.845021                1  
6          -2.911422        -3.579016        0.845021                1  
7          -2.911422        -3.579016        0.845021                1  
8          -2.911422        -3.579016        0.845021                1  
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:    3.2s finished

我的问题是，为什么网格搜索在所有数据拆分上返回完全相同的结果？

我的假设是，网格搜索似乎只对所有数据拆分执行 1 个参数网格（例如 'max_depth': 1, 'n_estimators': 10）。如果是这样，为什么会这样？

最后，如何使网格搜索能够为所有数据拆分返回正确的结果？

【问题讨论】：

你的假设不成立；从cv_results_ 可以清楚地看出，所有超参数组合都已尝试过（这也是您有 9 个条目的原因） - 请参阅列 param_max_depth 和 param_n_estimators。没有您的数据就不可能再多说什么，但第一个调试步骤是在没有您的包装类的情况下运行它（即使用 scikit-learn 本机 RF）。从你所展示的，我不明白你为什么不直接使用RandomForestRegressor；为什么需要这个包装类？ @desertnaut 如果我使用 scikit-learn 中的 RandomForestRegressor()，它工作得很好，这意味着它为所有数据拆分返回正确的结果 【参考方案1】：

您的set_params 方法实际上并未更改self.model 属性中RandomForestRegressor 实例的超参数。相反，它直接将属性设置为您的 RandomForest 实例（以前不存在，并且不会影响实际模型！）。所以网格搜索反复设置这些无关紧要的新参数，每次拟合的实际模型都是一样的。（同样get_params方法获取RandomForest属性，与RandomForestRegressor属性不同。）

您应该能够通过让set_params 调用self.model.set_params 来解决大部分问题（并且让get_params 使用self.model.<parameter_name> 而不仅仅是self.<parameter_name>。

我认为还有另一个问题，但我根本不知道您的示例是如何运行的：您使用self.<parameter_name> 实例化了model 属性，但这从未在__init__ 中定义。

【讨论】：

非常感谢您的解决方案！是的，你是对的，我想我没有正确地将属性分配给包装器。关于 __init__() 中从未实例化的属性的另一个问题，我错过了将其粘贴到这篇文章中。我已经编辑了原始问题。谢谢你的评论！

以上是关于给定自定义模型，网格搜索返回完全相同的结果的主要内容，如果未能解决你的问题，请参考以下文章

分页工具栏不显示自定义搜索结果

Python，机器学习 - 对自定义验证集执行网格搜索

R语言使用caret包对GBM模型自定义参数调优：自定义参数优化网格

Laravel删除模型自定义返回

网格搜索分类的自定义评分功能

完全理解双亲委派模型与自定义 ClassLoader