在多个数据集上迭代 GridSearchCV 会为每个数据集提供相同的结果

Posted

技术标签:

【中文标题】在多个数据集上迭代 GridSearchCV 会为每个数据集提供相同的结果【英文标题】:Iterating GridSearchCV over multiple datasets gives identical result for each 【发布时间】:2022-01-21 13:53:22 【问题描述】:

我正在尝试在 Scikit-learn 中针对存储在专用字典中的多个训练数据集上具有不同超参数的特定算法执行网格搜索。首先,我调用不同的超参数和要使用的模型:

scoring = ['accuracy', 'balanced_accuracy', 'f1', 'precision', 'recall']
grid_search = 

for key in X_train_d.keys():
    cv = StratifiedKFold(n_splits=5, random_state=1)
    model = XGBClassifier(objective="binary:logistic", random_state=42)
    space = dict()
    space['n_estimators']=[50] # 200
    space['learning_rate']= [0.5] #0.01, 0.3, 0.5
    grid_search= GridSearchCV(model, space, scoring=scoring, cv=cv, n_jobs=3, verbose=2, refit='balanced_accuracy')

然后,我创建一个空字典,该字典应填充与 X_train_d.keys() 一样多的 GridSearchCV 对象,通过:

grid_result =     
for key in X_train_d.keys():
    grid_result[key] = grid_search.fit(X_train_d[key], Y_train_d[key])

最后,我通过以下方式创建与现有键报告评分等信息一样多的数据集:

df_grid_results = 
for key in X_train_d.keys():
    df_grid_results[key]=pd.DataFrame(grid_search.cv_results_)
    df_grid_results[key] = (
    df_grid_results[key]
    .set_index(df_grid_results[key]["params"].apply(
        lambda x: "_".join(str(val) for val in x.values()))
    )
    .rename_axis('kernel')
    )

一切都“完美”地工作——在某种意义上没有显示错误——除了当我检查不同的 GridSearchCV 对象或 df_grid_results 数据集时,我看到结果都是相同的,就好像模型适合相同数据集一遍又一遍,而 X_train_d 和 Y_train_d 字典包含不同的数据集。

当然,当我单独拟合模型时,例如:

model1_cv = grid_search.fit(X_train_d[1], Y_train_d[1])
model2_cv = grid_search.fit(X_train_d[2], Y_train_d[2])

结果与预期不同。

我觉得我在这里错过了一些非常愚蠢和明显的东西。有人可以帮忙吗?谢谢!

【问题讨论】:

欢迎来到堆栈溢出,请提供一段工作代码,以便我们尝试和帮助。这里 X_train_d 没有定义。在这里,您似乎每次都使用覆盖 grid_search 变量,因此它只保留最后一个。这可以解释你的结果。在进行下一个循环之前,您必须在同一循环中定义和使用 grid_search。 【参考方案1】:

正如Malo 所指出的,问题在于在最后一个循环中,您正在复制粘贴所有数据帧中最后一个数据集的网格搜索结果。但是,您的代码中的多个循环并不是真正需要的,您可以将代码简化为只运行一个循环并将结果直接保存在数据框中,如下所示:

import numpy as np
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold, GridSearchCV

# features datasets
X_train_d = 
    'd1': np.random.normal(0, 1, (100, 3)), 
    'd2': np.random.normal(0, 1, (100, 5))


# labels datasets
Y_train_d = 
    'd1': np.random.choice([0, 1], 100), 
    'd2': np.random.choice([0, 1], 100)


# parameter grid
param_grid = 'n_estimators': [50, 100], 'learning_rate': [0.3, 0.5]

# evaluation metrics
scoring = ['accuracy', 'balanced_accuracy', 'f1', 'precision', 'recall']

# cross-validation splits
cv = StratifiedKFold(n_splits=5)

# results data frame
df_grid_results = pd.DataFrame()

for key in X_train_d.keys():

    # run the grid search
    grid_search = GridSearchCV(
        estimator=XGBClassifier(objective='binary:logistic', random_state=42), 
        param_grid=param_grid, 
        scoring=scoring, 
        cv=cv, 
        n_jobs=3, 
        verbose=2, 
        refit='balanced_accuracy'
    )
    
    grid_search.fit(X_train_d[key], Y_train_d[key])
    
    # save the grid search results in the data frame
    df_temp = pd.DataFrame(grid_search.cv_results_)
    df_temp['dataset'] = key
    
    df_grid_results = df_grid_results.append(df_temp, ignore_index=True)

df_grid_results = df_grid_results.set_index(df_grid_results['params'].apply(lambda x: '_'.join(str(val) for val in x.values()))).rename_axis('kernel')

print(df_grid_results[['dataset', 'mean_test_accuracy', 'mean_test_balanced_accuracy', 'mean_test_f1', 'mean_test_precision', 'mean_test_recall']])
#         dataset  mean_test_accuracy  mean_test_balanced_accuracy  mean_test_f1  mean_test_precision  mean_test_recall  
# kernel                                                             
# 0.3_50       d1                0.40                     0.403232      0.349067             0.399953          0.335556  
# 0.3_100      d1                0.38                     0.382323      0.356022             0.368983          0.355556  
# 0.5_50       d1                0.43                     0.429596      0.351857             0.391209          0.335556  
# 0.5_100      d1                0.41                     0.409596      0.342767             0.365812          0.335556  
# 0.3_50       d2                0.55                     0.540025      0.448419             0.501948          0.436111
# 0.3_100      d2                0.57                     0.556692      0.462381             0.515996          0.436111  
# 0.5_50       d2                0.62                     0.607449      0.536695             0.587857          0.502778  
# 0.5_100      d2                0.64                     0.629672      0.571682             0.607857          0.547222  

【讨论】:

【参考方案2】:

这里似乎你每次都使用覆盖 grid_search 变量,所以它只保留最后一个。 这可以解释你的结果。在进行下一个循环之前,您必须在同一循环中定义和使用 grid_search。 请提供工作代码和数据,我将编辑您的代码。

思路是这样的:

scoring = ['accuracy', 'balanced_accuracy', 'f1', 'precision', 'recall']
grid_search = 
grid_result =     

for key in X_train_d.keys():
    cv = StratifiedKFold(n_splits=5, random_state=1)
    model = XGBClassifier(objective="binary:logistic", random_state=42)
    space = dict()
    space['n_estimators']=[50] # 200
    space['learning_rate']= [0.5] #0.01, 0.3, 0.5
    grid_search= GridSearchCV(model, space, scoring=scoring, cv=cv, n_jobs=3, verbose=2, refit='balanced_accuracy')
    grid_result[key] = grid_search.fit(X_train_d[key], Y_train_d[key])
    
df_grid_results = 
for key in X_train_d.keys():
    df_grid_results[key]=pd.DataFrame(grid_search.cv_results_)
    df_grid_results[key] = (
    df_grid_results[key]
    .set_index(df_grid_results[key]["params"].apply(
        lambda x: "_".join(str(val) for val in x.values()))
    )
    .rename_axis('kernel')
    )

【讨论】:

感谢马洛。不幸的是,您的解决方案导致了同样的问题:所有结果都是相同的。此外,我不能分享有关此项目的任何机密数据。

以上是关于在多个数据集上迭代 GridSearchCV 会为每个数据集提供相同的结果的主要内容,如果未能解决你的问题,请参考以下文章

有没有办法在科学工具包中预测 GridSearchCV 中的多个模型?

GridSearchCV - 每次迭代保存结果

GridSearchCV 参数不能改善分类

将 GridSearchCV 与 TimeSeriesSplit 一起使用

如何在较大的数据集上迭代执行组合?

python:在验证集上调整模型的参数