如何在网格搜索后绘制热图并找到决策树的最佳超参数

Posted

技术标签:

【中文标题】如何在网格搜索后绘制热图并找到决策树的最佳超参数【英文标题】:how to plot a heatmap and find best hyperparameter for decision tree after gridsearch 【发布时间】:2019-10-11 15:31:48 【问题描述】:

我需要绘制一个热图,以便在网格搜索 kaggle 提供的捐助者选择数据集后找到决策树的最佳超参数。

这里有两个超参数:

max_depth=[1, 5, 10, 50, 100, 500]
min_samples_split=[5, 10, 100, 500]

X_tr_bow = hstack((X_train_price_norm,X_train_categories_ohe,X_train_state_ohe,X_train_teacher_ohe,X_train_grade_ohe,X_train_essay__bow,X_train_clean_title__bow)).tocsr()

X_tr_bow 是我适合网格搜索的数据。

X_tr_bow 的维度 - (53531, 7980) (53531,)

%%time
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
import math
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
lr_bow = DecisionTreeClassifier()
#alphas=list(map(lambda x: float(pow(10,x)),list(range(-15,16,1))))
#alphas=[0.00001, 0.00005, 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000, 2500, 5000, 10000]
max_depth=[1, 5, 10, 50, 100, 500]
min_samples_split=[5, 10, 100, 500]
parameters = 'max_depth':max_depth,'min_samples_split':min_samples_split

clf = GridSearchCV(lr_bow, parameters, cv= 10, scoring='roc_auc')

clf.fit(X_tr_bow, y_train) 

train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']
cv_auc = clf.cv_results_['mean_test_score'] 
cv_auc_std= clf.cv_results_['std_test_score']


print("Best cross-validation score: :.2f".format(clf.best_score_))
print("Best parameters: ", clf.best_params_)

import pandas as pd
pvt = pd.pivot_table(pd.DataFrame( clf.cv_results_['param_max_depth'],clf.cv_results_['param_min_samples_split'],clf.cv_results_['mean_train_score'],clf.cv_results_['mean_test_score']),
index='param_alpha', columns='param_l1_ratio')
# values='mean_test_score'
pvt
import seaborn as sns       
ax = sns.heatmap(pvt)

我在这里遇到的错误

Best cross-validation score: 0.59
Best parameters:  'max_depth': 50, 'min_samples_split': 500
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py in create_block_manager_from_blocks(blocks, axes)
   1650                 blocks = [make_block(values=blocks[0],
-> 1651                                      placement=slice(0, len(axes[0])))]
   1652 

6 frames
ValueError: Wrong number of items passed 1, placement implies 24

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py in construction_error(tot_items, block_shape, axes, e)
   1689         raise ValueError("Empty data passed with indices specified.")
   1690     raise ValueError("Shape of passed values is 0, indices imply 1".format(
-> 1691         passed, implied))
   1692 
   1693 

ValueError: Shape of passed values is (24, 1), indices imply (24, 24)

【问题讨论】:

【参考方案1】:

以防万一有人仍在寻找答案,以下代码对我有用,

results = pd.DataFrame.from_dict(rand_search_cv.cv_results_)

max_scores = results.groupby(['param_min_samples_split', 'param_max_depth']).max()
max_scores = max_scores.unstack()[['mean_test_score', 'mean_train_score']]
sn.heatmap(max_scores.mean_test_score, annot=True, fmt='.4g');

【讨论】:

以上是关于如何在网格搜索后绘制热图并找到决策树的最佳超参数的主要内容,如果未能解决你的问题,请参考以下文章

网格搜索后如何在 pivot_table 上绘制热图

在 Python Bagging Classifier 中将最佳网格搜索超参数分配给最终模型

Scikit 网格搜索参数(不是超参数)

执行 python scikit-learn 网格搜索方法时出现无效参数错误

SVC 的网格搜索问题 - 如何排除故障?

基于树的模型的最优超参数调整