如何绘制最佳参数对应的随机森林树

Posted 2023-03-12

技术标签:

【中文标题】如何绘制最佳参数对应的随机森林树【英文标题】：How to plot the random forest tree corresponding to best parameter 【发布时间】：2020-09-18 13:16:51 【问题描述】：

Python：3.6

窗口：10

我对随机森林和手头的问题几乎没有疑问：

我正在使用 Gridsearch 来运行使用随机森林的回归问题。我想绘制与 gridsearch 发现的最佳拟合参数对应的树。这是代码。

    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=55)

    # Use the random grid to search for best hyperparameters
    # First create the base model to tune
    rf = RandomForestRegressor()
    # Random search of parameters, using 3 fold cross validation, 
    # search across 100 different combinations, and use all available cores
    rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 50, cv = 5, verbose=2, random_state=56, n_jobs = -1)
    # Fit the random search model
    rf_random.fit(X_train, y_train)

    rf_random.best_params_

最好的参数是：

    'n_estimators': 1000,
     'min_samples_split': 5,
     'min_samples_leaf': 1,
     'max_features': 'auto',
     'max_depth': 5,
     'bootstrap': True

如何使用上述参数绘制这棵树？

我的因变量 y 位于 [0,1] 范围内（连续），并且所有预测变量都是二元或分类变量。一般来说，哪种算法可以很好地适应这个输入和输出特征空间。我试过随机森林。（没有给出那么好的结果）。注意这里y变量是一种比率，因此它在0和1之间。Example: Expense on food/Total Expense

上述数据有偏差，这意味着依赖变量或y 变量在 60% 的数据中具有 value=1，在其余数据中介于 0 和 1 之间。比如0.66, 0.87等等。

因为我的数据只有二进制 0,1 和分类变量 A,B,C。我需要将其转换为one-hot encoding 变量以使用随机森林吗？

【问题讨论】：

【参考方案1】：

关于情节（恐怕你的其他问题对SO来说太宽泛了，一般的想法是避免同时问多个问题）：

拟合您的RandomizedSearchCV 会产生一个rf_random.best_estimator_，它本身就是一个随机森林，具有您问题中显示的参数（包括'n_estimators': 1000）。

根据docs，一个拟合的RandomForestRegressor包含一个属性：

estimators_：DecisionTreeRegressor 列表

拟合子估计器的集合。

因此，要绘制随机森林的任何一棵树，您应该使用任一

from sklearn import tree
tree.plot_tree(rf_random.best_estimator_.estimators_[k])

或

from sklearn import tree
tree.export_graphviz(rf_random.best_estimator_.estimators_[k])

在您的情况下为[0, 999] 中所需的k（在一般情况下为[0, n_estimators-1]）。

【讨论】：

【参考方案2】：

在回答您的问题之前，请允许我退后一步。

理想情况下，应该通过GridSearchCV 进一步深入了解RandomizedSearchCV 的best_params_ 输出。 RandomizedSearchCV 将检查您的参数而不尝试所有可能的选项。然后，一旦您拥有 best_params_ 或 RandomizedSearchCV，我们就可以在更窄的范围内调查所有可能的选项。

您没有在代码输入中包含 random_grid 参数，但我希望您像这样执行 GridSearchCV：

# Create the parameter grid based on the results of RandomizedSearchCV
param_grid = 
    'max_depth': [4, 5, 6],
    'min_samples_leaf': [1, 2],
    'min_samples_split': [4, 5, 6],
    'n_estimators': [990, 1000, 1010]

# Fit the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 5, n_jobs = -1, verbose = 2, random_state=56)

上面要做的就是遍历param_grid中所有可能的参数组合，给你最好的参数。

现在来回答你的问题：

随机森林是多棵树的组合 - 因此您可以绘制的不仅仅是一棵树。您可以做的是绘制随机森林使用的 1 棵或更多棵树。这可以通过plot_tree 函数来实现。阅读文档和这个SO 问题以进一步了解它。

您是否先尝试了简单的线性回归？

这将影响您用来评估模型的拟合/准确度的准确度指标。处理不平衡/偏斜数据时会想到精度、召回率和 F1 分数

是的，在拟合随机森林之前需要将分类变量转换为虚拟变量

【讨论】：

您在上面对绘制树的建议：适用于随机森林分类器，但不适用于回归器 @MAC 根据 scikit learn 的文档，plot_tree 函数可用于分类器和回归器。虽然我必须承认我从未将它应用于回归器。

I have written: grid = GridSearchCV(estimator=xgb, param_grid=params, scoring='neg_mean_squared_error', n_jobs=4, verbose=3 ) and grid.fit(X_train, y_train)

。现在如何根据最佳估计器绘制树？？ @MAC XGBoost 和随机森林是多个决策树的集合。没有一棵树可以代表最佳参数。但是，可以使用plot_tree(grid, num_trees=0) 在经过训练的 XGBoost 模型中绘制一棵特定的树。将 0 替换为要可视化的第 n 个决策树。要找出您的 grid 模型中的树数，请查看其 n_estimators。

以上是关于如何绘制最佳参数对应的随机森林树的主要内容，如果未能解决你的问题，请参考以下文章

随机森林参数说明

机器学习——“决策树&随机森林”学习笔记

将随机森林变成决策树 - 在 R 中使用 randomForest 包

导出/绘制随机森林决策树/“RandomForestClassifier”对象没有属性“tree_”

简述树模型之决策树、随机森林、xgboost

r语言随机森林结果规则怎么显示