获取叶节点决策路径中的所有特征（随机森林）

Posted 2023-03-12

技术标签:

【中文标题】获取叶节点决策路径中的所有特征（随机森林）【英文标题】：Get all features in decision path to leaf node (Random Forest) 【发布时间】：2017-09-07 17:58:45 【问题描述】：

我有以下示例代码，用于仅使用 2 个决策树在 iris 数据集上的简单随机森林分类器。此代码最好在 jupyter notebook 中运行。

# Setup
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
import numpy as np
# Set seed for reproducibility
np.random.seed(1015)

# Load the iris data
iris = load_iris()

# Create the train-test datasets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target)

np.random.seed(1039)

# Just fit a simple random forest classifier with 2 decision trees
rf = RandomForestClassifier(n_estimators = 2)
rf.fit(X = X_train, y = y_train)

# Define a function to draw the decision trees in IPython
# Adapted from: http://scikit-learn.org/stable/modules/tree.html
from IPython.display import display, Image
import pydotplus

# Now plot the trees individually
for dtree in rf.estimators_:
    dot_data = tree.export_graphviz(dtree
                                    , out_file = None
                                    , filled   = True
                                    , rounded  = True
                                    , special_characters = True)  
    graph = pydotplus.graph_from_dot_data(dot_data)  
    img = Image(graph.create_png())
    display(img)
    draw_tree(inp_tree = dtree)
    #print(dtree.tree_.feature)

第一棵树的输出是：

可以看出，第一个决策有 8 个叶节点，第二个决策树（未显示）有 6 个叶节点

如何提取一个简单的 numpy 数组，其中包含每个决策树和树中每个叶节点的信息：

该叶节点的分类结果（例如它预测的最频繁类）在同一叶节点的决策路径中使用的所有特征（布尔值）？

在上面的例子中，我们会有：

2 棵树 - 0, 1 对于树 0，我们有 8 个叶节点索引为 0, 1, ..., 7 对于树 1，我们有 6 个叶节点索引为 0, 1, ..., 5 对于每棵树中的每个叶节点，我们都有一个最常见的预测类，即 iris 数据集的0, 1, 2 对于每个叶节点，我们都有一组用于制作该树的 4 个特征的布尔值。在这里，如果 4 个特征之一在到叶节点的决策路径中使用一次或多次，我们将其计为 True，否则 False 如果从未在决策路径中使用到叶节点。

感谢任何帮助将此 numpy 数组改编为上述代码（循环）的帮助。

谢谢

【问题讨论】：

你有没有看过tree类中的代码，特别是我认为export_graphiz函数的代码是一个很好的起点github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/tree/… 当我尝试运行您的代码时，我得到 name 'draw_tree' is not defined 有什么想法吗？ @user4687531 当我尝试运行您的代码时，我得到 name 'draw_tree' is not defined 任何想法为什么？决策节点可以在 Python 中访问，参见***.com/questions/50600290/… 【参考方案1】：

与这里的问题类似：how extraction decision rules of random forest in python

您可以使用提供的sn-p @jonnor（我也使用它修改过）：

import numpy
from sklearn.model_selection import train_test_split
from sklearn import metrics, datasets, ensemble

def print_decision_rules(rf):

    for tree_idx, est in enumerate(rf.estimators_):
        tree = est.tree_
        assert tree.value.shape[1] == 1 # no support for multi-output

        print('TREE: '.format(tree_idx))

        iterator = enumerate(zip(tree.children_left, tree.children_right, tree.feature, tree.threshold, tree.value))
        for node_idx, data in iterator:
            left, right, feature, th, value = data

            # left: index of left child (if any)
            # right: index of right child (if any)
            # feature: index of the feature to check
            # th: the threshold to compare against
            # value: values associated with classes            

            # for classifier, value is 0 except the index of the class to return
            class_idx = numpy.argmax(value[0])

            if left == -1 and right == -1:
                print(' LEAF: return class='.format(node_idx, class_idx))
            else:
                print(' NODE: if feature[] <  then next= else next='.format(node_idx, feature, th, left, right))    


digits = datasets.load_digits()
Xtrain, Xtest, ytrain, ytest = train_test_split(digits.data, digits.target)
estimator = ensemble.RandomForestClassifier(n_estimators=3, max_depth=2)
estimator.fit(Xtrain, ytrain)

另一种方法和可视化：

为了可视化决策路径，您可以使用库 dtreeviz from https://explained.ai/decision-tree-viz/index.html

它们具有出色的可视化效果，例如：

来源https://explained.ai/decision-tree-viz/images/samples/sweets-TD-3-X.svg

查看他们的shadowDecisionTree 实现，以获取有关决策路径的更多信息。在https://explained.ai/decision-tree-viz/index.html 中，他们还提供了一个示例

shadow_tree = ShadowDecTree(tree_model, X_train, y_train, feature_names, class_names)

然后你可以使用类似get_leaf_sample_counts方法的东西。

【讨论】：

以上是关于获取叶节点决策路径中的所有特征（随机森林）的主要内容，如果未能解决你的问题，请参考以下文章