scikit-learn DecisitonTreeClassifier 中的阈值和特征（对于每个训练的节点）有啥区别？

Posted 2023-03-12

技术标签:

【中文标题】scikit-learn DecisitonTreeClassifier 中的阈值和特征（对于每个训练的节点）有啥区别？【英文标题】：What's the difference between threshold and feature (for each of trained nodes) in scikit-learn DecisitonTreeClassifier?scikit-learn DecisitonTreeClassifier 中的阈值和特征（对于每个训练的节点）有什么区别？ 【发布时间】：2021-06-02 22:06:33 【问题描述】：

我已经浏览了 scikit-learn 中 DecisionTreeClassifier 的数据结构。简单来说，我刚刚看到这个页面https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html，这对我很有帮助，因为我需要在经过训练的决策树中提取内部数据。但是，突然出现了一个问题。对于每个节点，都有threshold 值和feature 值。门槛没问题。对于测试阶段，将特征向量（来自测试数据）作为树的输入，其中一个特征映射到一个节点，我们比较特征（来自测试数据）和阈值。

训练树中的feature（来自训练数据）究竟是什么？以下是代码sn-p。

import numpy as np
from matplotlib import pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

clf = DecisionTreeClassifier(max_leaf_nodes=3, random_state=0)
clf.fit(X_train, y_train)
n_nodes = clf.tree_.node_count
children_left = clf.tree_.children_left
children_right = clf.tree_.children_right

# This is an array where one feature value 
# is associated with each node in the tree trained.
# What's the meaning of the feature for each node
# in the trained tree?
feature = clf.tree_.feature
threshold = clf.tree_.threshold

node_depth = np.zeros(shape=n_nodes, dtype=np.int64)
is_leaves = np.zeros(shape=n_nodes, dtype=bool)

# what this shows is `[ 3 -2  2 -2 -2]`, 
# where the 1st, 3rd, 4th nodes are leaves 
# and associated with -2. 
# What are 3 and 2 on the other split node? 
# How were these values determined?
print(feature)

本例中特征向量的维数为4，树中有5个节点，包括叶子节点和非叶子节点。 feature 是 [ 3 -2 2 -2 -2]，其中除了第 0 和第 2 之外的所有内容都是叶节点。非叶节点与值 2 或 3 相关联。这是什么意思？这是否意味着对于特征向量（来自测试数据）x=(x0, x1, x2, x3)，我们在第 0 个节点上使用 x3 并与其阈值进行比较，而我们在第 2 个节点上使用 x2 并执行与它的阈值比较？

【问题讨论】：

【参考方案1】：

我建议你检查这个回答question，它非常有用。无论如何，我会用一个例子来解释它。

首先是 Adam 给出的定义：clr.tree_.feature 按深度优先搜索算法的顺序返回节点/叶子。首先，它从根节点开始，然后跟随左子节点，直到它到达一个叶子（用 -2 编码），当它到达一个叶子时，它从一个叶子爬到另一个叶子，直到它到达一个节点。一旦到达一个节点，它就会在层次结构中再次下降，直到到达叶节点。

让我们看一个例子。首先我们绘制决策树：

fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(clf, 
                   feature_names=iris.feature_names,  
                   class_names=iris.target_names,
                   filled=True)

现在让我们绘制特征：

array([ 3, -2,  2, -2, -2], dtype=int64)

我还将绘制特征名称：

iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

让我们更深入。我们有 clr.tree_.feature 等于 [ 3 -2 2 -2 -2]。正如定义所说，我们探索从左侧开始向下的树。让我们按顺序探索：

我们发现的第一个特征是什么？根一 --> "花瓣宽度 (cm)'，即特征 3。Feature=[3] 接下来我们向下和向左（橙色叶子），因为它是最后的叶子，我们返回 -2。 特征=[3, -2] 现在让我们往上走，我们又在根节点，知道让我们往右边走。猜猜是哪个功能？ 'Petal length (cm)'，特征号2。Feature=[3, -2, 2] 让我们往下走（绿叶），因为它是最后一片叶子，我们返回-2。 特征=[3, -2, 2, -2] 我们再次上升到“花瓣长度（厘米）”，现在我们移动对（强烈的紫色）。我们再次处于叶节点中。我们返回 -2。 特征=[3, -2, 2, -2, -2]

所以最后我们得到：Feature=[3, -2, 2, -2, -2]

【讨论】：

谢谢！！！！！！！！！！！！！！！！！！遍历是以“深度优先搜索”方式完成的事实是否记录在某处？我找不到原始信息。但我可以确认其他数据集的结果显示了 DFS 中的内容。我们展示的例子并不是一个很好的例子，因为 DFS/BFS 都显示了相同的遍历结果。我找不到文档来确认是否应用了“深度优先搜索”。正如我之前在回答中提供的链接，我几乎 100% 确信我们正在应用 DFS。无论如何，要确认它，请尝试另一个具有更多节点的简单示例，并且很容易判断遍历是在 DFS 还是 BFS 中完成。我觉得这里先序遍历更适合节点索引。

以上是关于scikit-learn DecisitonTreeClassifier 中的阈值和特征（对于每个训练的节点）有啥区别？的主要内容，如果未能解决你的问题，请参考以下文章

无法安装 scikit-learn

scikit-learn学习基础知识四

[机器学习与scikit-learn-3]：scikit-learn模型地图与模型选择

scikit-learn：如何使用拟合概率模型？

使用 yml 环境获取 scikit-learn 版本警告

sklearn (scikit-learn) 逻辑回归包——设置训练的分类系数。