使用sklearn python通过决策树提取数据点的规则路径

Posted

技术标签:

【中文标题】使用sklearn python通过决策树提取数据点的规则路径【英文标题】:Extract rule path of data point through decision tree with sklearn python 【发布时间】:2018-11-06 07:03:13 【问题描述】:

我正在使用决策树模型,我想提取每个数据点的决策路径,以便了解导致 Y 的原因而不是预测它。 我怎样才能做到这一点?找不到任何文档。

【问题讨论】:

【参考方案1】:

这是使用iris dataset 的示例。

from sklearn.datasets import load_iris
from sklearn import tree
import graphviz 

iris = load_iris()
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)

dot_data = tree.export_graphviz(clf, out_file=None, 
                                feature_names=iris.feature_names,  
                                class_names=iris.target_names,  
                                filled=True, rounded=True,  
                                special_characters=True)  
graph = graphviz.Source(dot_data)  
#this will create an iris.pdf file with the rule path
graph.render("iris")


编辑:以下代码来自 sklearn 文档,为实现您的目标进行了一些小改动

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

estimator = DecisionTreeClassifier(max_leaf_nodes=3, random_state=0)
estimator.fit(X_train, y_train)

# The decision estimator has an attribute called tree_  which stores the entire
# tree structure and allows access to low level attributes. The binary tree
# tree_ is represented as a number of parallel arrays. The i-th element of each
# array holds information about the node `i`. Node 0 is the tree's root. NOTE:
# Some of the arrays only apply to either leaves or split nodes, resp. In this
# case the values of nodes of the other type are arbitrary!
#
# Among those arrays, we have:
#   - left_child, id of the left child of the node
#   - right_child, id of the right child of the node
#   - feature, feature used for splitting the node
#   - threshold, threshold value at the node

n_nodes = estimator.tree_.node_count
children_left = estimator.tree_.children_left
children_right = estimator.tree_.children_right
feature = estimator.tree_.feature
threshold = estimator.tree_.threshold

# The tree structure can be traversed to compute various properties such
# as the depth of each node and whether or not it is a leaf.
node_depth = np.zeros(shape=n_nodes, dtype=np.int64)
is_leaves = np.zeros(shape=n_nodes, dtype=bool)
stack = [(0, -1)]  # seed is the root node id and its parent depth
while len(stack) > 0:
    node_id, parent_depth = stack.pop()
    node_depth[node_id] = parent_depth + 1

    # If we have a test node
    if (children_left[node_id] != children_right[node_id]):
        stack.append((children_left[node_id], parent_depth + 1))
        stack.append((children_right[node_id], parent_depth + 1))
    else:
        is_leaves[node_id] = True

print("The binary tree structure has %s nodes and has "
      "the following tree structure:"
      % n_nodes)
for i in range(n_nodes):
    if is_leaves[i]:
        print("%snode=%s leaf node." % (node_depth[i] * "\t", i))
    else:
        print("%snode=%s test node: go to node %s if X[:, %s] <= %s else to "
              "node %s."
              % (node_depth[i] * "\t",
                 i,
                 children_left[i],
                 feature[i],
                 threshold[i],
                 children_right[i],
                 ))
print()

# First let's retrieve the decision path of each sample. The decision_path
# method allows to retrieve the node indicator functions. A non zero element of
# indicator matrix at the position (i, j) indicates that the sample i goes
# through the node j.

node_indicator = estimator.decision_path(X_test)

# Similarly, we can also have the leaves ids reached by each sample.

leave_id = estimator.apply(X_test)

# Now, it's possible to get the tests that were used to predict a sample or
# a group of samples. First, let's make it for the sample.

# HERE IS WHAT YOU WANT
sample_id = 0
node_index = node_indicator.indices[node_indicator.indptr[sample_id]:
                                    node_indicator.indptr[sample_id + 1]]

print('Rules used to predict sample %s: ' % sample_id)
for node_id in node_index:

    if leave_id[sample_id] == node_id:  # <-- changed != to ==
        #continue # <-- comment out
        print("leaf node  reached, no decision here".format(leave_id[sample_id])) # <--

    else: # < -- added else to iterate through decision nodes
        if (X_test[sample_id, feature[node_id]] <= threshold[node_id]):
            threshold_sign = "<="
        else:
            threshold_sign = ">"

        print("decision id node %s : (X[%s, %s] (= %s) %s %s)"
              % (node_id,
                 sample_id,
                 feature[node_id],
                 X_test[sample_id, feature[node_id]], # <-- changed i to sample_id
                 threshold_sign,
                 threshold[node_id]))

这将在最后打印以下内容:

Rules used to predict sample 0:
decision id node 0 : (X[0, 3] (= 2.4) > 0.800000011920929)
decision id node 2 : (X[0, 2] (= 5.1) > 4.949999809265137)
leaf node 4 reached, no decision here

【讨论】:

感谢您的回答,但我正在寻找每个数据点的路径。例如:行号-1规则-花瓣长度>2.45和花瓣宽度>1.75... 您的意思是您想要用于预测样本的规则吗? 是的,最终结果将是样本索引和规则。谢谢! @AdiCohen 查看我的更新答案。这可以满足您的要求。在sample_id = 0command 之后,您可以找到适合您的重要代码。通过更改sample_id,您可以打印所有样本的规则!欢呼【参考方案2】:

代码

from sklearn.tree.export import export_text
tree_rules = export_text(clf, feature_names=list(X_train))
print(tree_rules)

将为您提供由树构建的规则,并有助于理解预测。

【讨论】:

以上是关于使用sklearn python通过决策树提取数据点的规则路径的主要内容,如果未能解决你的问题,请参考以下文章

python-sklearn数据拆分与决策树的实现

如何从每个节点提取sklearn决策树规则到pandas布尔条件?

如何使用 sklearn 从决策树模型中提高预测的准确性?

人工智能机器学习之用Python使用ID3算法实例及使用sklearn的决策树算法对葡萄酒数据集进行分类

sklearn中的交叉验证+决策树

黑客/克隆 sklearn 以支持修剪决策树?