ML 决策树分类器仅在同一棵树上拆分/询问相同的属性

Posted

技术标签:

【中文标题】ML 决策树分类器仅在同一棵树上拆分/询问相同的属性【英文标题】:ML Decision Tree classifier is only splitting on the same tree / asking about the same attribute 【发布时间】:2021-03-18 21:58:00 【问题描述】:

我目前正在使用 Gini 和 Information Gain 制作决策树分类器,并根据每次获得最大增益的最佳属性拆分树。但是,它每次都坚持相同的属性,只是调整其question 的值。这会导致非常低的准确度,通常约为 30%,因为它只考虑了第一个属性。

寻找最佳分割

 # Used to find the best split for data among all attributes

def split(r):
    max_ig = 0
    max_att = 0
    max_att_val = 0
    i = 0

    curr_gini = gini_index(r)
    n_att = len(att)

    for c in range(n_att):
        if c == 3:
            continue

        c_vals = get_column(r, c)

        while i < len(c_vals):
            # Value of the current attribute that is being tested
            curr_att_val = r[i][c]
            true, false = fork(r, c, curr_att_val)
            ig = gain(true, false, curr_gini)

            if ig > max_ig:
                max_ig = ig
                max_att = c
                max_att_val = r[i][c]
            i += 1

    return max_ig, max_att, max_att_val

比较根据真假将数据拆分为真

    # Used to compare and test if the current row is greater than or equal to the test value
# in order to split up the data

def compare(r, test_c, test_val):
    if r[test_c].isdigit():
        return r[test_c] == test_val

    elif float(r[test_c]) >= float(test_val):
        return True

    else:
        return False


# Splits the data into two lists for the true/false results of the compare test

def fork(r, c, test_val):
    true = []
    false = []

    for row in r:

        if compare(row, c, test_val):
            true.append(row)
        else:
            false.append(row)

    return true, false

遍历树

def rec_tree(r):
ig, att, curr_att_val = split(r)

if ig == 0:
    return Leaf(r)

true_rows, false_rows = fork(r, att, curr_att_val)

true_branch = rec_tree(true_rows)
false_branch = rec_tree(false_rows)

return Node(att, curr_att_val, true_branch, false_branch)

【问题讨论】:

【参考方案1】:

我的工作解决方案是按如下方式更改拆分功能。老实说,我看不出有什么问题,但这可能很明显 工作函数如下

def split(r):
max_ig = 0
max_att = 0
max_att_val = 0

# calculates gini for the rows provided
curr_gini = gini_index(r)
no_att = len(r[0])

# Goes through the different attributes

for c in range(no_att):

    # Skip the label column (beer style)

    if c == 3:
        continue
    column_vals = get_column(r, c)

    i = 0
    while i < len(column_vals):
        # value we want to check
        att_val = r[i][c]

        # Use the attribute value to fork the data to true and false streams
        true, false = fork(r, c, att_val)

        # Calculate the information gain
        ig = gain(true, false, curr_gini)

        # If this gain is the highest found then mark this as the best choice
        if ig > max_ig:
            max_ig = ig
            max_att = c
            max_att_val = r[i][c]
        i += 1

return max_ig, max_att, max_att_val

【讨论】:

以上是关于ML 决策树分类器仅在同一棵树上拆分/询问相同的属性的主要内容,如果未能解决你的问题,请参考以下文章

如何在 Scikit Learn 决策树中根据分类变量拆分节点?

决策树拆分策略

如何在 spark ml 中处理决策树、随机森林的分类特征?

ML之监督学习算法之分类算法一 ———— 决策树算法

10、决策树集成--随机森林

决策树分类器