机器学习——决策树（下）算法实现

Posted 2022-11-28 Lyndon_zheng

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了机器学习——决策树（下）算法实现相关的知识，希望对你有一定的参考价值。

Decision tree

在机器学习（5）——决策树（上）原理中介绍了决策树的生成和剪枝原理。介绍了CART,ID3，C4.5等算法的算法流程，其中CART算法可以实现回归和分类，是基于基尼不纯度实现的，这里并未实现。这里主要实现了ID3和C4.5算法，是基于信息熵的，在本处因为没有涉及剪枝，他们最终得到的结果都是一样的。我们先来看ID3的整个算法框架（C4.5也基本类似，不同之处是特征选取的区别）：

Algotithm 4.1 ID3(D)
Input: an attribute-valued dataset $D$
Output: a decision tree
1. if $D$ is “pure” OR Attribute is null then
2. return class
3. end if
4. for all attribute $a\\in D$ do
5. computer the imformation gain and select best feature
6. end for
7. $a_best=$ Best attribute feature
8. $Tree=$ Create a decision node that feature $a_best$ in root
9. $D_v=$ Induced sub-dataset for feature $a_best$
10. for all $D_v$ do
11. $Tree_v=ID3(D_v)%$
12. end for
13. return Tree

算法实现

(1)创建训练数据集：
从.txt文件中读取数据，并去掉空格，分割数据，最终返回dataset数据集合attribute特征类别。

# process training data set
# input: directory
# output: data_set, attribute

def proData(path):
    fileset = open(path)   #loading data file
    dataset = [data.strip().split('\\t') for data in fileset.readlines()]
    attribute = dataset [0]
    del(dataset[0])
    return dataset,attribute

(2)计算信息熵：
先统计训练数据的总量，然后统计每个标签类别的数目，得到其概率，最后计算信息熵

H(X)=−∑i=1npilogpi $H(X)=-\\sum_i=1^np_i\\log p_i$

# calculate the information entropy
# input: dataset
# output: entropy

def calcEntropy(dataset):
    numEntries = len (dataset)
    attributeCounts = 
    for item in dataset:
        currentAttribute = item[-1]
        if currentAttribute not in attributeCounts.keys():
            attributeCounts[currentAttribute]=0
        attributeCounts[currentAttribute]+=1
    entropy = 0.0
    for key in attributeCounts:
        prob = float (attributeCounts[key])/numEntries
        entropy -= prob *log(prob,2)
    return entropy

(3)划分子数据集：
选取最好的分类特征之后，依据该特征得到新的子训练样本，将子样本进行归类，并去掉本次已选的属性（特征）。

# split data based on different values of attribute
# input: dataset
# output: split data 
def splitData(dataset,axis,value):
    splitdata = [] 
    for feature in dataset:
        if feature[axis] == value:
            #del(feature[axis])
            tempFeaVec = feature[:axis]
            tempFeaVec.extend(feature[axis+1:])
            splitdata.append(tempFeaVec)
    return splitdata

(4)选取最好特征：
在ID3算法中，依据信息增益选取最好的特征，在C4.5中依据信息增益比选取最好特征。
ID3：信息增益

# calculate the entropy of different features
# input: dataset
# output: best feature
def selectBestFeature(dataset):
    numFeatures = len(dataset[0]) - 1
    baseEntropy = calcEntropy(dataset)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):
        featList = [features[i] for features in dataset] # Select attribute types
        uniqueVals = set(featList)                       # Set different values of same attribute
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitData(dataset, i, value)
            prob = float(len(subDataSet))/len(dataset)
            newEntropy += prob * calcEntropy(subDataSet) 
        infoGain = baseEntropy - newEntropy
        if (infoGain > bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature

C4.5：信息增益率

# calculate the information gain ratio for different features
# input: dataset
# output: best feature
def selectBestFeature_C4(dataset):
    numFeatures = len(dataset[0]) - 1
    baseEntropy = calcEntropy(dataset)
    bestInfoGainRatio = 0.0; bestFeature = -1
    for i in range(numFeatures):
        featList = [features[i] for features in dataset] # Select attribute types
        uniqueVals = set(featList)                       # Set different values of same attribute
        newEntropy = 0.0;Splitentropy = 0.0
        for value in uniqueVals:
            subDataSet = splitData(dataset, i, value)
            prob = float(len(subDataSet))/len(dataset)
            newEntropy += prob * calcEntropy(subDataSet) 
            Splitentropy -= prob *log(prob,2)
        infoGainRatio = (baseEntropy - newEntropy)/Splitentropy
        if (infoGainRatio > bestInfoGainRatio):
            bestInfoGainRatio = infoGainRatio
            bestFeature = i
    return bestFeature

(5)创建决策树：
首先计算所有属性（特征）对于原经验熵的信息增益（率），据此选取出最好的属性（特征），然后根据所选的最好属性（特征）将原数据集分成不同的子数据集，并迭代计算子数据集的树，直到子数据集不可分或属性集合为空为止。
ID3决策树生成

# train decision tree ID3
# input: dataset, attribute
# output: decision tree
def createTreeID3(dataset,attributes):
    classList = [example[-1] for example in dataset]
    classCount = 
    if classList.count(classList[0]) ==len(classList):
        return classList[0]                             # stop splitting when all data belong to same labels
    if len(dataset[0]) == 1:                            # stop splitting when attribute = NULL, return the max class
        for value in classList:
            if value not in classCount.keys():
                classCount[value] = 0
            classCount[value]+=1
        sortedClassCount = sorted(classCount.iteritems(), key = operator.itemgetter(1), reverse = True)
        return sortedClassCount[0][0] 
    bestFeature = selectBestFeature(dataset)
    bestAttribute = attributes [bestFeature]
    myTree = bestAttribute:
    del(attributes [bestFeature])
    featureValues = [example[bestFeature] for example in dataset] # select the training data of the child node
    uniqueVals = set(featureValues)
    for value in uniqueVals:
        subattributes = attributes[:]
        myTree[bestAttribute][value] = createTreeID3(splitData(dataset, bestFeature, value), subattributes)
    return myTree

C4.5决策树生成

# train decision tree C4.5
# input: dataset, attribute
# output: decision tree
def createTreeC4(dataset,attributes):
    classList = [example[-1] for example in dataset]
    classCount = 
    if classList.count(classList[0]) == len(classList):
        return classList[0]
    if len(dataset[0]) == 1:
        for value in classList: 
            if value not in classCount.keys():
                classCount[value] = 0
            classCount[value] +=1
        sortedClassCount = sorted(classCount.iteritems(), key = operator.itemgetter(1), reverse = True)
        return sortedClassCount[0][0] 
    bestFeature = selectBestFeature_C4(dataset)
    bestAttribute = attributes [bestFeature]
    myTree = bestAttribute:
    del(attributes [bestFeature])
    featureValues = [example[bestFeature] for example in dataset] # select the training data of the child node
    uniqueVals = set(featureValues)
    for value in uniqueVals:
        subattributes = attributes[:]
        myTree[bestAttribute][value] = createTreeC4(splitData(dataset, bestFeature, value), subattributes)
    return myTree

(6)主函数：
给定数据所在位置，并输出最终的效果。

# main function
if __name__=="__main__":
    # data_set processing
    dataset = []
    attributes = []
    path='F:\\Program\\Python\\Machine_Learning\\Decision_tree\\lenses.txt'
    dataset,attributes = proData(path)
    myTreeID3 = createTreeID3(dataset,attributes)
    dataset,attributes = proData(path)
    myTreeC4 = createTreeC4(dataset, attributes)
    print str(myTreeID3)
    createPlot(myTreeID3)
    print str(myTreeC4)
    createPlot(myTreeC4)

(7)画图函数：
生成的决策树通过文本形式观看不是很直观，设计一个画图子函数之后，可以很直观的将生成的决策树打印出来，看到不错的效果。

# Project: Machine learning-decision tree
# Author: Lyndon
# date: 2015/10/27

from matplotlib import pyplot as plt

# define the format of text and arrow
decisionNode = dict(boxstyle ="sawtooth",fc = "0.8")
leafNode = dict(boxstyle = "round4", fc = "0.8")
arrowRrgs = dict (arrowstyle = "<-")

# calculate the number of tree leaves and the depth of tree 
# input: decision tree
# output: numbers of node, depth of the tree
def calNumLeaves(tree):
    numLeaves = 0
    maxDepth = 0
    firstNode = tree.keys()[0]
    secondDict = tree[firstNode]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__ == 'dict':        #check if the node is leaf
            subnumLeaves,submaxDepth = calNumLeaves(secondDict[key])
            numLeaves += subnumLeaves
            thisDepth = 1 +submaxDepth
        else: 
            numLeaves +=1
            thisDepth = 1
        if thisDepth > maxDepth:
            maxDepth = thisDepth
    return numLeaves,maxDepth

# plot the node and leaf
# input: node,leaf, center, parent,   
# output: null
def plotsubtree(node,text,center,parent,nodeType):
    createPlot.ax1.annotate(node,xy=parent,xycoords='axes fraction',
                            xytext=center,textcoords='axes fraction',
                            va='center',ha='center',bbox=nodeType,arrowprops=arrowRrgs)
    xMid = (parent[0]-center[0])/2.0+center[0]
    yMid = (parent[1]-center[1])/2.0+center[1]
    createPlot.ax1.text(xMid,yMid,text,va='center',ha='center',rotation=30)

# plot the tree
# input: tree
# output: null
def plotTree(tree,parent,nodetxt):
    numLeaves, depth = calNumLeaves(tree)
    firstNode = tree.keys()[0]
    center = (plotTree.xOff+(1+float(numLeaves))/2.0/plotTree.num,plotTree.yOff )
    plotsubtree(firstNode, nodetxt, center, parent, decisionNode)
    secondDict = tree[firstNode]
    plotTree.yOff -=1.0/plotTree.depth 
    for key in secondDict.keys():
        if type(secondDict[key]).__name__ == 'dict': 
            plotTree(secondDict[key], center, str(key))
        else:
            plotTree.xOff += 1.0/plotTree.num
            plotsubtree(secondDict[key], str(key), (plotTree.xOff,plotTree.yOff), center, leafNode)
    plotTree.yOff += 1.0/plotTree.depth

# plot the Tree
# input: Tree
# output: Null
def createPlot(tree):
    fig = plt.figure(1,facecolor='white')
    fig.clf()
    axprops = dict(xticks=[],yticks=[])
    createPlot.ax1 = plt.subplot(111,frameon=False,**axprops) 
    plotTree.num, plotTree.depth = calNumLeaves(tree)
    plotTree.xOff = -0.5/plotTree.num; plotTree.yOff = 1.0
    plotTree(tree,(0.5,1.0),'')
    plt.show()

(8)分类结果：
决策树文本输出：
‘tearRate’‘reduced’: ‘no lenses’, ‘normal’: ‘astigmatic’: ‘yes’: ‘prescriptor’: ‘hyper’: ‘age’: ‘pre’: ‘no lenses’, ‘presbyopic’: ‘no lenses’, ‘young’: ‘hard’, ‘myope’: ‘hard’, ‘no’: ‘age’: ‘pre’: ‘soft’, ‘presbyopic’: ‘prescriptor’: ‘hyper’: ‘soft’, ‘myope’: ‘no lenses’, ‘young’: ‘soft’
决策树图示：

在本例中，没有剪枝过程，ID3和C4.5算法实现的最终结果一样。
PS：
本文主要通过Python实现了决策树中的ID3和C4.5算法，只是简单的应用了信息增益和信息增益率来实现分类，代码参考了《机器学习实战》，完整代码及数据。

以上是关于机器学习——决策树（下）算法实现的主要内容，如果未能解决你的问题，请参考以下文章