机器学习——决策树(下)算法实现

Posted Lyndon_zheng

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了机器学习——决策树(下)算法实现相关的知识,希望对你有一定的参考价值。

Decision tree

机器学习(5)——决策树(上)原理中介绍了决策树的生成和剪枝原理。介绍了CART,ID3,C4.5等算法的算法流程,其中CART算法可以实现回归和分类,是基于基尼不纯度实现的,这里并未实现。这里主要实现了ID3和C4.5算法,是基于信息熵的,在本处因为没有涉及剪枝,他们最终得到的结果都是一样的。我们先来看ID3的整个算法框架(C4.5也基本类似,不同之处是特征选取的区别):

  • Algotithm 4.1 ID3(D)
  • Input: an attribute-valued dataset D
  • Output: a decision tree
    1. if D is “pure” OR Attribute is null then
    2.   return class
    3. end if
    4. for all attribute aD do
    5.   computer the imformation gain and select best feature
    6. end for
    7. abest= Best attribute feature
    8. Tree= Create a decision node that feature abest in root
    9. Dv= Induced sub-dataset for feature abest
    10. for all Dv do
    11.    Treev=ID3(Dv)
    12. end for
    13. return Tree

算法实现

(1)创建训练数据集:
从.txt文件中读取数据,并去掉空格,分割数据,最终返回dataset数据集合attribute特征类别。

# process training data set
# input: directory
# output: data_set, attribute

def proData(path):
    fileset = open(path)   #loading data file
    dataset = [data.strip().split('\\t') for data in fileset.readlines()]
    attribute = dataset [0]
    del(dataset[0])
    return dataset,attribute

(2)计算信息熵:
先统计训练数据的总量,然后统计每个标签类别的数目,得到其概率,最后计算信息熵

H(X)=i=1npilogpi

# calculate the information entropy
# input: dataset
# output: entropy

def calcEntropy(dataset):
    numEntries = len (dataset)
    attributeCounts = 
    for item in dataset:
        currentAttribute = item[-1]
        if currentAttribute not in attributeCounts.keys():
            attributeCounts[currentAttribute]=0
        attributeCounts[currentAttribute]+=1
    entropy = 0.0
    for key in attributeCounts:
        prob = float (attributeCounts[key])/numEntries
        entropy -= prob *log(prob,2)
    return entropy

(3)划分子数据集:
选取最好的分类特征之后,依据该特征得到新的子训练样本,将子样本进行归类,并去掉本次已选的属性(特征)。

# split data based on different values of attribute
# input: dataset
# output: split data 
def splitData(dataset,axis,value):
    splitdata = [] 
    for feature in dataset:
        if feature[axis] == value:
            #del(feature[axis])
            tempFeaVec = feature[:axis]
            tempFeaVec.extend(feature[axis+1:])
            splitdata.append(tempFeaVec)
    return splitdata

(4)选取最好特征:
在ID3算法中,依据信息增益选取最好的特征,在C4.5中依据信息增益比选取最好特征。
ID3:信息增益

# calculate the entropy of different features
# input: dataset
# output: best feature
def selectBestFeature(dataset):
    numFeatures = len(dataset[0]) - 1
    baseEntropy = calcEntropy(dataset)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):
        featList = [features[i] for features in dataset] # Select attribute types
        uniqueVals = set(featList)                       # Set different values of same attribute
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitData(dataset, i, value)
            prob = float(len(subDataSet))/len(dataset)
            newEntropy += prob * calcEntropy(subDataSet) 
        infoGain = baseEntropy - newEntropy
        if (infoGain > bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature

C4.5:信息增益率

# calculate the information gain ratio for different features
# input: dataset
# output: best feature
def selectBestFeature_C4(dataset):
    numFeatures = len(dataset[0]) - 1
    baseEntropy = calcEntropy(dataset)
    bestInfoGainRatio = 0.0; bestFeature = -1
    for i in range(numFeatures):
        featList = [features[i] for features in dataset] # Select attribute types
        uniqueVals = set(featList)                       # Set different values of same attribute
        newEntropy = 0.0;Splitentropy = 0.0
        for value in uniqueVals:
            subDataSet = splitData(dataset, i, value)
            prob = float(len(subDataSet))/len(dataset)
            newEntropy += prob * calcEntropy(subDataSet) 
            Splitentropy -= prob *log(prob,2)
        infoGainRatio = (baseEntropy - newEntropy)/Splitentropy
        if (infoGainRatio > bestInfoGainRatio):
            bestInfoGainRatio = infoGainRatio
            bestFeature = i
    return bestFeature

(5)创建决策树:
首先计算所有属性(特征)对于原经验熵的信息增益(率),据此选取出最好的属性(特征),然后根据所选的最好属性(特征)将原数据集分成不同的子数据集,并迭代计算子数据集的树,直到子数据集不可分或属性集合为空为止。
ID3决策树生成

# train decision tree ID3
# input: dataset, attribute
# output: decision tree
def createTreeID3(dataset,attributes):
    classList = [example[-1] for example in dataset]
    classCount = 
    if classList.count(classList[0]) ==len(classList):
        return classList[0]                             # stop splitting when all data belong to same labels
    if len(dataset[0]) == 1:                            # stop splitting when attribute = NULL, return the max class
        for value in classList:
            if value not in classCount.keys():
                classCount[value] = 0
            classCount[value]+=1
        sortedClassCount = sorted(classCount.iteritems(), key = operator.itemgetter(1), reverse = True)
        return sortedClassCount[0][0] 
    bestFeature = selectBestFeature(dataset)
    bestAttribute = attributes [bestFeature]
    myTree = bestAttribute:
    del(attributes [bestFeature])
    featureValues = [example[bestFeature] for example in dataset] # select the training data of the child node
    uniqueVals = set(featureValues)
    for value in uniqueVals:
        subattributes = attributes[:]
        myTree[bestAttribute][value] = createTreeID3(splitData(dataset, bestFeature, value), subattributes)
    return myTree

C4.5决策树生成

# train decision tree C4.5
# input: dataset, attribute
# output: decision tree
def createTreeC4(dataset,attributes):
    classList = [example[-1] for example in dataset]
    classCount = 
    if classList.count(classList[0]) == len(classList):
        return classList[0]
    if len(dataset[0]) == 1:
        for value in classList: 
            if value not in classCount.keys():
                classCount[value] = 0
            classCount[value] +=1
        sortedClassCount = sorted(classCount.iteritems(), key = operator.itemgetter(1), reverse = True)
        return sortedClassCount[0][0] 
    bestFeature = selectBestFeature_C4(dataset)
    bestAttribute = attributes [bestFeature]
    myTree = bestAttribute:
    del(attributes [bestFeature])
    featureValues = [example[bestFeature] for example in dataset] # select the training data of the child node
    uniqueVals = set(featureValues)
    for value in uniqueVals:
        subattributes = attributes[:]
        myTree[bestAttribute][value] = createTreeC4(splitData(dataset, bestFeature, value), subattributes)
    return myTree

(6)主函数:
给定数据所在位置,并输出最终的效果。

# main function
if __name__=="__main__":
    # data_set processing
    dataset = []
    attributes = []
    path='F:\\Program\\Python\\Machine_Learning\\Decision_tree\\lenses.txt'
    dataset,attributes = proData(path)
    myTreeID3 = createTreeID3(dataset,attributes)
    dataset,attributes = proData(path)
    myTreeC4 = createTreeC4(dataset, attributes)
    print str(myTreeID3)
    createPlot(myTreeID3)
    print str(myTreeC4)
    createPlot(myTreeC4)

(7)画图函数:
生成的决策树通过文本形式观看不是很直观,设计一个画图子函数之后,可以很直观的将生成的决策树打印出来,看到不错的效果。

# Project: Machine learning-decision tree
# Author: Lyndon
# date: 2015/10/27

from matplotlib import pyplot as plt

# define the format of text and arrow
decisionNode = dict(boxstyle ="sawtooth",fc = "0.8")
leafNode = dict(boxstyle = "round4", fc = "0.8")
arrowRrgs = dict (arrowstyle = "<-")

# calculate the number of tree leaves and the depth of tree 
# input: decision tree
# output: numbers of node, depth of the tree
def calNumLeaves(tree):
    numLeaves = 0
    maxDepth = 0
    firstNode = tree.keys()[0]
    secondDict = tree[firstNode]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__ == 'dict':        #check if the node is leaf
            subnumLeaves,submaxDepth = calNumLeaves(secondDict[key])
            numLeaves += subnumLeaves
            thisDepth = 1 +submaxDepth
        else: 
            numLeaves +=1
            thisDepth = 1
        if thisDepth > maxDepth:
            maxDepth = thisDepth
    return numLeaves,maxDepth

# plot the node and leaf
# input: node,leaf, center, parent,   
# output: null
def plotsubtree(node,text,center,parent,nodeType):
    createPlot.ax1.annotate(node,xy=parent,xycoords='axes fraction',
                            xytext=center,textcoords='axes fraction',
                            va='center',ha='center',bbox=nodeType,arrowprops=arrowRrgs)
    xMid = (parent[0]-center[0])/2.0+center[0]
    yMid = (parent[1]-center[1])/2.0+center[1]
    createPlot.ax1.text(xMid,yMid,text,va='center',ha='center',rotation=30)

# plot the tree
# input: tree
# output: null
def plotTree(tree,parent,nodetxt):
    numLeaves, depth = calNumLeaves(tree)
    firstNode = tree.keys()[0]
    center = (plotTree.xOff+(1+float(numLeaves))/2.0/plotTree.num,plotTree.yOff )
    plotsubtree(firstNode, nodetxt, center, parent, decisionNode)
    secondDict = tree[firstNode]
    plotTree.yOff -=1.0/plotTree.depth 
    for key in secondDict.keys():
        if type(secondDict[key]).__name__ == 'dict': 
            plotTree(secondDict[key], center, str(key))
        else:
            plotTree.xOff += 1.0/plotTree.num
            plotsubtree(secondDict[key], str(key), (plotTree.xOff,plotTree.yOff), center, leafNode)
    plotTree.yOff += 1.0/plotTree.depth

# plot the Tree
# input: Tree
# output: Null
def createPlot(tree):
    fig = plt.figure(1,facecolor='white')
    fig.clf()
    axprops = dict(xticks=[],yticks=[])
    createPlot.ax1 = plt.subplot(111,frameon=False,**axprops) 
    plotTree.num, plotTree.depth = calNumLeaves(tree)
    plotTree.xOff = -0.5/plotTree.num; plotTree.yOff = 1.0
    plotTree(tree,(0.5,1.0),'')
    plt.show()

(8)分类结果:
决策树文本输出:
‘tearRate’‘reduced’: ‘no lenses’, ‘normal’: ‘astigmatic’: ‘yes’: ‘prescriptor’: ‘hyper’: ‘age’: ‘pre’: ‘no lenses’, ‘presbyopic’: ‘no lenses’, ‘young’: ‘hard’, ‘myope’: ‘hard’, ‘no’: ‘age’: ‘pre’: ‘soft’, ‘presbyopic’: ‘prescriptor’: ‘hyper’: ‘soft’, ‘myope’: ‘no lenses’, ‘young’: ‘soft’
决策树图示:

在本例中,没有剪枝过程,ID3和C4.5算法实现的最终结果一样。
PS:
本文主要通过Python实现了决策树中的ID3和C4.5算法,只是简单的应用了信息增益和信息增益率来实现分类,代码参考了《机器学习实战》,完整代码及数据

以上是关于机器学习——决策树(下)算法实现的主要内容,如果未能解决你的问题,请参考以下文章

机器学习--决策树

机器学习决策树(划分选择算法流程剪枝处理,连续值与缺失值处理)

《机器学习》(周志华)第4章 决策树 笔记 理论及实现——“西瓜树”

统计学习笔记之决策树

机器学习之决策树

机器学习算法实践:决策树 (Decision Tree)(转载)