决策树
Posted geekdanny
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了决策树相关的知识,希望对你有一定的参考价值。
决策树 Decision Tree
一.基础知识
树的基本类型: 结点(内部节点,叶结点)+有向边
决策树也叫判断树,树的结构是满足 if-then 条件规则的.
树的特点:可读性性高,分类速度快
二.思想脉络
决策树=从训练数据集中归纳出一组分类规则(模型)+以损失函数为目标函数的最小化(策略)+递归的选择最优特征(算法)
三.算法推导
决策树的生成
特征选择(递归的选择最优特征,算法)
信息熵: $$Ent(D)=-sum_{k=1}^{left| y ight|} p_{k} log_{2} p_{k}$$
信息熵是度量样本集合纯度的一项指标, 其中$p_{k}$是样本的比例
信息增益: $$Gain(D,a)=Ent(D)- sum_{ u=1}^{ V } frac{left | D^{ u} ight |}{left |D ight |} Ent(D^{ u})$$
用 a 属性来对样本进行划分. 著名的ID3算法就是用信息增益为准则.
信息增益有一个缺点:对可取值数目较多的属性具有一定的偏袒性.
解决办法就是引入
增益率:$$ Gain_ratio(D,a)=frac{Gain(D,a)}{IV(a)}$$ ,其中$IV(a)=-sum_{v=1}^{V} frac{left |D^{v} ight |}{left | D ight |} log_{2} frac{left |D^{v} ight |}{left | D ight |}$
需要注意的是增益率准则对可取值数目较少的属性有所偏好.
到底有没有对可取值数目多少没有偏袒性的特征选择了呢?
还真有!
CART(classification and regression tree)使用基尼系数来选择划分属性.
基尼值:$$ Gini(D)=sum_{k=1}^{left | y ight |} sum_{k^{‘} eq k} p_{k}p_{k^{‘}}=1-sum_{k=1}^{left | y ight |}p_{k}^{2}$$
基尼指数:$$Gini_index(D,a)=sum_{v=1}^{V} frac{left | D^{v} ight|}{left | D ight |}Gini(D^{v})$$
决策树的优化
决策树生成过程中要考虑很多问题,怎么用方法将决策树进行优化
剪枝处理,连续值和缺失值处理.
剪枝处理
剪枝处理是决策树学习算法对付"过拟合"的主要手段
前剪枝:在决策树根据最优特征,也就是每个结点在划分前先进行估计.若这个节点划分后根据验证集的精度来判断是否进行划分.
后剪枝:先将决策树训练出来,然后自下而上的考察非叶子结点分,对该节点替换为叶子节点能否提升决策树的泛化性能.来进行决策.
连续值和缺失值处理
连续值:连续属性离散化 二分法
缺失值:按权重进行从新赋值,定义.
四.编程实现
implementation of ID3
# 定义一个决定树
def TreeGenerate(df):
"""
@param df : the pandas dataframe of the dataset
@return root : the root node of decision tree
"""
newNode=Node(None,None,{})
labelArr=df[df.columns[-1]]
labelCount=NodeLabel(labelArr)
if labelCount:# 假设标签数不是空
newNode.label=max(labelCount,key=labelCount.get)
if len(labelCount)==1 or len(labelArr)==0: # end if there is only 1 class in node data
return newNode
# get the optimal attribution for a new branching
newNode.attr,divValue=OptArr(df)
# recursion
if divValue==0: # categoric variable
valueCount=ValueCount(df[newNode.attr])
for value in valueCount:
dfV=df[df[newNode.attr].isin([value])] # get sub set
# delete current attribution
dfV=dfV.drop(newNode.attr,1)
newNode.attrDown[value]=TreeGenerate(dfV)
else:# continuous variable
# left and right child
valueL="<=%.3f"% divValue
valueR=">%.3f" % divValue
dfVL=df[df[newNode.attr]<=divVakue]
dfVR=df[df[newNode.attr]>divVakue]
newNode.attrDown[valueL]=TreeGenerate(dfVL)
newNode.attrDown[valueR]=TreeGenerate(dfVR)
return newNode
# NodeLabel()函数
# calculating the appeared labe and it‘s counts
# 分类标签计数函数
def NodeLabel(labelArr):
"""
@param labelArr:data Array for class labels
@return labelCount:dict,the appeared label and it‘s counts
"""
labelCount={} # store count of label
for label in labelArr:
if label in labelCount:
labelCount[label]+=1
else:
labelCount[label]=1
return labelCount
# OptArr()函数
# find the optimal attributes of current dataSet,找到数据集中可选属性
# 找到最大的信息增益的属性用来分支树的
def OptArr(df):
"""
@param df: pandas dataframe of the dataSet
@return optArr: the optimal attribution for branch
@return divVlaue: for discrete variable value=0
for continuous variable value =t for bisection divide value
"""
infoGain=0
for attrId in df.columns[1:-1]: # 计算每个特征,在特征中找到划分特征的最佳属性
infoGainTmp,divValueTmp=InfoGain(df,attrId)
if infoGainTmp>infoGain:
infoGain=infoGainTmp
optArr=attrId
divValue=divValueTmp
return optArr,divValue
# infoGain()函数
# 用来计算每个属性的信息增益
def InfoGain(df,index):
"""
@param df :the pandas of dataframe
@param index : the attrbution ID
@return : infoGain and divValue
"""
infoGain=InfoEnt(df.values[:,-1])# labelArr 中class的信息熵的计算
# infoGain for the whole label
divValue=0
# for continuous attribute
n=len(df[index])
# for continuous variable using method of bitsection
if df[index].dtype==(float,int):
subInfoEnt={}
# sorted the index
df=df.sort_values([index],ascending=1)
df=df.reset_index(drop=True)# 重新设置索引值,并删除一些元素
dataArr=df[index]
labelArr=df[df.columns[-1]]
# 连续值的处理:连续属性离散化,bit-partition ,西瓜书 4.4
for i in range(n-1):
div=(dataAr[i]+dataArr[i+1])/2
subInfoEnt(div)=((i+1)*InfoEnt(labelArr[0:i+1])/n)+((n-i-1)*InfoEnt(labelArr[i+1:-1])/n)
divValue,sunInfoEntMax=min(subInfoEnt.items(),key=lambda x:x[1]) # lambda 这个key 是什么意思?
infoGain-=subInfoGainMax
# discrete variable
else:
dataArr=df[index]
labelArr=df[df.columns[-1]]
valueCount=ValueCount(dataArr)
for key in valueCount:
keyLabelArr=labelArr[dataArr==key]
infoGain-=valueCount[key]*InfoEnt(keyLabelArr)/n
return infoGain,divValue
# ValueCount()函数
# 计算每个特征中的属性值
def ValueCount(labelArr):
"""
@param labelArr: the attribute of data array
@return valueCount:dict,the appeared value and it‘s counts
"""
valueCount={}
for label in labelArr:
if label in valueCount:
valueCount[label]+=1
else:
valueCount[label]=1
return valueCount
# infoEnt()函数
# 用来计算属性的信息熵
def InfoEnt(labelArr):
"""
@param labelArr: data array of class label
@return ent: the class information entropy
"""
try:
from math import log2
except ImportError:
print(‘module math.log2 not found‘)
ent=0
n=len(labelArr)
labelCount=NodeLabel(labelArr)
for key in labelCount:
ent-=(labelCount[key]/n)*log2(labelCount[key]/n)
return ent
# Predict() function
# make a perdict based on root
def Predict(root,df_sample):
try:
import re # using regular exopression to get the number i string
expect ImportError:
print(‘module re not found‘)
while root.attr !=None:
if df_sample[root.attr].dtype==(float,int):
# get the div_value from root.attr_down
for key in list(root.attr_down):
num=re.findall(r"d+.?d*",key)
div_value=float(num[0])
break
if df_sample[root.attr].values[0]<=div_value:
key="<=%.3f" %div_value
root=root.attr_down[key]
else:
key=">%.3f" %div_value
root=root.attr_down[key]
# categoric variable
else:
key=df_sample[root.attr].values[0]
# check whether the attr_value in the child branch
if key in root.attr_down:
root=root.attr_down[key]
else:
break
return root.label
# DrawPng() functions
# visualization the tree using graphviz
def DrawPng(root,out_file):
"""
@param root : the tree root node
@param out_file: the output name&file path of file
"""
try:
from pydotplus import graphviz
except ImportError:
print("module pydotplus.graphviz not found")
g=graphviz.Dot() # generation of new dot
TreeToGraph(0,g,root)
g2=graphviz.graph_from_dot_data(g.to_string())
g2.write_png(out_file)
# TreeToGraph()
# bulid a graph from root on
def TreeToGraph(i,g,root):
"""
@param i: node number in this tree
@param g: pydotplus.garphviz.Dot() object
@param root : the root node
@return i:node number after modified
@return g:...object ater modified
@return g_node: the current root node in graphviz
"""
try:
from pydotplus import graphviz
except ImportError:
print("module p... not found")
if root.attr==None:
g_node_label=‘Node:%d
好瓜:%s‘%(i,root,label)
else:
g_node_label="Node:%d
好瓜:%s"%(i,root.label,root.attr)
g_node=i
g.add_node(graphviz.Node(g_node,label=g_node_label))
for value in list(root.attr_down):
i,g_child=TreeToGraph(i+1,g,root.attr_down[value])
g.add_edge(graphviz.Edge(g_node,g_child,label=value))
return i,g_node
上面的树的基础已经完成,利用上面的函数来进行数据的处理
root=TreeGenerate(df)
# 计算准确率
from random import sample
accuracy_scores=[]
for i in range(10):
train=sample(range(len(df.index)),int(1*len(df.index)/2))
df_train=df.iloc[train] # 按位置选取元素
df_test=df.drop(train)
# generate tree
root=TreeGenerate(df_train)
# test the accuracy
pred_true=0
for i in df_test.index:
label=Predict(root,df[df.index==i])
if label==df_test[df_test.columns[-1]][i]:
pred_true+=1
accuracy=pred_true/len(df_test.index)
accuracy_scores.append(accuracy)
# K-folds cross prediction
# k 折交叉验证 ,一个模型评估方法
n=len(df.index)
k=5
for i in range(k):
m=int(n/k)
test=[]
for j in range(i*m,i*m+m): # 这个程序要记住
test.append(j)
df_train=df.drop(test)
df_test=df.iloc[test]
root=TreeGenerate(df_train) # generate the tree
# test the accuracy
pred_true=0
for i in df_test.index:
label=Predict(root,df[df.index==i])
if label==df_test[df_test.columns[-1]][i]:
pred_true+=1
accuracy=pred_true/len(df_test.index)
accuracy_scores.append(accuracy)
# print the prediction accuracy result
accuracy_sum=0
print("accuracy:",end= "")
for i in range(k):
print("%.3f "% accuracy_scores[i],end="")
accuracy_sum+=accuracy_scores[i]
print("
average accuracy {}".format(accuracy_sum/k))
# visualization 可视化处理
# decision tree visualization using pydotplus.graphviz
root=TreeGenerate(df)
DrawPng(root,"decision_tree_ID3.png")
以上是关于决策树的主要内容,如果未能解决你的问题,请参考以下文章
sklearn决策树算法DecisionTreeClassifier(API)的使用以及决策树代码实例 - 莺尾花分类