集成学习（Ensemble Learning）

Posted 2020-08-22

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了集成学习（Ensemble Learning）相关的知识，希望对你有一定的参考价值。

集成学习是机器学习中一个非常重要且热门的分支，是用多个弱分类器构成一个强分类器，其哲学思想是“三个臭皮匠赛过诸葛亮”。一般的弱分类器可以由决策树，神经网络，贝叶斯分类器，K-近邻等构成。已经有学者理论上证明了集成学习的思想是可以提高分类器的性能的，比如说统计上的原因，计算上的原因以及表示上的原因。集成学习中主要的3个算法为：boosting,bagging,stacking。其中boosting的弱分类器形成是同一种机器学习算法，只是其数据抽取时的权值在不断更新，每次都是提高前一次分错了的数据集的权值，最后得到T个弱分类器，且分类器的权值也跟其中间结果的数据有关。Bagging算法也是用的同一种弱分类器，其数据的来源是用bootstrap算法得到的。Stacking算法分为2层，第一层是用不同的算法形成T个弱分类器，同时产生一个与原数据集大小相同的新数据集，利用这个新数据集和一个新算法构成第二层的分类器。集成学习有效的前提：1.每个弱分类器的错误率不能高于0.5。2.弱分类器之间的性能要有较大的差别，否则集成效果不是很好。集成学习按照基本分类器之间的关系可以分为异态集成学习和同态集成学习。异态集成学习是指弱分类器之间本身不同，而同态集成学习是指弱分类器之间本身相同只是参数不同。怎样形成不同的基本分类器呢？主要从以下5个方面得到。

基本分类器本身的种类，即其构成算法不同。
对数据进行处理不同，比如说boosting,bagging,stacking, cross-validation,hold-out test.等。
对输入特征进行处理和选择
对输出结果进行处理，比如说有的学者提出的纠错码
引入随机扰动

基本分类器之间的整合方式，一般有简单投票，贝叶斯投票，基于D-S证据理论的整合，基于不同的特征子集的整合。基础学习性能的分析方法主要有bias-variance分析法。目前有的一般性实验结论：Boosting方法的集成分类器效果明显优于bagging,但是在某些数据集boosting算法的效果还不如单个分类器的；使用随机化的人工神经网络初始权值来进行集成的方法往往能够取得和bagging同样好的效果；Boosting算法一定程度上依赖而数据集，而bagging对数据集的依赖没有那么明显；Boosting算法不仅能够减少偏差还能减少方差，但bagging算法智能减少方差，对偏差的减少作用不大。

以下先对统计学习方法上的相关内容进行说明。

boostting类

1、AdaBoost算法

提升方法的思路是综合多个分类器，得到更准确的分类结果。

AdaBoost算法的归类

《统计学习方法》称AdaBoost是提升算法的代表，所谓提升算法，指的是一种常用的统计学习方法，应用广泛且有效。在分类问题中，它通过改变训练样本的权重，学习多个分类器，并将这些分类器进行线性组合，提髙分类的性能。

《机器学习实战》称AdaBoost是最流行的元算法，所谓元算法，指的是“学习算法的算法”。

AdaBoost算法的基本思想

多轮训练，多个分类器
每轮训练增加错误分类样本的权值，降低正确分类样本的权值
降低错误率高的分类器的权值，增加正确率高的分类器的权值

AdaBoost算法

给定一个二类分类的训练数据集

其中，每个样本点由实例与标记组成。实例，标记，技术分享是实例空间，是标记集合。AdaBoost利用以下算法，从训练数据中学习一系列弱分类器或基本分类器，并将这些弱分类器线性组合成为一个强分类器。

AdaBoost算法

输入：训练数据集:弱学习算法；

输出：最终分类器技术分享

(1)初始化训练数据的权值分布

技术分享

每个w的下标由两部分构成，前一个数表示当前迭代次数，与D的下标保持一致，后一个数表示第几个权值，与位置保持一致。初始情况下，每个权值都是均等的。

(2)对技术分享（这里的M原著未做解释，其实是表示训练的迭代次数，是由用户指定的。每轮迭代产生一个分类器，最终就有M个分类器）：

(a)使用具有权值分布的训练数据集学习，得到基本分类器

技术分享

(b)计算技术分享在训练数据集上的分类误差率

技术分享

分类误差率这个名字可能产生误解，这里其实是个加权和。

(c）计算技术分享的系数

技术分享

这里的对数是自然对数。技术分享表示在最终分类器中的重要性。由式可知，当时，,并且随着的减小而增大，所以分类误差率越小的基本分类器在最终分类器中的作用越大。

为什么一定要用这个式子呢？这与前向分步算法的推导有关，在后面的章节会介绍。

(d)更新训练数据集的权值分布

技术分享

y只有正负一两种取值，所以上式可以写作：

技术分享

这里，技术分享是规范化因子

技术分享

它使技术分享成为一个概率分布。

由此可知，被基本分类器误分类样本的权值得以扩大，而被正确分类样本的权值却得以缩小。两相比较，误分类样本的权值被放大技术分享倍。因此，误分类样本在下一轮学习中起更大的作用。不改变所给的训练数据，而不断改变训练数据权值的分布，使得训练数据在基本分类器的学习中起不同的作用，这是AdaBoost的一个特点。

(3)构建基本分类器的线性组合

技术分享

得到最终分类器

技术分享

AdaBoost算法的解释

AdaBoost算法还有另一个解释，即可以认为AdaBoost算法是模型为加法模型、损失函数为指数函数、学习算法为前向分步算法时的二类分类学习方法。

为什么还要学习前向分步算法呢？直接给我AdaBoost的代码不就好了吗？因为只有理解了前向分步算法，才能理解AdaBoost为什么能跟决策树组合起来。

前向分步算法

考虑加法模型（additive model)

其中，技术分享为基函数，为基函数的参数，为基函数的系数。显然，是一个加法模型。

在给定训练数据及损失函数的条件下，学习加法模型技术分享成为经验风险极小化即损失函数极小化问题：

技术分享

通常这是一个复杂的优化问题。前向分步算法（forward stage wise algorithm)求解这一优化问题的想法是：因为学习的是加法模型，如果能够从前向后，每一步只学习一个基函数及其系数，逐步逼近优化目标函数式技术分享（L应该是loss的缩写，表示一个损失函数，输入正确答案yi和模型预测值，输出损失值），那么就可以简化优化的复杂度。具体地，每步只需优化如下损失函数：

技术分享

也就是说，原来有M个分类器，现在只专注优化一个。

给定训练数据集技术分享。损失函数和基函数的集合,学习加法模型的前向分步算法如下：

算法(前向分步算法）

输入：训练数据集技术分享，损失函数，基函数集;

输出：加法模型技术分享

(1)初始化技术分享

(2)对技术分享

（a）极小化损失函数

技术分享

得到参数技术分享

(b)更新

技术分享

(3)得到加法模型

技术分享

这样，前向分步算法将同时求解从m=1到M所有参数技术分享的优化问题简化为逐次求解各个的优化问题。

前向分步算法与AdaBoost

由前向分步算法可以推导出AdaBoost，用定理叙述这一关系。

定理 AdaBoost算法是前向分歩加法算法的特例。这时，模型是由基本分类器组成的加法模型，损失函数是指数函数。

证明前向分步算法学习的是加法模型，当基函数为基本分类器时，该加法模型等价于AdaBoost的最终分类器

由基本分类器技术分享及其系数组成，m=1,2,…，M。前向分步算法逐一学习基函数，这一过程与AdaBoost算法逐一学习基本分类器的过程一致。下面证明前向分步算法的损失函数是指数损失函数（exponential loss function)

技术分享

时，其学习的具体操作等价于AdaBoost算法学习的具体操作。

假设经过m-1轮迭代前向分步算法已经得到技术分享 :

技术分享

在第m轮迭代得到技术分享和。

技术分享

目标是使前向分步算法得到的技术分享使在训练数据集T上的指数损失最小，即

技术分享

上式可以表示为

技术分享

其中，技术分享（指数中的加法可以拿出来做乘法）。因为既不依赖α也不依赖于G，所以与最小化无关。但依赖于,随着每一轮迭代而发生改变。

现证使式技术分享达到最小的就是AdaBoost算法所得到的。

求解式技术分享可分两步：

首先，求技术分享。对任意a>0,使式最小的由下式得到：

技术分享

其中，技术分享。

此分类器技术分享即为AdaBoost算法的基本分类器，因为它是使第m轮加权训练数据分类误差率最小的基本分类器。

之后，求技术分享。中

技术分享

这个转换很简单，当y和G一致时，指数为负，反之为正，第二个等号也是利用这个原理，只不过换成了用指示变量I表述。

将已求得的技术分享代入式，对α求导并使导数为0,即得到使式最小的a。

技术分享

其中，技术分享是分类误差率：

这里的技术分享与AdaBoost算法第2(c)步的完全一致。

最后来看每一轮样本权值的更新。由

技术分享

以及技术分享，可得

技术分享

这与AdaBoost算法第2(d)步的样本权值的更新，只相差规范化因子，因而等价。

提升树

提升树是以分类树或回归树为基本分类器的提升方法。提升树被认为是统计学习中性能最好的方法之一。

提升方法实际采用加法模型（即基函数的线性组合）与前向分步算法。以决策树为基函数的提升方法称为提升树（boosting tree)。对分类问题决策树是二叉分类树，对回归问题决策树是二叉回归树。在原著例题中看到的基本分类器，可以看作是由一个根结点直接连接两个叶结点的简单决策树，即所谓的决策树桩（decision stump)。提升树模型可以表示为决策树的加法模型：

技术分享

其中，技术分享表示决策树；为决策树的参数；M为树的个数。

提升树算法

提升树算法采用前向分步算法。首先确定初始提升树/e(x)=0,第m歩的模型是

技术分享

其中，技术分享为当前模型，通过经验风险极小化确定下一棵决策树的参数

技术分享

由于树的线性组合可以很好地拟合训练数据，即使数据中的输入与输出之间的关系很复杂也是如此，所以提升树是一个髙功能的学习算法。

不同问题有大同小异的提升树学习算法，其主要区别在于使用的损失函数不同。包括用平方误差损失函数的回归问题，用指数损失函数的分类问题，以及用一般损失函数的一般决策问题。

对于二类分类问题，提升树算法只需将AdaBoost算法中的基本分类器限制为二类分类树即可，可以说这时的提升树算法是AdaBoost算法的特殊情况，接下来通过《机器学习实战》中的代码学习其应用。

提升树的Python实现

AdaBoost+决策树=提升树，来看看具体用Python怎么实现。

from numpy import *

def loadSimpData():
    datMat = matrix([[ 1. ,  2.1],
        [ 2. ,  1.1],
        [ 1.3,  1. ],
        [ 1. ,  1. ],
        [ 2. ,  1. ]])
    classLabels = [1.0, 1.0, -1.0, -1.0, 1.0]
    return datMat,classLabels

def loadDataSet(fileName):      #general function to parse tab -delimited floats
    numFeat = len(open(fileName).readline().split(‘\\t‘)) #get number of fields 
    dataMat = []; labelMat = []
    fr = open(fileName)
    for line in fr.readlines():
        lineArr =[]
        curLine = line.strip().split(‘\\t‘)
        for i in range(numFeat-1):
            lineArr.append(float(curLine[i]))
        dataMat.append(lineArr)
        labelMat.append(float(curLine[-1]))
    return dataMat,labelMat

def stumpClassify(dataMatrix,dimen,threshVal,threshIneq):#just classify the data
    retArray = ones((shape(dataMatrix)[0],1))
    if threshIneq == ‘lt‘:
        retArray[dataMatrix[:,dimen] <= threshVal] = -1.0
    else:
        retArray[dataMatrix[:,dimen] > threshVal] = -1.0
    return retArray
    

def buildStump(dataArr,classLabels,D):
    dataMatrix = mat(dataArr); labelMat = mat(classLabels).T
    m,n = shape(dataMatrix)
    numSteps = 10.0; bestStump = {}; bestClasEst = mat(zeros((m,1)))
    minError = inf #init error sum, to +infinity
    for i in range(n):#loop over all dimensions
        rangeMin = dataMatrix[:,i].min(); rangeMax = dataMatrix[:,i].max();
        stepSize = (rangeMax-rangeMin)/numSteps
        for j in range(-1,int(numSteps)+1):#loop over all range in current dimension
            for inequal in [‘lt‘, ‘gt‘]: #go over less than and greater than
                threshVal = (rangeMin + float(j) * stepSize)
                predictedVals = stumpClassify(dataMatrix,i,threshVal,inequal)#call stump classify with i, j, lessThan
                errArr = mat(ones((m,1)))
                errArr[predictedVals == labelMat] = 0
                weightedError = D.T*errArr  #calc total error multiplied by D
                #print "split: dim %d, thresh %.2f, thresh ineqal: %s, the weighted error is %.3f" % (i, threshVal, inequal, weightedError)
                if weightedError < minError:
                    minError = weightedError
                    bestClasEst = predictedVals.copy()
                    bestStump[‘dim‘] = i
                    bestStump[‘thresh‘] = threshVal
                    bestStump[‘ineq‘] = inequal
    return bestStump,minError,bestClasEst


def adaBoostTrainDS(dataArr,classLabels,numIt=40):
    weakClassArr = []
    m = shape(dataArr)[0]
    D = mat(ones((m,1))/m)   #init D to all equal
    aggClassEst = mat(zeros((m,1)))
    for i in range(numIt):
        bestStump,error,classEst = buildStump(dataArr,classLabels,D)#build Stump
        #print "D:",D.T
        alpha = float(0.5*log((1.0-error)/max(error,1e-16)))#calc alpha, throw in max(error,eps) to account for error=0
        bestStump[‘alpha‘] = alpha  
        weakClassArr.append(bestStump)                  #store Stump Params in Array
        #print "classEst: ",classEst.T
        expon = multiply(-1*alpha*mat(classLabels).T,classEst) #exponent for D calc, getting messy
        D = multiply(D,exp(expon))                              #Calc New D for next iteration
        D = D/D.sum()
        #calc training error of all classifiers, if this is 0 quit for loop early (use break)
        aggClassEst += alpha*classEst
        #print "aggClassEst: ",aggClassEst.T
        aggErrors = multiply(sign(aggClassEst) != mat(classLabels).T,ones((m,1)))
        errorRate = aggErrors.sum()/m
        print "total error: ",errorRate
        if errorRate == 0.0: break
    return weakClassArr,aggClassEst

def adaClassify(datToClass,classifierArr):
    dataMatrix = mat(datToClass)#do stuff similar to last aggClassEst in adaBoostTrainDS
    m = shape(dataMatrix)[0]
    aggClassEst = mat(zeros((m,1)))
    for i in range(len(classifierArr)):
        classEst = stumpClassify(dataMatrix,classifierArr[i][‘dim‘],                                 classifierArr[i][‘thresh‘],                                 classifierArr[i][‘ineq‘])#call stump classify
        aggClassEst += classifierArr[i][‘alpha‘]*classEst
        print aggClassEst
    return sign(aggClassEst)

def plotROC(predStrengths, classLabels):
    import matplotlib.pyplot as plt
    cur = (1.0,1.0) #cursor
    ySum = 0.0 #variable to calculate AUC
    numPosClas = sum(array(classLabels)==1.0)
    yStep = 1/float(numPosClas); xStep = 1/float(len(classLabels)-numPosClas)
    sortedIndicies = predStrengths.argsort()#get sorted index, it‘s reverse
    fig = plt.figure()
    fig.clf()
    ax = plt.subplot(111)
    #loop through all the values, drawing a line segment at each point
    for index in sortedIndicies.tolist()[0]:
        if classLabels[index] == 1.0:
            delX = 0; delY = yStep;
        else:
            delX = xStep; delY = 0;
            ySum += cur[1]
        #draw line from cur to (cur[0]-delX,cur[1]-delY)
        ax.plot([cur[0],cur[0]-delX],[cur[1],cur[1]-delY], c=‘b‘)
        cur = (cur[0]-delX,cur[1]-delY)
    ax.plot([0,1],[0,1],‘b--‘)
    plt.xlabel(‘False positive rate‘); plt.ylabel(‘True positive rate‘)
    plt.title(‘ROC curve for AdaBoost horse colic detection system‘)
    ax.axis([0,1,0,1])
    plt.show()
    print "the Area Under the Curve is: ",ySum*xStep

2、梯度提升gradient boosting

梯度提升是对AdaBoost的延伸，它不再要求误差函数是指数误差函数，而可能是任意一种误差函数（因为这里是用梯度下降法来最佳化误差函数，所以这里要求误差函数是平滑的）。

在这个架构下，我们就可以使用不同的假设和模型，来解决分类或者回归的问题。

梯度提升应用于回归问题

梯度提升应用于回归问题时，误差函数选中均方误差函数。

紧接着，我们对这个误差函数中变量s在sn处进行一阶泰勒展开的近似，我们发现要最小化的实际是∑h(xn)·2(sn-yn)，要让该表达式最小，需要h(xn)和(sn-yn)的方向相反：

要求解最佳化问题，需要h(xn)和(sn-yn)的方向相反，而h(xn)的大小其实不是我们关系的问题，因为步长问题是由参数η决定的。
如果将h(xn)强制限制为1或者某个常数的话，那么就得到了一个有条件的最佳化问题，增加了求解的难度。不如我们将惩罚项h(xn)的平方放进最佳化式子中（意思是，如果h(xn)越大，我们就越不希望如此）。
我们可以将平方式子变换一下，得到有关(h(xn)-(yn-sn))^2的式子，所以我们要求一个带惩罚项的近似函数梯度的问题，就等效于求xn和余数(residual)yn-sn的回归问题。

确定步长η：

我们现在确定了gt，接着我们要确定步长η，这里我们可以将误差函数写成余数(yn-sn)的形式，这是一个单变量的线性回归问题，其中输入是用gt转换后的数据，输出是余数(residual)。

梯度提升决策树

综合第三小节的步骤，我们就可以得到梯度提升决策树的算法流程：

1、在每一次迭代过程，解决一个回归问题，这里可以用CART算法来解{xn, (yn-sn)}的回归问题；

2、然后，用gt做转换，做一个{gt(xn), yn-sn}的单变量线性回归问题；

3、更新分数sn；

4、经过T轮迭代，得到G(x)。

这个GBDT算法可以看做是AdaBoost-DTree的回归问题版本。

代码如下：

/**
 * 梯度提升回归树    简单实现
 * @author ysh  1208706282
 *
 */
public class Gbdt {
    static List<Sample> mSamples;
    static List<Double> mTrainTarget;
    static List<Double> mTrainCurTarget;
    static List<Cart> mCarts;
    /**
     * 加载数据   回归树
     * @param path
     * @param regex
     * @throws Exception
     */
    public  void loadData(String path,String regex) throws Exception{
        mSamples = new ArrayList<Sample>();
        mTrainTarget = new ArrayList<Double>();
        mTrainCurTarget = new ArrayList<Double>();
        BufferedReader reader = new BufferedReader(new FileReader(path));
        String line = null;
        String splits[] = null;
        Sample sample = null;
        while(null != (line=reader.readLine())){
            splits = line.split(regex);
            sample = new Sample();
            sample.label = Double.valueOf(splits[0]);
            mTrainTarget.add(sample.label);
            mTrainCurTarget.add(0.0);
            sample.feature = new ArrayList<Double>(splits.length-1);
            for(int i=0;i<splits.length-1;i++){
                sample.feature.add(new Double(splits[i+1]));
            }
            
            mSamples.add(sample);
        }
        reader.close();
    }
    /**
     * 更新训练集目标
     * @author Administrator
     *
     */
    static class Update implements Runnable{
        int from;
        int to;
        Cart cart;
        public Update(int from,int to,Cart cart){
            this.from = from;
            this.to = to;
            this.cart = cart;
        }
        @Override
        public void run() {
            // TODO Auto-generated method stub
            Sample sample = null;
            for(int i=from;i<to;i++){
                sample = mSamples.get(i);
                mTrainCurTarget.set(i, mTrainCurTarget.get(i)+cart.classify(sample));
                sample.label = mTrainTarget.get(i)-mTrainCurTarget.get(i);
            }
        }
        
    }
    public void updateTarget(Cart cart) throws InterruptedException{
        /*Sample sample = null;
        for(int i=0;i<mSamples.size();i++){
            sample = mSamples.get(i);
            mTrainCurTarget.set(i, mTrainCurTarget.get(i)+cart.classify(sample));
            sample.label = mTrainTarget.get(i)-mTrainCurTarget.get(i);
        }*/
        Update update = null;
        int num = 10;
        Thread ths[] = new Thread[num];
        int size = mSamples.size();
        
        for(int i=0;i<num;i++){
            update = new Update(i*size/num,(i+1)*size/num,cart);
            ths[i] = new Thread(update);
            ths[i].start();
        }
        for(int i=0;i<num;i++){
            ths[i].join();
        }
    }
    public void train(int iters) throws InterruptedException{
        mCarts = new ArrayList<Cart>(iters);
        Random ran = new Random();
        ran.setSeed(100);
        for(int iter=0;iter<iters;iter++){
            System.out.println("start iter "+iter+"  time:"+System.currentTimeMillis()/1000);
            Cart cart = new Cart();
            cart.mFeatureRate = 0.8;
            cart.mMaxDepth = 5;
            cart.mMinLeaf = 1;
            cart.mRandom = ran;
            cart.setData(mSamples);
            cart.train();
            mCarts.add(cart);
            updateTarget(cart);
            System.out.println("end iter "+iter+"  time:"+System.currentTimeMillis()/1000);
        }
    }
    public double classify(Sample sample){
        double ret = 0;
        for(Cart cart:mCarts){
            ret += cart.classify(sample);
        }
        return ret;
    }
    /**
     * @param args
     * @throws Exception 
     */
    public static void main(String[] args) throws Exception {
        // TODO Auto-generated method stub
        System.out.println(System.currentTimeMillis());
        Gbdt gbdt = new Gbdt();
        gbdt.loadData("F:/2016-contest/20161001/train_data_1.csv", ",");
        gbdt.train(100);
        List<Sample> samples = Cart.loadTestData("F:/2016-contest/20161001/valid_data_1.csv", true, ",");
        double sum = 0;
        for(Sample s:samples){
            double val = gbdt.classify(s);
            sum += (val-s.label)*(val-s.label);
            System.out.println(val+"  "+s.label);
        }
        System.out.println(sum/samples.size()+"  "+sum);
        System.out.println(System.currentTimeMillis());
    }

}

bagging类算法

boosting算法在每一轮学习中，模型具有相似性的特点，而bagging则是在每一轮之间，模型尽可能不同，呈现并行的特点，boosting呈现串行的特点。bagging特点：对m个样本数据，采集出T个含n个样本的采集数据集，每一个采集数据集用于训练一个模型，再将各个模型进行整合，整合时，常采用简单投票、简单平均、置信度比较等策略。bagging的计算复杂度较高，算法较高效。同时，bagging专注降低方差，boosting专注降低偏差。

随机森林：

随机森林是bagging思想的一种体现，现在简单介绍一下。

随机森林，指的是利用多棵树对样本进行训练并预测的一种分类器。该分类器最早由Leo Breiman和Adele Cutler提出，并被注册成了商标。简单来说，随机森林就是由多棵CART（Classification And Regression Tree）构成的。对于每棵树，它们使用的训练集是从总的训练集中有放回采样出来的，这意味着，总的训练集中的有些样本可能多次出现在一棵树的训练集中，也可能从未出现在一棵树的训练集中。在训练每棵树的节点时，使用的特征是从所有特征中按照一定比例随机地无放回的抽取的，根据Leo Breiman的建议，假设总的特征数量为M，这个比例可以是sqrt(M),1/2sqrt(M),2sqrt(M)。

因此，随机森林的训练过程可以总结如下：

(1)给定训练集S，测试集T，特征维数F。确定参数：使用到的CART的数量t，每棵树的深度d，每个节点使用到的特征数量f，终止条件：节点上最少样本数s，节点上最少的信息增益m

对于第1-t棵树，i=1-t：

(2)从S中有放回的抽取大小和S一样的训练集S(i)，作为根节点的样本，从根节点开始训练

(3)如果当前节点上达到终止条件，则设置当前节点为叶子节点，如果是分类问题，该叶子节点的预测输出为当前节点样本集合中数量最多的那一类c(j)，概率p为c(j)占当前样本集的比例；如果是回归问题，预测输出为当前节点样本集各个样本值的平均值。然后继续训练其他节点。如果当前节点没有达到终止条件，则从F维特征中无放回的随机选取f维特征。利用这f维特征，寻找分类效果最好的一维特征k及其阈值th，当前节点上样本第k维特征小于th的样本被划分到左节点，其余的被划分到右节点。继续训练其他节点。有关分类效果的评判标准在后面会讲。

(4)重复(2)(3)直到所有节点都训练过了或者被标记为叶子节点。

(5)重复(2),(3),(4)直到所有CART都被训练过。

利用随机森林的预测过程如下：

对于第1-t棵树，i=1-t：

(1)从当前树的根节点开始，根据当前节点的阈值th，判断是进入左节点(<th)还是进入右节点(>=th)，直到到达，某个叶子节点，并输出预测值。

(2)重复执行(1)直到所有t棵树都输出了预测值。如果是分类问题，则输出为所有树中预测概率总和最大的那一个类，即对每个c(j)的p进行累计；如果是回归问题，则输出为所有树的输出的平均值。

注：有关分类效果的评判标准，因为使用的是CART，因此使用的也是CART的评判标准，和C3.0,C4.5都不相同。

对于分类问题（将某个样本划分到某一类），也就是离散变量问题，CART使用Gini值作为评判标准。定义为Gini=1-∑(P(i)*P(i)),P(i)为当前节点上数据集中第i类样本的比例。例如：分为2类，当前节点上有100个样本，属于第一类的样本有70个，属于第二类的样本有30个，则Gini=1-0.7×07-0.3×03=0.42，可以看出，类别分布越平均，Gini值越大，类分布越不均匀，Gini值越小。在寻找最佳的分类特征和阈值时，评判标准为：argmax（Gini-GiniLeft-GiniRight），即寻找最佳的特征f和阈值th，使得当前节点的Gini值减去左子节点的Gini和右子节点的Gini值最大。

对于回归问题，相对更加简单，直接使用argmax(Var-VarLeft-VarRight)作为评判标准，即当前节点训练集的方差Var减去减去左子节点的方差VarLeft和右子节点的方差VarRight值最大。

代码如下：

#ifndef _DECISION_TREE_H_
#define _DECISION_TREE_H_
#include <string>
#include <vector>
#include <set>
#include <ctime> 
#include <algorithm>
#include <cmath>

using namespace std;

//the data structure for a tuple
struct TupleData
{
vector<int> A;
char label;
};

struct TNode
{
int attrNum;    
int attr;    
char label;
};

struct decision_tree
{
TNode node;
vector<decision_tree*> childs;
};

void init(char * trainname, char * testname);
int readData(vector<TupleData> &data, const char* fileName);
int stringtoint(string s);
void sub_init();
void calculate_ArrtNum();
void calculate_attributes();
void RandomSelectData(vector<TupleData> &data, vector<TupleData> &subdata);
double Entropy(double p, double s);
int creat_classifier(decision_tree *&p, const vector<TupleData> &samples, vector<int> &attributes);
int BestGainArrt(const vector<TupleData> &samples, vector<int> &attributes);
bool Allthesame(const vector<TupleData> &samples, char ch);
char Majorityclass(const vector<TupleData> &samples);
void RandomSelectAttr(vector<int> &data, vector<int> &subdata);
char testClassifier(decision_tree *p, TupleData d);
void testData();
void freeClassifier(decision_tree *p);
void freeArrtNum();
void showResult();
#endif //_DECISION_TREE_H_


#include <iostream>
#include <fstream>
#include <sstream>
#include "random_forest.h"

using namespace std;

vector<decision_tree*> alltrees;

vector<TupleData>    trainAll,
                    train,    
test;    

vector<int>     attributes;    

int trainAllNum=0;    
int testAllNum=0;    
int MaxAttr;    
int *ArrtNum;
unsigned int F;
int tree_num=100;
const int leafattrnum=-1;
int TP=0,
FN=0,
FP=0,
TN=0,
TestP=0,
TestN=0;

void init(char * trainname, char * testname)
{
trainAllNum=readData(trainAll, trainname);
testAllNum=readData(test, testname);
calculate_attributes();
double temp=(double)trainAllNum;
temp=log(temp)/log(2.0);
//     F=round(temp)+1;
F = (unsigned int)floor(temp+0.5)+1;
if(F>MaxAttr) F=MaxAttr;
//cout<<"f="<<F<<endl;
}

void sub_init()
{
RandomSelectData(trainAll, train);
calculate_ArrtNum();
}


int readData(vector<TupleData> &data, const char* fileName)
{
ifstream fin;
fin.open(fileName);
string line;

int datanum=0;

while(getline(fin,line))
{
TupleData d;
        istringstream stream(line);
        string str;
while(stream>>str)
{
if(str.find(‘+‘)==0)
{
d.label=‘+‘;
}
else if(str.find(‘-‘)==0)
{
 d.label=‘-‘;
}
else
{
int j=stringtoint(str);
d.A.push_back(j);
}
}

data.push_back(d);    
datanum++;
}

fin.close();
return datanum;
}

void RandomSelectData(vector<TupleData> &data, vector<TupleData> &subdata)
{
int index;
subdata.clear();
int d=0;
while (d < trainAllNum)
{
index = rand() % trainAllNum;
subdata.push_back(data.at(index));
d++;
}
}

void calculate_attributes()
{
TupleData d=trainAll.at(0);
MaxAttr=d.A.size();
attributes.clear();

for (int i = 0; i < MaxAttr; i++)
{
attributes.push_back(i);
}

ArrtNum=new int[MaxAttr];
}


int stringtoint(string s)
{
int sum=0;
for(int i=0; s[i]!=‘\\0‘;i++)
{
int j=int(s[i])-48;
sum=sum*10+j;
}
return sum;
}

void calculate_ArrtNum()
{
for(int i=0; i<MaxAttr;i++) ArrtNum[i]=0;
for (vector<TupleData>::const_iterator it = train.begin(); it != train.end(); it++)    
{
int i=0;
for (vector<int>::const_iterator intt=(*it).A.begin(); intt!=(*it).A.end();intt++)
{
int valuemax=(*intt)+1;   //(*it).A.at(i)???
if(valuemax>ArrtNum[i]) ArrtNum[i]=valuemax;
i++;
}
}
}


double Entropy(double p, double s)
{
double n = s - p;
double result = 0;
if (n != 0)
result += - double(n) / s * log(double(n) / s) / log(2.0);
if (p != 0)
result += double(-p) / s * log(double(p) / s) / log(2.0);
return result;
}

int creat_classifier(decision_tree *&p, const vector<TupleData> &samples, vector<int> &attributes)
{
if (p == NULL)
p = new decision_tree();
if (Allthesame(samples, ‘+‘))
{
p->node.label = ‘+‘;
p->node.attrNum = leafattrnum;
p->childs.clear();
return 1;
}
if (Allthesame(samples, ‘-‘))
{
p->node.label = ‘-‘;
p->node.attrNum = leafattrnum;
p->childs.clear();
return 1;
}
if (attributes.size() == 0)
{
p->node.label = Majorityclass(samples);
p->node.attrNum = leafattrnum;
p->childs.clear();
return 1;
}
p->node.attrNum = BestGainArrt(samples, attributes);

p->node.label = ‘ ‘;

vector<int> newAttributes;
for (vector<int>::iterator it = attributes.begin(); it != attributes.end(); it++)
if ((*it) != p->node.attrNum)
newAttributes.push_back((*it));

int maxvalue=ArrtNum[p->node.attrNum];
vector<TupleData>* subSamples = new vector<TupleData>[maxvalue];
for (int i = 0; i < maxvalue; i++)
subSamples[i].clear();

for (vector<TupleData>::const_iterator it = samples.begin(); it != samples.end(); it++)
{
subSamples[(*it).A.at(p->node.attrNum)].push_back((*it));
}

decision_tree *child;
for (int i = 0; i < maxvalue; i++)
{
child = new decision_tree;
child->node.attr = i;
if (subSamples[i].size() == 0)
child->node.label = Majorityclass(samples);
else
creat_classifier(child, subSamples[i], newAttributes);
p->childs.push_back(child);
}
delete[] subSamples;
return 0;
}

int BestGainArrt(const vector<TupleData> &samples, vector<int> &attributes)
{
int attr, 
bestAttr = 0,
p = 0,
s = (int)samples.size();

for (vector<TupleData>::const_iterator it = samples.begin(); it != samples.end(); it++)
{
if ((*it).label == ‘+‘)
p++;
}

double infoD;
double bestResult = 0;
infoD=Entropy(p, s);

vector<int> m_attributes;
RandomSelectAttr(attributes, m_attributes);

for (vector<int>::iterator it = m_attributes.begin(); it != m_attributes.end(); it++)
{
attr = (*it);
double result = infoD;

int maxvalue=ArrtNum[attr];
int* subN = new int[maxvalue];
int* subP = new int[maxvalue];
int* sub = new int[maxvalue];
for (int i = 0; i < maxvalue; i++)
{
subN[i] = 0;
subP[i] = 0;
sub[i]=0;
}
for (vector<TupleData>::const_iterator jt = samples.begin(); jt != samples.end(); jt++)
{
if ((*jt).label == ‘+‘)
subP[(*jt).A.at(attr)] ++;
else
subN[(*jt).A.at(attr)] ++;
sub[(*jt).A.at(attr)]++;
}

double SplitInfo=0;
for(int i=0; i<maxvalue; i++)
{
double partsplitinfo;
partsplitinfo=-double(sub[i])/s*log(double(sub[i])/s)/log(2.0);
SplitInfo=SplitInfo+partsplitinfo;
}

double infoattr=0;
for (int i = 0; i < maxvalue; i++)
{
double partentropy;
partentropy=Entropy(subP[i], subP[i] + subN[i]);
infoattr=infoattr+((double)(subP[i] + subN[i])/(double)(s))*partentropy;
}
result=result-infoattr;
result=result/SplitInfo;

if (result > bestResult)
{
bestResult = result;
bestAttr = attr;
}
delete[] subN;
delete[] subP;
delete[] sub;
}

if (bestResult == 0)
{
bestAttr=attributes.at(0);
}
return bestAttr;
}

void RandomSelectAttr(vector<int> &data, vector<int> &subdata)
{
int index;
unsigned int dataNum=data.size();
subdata.clear();
if(dataNum<=F)
{
for (vector<int>::iterator it = data.begin(); it != data.end(); it++)
{
int attr = (*it);
subdata.push_back(attr);
}
}
else
{
set<int> AttrSet;
AttrSet.clear();
while (AttrSet.size() < F)
{
index = rand() % dataNum;
if (AttrSet.count(index) == 0)
{
AttrSet.insert(index);
subdata.push_back(data.at(index));
}
}
}
}

bool Allthesame(const vector<TupleData> &samples, char ch)
{
for (vector<TupleData>::const_iterator it = samples.begin(); it != samples.end(); it++)
if ((*it).label != ch)
return false;
return true;
}

char Majorityclass(const vector<TupleData> &samples)
{
int p = 0, n = 0;
for (vector<TupleData>::const_iterator it = samples.begin(); it != samples.end(); it++)
if ((*it).label == ‘+‘)
p++;
else
n++;
if (p >= n)
return ‘+‘;
else
return ‘-‘;
}

char testClassifier(decision_tree *p, TupleData d)
{
if (p->node.label != ‘ ‘)
return p->node.label;
int attrNum = p->node.attrNum;
if (d.A.at(attrNum) < 0)
return ‘ ‘;
return testClassifier(p->childs.at(d.A.at(attrNum)), d);
}

void testData()
{
for (vector<TupleData>::iterator it = test.begin(); it != test.end(); it++)
{
if((*it).label==‘+‘) TestP++;
else TestN++;

int p=0, n=0;
for(int i=0;i<tree_num;i++)
{
if(testClassifier(alltrees.at(i), (*it))==‘+‘)  p++;
else n++;
}

if(p>n)
{
if((*it).label==‘+‘) TP++;
else FP++;
}
else
{
if((*it).label==‘+‘) FN++;
else TN++;
}
}
}


void freeClassifier(decision_tree *p)
{
if (p == NULL)
return;
for (vector<decision_tree*>::iterator it = p->childs.begin(); it != p->childs.end(); it++)
{
freeClassifier(*it);
}
delete p;
}

void freeArrtNum()
{
delete[] ArrtNum;
}

void showResult()
{
cout<<"Train size:    "<< trainAllNum<<endl;
cout<<"Test size:    "<<testAllNum<<endl;    
cout << "True positive:    " << TP << endl;
cout << "False negative:    "<< FN<<endl;
cout << "False positive:    "<<FP<<endl;
cout << "True negative:    "<<TN<<endl;

//     cout << TP << endl;
//     cout << FN<<endl;
//     cout <<FP<<endl;
//     cout <<TN<<endl;
}

int main(int argc, char **argv)
{
char * trainfile=argv[1];
char * testfile=argv[2];

//cout<<"input the F and tree_num"<<endl;
//cin>>F>>tree_num;

srand((unsigned)time(NULL)); 

init(trainfile, testfile);

for(int i=0; i<tree_num; i++)
{
sub_init();
decision_tree * root=NULL;
creat_classifier(root, train, attributes);
alltrees.push_back(root);
}

testData();

for (vector<decision_tree *>::const_iterator it = alltrees.begin(); it != alltrees.end(); it++)
{
freeClassifier((*it));
}

freeArrtNum();

showResult();
return 0;
}