预测模型还是解释模型。两者区别及联系

Posted 2023-05-02

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了预测模型还是解释模型。两者区别及联系相关的知识，希望对你有一定的参考价值。

参考技术A

欢迎访问我的个人网站： data-scientist
统计模型是一个在开发和测试理论中强有力的工具，包括因果解释，预测和描述。在很多原则中都是用统计模型，并且认为统计模型有很高的解释性和预测能力。解释性和预测性的冲突是普遍存在的，因此我们必须了解和处理它们之间的关系。

1.introduction

1.1解释性模型

Causal theoretical model. 统计模型用于测试因果假设，通常是测量变量X对Y的潜在影响。

解释模型的作用通常是通过因果假设来进行理论创建。

1.2 预测性模型

预测同学通常是通过统计模型和数据挖掘来进行预测新的数据或未来。通过新的观测X来预测新的结果Y。预测包括时序预测，点预测，区间预测，分布预测活拍下预测，通常使用贝叶斯，频繁项，数据挖掘算法和统计模型。

1.3 描述性模型

描述性模型通常是用一种更简洁的方式来总结和表示数据的结构。

1.4 预测模型的科学价值

通常统计学家认为预测模型不具有科学性，所以被统计学家所抛弃。即使在统计学派中也被分为两类，预测性作为主要目的被认为是unacademic。

当然预测模型也是必要的科学尝试。预测模型的主要功能

（1）大规模的丰富的数据集通常很复杂，并且模式难以进行假设，使用预测模型可以解释一些潜在的新的机制。

（2）预测模型可以被用于发现新的测量和评价的体系

（3）对于复杂模式和关系的挖掘，预测模型通常可以得到更好的结果。

（4）科学发展需要严格的相关研究，预测模型是一种介于理论和实验的产物.虽然解释模型可以解释变量之间的因果关系，但是预测能力可能不如预测模型

（5）预测能力评估提供一种straightforward的方式来比较解释模型的预测能力

（6）预测模型来量化预测能力，创建benchmark上十分重要。因为预测模型可以有相比于解释模型更高的预测能力。一个较低的预测模型通常意味着我们需要进行新的数据收集，测量方式，或新的经验注意的方式。当解释模型的结果接近预测模型时表示我们对现象的理解已经很全面了。另一方面，当解释性模型的结果距离预测模型的benchmark较低时，说明我们还需要接下来的探索和理解。

1.5预测和解释模型的不同

预测模型和解释模型的冲突在于它们的科学性的根基。

预测模型和解释模型的不同在于数据不能精确的来表示和结果之间的关系。

在解释模型中，X，y时估计函数f的工具，同时，x,y也用于测试因果假设。

但是在预测模型中，函数f时工具，用于产生产生对y的预测。事实上，即使潜在的因果关系是y=f(x),但是y=f1(x)可能在x1而不是x上取得更好的结果，因为估计可能是有偏的估计，有偏估计可能会有更好的结果。

因果关联 ：在解释性模型中f代表着潜在的因果关系的函数，X被认为可以造成y。而预测模型中，函数f是找到X,Y 之间的关系。

理论-数据 ：在解释模型中，f是完全建立在支持解释预先估计的在X，Y 之间的因果关系。而在预测模型中，直接的解释X，Y 之间的因果关系是不需要的，虽然有时候一些透明的f是期望的。

Retrospective-prospective: 预测模式是forward-looking，f时用于预测新的数据。相反的是在解释模型中，更多的是回溯，f用于检测现有的数据和假说。

Bias-variance: 方差和偏差，

[图片上传失败...(image-8f86e5-1553739784176)]

在解释性模型中，我们的目标是最小化bias来获得最精确的表达。相反，预测模型寻找最小化的bias和estimation variance的组合误差，有时会牺牲一些理论的准确度来提高经验的准确度，

1.6 void in the statistics literature

使用预测模型和解释模型的争论一直存在，但是并没有被翻译成统计语言。在模型选择中一直存在着争议，

There may be no significant difference between the point of view of inferring the true structure and that of making a prediction if an infinitely large quantity of data is available or if the data are noiseless. However, in modeling based on a finite quantity of real data, there is a significant gap between these two points of view, because an optimal model for prediction purposes may be different from one obtained by estimating the ‘true model.’

2.1 研究设计和数据收集

（1）

对于解释和预测，数据的收集也不太一样，考虑样本的大小。

在解释模型中，目标是估计theory-based f 并且来使用它去推断，统计的能力是主要的考虑。减少bias需要足够的数据来进行模型测试。到达一定数量的数据后，超过的部分对于精度的提高可以忽略不计，而对于预测模型，f通常是数据决定的，通常更多的数据会带来更好的结果。

（2）对于抽样的方式：

在hierarchical data中，对于预测模型，group size的增加比group number 更有效，而解释模型则相反。

（3）实验设计的考量:

解释模型需要更多的可解释数据，但是这受限于实验环境和可获得的资源，同时解释需要需要非常干净的数据，

预测模型需要更多的其他的数据，数据维度越多越好。

（4）数据收集的设施:

解释性模型需要构建一个比较好的理论来支持,比如item的心理上含义。预测模型更多的是要保证预测数据的质量和数据的含义清楚。

（5）实验设计的方式:

Factorial designs 关注与因果解释，找到结果的含义

Response surface methodology design，使用优化技术和非线性变换来提高解释性

2.2 数据准备

（1）缺失值的处理

如果你有很少一部分的缺失数据，对于解释模型，可以直接扔掉。而对于预测模型，则不需要扔掉这些数据。

在回归模型，对缺失变量进行dummy处理可以增加预测模型的表现，但是对于解释模型却不符合要求。

确实值的意义是否对预测有影响或者对于预测的含义不明确，所以用确实值来做解释模型通常不太合理。

(2)数据切分data partitioning

通常避免过拟合的方式是在保留测试集上进行评估模型，防止过拟合,通过交叉验证，或其他采样的方式，boost 来使得预测模型在小数据集上进行。

数据切分的目的是为了最小化方差和偏差之和。对于预测模型来说更小的样本通常会导致更大的bias，因此通过data partitioning 可以有效的提高模型的表现，但是对于解释模型的帮助很小。对于预测模型，数据切分是一个关键的步骤。

对于解释模型，做data partitioning 通常用于评估模型的鲁棒性和预测能力。

2.3 EDA

在解释模型中，EDA指向特定的因果关系，然而在预测模型中，EDA更多的是free-form，为了支撑模型来找个更多未知的关系，可能并没有正式的公式。

eda可以是毫无目的的探索，或者来验证已有的假设，评估潜在的模型，共线性和变量的转换。

降维，在预测模型中可以减少采样方差。PCA或其他降维方式解释性会比较差，但是可以作为压缩变量变量放入模型中，

2.4 变量选择。

在解释模型中，变量选择根据变量之间的因果结果和变量自身的操作。更加关注因果关系

预测模型主要关系x,y之间的关联关系而不是因果关系。主要关注响应，数据质量，数据的可获得性。对于时间序列的建模，X必须是在y之前能获得的。

2.5 选择方式（choice of methods)

causation–association, theory–data, retrospective–prospective and bias–variance

四种不同的方式将会导致不同的结果。解释性模型可以很容易的连接到潜在的理论。

对于预测模型，顶部的优先级模型可以产生更加准确的结果，但是模型f可能更加未知。虽然模型的透明性很多情况下未知，但是有很多情况下都是先提高准确度，然后再试图理解模型。

Bias-variance方面对于提高预测模型比较有效，比如ridge regression和lasso，通过对稀疏惩罚的方式来引入bias但是降低variance.另外还有ensembke模型和bagging, boosting。

2.6 model evaluation and selection

从一系列的模型中选择最优的模型，评估模型的表现在解释模型和预测模型中采用不同的方式。

Validation:

在解释模型中，验证包含两个部分，模型验证f是否能够表示F和模型是能能很好的fit现有数据。

而对于预测模型，主要关注的是泛化能力，即模型在保留测试集上的表现。

对于解释模型，验证主要考率模型的系数是否over/under-specification,goodness of fit tests，还有一些模型的诊断包括残差分析 residual analysis.

对于预测模型，最大的挑战是防止过拟合，通过对比测试集和训练集的表现，来检查是否出现过拟合。

对于大规模的数据验证，对于解释模型和预测模型不太相同。比如说检查共线性对于解释模型非常相关，多重共线性可以导致标准差的增大，因此很多已有的文献来剔除共线性。相反对于预测模型来说，多重共线性不是罪恶的。

去重共线性对于系数的解释能力很关键，和考率一个变量对另一个变量影响是十分关键。另外还可以评估变量变化对于结果的影响。监测波动要去除共线性。

model evaluation

考虑两方面的能力，解释能力和预测能力。

解释模型考虑变量对于结果的关系，研究者常用R2值和统计意义的F统计来表明对结果的影响。

相反预测模型聚焦于预测准确度和预测能力，考虑f在新数据上的表现。不同的任务需要考虑的评价指标不一样，例如ranking模型或者分类模型不一样。

model selection

在解释模型中，比较模型之前的解释能力。使用stepwise的方法来增加删除变量，变量的增删通过统计模型来清楚的表达。主要通过AIC，BIC来进行筛选。

AIC和BIC 提供估计不同的事情。 If the question of which estimator is better is to make sense, we must decide whether the average likelihood of a family [=BIC] or its predictive accuracy [=AIC] is what we want to estimate.

2.7 Model use and reporting

解释模型倾向于验证现有的因果推断理论，查看统计结果是否合理。

对于预测模型，f通常用于对新数据预测。在实际的应用中，目标通常专注让预测模型来支持科学研究，通过构建新的理论来产生新的假说，解释性模型的文章据记载模型理论构建和未被观测的参数和统计推断，预测部分聚焦于预测能力和比较不同的模型结果。

总结：

（1）在模型研究中，需要制定一个优化的目标

（2）即使目标是预测模型或者是解释模型，两方面的模型都要做来验证互相的结果。

对于预测模型，或许解释不是必须的，但是能够解释目的和重要性十分重要。

Reference:

[1] Shmueli G. To explain or to predict?[J]. Statistical science, 2010, 25(3): 289-310.

人大任务参考

机器学习基本概念

机器学习、统计模型和数据挖掘有什么异同？

机器学习和统计模型区别不是很大，机器学习和统计模型中的回归都一样，底层算法都是差不多的，只是侧重点不一样，在统计学的角度，回归主要解决的问题侧重点在于模型的解释能力，关注的是 x 和 y 之间的关系，关注的更多是系数，从机器学习的角度看，关注的重点是预测的准确性。

机器学习和数据挖掘也没什么不一样，两者的算法基本上是一样的，只是在一些流程步骤上，数据挖掘会有一些特征工程的工作，以及对具体应用问题的解释。

有监督学习和无监督学习有什么区别？

有监督学习就是指有 y 作为数据的一部分，被称为目标变量，或被解释变量。无监督学习是指一堆数据，没有特定的 y，要从一堆 x 里找到模式或者规律出来。

有监督学习可以分为两个子类：分类和回归。分类问题中要预测的 y 偏离散，比如性别、血型；回归问题 y 都是连续的，实数域中的，比如收入、天气。

分类问题和聚类问题有什么区别？

分类问题是预测一个未知类别的对象属于哪个类别，而聚类是根据选定的指标，对一群对象进行划分，它不属于预测问题。

交叉验证是什么？

交叉验证是指用来建立模型的数据，和最后用来模型验证的数据，是不一样的。实践中拿到数据后，应该分为几个部分，最简单的分为两部分，一部分用于训练模型，另外一部分用于检验模型，这就是交叉验证。

何时用到特征工程？

特征工程是指要把原始数据做些整理，做些转换，主要目的是暴露出预测 y 的信息。

如何加载 sklearn 的内置数据集？

from sklearn import datasets
from sklearn import cross_validation
from sklearn import linear_model
from sklearn import metrics
from sklearn import tree
from sklearn import neighbors
from sklearn import svm
from sklearn import ensemble
from sklearn import cluster

%matplotlib inline
import matplotlib.pyplot as plt

import numpy as np
import seaborn as sns

skearn 有很多内置的数据集，上面已经加载了 sklearn 的 datasets，datasets 有一些可以用的数据，比如说加载 boston 数据集，加载后返回的就是个数据字典。

boston = datasets.load_boston()
print boston.keys()

[‘data‘, ‘feature_names‘, ‘DESCR‘, ‘target‘]

数据集的大小/格式/类型等信息如何得知？

print boston.DESCR

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000‘s

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. ‘Hedonic
prices and the demand for clean air‘, J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, ‘Regression diagnostics
...‘, Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, ‘Regression diagnostics: Identifying Influential Data and Sources of Collinearity‘, Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

boston.data # 自变量矩阵，面积、家具等特征

array([[  6.32000000e-03,   1.80000000e+01,   2.31000000e+00, ...,
          1.53000000e+01,   3.96900000e+02,   4.98000000e+00],
       [  2.73100000e-02,   0.00000000e+00,   7.07000000e+00, ...,
          1.78000000e+01,   3.96900000e+02,   9.14000000e+00],
       [  2.72900000e-02,   0.00000000e+00,   7.07000000e+00, ...,
          1.78000000e+01,   3.92830000e+02,   4.03000000e+00],
       ..., 
       [  6.07600000e-02,   0.00000000e+00,   1.19300000e+01, ...,
          2.10000000e+01,   3.96900000e+02,   5.64000000e+00],
       [  1.09590000e-01,   0.00000000e+00,   1.19300000e+01, ...,
          2.10000000e+01,   3.93450000e+02,   6.48000000e+00],
       [  4.74100000e-02,   0.00000000e+00,   1.19300000e+01, ...,
          2.10000000e+01,   3.96900000e+02,   7.88000000e+00]])

boston.data.shape

(506, 13)

boston.target.shape # 目标变量，房价

(506,)

datasets 除了现有的数据，还可以造一些数据，通过人造数据，来研究不同算法的特点。

datasets.make_regression

<function sklearn.datasets.samples_generator.make_regression>

回归问题

回归问题中，y 是个连续的数值，不仅可以采取线性回归，还可以使用决策树等做回归，只要输出是连续值，都可以用回归模型。

sklearn 的回归和 statsmodel 中的回归有什么异同？

如果使用同样的模型，两者的回归的解都是一样的，只是 statsmodel 输出更多些，比较偏向对参数做更多的解释，而 sklearn 更注重预测准确性。

如何使用交叉验证？它和过拟合有什么关系？

来个完整的例子。

np.random.seed(123)
X_all, y_all = datasets.make_regression(n_samples=50, n_features=50, n_informative=10)  # 真正有用的变量只有 10 个，另外 40 个都是噪音
print X_all.shape, y_all.shape

(50, 50) (50,)

这个数据集比较棘手，样本数比较少，变量多，容易过拟合，而且有用的变量不多，有很多噪音在里面，在做回归时，很容易把噪音放到方程里，用传统的方法比较难于处理，这里用机器学习处理。

为了防止过拟合，要用交叉验证，把数据分为两部分，一部分用于训练，一部分用于验证。

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X_all, y_all, train_size=0.5)
print X_train.shape, y_train.shape
print X_test.shape, y_test.shape
print type(X_train)

(25, 50) (25,)
(25, 50) (25,)
<type ‘numpy.ndarray‘>

如何以线性模型拟合数据集？

model = linear_model.LinearRegression() # 实例化
model.fit(X_train, y_train) # 做线性回归拟合数据

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

残差（residual）是什么？如何评估模型是个好模型？

残差是真实的 y 和预测的 y 的差，残差越小，拟合越好。

def sse(resid):
    return sum(resid**2) # 残差平方和，是回归效果的一个指标

残差平方和是回归效果的一个指标，值越小，说明模型越好。

resid_train = y_train - model.predict(X_train)
sse_train = sse(resid_train)
print sse_train

5.87164948974e-25

resid_test = y_test - model.predict(X_test)
sse_test = sse(resid_test)
sse_test

194948.84691187815

model.score(X_train, y_train) # 计算判定系数 R-squared

1.0

model.score(X_test, y_test)

0.26275088549060643

模型在测试集上的分数只有 0.26，效果并不好。

画出各个样本对应的残差，和各个变量的系数。

def plot_residuals_and_coeff(resid_train, resid_test, coeff):
    fig, axes = plt.subplots(1, 3, figsize=(12, 3))
    axes[0].bar(np.arange(len(resid_train)), resid_train) # 各个样本对应的残差
    axes[0].set_xlabel("sample number")
    axes[0].set_ylabel("residual")
    axes[0].set_title("training data")
    axes[1].bar(np.arange(len(resid_test)), resid_test) # 各个样本对应的残差
    axes[1].set_xlabel("sample number")
    axes[1].set_ylabel("residual")
    axes[1].set_title("testing data")
    axes[2].bar(np.arange(len(coeff)), coeff) # 各个变量的系数
    axes[2].set_xlabel("coefficient number")
    axes[2].set_ylabel("coefficient")
    fig.tight_layout()
    return fig, axes

fig, ax = plot_residuals_and_coeff(resid_train, resid_test, model.coef_); # 训练集的残差，测试集的残差，各个系数的大小

技术分享

可以看到第一幅图训练集中的残差不算大，范围在 -3 到 5 之间，测试集的残差就过大，范围在 -150 到 250 之间，可见模型过拟合了，在训练集上还行，测试集上就很糟糕。

真实变量里只有 10 个是有用的，而上面的变量系数图有很多都不是 0，有很多冗余。变量多，样本少，怎么解决呢？

变量比样本多时，如何处理？

一种方法是做个主成分分析，降维，对变量做个筛选，再放到模型里来，但这种方法比较麻烦。

还有种是正则化的方法。

正则化是什么？有哪两种方法？

正则化在统计学里有两种思路，一种叫岭回归，把系数放到 loss function 中，经典的 loss function 是残差平方和，这里把 50 个系数平方求和，放到 loss function 中，所以最终既要使残差平方和小，又要使权重小，可以压制一些过于冗余的权重。

model = linear_model.Ridge(alpha=5) # 参数 alpha 表示对于权重系数的决定因子
model.fit(X_train, y_train)

Ridge(alpha=5, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver=‘auto‘, tol=0.001)

resid_train = y_train - model.predict(X_train)
sse_train = sum(resid_train**2)
print sse_train

2963.35374445

resid_test = y_test - model.predict(X_test)
sse_test = sum(resid_test**2)
print sse_test

187177.590437

残差平方和还比较高。

model.score(X_train, y_train), model.score(X_test, y_test)

(0.99197132152011414, 0.29213988699168503)

之前测试集的 R 方是 0.26，这里是 0.29，略有改善。

fig, ax = plot_residuals_and_coeff(resid_train, resid_test, model.coef_)

技术分享

从图上看，training 的残差增加了，testing 的有所减少，系数大小没有太多改善，所以在这里使用岭回归有一点点效果，并不明显。

下面使用正则化的另一种方法，称为 Lasso，思路跟岭回归是一样的，都是把残差平方和以及系数放到 loss function 中，既要使残差平方和小，又要使系数小，但 Losso 公式有点不一样，Losso 是把权重的绝对值加起来，而岭回归是把权重的方法加起来，有这么一点不一样，就可以使得很多权重回为 0。看小效果。

model = linear_model.Lasso(alpha=1.0)
model.fit(X_train, y_train)

Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection=‘cyclic‘, tol=0.0001, warm_start=False)

resid_train = y_train - model.predict(X_train)
sse_train = sse(resid_train)
print sse_train

256.539066413

resid_test = y_test - model.predict(X_test)
sse_test = sse(resid_test)
print sse_test

691.523154567

fig, ax = plot_residuals_and_coeff(resid_train, resid_test, model.coef_)

技术分享

testing 的残差有明显的减小，范围减小到 -10 到 15 之间。很多系数也变为 0，真正的起作用的只有少数几个。这是 lasso 的优点，它可以应付有很多噪音的情况，对于维度比较高噪音比较多的情况，lasso 可以在建模的同时做降维。

alphas = np.logspace(-4, 2, 100) # 尝试 100 个不同的 alpha

coeffs = np.zeros((len(alphas), X_train.shape[1]))
sse_train = np.zeros_like(alphas)
sse_test = np.zeros_like(alphas)

for n, alpha in enumerate(alphas):
    model = linear_model.Lasso(alpha=alpha)
    model.fit(X_train, y_train)
    coeffs[n, :] = model.coef_
    resid = y_train - model.predict(X_train)
    sse_train[n] = sum(resid**2)
    resid = y_test - model.predict(X_test)
    sse_test[n] = sum(resid**2)

fig, axes = plt.subplots(1, 2, figsize=(12, 4), sharex=True)

for n in range(coeffs.shape[1]):
    axes[0].plot(np.log10(alphas), coeffs[:, n], color=‘k‘, lw=0.5)

axes[1].semilogy(np.log10(alphas), sse_train, label="train")
axes[1].semilogy(np.log10(alphas), sse_test, label="test")
axes[1].legend(loc=0)

axes[0].set_xlabel(r"${\\log_{10}}\\alpha$", fontsize=18)
axes[0].set_ylabel(r"coefficients", fontsize=18)
axes[1].set_xlabel(r"${\\log_{10}}\\alpha$", fontsize=18)
axes[1].set_ylabel(r"sse", fontsize=18)
fig.tight_layout()

技术分享

alpha 为 0 时，表示没有在 loss function 中放权重项，即没有惩罚，这时做回归，跟前面结果是一样的。

alpha 增大，很多噪音的因子就会降为 0，即对变量做筛选。

我们的目标是要使模型的预测效果最佳，自然要选择测试集上残差平方和最小的地方所对应的 alpha。

怎么求这个点，实际用时用 LassoCV 自动找出最好的 alpha。

model = linear_model.LassoCV() # 可以去尝试不同的参数值

model.fit(X_all, y_all)

LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
    max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False,
    precompute=‘auto‘, random_state=None, selection=‘cyclic‘, tol=0.0001,
    verbose=False)

model.alpha_ # 自动找出最好的 alpha

0.06559238747534718

resid_train = y_train - model.predict(X_train)
sse_train = sse(resid_train)
print sse_train

1.76481994041

resid_test = y_test - model.predict(X_test)
sse_test = sse(resid_test)
print sse_test

1.31238073253

效果非常不错。

model.score(X_train, y_train), model.score(X_test, y_test)

(0.9999952185351132, 0.99999503689532787)

fig, ax = plot_residuals_and_coeff(resid_train, resid_test, model.coef_)

技术分享

training 和 testing 的残差都比较小，无关变量的系数都被压到 0。效果非常好。

分类问题

分类问题和回归问题有什么区别？

分类的目标和回归不一样，虽然都是做预测，回归的 y 是连续的数值，分类预测的 y 是离散的数值，比如预测明天会不会下雨，就有会下雨和不会下雨两种情况，这是二元分类问题，编码时可编为 0 和 1 两种情况，还有是判断一个图形是什么阿拉伯数字，可能是 0,1,...,9，有 10 个可能的分类，是多元分类问题。

之前用的 statsmodels 的 logistic 回归就是分类模型，这里用 sklearn 中更多的分类模型。

sklearn 内有哪些分类模型？

广义线性模型
- 岭回归
- Logistic 回归
- 贝叶斯回归
支持向量机
最近邻
朴素贝叶斯
决策树

iris = datasets.load_iris() # 载入鸢尾花数据集

print iris.target_names
print iris.feature_names

[‘setosa‘ ‘versicolor‘ ‘virginica‘]
[‘sepal length (cm)‘, ‘sepal width (cm)‘, ‘petal length (cm)‘, ‘petal width (cm)‘]

print iris.data.shape
print iris.target.shape

(150, 4)
(150,)

X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, train_size=0.7) # 70% 用于训练，30% 用于检验

classifier = linear_model.LogisticRegression()
classifier.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class=‘ovr‘, n_jobs=1,
          penalty=‘l2‘, random_state=None, solver=‘liblinear‘, tol=0.0001,
          verbose=0, warm_start=False)

y_test_pred = classifier.predict(X_test)

如何评估分类效果？confusion matrix 是什么？

用 metrics 模块来检查模型效果，其中的 classification_report 是分类报告，显示各种指标，来衡量模型的效果。

print(metrics.classification_report(y_test, y_test_pred)) # 真实的 y 和预测的 y

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        15
          1       1.00      0.75      0.86        16
          2       0.78      1.00      0.88        14

avg / total       0.93      0.91      0.91        45

precision 是精准度，recall 是召回率，fs-score 是 F1 值。从这几个值可以看到模型很完美。

还可以用混淆矩阵 confusion matrix 来评估分类器。混淆矩阵的每一列代表了预测类别，每一列的总数表示预测为该类别的数据的数目；每一行代表了数据的真实归属类别，每一行的数据总数表示该类别的数据实例的数目。每一列中的数值表示真实数据被预测为该类的数。如果混淆矩阵的所有数据都在对角线上，就说明预测是完全正确的。

metrics.confusion_matrix(y_test, y_test_pred)

array([[15,  0,  0],
       [ 0, 12,  4],
       [ 0,  0, 14]])

y_test.shape

(45,)

classifier = tree.DecisionTreeClassifier() # 决策树
classifier.fit(X_train, y_train)
y_test_pred = classifier.predict(X_test)
metrics.confusion_matrix(y_test, y_test_pred)

array([[12,  0,  0],
       [ 0, 13,  2],
       [ 0,  2, 16]])

classifier = neighbors.KNeighborsClassifier() # K 近邻
classifier.fit(X_train, y_train)
y_test_pred = classifier.predict(X_test)
metrics.confusion_matrix(y_test, y_test_pred)

array([[12,  0,  0],
       [ 0, 14,  1],
       [ 0,  2, 16]])

classifier = svm.SVC() # 支持向量机
classifier.fit(X_train, y_train)
y_test_pred = classifier.predict(X_test)
metrics.confusion_matrix(y_test, y_test_pred)

array([[12,  0,  0],
       [ 0, 15,  0],
       [ 0,  3, 15]])

classifier = ensemble.RandomForestClassifier()
classifier.fit(X_train, y_train)
y_test_pred = classifier.predict(X_test)
metrics.confusion_matrix(y_test, y_test_pred)

array([[12,  0,  0],
       [ 0, 14,  1],
       [ 0,  2, 16]])

train_size_vec = np.linspace(0.1, 0.9, 30) # 尝试不同的样本大小

classifiers = [tree.DecisionTreeClassifier,
               neighbors.KNeighborsClassifier,
               svm.SVC,
               ensemble.RandomForestClassifier
              ]

cm_diags = np.zeros((3, len(train_size_vec), len(classifiers)), dtype=float) # 用来放结果

for n, train_size in enumerate(train_size_vec):
    X_train, X_test, y_train, y_test =         cross_validation.train_test_split(iris.data, iris.target, train_size=train_size)

    for m, Classifier in enumerate(classifiers): 
        classifier = Classifier()
        classifier.fit(X_train, y_train)
        y_test_pred = classifier.predict(X_test)
        cm_diags[:, n, m] = metrics.confusion_matrix(y_test, y_test_pred).diagonal()
        cm_diags[:, n, m] /= np.bincount(y_test)

fig, axes = plt.subplots(1, len(classifiers), figsize=(12, 3))

for m, Classifier in enumerate(classifiers): 
    axes[m].plot(train_size_vec, cm_diags[2, :, m], label=iris.target_names[2])
    axes[m].plot(train_size_vec, cm_diags[1, :, m], label=iris.target_names[1])
    axes[m].plot(train_size_vec, cm_diags[0, :, m], label=iris.target_names[0])
    axes[m].set_title(type(Classifier()).__name__)
    axes[m].set_ylim(0, 1.1)
    axes[m].set_xlim(0.1, 0.9)
    axes[m].set_ylabel("classification accuracy")
    axes[m].set_xlabel("training size ratio")
    axes[m].legend(loc=4)

fig.tight_layout()

技术分享

样本大小对预测分类结果有影响吗？

从上图可看出，当样本太小时，有些模型会表现很差，比如决策树、K 最近邻和SVC，KNN 和 SVM 是比较依赖数据规模的。当 train size 变大后，分类准确率就比较高了。

聚类问题

聚类是一种无监督学习方法。

聚类问题的应用场景是什么？

主要解决把一群对象划分为若干个组的问题。例如用户细分：选择若干指标把用户群聚为若干个组，组内特征类似，组件特征差异明显。

应用最广泛的聚类方法是什么？

K-means 聚类。

X, y = iris.data, iris.target
np.random.seed(123)
n_clusters = 3 # 可以尝试其它值
c = cluster.KMeans(n_clusters=n_clusters) # 实例化
c.fit(X) # 这里的 fit 没有 y

KMeans(copy_x=True, init=‘k-means++‘, max_iter=300, n_clusters=3, n_init=10,
    n_jobs=1, precompute_distances=‘auto‘, random_state=None, tol=0.0001,
    verbose=0)

y_pred = c.predict(X)
print y_pred[::8]
print y[::8]

[1 1 1 1 1 1 1 2 2 2 2 2 2 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2]

聚类结果是 1、2、0，真实结果是 0、1、2。为了跟真实值做比对，需要做个转换。

idx_0, idx_1, idx_2 = (np.where(y_pred == n) for n in range(3)) # 做转换
y_pred[idx_0], y_pred[idx_1], y_pred[idx_2] = 2, 0, 1
print y_pred[::8]
print y[::8]

[0 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2]
[0 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2]

print metrics.confusion_matrix(y, y_pred) # 当然在实际场景中是不可能有混淆矩阵的，因为根本就没有真实的 y

[[50  0  0]
 [ 0 48  2]
 [ 0 14 36]]

以上是关于预测模型还是解释模型。两者区别及联系的主要内容，如果未能解决你的问题，请参考以下文章