python多元线性回归怎么计算

Posted 2023-03-17

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python多元线性回归怎么计算相关的知识，希望对你有一定的参考价值。

参考技术A

1、什么是多元线性回归模型？

当y值的影响因素不唯一时,采用多元线性回归模型。

y =y=β0+β1x1+β2x2+...+βnxn

例如商品的销售额可能不电视广告投入,收音机广告投入,报纸广告投入有关系,可以有 sales =β0+β1*TV+β2* radio+β3*newspaper.

2、使用pandas来读取数据

pandas 是一个用于数据探索、数据分析和数据处理的python库

[python] view plain copy

import pandas as pd

[html] view plain copy

<pre name="code" class="python"># read csv file directly from a URL and save the results

data = pd.read_csv('/home/lulei/Advertising.csv')

# display the first 5 rows

data.head()

上面代码的运行结果：

上面显示的结果类似一个电子表格，这个结构称为Pandas的数据帧(data frame)，类型全称：pandas.core.frame.DataFrame.

pandas的两个主要数据结构：Series和DataFrame：

Series类似于一维数组，它有一组数据以及一组与之相关的数据标签(即索引)组成。

DataFrame是一个表格型的数据结构，它含有一组有序的列，每列可以是不同的值类型。DataFrame既有行索引也有列索引，它可以被看做由Series组成的字典。

[python] view plain copy

# display the last 5 rows

data.tail()

[html] view plain copy

# check the shape of the DataFrame(rows, colums)

data.shape

(200,4)

3、分析数据

特征：

TV：对于一个给定市场中单一产品，用于电视上的广告费用（以千为单位）

Radio：在广播媒体上投资的广告费用

Newspaper：用于报纸媒体的广告费用

响应：

Sales：对应产品的销量

在这个案例中，我们通过不同的广告投入，预测产品销量。因为响应变量是一个连续的值，所以这个问题是一个回归问题。数据集一共有200个观测值，每一组观测对应一个市场的情况。

注意：这里推荐使用的是seaborn包。网上说这个包的数据可视化效果比较好看。其实seaborn也应该属于matplotlib的内部包。只是需要再次的单独安装。

[python] view plain copy

import seaborn as sns

import matplotlib.pyplot as plt

# visualize the relationship between the features and the response using scatterplots

sns.pairplot(data, x_vars=['TV','Radio','Newspaper'], y_vars='Sales', size=7, aspect=0.8)

plt.show()#注意必须加上这一句，否则无法显示。

[html] view plain copy

这里选择TV、Radio、Newspaper 作为特征，Sales作为观测值

[html] view plain copy

返回的结果：

[python] view plain copy

sns.pairplot(data, x_vars=['TV','Radio','Newspaper'], y_vars='Sales', size=7, aspect=0.8, kind='reg')

plt.show()

4、线性回归模型

优点：快速；没有调节参数；可轻易解释；可理解。

缺点：相比其他复杂一些的模型，其预测准确率不是太高，因为它假设特征和响应之间存在确定的线性关系，这种假设对于非线性的关系，线性回归模型显然不能很好的对这种数据建模。

线性模型表达式： y=β0+β1x1+β2x2+...+βnxn 其中

y是响应

β0是截距

β1是x1的系数，以此类推

在这个案例中： y=β0+β1∗TV+β2∗Radio+...+βn∗Newspaper

scikit-learn要求X是一个特征矩阵，y是一个NumPy向量。

pandas构建在NumPy之上。

因此，X可以是pandas的DataFrame，y可以是pandas的Series，scikit-learn可以理解这种结构。

[python] view plain copy

#create a python list of feature names

feature_cols = ['TV', 'Radio', 'Newspaper']

# use the list to select a subset of the original DataFrame

X = data[feature_cols]

# equivalent command to do this in one line

X = data[['TV', 'Radio', 'Newspaper']]

# print the first 5 rows

print X.head()

# check the type and shape of X

print type(X)

print X.shape

[python] view plain copy

# select a Series from the DataFrame

y = data['Sales']

# equivalent command that works if there are no spaces in the column name

y = data.Sales

# print the first 5 values

print y.head()

（2）、构建训练集与测试集

[html] view plain copy

<pre name="code" class="python"><span style="font-size:14px;">##构造训练集和测试集

from sklearn.cross_validation import train_test_split #这里是引用了交叉验证

X_train,X_test, y_train, y_test = train_test_split(X, y, random_state=1)

[html] view plain copy

print X_train.shape

print y_train.shape

print X_test.shape

print y_test.shape

注：上面的结果是由train_test_spilit()得到的，但是我不知道为什么我的版本的sklearn包中居然报错：

处理方法：1、我后来重新安装sklearn包。再一次调用时就没有错误了。

2、自己写函数来认为的随机构造训练集和测试集。(这个代码我会在最后附上。)

[html] view plain copy

from sklearn.linear_model import LinearRegression

linreg = LinearRegression()

model=linreg.fit(X_train, y_train)

print model

print linreg.intercept_

print linreg.coef_

[html] view plain copy

# pair the feature names with the coefficients

zip(feature_cols, linreg.coef_)

对于给定了Radio和Newspaper的广告投入，如果在TV广告上每多投入1个单位，对应销量将增加0.0466个单位。就是加入其它两个媒体投入固定，在TV广告上每增加1000美元（因为单位是1000美元），销量将增加46.6（因为单位是1000）。但是大家注意这里的newspaper的系数居然是负数，所以我们可以考虑不使用newspaper这个特征。这是后话，后面会提到的。

（4）、预测

[python] view plain copy

y_pred = linreg.predict(X_test)

print y_pred

[python] view plain copy

print type(y_pred)

5、回归问题的评价测度

(1) 评价测度

对于分类问题，评价测度是准确率，但这种方法不适用于回归问题。我们使用针对连续数值的评价测度(evaluation metrics)。
这里介绍3种常用的针对线性回归的测度。

1)平均绝对误差(Mean Absolute Error, MAE)

(2)均方误差(Mean Squared Error, MSE)

(3)均方根误差(Root Mean Squared Error, RMSE)

这里我使用RMES。

[python] view plain copy

<pre name="code" class="python">#计算Sales预测的RMSE

print type(y_pred),type(y_test)

print len(y_pred),len(y_test)

print y_pred.shape,y_test.shape

from sklearn import metrics

import numpy as np

sum_mean=0

for i in range(len(y_pred)):

sum_mean+=(y_pred[i]-y_test.values[i])**2

sum_erro=np.sqrt(sum_mean/50)

# calculate RMSE by hand

print "RMSE by hand:",sum_erro

[python] view plain copy

import matplotlib.pyplot as plt

plt.figure()

plt.plot(range(len(y_pred)),y_pred,'b',label="predict")

plt.plot(range(len(y_pred)),y_test,'r',label="test")

plt.legend(loc="upper right") #显示图中的标签

plt.xlabel("the number of sales")

plt.ylabel('value of sales')

plt.show()

直到这里整个的一次多元线性回归的预测就结束了。

6、改进特征的选择
在之前展示的数据中，我们看到Newspaper和销量之间的线性关系竟是负关系（不用惊讶，这是随机特征抽样的结果。换一批抽样的数据就可能为正了），现在我们移除这个特征，看看线性回归预测的结果的RMSE如何？

依然使用我上面的代码，但只需修改下面代码中的一句即可：

[python] view plain copy

#create a python list of feature names

feature_cols = ['TV', 'Radio', 'Newspaper']

# use the list to select a subset of the original DataFrame

X = data[feature_cols]

# equivalent command to do this in one line

#X = data[['TV', 'Radio', 'Newspaper']]#只需修改这里即可<pre name="code" class="python" style="font-size: 15px; line-height: 35px;">X = data[['TV', 'Radio']] #去掉newspaper其他的代码不变

最后的到的系数与测度如下：

LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

备注：

注：上面的结果是由train_test_spilit()得到的，但是我不知道为什么我的版本的sklearn包中居然报错：

处理方法：1、我后来重新安装sklearn包。再一次调用时就没有错误了。

2、自己写函数来认为的随机构造训练集和测试集。(这个代码我会在最后附上。)

[python] view plain copy

import random

[python] view plain copy

<span style="font-family:microsoft yahei;">######自己写一个随机分配数的函数，分成两份，并将数值一次存储在对应的list中##########

def train_test_split(ylabel, random_state=1):

import random