我与西瓜书2----线性模型

Posted ssozhno-1

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了我与西瓜书2----线性模型相关的知识,希望对你有一定的参考价值。

上一章,讲了基本概念和关于模型选择与评估的概念。这一张学习线性回归,并根据线性回归加上模型选择与评估的知识来实例化。

 

1、线性回归(LinearRegression)(又名最小二乘法,ordinary least squares OLS)
最小二乘法的具体数学原理这里就不再赘述了,另外需要讲的一点是线性回归没有参数,这是一个优点,但也因此无法控制模型的复杂度。

具体实现过程:
先分类:
In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split helper function.
train_test_split(X,y,test_size,train_size,random_state)
X,y is array
test_size should between 0 to 1 .By default,test_size=0.25
train_size:If None, the value is automatically set to the complement of the test size.
random_state:int .
对训练集进行拟合(fit):
lr= linear_model.LinearRegression().fit(X_train,y_train)
这里“斜率”参数(w,也叫做权重或系数)被保存在coef_属性中(他是一个Numpy数组),而偏移或截距(b)保存在intercept_属性中(浮点数):
print("lr.coef_:",lr.coef_)
print("lr.intercept_:",lr.intercept_)
注:coef_和intercept_结尾处的下划线是scikit-learn将从训练数据中得出的值保存在以下划线结尾的属性中。这样可以与用户设置的参数区分开。

接着我们测试一下训练集和测试集的性能:
print("Training set score:{:.2f}".format(lr.score(X_train,y_train)))
print("Testing set score:",lr.score(X_test,y_test))
两个的score都在0.66左右,这个值就是相关系数的平方,当这个值表示预测输出与样本输出的相关性,显然R2=1(呈现线性关系y=kx+b)效果最好,而这里R2=0.66显然这个值表明训练集和测试集效果都很差,为欠拟合。
(测试,random_state一定,则评估结果一定,为了得到平均的结果应该,使用不同的random_state)

另外一种情况,当训练结果很好,测试结果很差如下:
Training set score:0.95
Test set score:0.61
训练集和测试集之间的性能差异是过拟合的明显标志,因此我们应该试图找到一个可以控制复杂度的模型。标准线性回归最常用的替代方法就是岭回归(ridge regression)
岭回归用到了被称为L2正则化(regularization)。
岭回归在linear_model.Redge中实现。Ridge在训练集上的分数要低于LinearRegression,但是在测试集上的分数更高。因为ridge是一种约束性更强的模型,所以更不容易过拟合。
Ridge模型在模型的简单性(系数都接近于0)与训练集性能之间做出权衡。简单性和训练集性能二者对于模型的重要程度可以由用户通过alpha参数来制定,其中alpha默认是为1.0。当alpha=0时候 Ridge和LinearRegression一样。
从数学的角度来看,Ridge惩罚了系数的L2范数(norm)或w的欧式长度。

import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets,linear_model
import mglearn
################################################################################
# X,y=mglearn.datasets.make_wave(n_samples=70)
# j=range(1,6)
# Training_score=[]
# Testing_score=[]
# for i in j:
#     X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=(20*i))
#     lr= linear_model.LinearRegression()
#     lrf=lr.fit(X_train,y_train)
#     y_predict=lr.predict(X_test)
#     #print("lr.coef_:{}".format(lrf.coef_))
#     #print("lr.intercept_:{:.2f}".format(lrf.intercept_))
#     plt.plot(X_test,y_predict,label=i)
#     Training_score.append(lrf.score(X_train,y_train))
#     Testing_score.append(lrf.score(X_test,y_test))
# plt.scatter(X_test,y_test)
# Training_average_score=sum(Training_score)/len(Training_score)
# Testing_average_score =sum(Testing_score)/len(Testing_score)
# print("Training_average_score:",Training_average_score)
# print("Testing_average_score:",Testing_average_score)
# plt.legend()
# plt.show()
#
#
#     # Training_average_score: 0.6820603963437258
#     # Testing_average_score: 0.5628366479885483
# ################################################################################
##without regularization
#
# X,y=mglearn.datasets.load_extended_boston()
# j=range(1,100)
# Training_score=[]
# Testing_score=[]
# lr=linear_model.LinearRegression()
# # print("X_test.shape :",X_test.shape,"y_predict shape:",y_predict.shape)   #X_test.shape : (127, 104) y_predict shape: (127,)
#
# for i in j:
#     X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=i)
#     lrf=lr.fit(X_train,y_train)
#     y_predict=lr.predict(X_test)
#     Training_score.append(lrf.score(X_train,y_train))
#     Testing_score.append(lrf.score(X_test,y_test))
# Training_average_score=sum(Training_score)/len(Training_score)
# Testing_average_score =sum(Testing_score)/len(Testing_score)
#     # print("coef_:",lrf.coef_)
# print("Training_average_score:",Training_average_score)
# print("Testing_average_score:",Testing_average_score)
#
#
# ##Training_average_score: 0.9368326849685069
# ##Testing_average_score: 0.7915812891905217
# ################################################################################
##Ridge Regression
X,y=mglearn.datasets.load_extended_boston()
j=range(0,100)
Training_score=[]
Testing_score=[]
# Ridge=linear_model.Ridge(normalize=True)#Training_average_score: 0.7854478510103376;Testing_average_score: 0.751578166025869
Ridge=linear_model.Ridge()
for i in j:
    X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=i)
    Ridgef=Ridge.fit(X_train,y_train)
    y_predict=Ridgef.predict(X_test)
    Training_score.append(Ridgef.score(X_train,y_train))
    Testing_score.append(Ridgef.score(X_test,y_test))
# print("coef_:",Ridgef.coef_)
Training_average_score=sum(Training_score)/len(Training_score)
Testing_average_score =sum(Testing_score)/len(Testing_score)
print("Training_average_score:",Training_average_score)
print("Testing_average_score:",Testing_average_score)
#Training_average_score: 0.8628368582406927
#Testing_average_score: 0.8224515767272093
# ################################################################################

#

 


























以上是关于我与西瓜书2----线性模型的主要内容,如果未能解决你的问题,请参考以下文章

我与西瓜书2外传----More about LinearRegression

《机器学习》(西瓜书)笔记--线性模型

一起啃西瓜书机器学习-期末复习(不挂科)

一起啃西瓜书机器学习-期末复习

机器学习-西瓜书南瓜书第三章

《机器学习》 周志华版(西瓜书)--课后参考答案