我与西瓜书2----线性模型
Posted ssozhno-1
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了我与西瓜书2----线性模型相关的知识,希望对你有一定的参考价值。
上一章,讲了基本概念和关于模型选择与评估的概念。这一张学习线性回归,并根据线性回归加上模型选择与评估的知识来实例化。
1、线性回归(LinearRegression)(又名最小二乘法,ordinary least squares OLS)
最小二乘法的具体数学原理这里就不再赘述了,另外需要讲的一点是线性回归没有参数,这是一个优点,但也因此无法控制模型的复杂度。
具体实现过程:
先分类:
In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split helper function.
train_test_split(X,y,test_size,train_size,random_state)
X,y is array
test_size should between 0 to 1 .By default,test_size=0.25
train_size:If None, the value is automatically set to the complement of the test size.
random_state:int .
对训练集进行拟合(fit):
lr= linear_model.LinearRegression().fit(X_train,y_train)
这里“斜率”参数(w,也叫做权重或系数)被保存在coef_属性中(他是一个Numpy数组),而偏移或截距(b)保存在intercept_属性中(浮点数):
print("lr.coef_:",lr.coef_)
print("lr.intercept_:",lr.intercept_)
注:coef_和intercept_结尾处的下划线是scikit-learn将从训练数据中得出的值保存在以下划线结尾的属性中。这样可以与用户设置的参数区分开。
接着我们测试一下训练集和测试集的性能:
print("Training set score:{:.2f}".format(lr.score(X_train,y_train)))
print("Testing set score:",lr.score(X_test,y_test))
两个的score都在0.66左右,这个值就是相关系数的平方,当这个值表示预测输出与样本输出的相关性,显然R2=1(呈现线性关系y=kx+b)效果最好,而这里R2=0.66显然这个值表明训练集和测试集效果都很差,为欠拟合。
(测试,random_state一定,则评估结果一定,为了得到平均的结果应该,使用不同的random_state)
另外一种情况,当训练结果很好,测试结果很差如下:
Training set score:0.95
Test set score:0.61
训练集和测试集之间的性能差异是过拟合的明显标志,因此我们应该试图找到一个可以控制复杂度的模型。标准线性回归最常用的替代方法就是岭回归(ridge regression)
岭回归用到了被称为L2正则化(regularization)。
岭回归在linear_model.Redge中实现。Ridge在训练集上的分数要低于LinearRegression,但是在测试集上的分数更高。因为ridge是一种约束性更强的模型,所以更不容易过拟合。
Ridge模型在模型的简单性(系数都接近于0)与训练集性能之间做出权衡。简单性和训练集性能二者对于模型的重要程度可以由用户通过alpha参数来制定,其中alpha默认是为1.0。当alpha=0时候 Ridge和LinearRegression一样。
从数学的角度来看,Ridge惩罚了系数的L2范数(norm)或w的欧式长度。
import matplotlib.pyplot as plt import numpy as np from sklearn.model_selection import train_test_split from sklearn import datasets,linear_model import mglearn ################################################################################ # X,y=mglearn.datasets.make_wave(n_samples=70) # j=range(1,6) # Training_score=[] # Testing_score=[] # for i in j: # X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=(20*i)) # lr= linear_model.LinearRegression() # lrf=lr.fit(X_train,y_train) # y_predict=lr.predict(X_test) # #print("lr.coef_:{}".format(lrf.coef_)) # #print("lr.intercept_:{:.2f}".format(lrf.intercept_)) # plt.plot(X_test,y_predict,label=i) # Training_score.append(lrf.score(X_train,y_train)) # Testing_score.append(lrf.score(X_test,y_test)) # plt.scatter(X_test,y_test) # Training_average_score=sum(Training_score)/len(Training_score) # Testing_average_score =sum(Testing_score)/len(Testing_score) # print("Training_average_score:",Training_average_score) # print("Testing_average_score:",Testing_average_score) # plt.legend() # plt.show() # # # # Training_average_score: 0.6820603963437258 # # Testing_average_score: 0.5628366479885483 # ################################################################################ ##without regularization # # X,y=mglearn.datasets.load_extended_boston() # j=range(1,100) # Training_score=[] # Testing_score=[] # lr=linear_model.LinearRegression() # # print("X_test.shape :",X_test.shape,"y_predict shape:",y_predict.shape) #X_test.shape : (127, 104) y_predict shape: (127,) # # for i in j: # X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=i) # lrf=lr.fit(X_train,y_train) # y_predict=lr.predict(X_test) # Training_score.append(lrf.score(X_train,y_train)) # Testing_score.append(lrf.score(X_test,y_test)) # Training_average_score=sum(Training_score)/len(Training_score) # Testing_average_score =sum(Testing_score)/len(Testing_score) # # print("coef_:",lrf.coef_) # print("Training_average_score:",Training_average_score) # print("Testing_average_score:",Testing_average_score) # # # ##Training_average_score: 0.9368326849685069 # ##Testing_average_score: 0.7915812891905217 # ################################################################################ ##Ridge Regression X,y=mglearn.datasets.load_extended_boston() j=range(0,100) Training_score=[] Testing_score=[] # Ridge=linear_model.Ridge(normalize=True)#Training_average_score: 0.7854478510103376;Testing_average_score: 0.751578166025869 Ridge=linear_model.Ridge() for i in j: X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=i) Ridgef=Ridge.fit(X_train,y_train) y_predict=Ridgef.predict(X_test) Training_score.append(Ridgef.score(X_train,y_train)) Testing_score.append(Ridgef.score(X_test,y_test)) # print("coef_:",Ridgef.coef_) Training_average_score=sum(Training_score)/len(Training_score) Testing_average_score =sum(Testing_score)/len(Testing_score) print("Training_average_score:",Training_average_score) print("Testing_average_score:",Testing_average_score) #Training_average_score: 0.8628368582406927 #Testing_average_score: 0.8224515767272093 # ################################################################################ #
以上是关于我与西瓜书2----线性模型的主要内容,如果未能解决你的问题,请参考以下文章