使用 scikit-learn 管道与手动操作时的不同分数
Posted
技术标签:
【中文标题】使用 scikit-learn 管道与手动操作时的不同分数【英文标题】:different scores when using scikit-learn pipeline vs. doing it manually 【发布时间】:2019-12-05 08:33:22 【问题描述】:下面使用 minmaxscaler、polyl 特征和线性回归分类器的简单示例。
通过管道进行:
pipeLine = make_pipeline(MinMaxScaler(),PolynomialFeatures(), LinearRegression())
pipeLine.fit(X_train,Y_train)
print(pipeLine.score(X_test,Y_test))
print(pipeLine.steps[2][1].intercept_)
print(pipeLine.steps[2][1].coef_)
0.4433729905419167
3.4067909278765605
[ 0. -7.60868833 5.87162697]
手动操作:
X_trainScaled = MinMaxScaler().fit_transform(X_train)
X_trainScaledandPoly = PolynomialFeatures().fit_transform(X_trainScaled)
X_testScaled = MinMaxScaler().fit_transform(X_test)
X_testScaledandPoly = PolynomialFeatures().fit_transform(X_testScaled)
reg = LinearRegression()
reg.fit(X_trainScaledandPoly,Y_train)
print(reg.score(X_testScaledandPoly,Y_test))
print(reg.intercept_)
print(reg.coef_)
print(reg.intercept_ == pipeLine.steps[2][1].intercept_)
print(reg.coef_ == pipeLine.steps[2][1].coef_)
0.44099256691782807
3.4067909278765605
[ 0. -7.60868833 5.87162697]
True
[ True True True]
【问题讨论】:
X_test
和 X_train
有可能有不同的最小值/最大值吗?您可以使用定义的数据集尝试它并将其添加到您的问题中吗?
你不应该fit_transform
两次。你应该 fit
使用训练数据,然后只调用 transform
来获取测试数据。
谢谢大家 :) 我现在可以看到我的方法的错误了 :)
【参考方案1】:
问题在于您的手动步骤,您使用测试数据对 Scaler 进行改装,您需要将其拟合到训练数据上并在测试数据上使用拟合实例,请参阅此处了解详细信息:How to normalize the Train and Test data using MinMaxScaler sklearn 和 @987654322 @
from sklearn.datasets import make_classification, make_regression
from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
X, y = make_regression(n_features=3, n_samples=50, n_informative=1, noise=1)
X_train, X_test, Y_train, Y_test = train_test_split(X, y)
pipeLine = make_pipeline(MinMaxScaler(),PolynomialFeatures(), LinearRegression())
pipeLine.fit(X_train,Y_train)
print(pipeLine.score(X_test,Y_test))
print(pipeLine.steps[2][1].intercept_)
print(pipeLine.steps[2][1].coef_)
scaler = MinMaxScaler().fit(X_train)
X_trainScaled = scaler.transform(X_train)
X_trainScaledandPoly = PolynomialFeatures().fit_transform(X_trainScaled)
X_testScaled = scaler.transform(X_test)
X_testScaledandPoly = PolynomialFeatures().fit_transform(X_testScaled)
reg = LinearRegression()
reg.fit(X_trainScaledandPoly,Y_train)
print(reg.score(X_testScaledandPoly,Y_test))
print(reg.intercept_)
print(reg.coef_)
print(reg.intercept_ == pipeLine.steps[2][1].intercept_)
print(reg.coef_ == pipeLine.steps[2][1].coef_)
【讨论】:
以上是关于使用 scikit-learn 管道与手动操作时的不同分数的主要内容,如果未能解决你的问题,请参考以下文章
如何将功能管道从 scikit-learn V0.21 移植到 V0.24