多项式回归的样本数不一致的输入变量
Posted
技术标签:
【中文标题】多项式回归的样本数不一致的输入变量【英文标题】:Input Variables With Inconsistent Numbers of Samples for Polynomial Regression 【发布时间】:2021-06-14 04:19:01 【问题描述】:尝试进行多项式回归并且在拟合模型时遇到了一些问题。 获取
ValueError: Found input variables with inconsistent numbers of samples: [1040, 260]
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
x = BTCdata.iloc[:, [1, 2, 4, 5]]
y = BTCdata.iloc[:,3]
x, y = np.array(x).reshape((-1, 1)), np.array(y).reshape((-1, 1))
poly_features= PolynomialFeatures(degree= 4, include_bias = False)
x_ = poly_features.fit_transform(x)
model = LinearRegression()
model.fit(x_, y)
【问题讨论】:
能否请您发布 BTCdata 或链接或类似内容以重现错误? 是的,当然是我没能做到这一点的错 drive.google.com/file/d/13VnQZbKB9UTOeNplT6GjzTZvH8CqxQcr/… sheet is 'FinalBTC' 。刚刚做了简单的 pd.read_excel(path) 【参考方案1】:问题出在这一行:
x = np.array(x).reshape((-1, 1))
通过这样做,您将n
行和m
列的数据框转换为n x m
行和1
列的数组。在您的示例中,x
最终具有 260 x 4 = 1040
行,而 y
具有 260
,从而引发此错误。
如果您的目标是在将数据用于模型之前将其转换为 numpy
数组,那么您只需执行以下操作:
x = x.to_numpy()
【讨论】:
【参考方案2】:import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import statsmodels.api as sm
#
BTCdata = pd.read_excel('BitcoinRegression.xlsx', sheet_name='FinalBTC')
x = BTCdata.iloc[:, [1, 2, 4, 5]]
print(x.shape)
y = BTCdata.iloc[:,3]
print(y.shape)
#x, y = np.array(x).reshape((-1, 1)), np.array(y).reshape((-1, 1))
poly_features= PolynomialFeatures(degree= 4, include_bias = False)
x_ = poly_features.fit_transform(x)
#model = LinearRegression()
#model.fit(x_, y)
mod = sm.OLS(y, x_).fit()
mod.summary()
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: BTC R-squared: 0.886
Model: OLS Adj. R-squared: 0.868
Method: Least Squares F-statistic: 46.86
Date: Wed, 17 Mar 2021 Prob (F-statistic): 2.63e-85
Time: 20:49:58 Log-Likelihood: -2299.3
No. Observations: 260 AIC: 4675.
Df Residuals: 222 BIC: 4810.
Df Model: 37
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 -0.0089 0.019 -0.468 0.640 -0.046 0.028
x2 0.0033 0.004 0.797 0.426 -0.005 0.012
x3 2.621e-05 3.55e-05 0.737 0.462 -4.38e-05 9.62e-05
x4 0.0005 0.001 0.789 0.431 -0.001 0.002
x5 -0.0238 0.067 -0.355 0.723 -0.156 0.108
x6 0.0790 0.688 0.115 0.909 -1.277 1.435
x7 0.0942 0.131 0.722 0.471 -0.163 0.352
x8 0.9679 1.276 0.759 0.449 -1.546 3.482
x9 0.0184 0.133 0.139 0.890 -0.243 0.280
x10 0.0093 0.013 0.726 0.469 -0.016 0.035
x11 0.0957 0.125 0.766 0.444 -0.150 0.342
x12 0.0001 0.000 0.864 0.389 -0.000 0.000
x13 0.0008 0.001 0.599 0.550 -0.002 0.003
x14 0.0207 0.026 0.783 0.435 -0.031 0.073
x15 3.594e-05 2.89e-05 1.245 0.214 -2.09e-05 9.28e-05
x16 -0.0004 0.001 -0.496 0.621 -0.002 0.001
x17 0.0158 0.010 1.621 0.106 -0.003 0.035
x18 -0.0068 0.002 -2.945 0.004 -0.011 -0.002
x19 -0.0014 0.007 -0.202 0.840 -0.015 0.012
x20 -0.0389 0.086 -0.454 0.650 -0.208 0.130
x21 0.1104 0.043 2.558 0.011 0.025 0.195
x22 0.7337 0.819 0.896 0.371 -0.881 2.348
x23 -1.4583 0.432 -3.378 0.001 -2.309 -0.607
x24 0.0601 0.031 1.913 0.057 -0.002 0.122
x25 0.0192 0.021 0.893 0.373 -0.023 0.061
x26 0.0403 0.091 0.445 0.657 -0.138 0.219
x27 -0.5110 0.224 -2.284 0.023 -0.952 -0.070
x28 0.0697 0.078 0.892 0.374 -0.084 0.224
x29 -0.1316 0.039 -3.397 0.001 -0.208 -0.055
x30 0.0054 0.103 0.052 0.958 -0.198 0.209
x31 0.0003 0.000 0.951 0.343 -0.000 0.001
x32 0.0060 0.007 0.856 0.393 -0.008 0.020
x33 -0.0124 0.012 -1.078 0.282 -0.035 0.010
x34 0.3317 0.394 0.842 0.400 -0.444 1.108
x35 -4.886e-09 1.1e-09 -4.439 0.000 -7.05e-09 -2.72e-09
x36 1.387e-07 3.68e-08 3.767 0.000 6.62e-08 2.11e-07
x37 5.106e-07 3.44e-06 0.148 0.882 -6.28e-06 7.3e-06
x38 4.652e-07 2.91e-07 1.601 0.111 -1.07e-07 1.04e-06
x39 -1.623e-06 5.17e-07 -3.138 0.002 -2.64e-06 -6.04e-07
x40 -8.446e-05 9.05e-05 -0.933 0.352 -0.000 9.39e-05
x41 -8.729e-06 7.38e-06 -1.182 0.238 -2.33e-05 5.82e-06
x42 -0.0017 0.002 -0.804 0.422 -0.006 0.002
x43 0.0007 0.000 1.705 0.090 -0.000 0.001
x44 -1.815e-05 2.11e-05 -0.862 0.390 -5.96e-05 2.33e-05
x45 9.562e-06 3.43e-06 2.788 0.006 2.8e-06 1.63e-05
x46 0.0012 0.001 1.413 0.159 -0.000 0.003
x47 5.405e-05 6.5e-05 0.831 0.407 -7.41e-05 0.000
x48 0.0069 0.044 0.156 0.876 -0.080 0.093
x49 -0.0078 0.006 -1.414 0.159 -0.019 0.003
x50 0.0001 0.000 0.307 0.759 -0.001 0.001
x51 0.1505 0.090 1.669 0.096 -0.027 0.328
x52 0.1555 0.046 3.410 0.001 0.066 0.245
x53 -0.0296 0.024 -1.210 0.227 -0.078 0.019
x54 0.0016 0.001 2.182 0.030 0.000 0.003
x55 -2.28e-05 8.77e-06 -2.600 0.010 -4.01e-05 -5.52e-06
x56 -0.0045 0.003 -1.594 0.112 -0.010 0.001
x57 -0.0002 0.000 -0.947 0.344 -0.001 0.000
x58 -0.0067 0.237 -0.028 0.977 -0.474 0.461
x59 0.0134 0.021 0.629 0.530 -0.029 0.055
x60 0.0020 0.002 1.123 0.262 -0.002 0.006
x61 0.0277 0.016 1.689 0.093 -0.005 0.060
x62 -0.3824 0.413 -0.926 0.355 -1.196 0.431
x63 0.3528 0.179 1.970 0.050 -0.000 0.706
x64 -0.0282 0.005 -5.708 0.000 -0.038 -0.018
x65 -0.0002 0.000 -0.695 0.488 -0.001 0.000
x66 0.0098 0.009 1.142 0.255 -0.007 0.027
x67 0.0901 0.103 0.873 0.384 -0.113 0.293
x68 -0.1941 0.648 -0.300 0.765 -1.471 1.083
x69 0.0237 0.021 1.128 0.261 -0.018 0.065
==============================================================================
Omnibus: 127.728 Durbin-Watson: 0.552
Prob(Omnibus): 0.000 Jarque-Bera (JB): 851.418
Skew: 1.861 Prob(JB): 1.31e-185
Kurtosis: 11.046 Cond. No. 4.00e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4e+16. This might indicate that there are
strong multicollinearity or other numerical problems.
"""
【讨论】:
以上是关于多项式回归的样本数不一致的输入变量的主要内容,如果未能解决你的问题,请参考以下文章
尝试实现逻辑回归,但 gridsearchCV 显示输入变量的样本数量不一致:[60000, 60001]
sklearn:发现样本数量不一致的输入变量:[1, 99]