如果输入是numpy数组,为啥scikit-learns Multiple Regression方法会产生错误?
Posted
技术标签:
【中文标题】如果输入是numpy数组,为啥scikit-learns Multiple Regression方法会产生错误?【英文标题】:Why does scikit-learns Multiple Regression method generate error if input is numpy array?如果输入是numpy数组,为什么scikit-learns Multiple Regression方法会产生错误? 【发布时间】:2021-10-31 04:42:36 【问题描述】:尝试使用 scikit-learns linear_model 类创建多重回归模型。我可以在网上找到的所有示例都使用 pandas 数据框将变量加载到模型中。但是我正在尝试使用 numpy 数组,这会导致错误,如下所述。
Trying to create a Multiple Regression model y = a0 + a1*x1 + a2*x2.
The independent variables x1 and x2 are one dimentional arrays with 36 values each:
x1 = [ 790, 1160, 929, 865, 1140, 929, 1109, 1365, 1112, 1150, 980,
990, 1112, 1252, 1326, 1330, 1365, 1280, 1119, 1328, 1584, 1428,
1365, 1415, 1415, 1465, 1490, 1725, 1523, 1705, 1605, 1746, 1235,
1390, 1405, 1395]
x2 = [1000, 1200, 1000, 900, 1500, 1000, 1400, 1500, 1500, 1600, 1100,
1300, 1000, 1600, 1600, 1600, 1600, 2200, 1600, 2000, 1600, 2000,
2100, 1600, 2000, 1500, 2000, 2000, 1600, 2000, 2100, 2000, 1600,
1600, 1600, 2500]
Combining the independent variables into one numpy array:
X = np.array([x_1, x_2])
X = array([[ 790, 1160, 929, 865, 1140, 929, 1109, 1365, 1112, 1150, 980,
990, 1112, 1252, 1326, 1330, 1365, 1280, 1119, 1328, 1584, 1428,
1365, 1415, 1415, 1465, 1490, 1725, 1523, 1705, 1605, 1746, 1235,
1390, 1405, 1395],
[1000, 1200, 1000, 900, 1500, 1000, 1400, 1500, 1500, 1600, 1100,
1300, 1000, 1600, 1600, 1600, 1600, 2200, 1600, 2000, 1600, 2000,
2100, 1600, 2000, 1500, 2000, 2000, 1600, 2000, 2100, 2000, 1600,
1600, 1600, 2500]], dtype=int64)
The target variable:
y = array([ 99, 95, 95, 90, 105, 105, 90, 92, 98, 99, 99, 101, 99,
94, 97, 97, 99, 104, 104, 105, 94, 99, 99, 99, 99, 102,
104, 114, 109, 114, 115, 117, 104, 108, 109, 120], dtype=int64)
Training the model generates an error:
regr = linear_model.LinearRegression()
regr.fit(X, y)
这会产生以下错误。为什么?
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-5d359e69e27d> in <module>
1 regr = linear_model.LinearRegression()
----> 2 regr.fit(X, y)
3
4 #predict the CO2 emission of a car where the weight is 2300kg, and the volume is 1300cm3:
5 predictedCO2 = regr.predict([[3300, 1300]])
c:\users\donald seger\miniconda3\envs\tensorflow\lib\site-packages\sklearn\linear_model\_base.py in fit(self, X, y, sample_weight)
503
504 n_jobs_ = self.n_jobs
--> 505 X, y = self._validate_data(X, y, accept_sparse=['csr', 'csc', 'coo'],
506 y_numeric=True, multi_output=True)
507
c:\users\donald seger\miniconda3\envs\tensorflow\lib\site-packages\sklearn\base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
430 y = check_array(y, **check_y_params)
431 else:
--> 432 X, y = check_X_y(X, y, **check_params)
433 out = X, y
434
c:\users\donald seger\miniconda3\envs\tensorflow\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
70 FutureWarning)
71 kwargs.update(k: arg for k, arg in zip(sig.parameters, args))
---> 72 return f(**kwargs)
73 return inner_f
74
c:\users\donald seger\miniconda3\envs\tensorflow\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
810 y = y.astype(np.float64)
811
--> 812 check_consistent_length(X, y)
813
814 return X, y
c:\users\donald seger\miniconda3\envs\tensorflow\lib\site-packages\sklearn\utils\validation.py in check_consistent_length(*arrays)
253 uniques = np.unique(lengths)
254 if len(uniques) > 1:
--> 255 raise ValueError("Found input variables with inconsistent numbers of"
256 " samples: %r" % [int(l) for l in lengths])
257
ValueError: Found input variables with inconsistent numbers of samples: [2, 36]
【问题讨论】:
您的X
已交换行和列(因此错误消息中的“2 个样本”)。转置它,你应该没问题。
【参考方案1】:
确实,正如@BenReiniger 在此评论中注意到的那样,这是一个行和列交换。
我测试了您的代码并将.T
作为转置添加到X
,它解决了您的问题:
>>> import numpy as np
>>> import sklearn
>>> x1 = [790, 1160, 929, 865, 1140, 929, 1109, 1365, 1112, 1150, 980,
... 990, 1112, 1252, 1326, 1330, 1365, 1280, 1119, 1328, 1584, 1428,
... 1365, 1415, 1415, 1465, 1490, 1725, 1523, 1705, 1605, 1746, 1235,
... 1390, 1405, 1395]
>>> x2 = [1000, 1200, 1000, 900, 1500, 1000, 1400, 1500, 1500, 1600, 1100,
... 1300, 1000, 1600, 1600, 1600, 1600, 2200, 1600, 2000, 1600, 2000,
... 2100, 1600, 2000, 1500, 2000, 2000, 1600, 2000, 2100, 2000, 1600,
... 1600, 1600, 2500]
>>> X = np.array([x1, x2]).T
>>> y = np.array([99, 95, 95, 90, 105, 105, 90, 92, 98, 99, 99, 101, 99,
... 94, 97, 97, 99, 104, 104, 105, 94, 99, 99, 99, 99, 102,
... 104, 114, 109, 114, 115, 117, 104, 108, 109, 120])
>>> regr = sklearn.linear_model.LinearRegression()
>>> regr.fit(X, y)
LinearRegression()
我们有一个LinearRegression()
对象作为预期返回。
【讨论】:
嗨@Alex,如果这个或任何答案解决了您的问题,请点击复选标记考虑accepting it。这向更广泛的社区表明您已经找到了解决方案,并为回答者和您自己提供了一些声誉。没有义务这样做。以上是关于如果输入是numpy数组,为啥scikit-learns Multiple Regression方法会产生错误?的主要内容,如果未能解决你的问题,请参考以下文章