如果输入是numpy数组,为啥scikit-learns Multiple Regression方法会产生错误?

Posted

技术标签:

【中文标题】如果输入是numpy数组,为啥scikit-learns Multiple Regression方法会产生错误?【英文标题】:Why does scikit-learns Multiple Regression method generate error if input is numpy array?如果输入是numpy数组,为什么scikit-learns Multiple Regression方法会产生错误? 【发布时间】:2021-10-31 04:42:36 【问题描述】:

尝试使用 scikit-learns linear_model 类创建多重回归模型。我可以在网上找到的所有示例都使用 pandas 数据框将变量加载到模型中。但是我正在尝试使用 numpy 数组,这会导致错误,如下所述。

Trying to create a Multiple Regression model y = a0 + a1*x1 + a2*x2.

The independent variables x1 and x2 are one dimentional arrays with 36 values each:

x1 = [ 790, 1160,  929,  865, 1140,  929, 1109, 1365, 1112, 1150,  980,
         990, 1112, 1252, 1326, 1330, 1365, 1280, 1119, 1328, 1584, 1428,
        1365, 1415, 1415, 1465, 1490, 1725, 1523, 1705, 1605, 1746, 1235,
        1390, 1405, 1395]

x2 = [1000, 1200, 1000,  900, 1500, 1000, 1400, 1500, 1500, 1600, 1100,
        1300, 1000, 1600, 1600, 1600, 1600, 2200, 1600, 2000, 1600, 2000,
        2100, 1600, 2000, 1500, 2000, 2000, 1600, 2000, 2100, 2000, 1600,
        1600, 1600, 2500]

Combining the independent variables into one numpy array:
X = np.array([x_1, x_2])

X = array([[ 790, 1160,  929,  865, 1140,  929, 1109, 1365, 1112, 1150,  980,
         990, 1112, 1252, 1326, 1330, 1365, 1280, 1119, 1328, 1584, 1428,
        1365, 1415, 1415, 1465, 1490, 1725, 1523, 1705, 1605, 1746, 1235,
        1390, 1405, 1395],
       [1000, 1200, 1000,  900, 1500, 1000, 1400, 1500, 1500, 1600, 1100,
        1300, 1000, 1600, 1600, 1600, 1600, 2200, 1600, 2000, 1600, 2000,
        2100, 1600, 2000, 1500, 2000, 2000, 1600, 2000, 2100, 2000, 1600,
        1600, 1600, 2500]], dtype=int64)

The target variable:
y = array([ 99,  95,  95,  90, 105, 105,  90,  92,  98,  99,  99, 101,  99,
        94,  97,  97,  99, 104, 104, 105,  94,  99,  99,  99,  99, 102,
       104, 114, 109, 114, 115, 117, 104, 108, 109, 120], dtype=int64)

Training the model generates an error:
regr = linear_model.LinearRegression()
regr.fit(X, y)

这会产生以下错误。为什么?

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-5d359e69e27d> in <module>
      1 regr = linear_model.LinearRegression()
----> 2 regr.fit(X, y)
      3 
      4 #predict the CO2 emission of a car where the weight is 2300kg, and the volume is 1300cm3:
      5 predictedCO2 = regr.predict([[3300, 1300]])

c:\users\donald seger\miniconda3\envs\tensorflow\lib\site-packages\sklearn\linear_model\_base.py in fit(self, X, y, sample_weight)
    503 
    504         n_jobs_ = self.n_jobs
--> 505         X, y = self._validate_data(X, y, accept_sparse=['csr', 'csc', 'coo'],
    506                                    y_numeric=True, multi_output=True)
    507 

c:\users\donald seger\miniconda3\envs\tensorflow\lib\site-packages\sklearn\base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    430                 y = check_array(y, **check_y_params)
    431             else:
--> 432                 X, y = check_X_y(X, y, **check_params)
    433             out = X, y
    434 

c:\users\donald seger\miniconda3\envs\tensorflow\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update(k: arg for k, arg in zip(sig.parameters, args))
---> 72         return f(**kwargs)
     73     return inner_f
     74 

c:\users\donald seger\miniconda3\envs\tensorflow\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    810         y = y.astype(np.float64)
    811 
--> 812     check_consistent_length(X, y)
    813 
    814     return X, y

c:\users\donald seger\miniconda3\envs\tensorflow\lib\site-packages\sklearn\utils\validation.py in check_consistent_length(*arrays)
    253     uniques = np.unique(lengths)
    254     if len(uniques) > 1:
--> 255         raise ValueError("Found input variables with inconsistent numbers of"
    256                          " samples: %r" % [int(l) for l in lengths])
    257 

ValueError: Found input variables with inconsistent numbers of samples: [2, 36]

【问题讨论】:

您的 X 已交换行和列(因此错误消息中的“2 个样本”)。转置它,你应该没问题。 【参考方案1】:

确实,正如@BenReiniger 在此评论中注意到的那样,这是一个行和列交换。 我测试了您的代码并将.T 作为转置添加到X,它解决了您的问题:

>>> import numpy as np
>>> import sklearn

>>> x1 = [790, 1160,  929,  865, 1140,  929, 1109, 1365, 1112, 1150,  980,
...       990, 1112, 1252, 1326, 1330, 1365, 1280, 1119, 1328, 1584, 1428,
...       1365, 1415, 1415, 1465, 1490, 1725, 1523, 1705, 1605, 1746, 1235,
...       1390, 1405, 1395]
>>> x2 = [1000, 1200, 1000,  900, 1500, 1000, 1400, 1500, 1500, 1600, 1100,
...       1300, 1000, 1600, 1600, 1600, 1600, 2200, 1600, 2000, 1600, 2000,
...       2100, 1600, 2000, 1500, 2000, 2000, 1600, 2000, 2100, 2000, 1600,
...       1600, 1600, 2500]
>>> X = np.array([x1, x2]).T
>>> y = np.array([99,  95,  95,  90, 105, 105,  90,  92,  98,  99,  99, 101,  99,
...               94,  97,  97,  99, 104, 104, 105,  94,  99,  99,  99,  99, 102,
...               104, 114, 109, 114, 115, 117, 104, 108, 109, 120])
>>> regr = sklearn.linear_model.LinearRegression()
>>> regr.fit(X, y)
LinearRegression()

我们有一个LinearRegression() 对象作为预期返回。

【讨论】:

嗨@Alex,如果这个或任何答案解决了您的问题,请点击复选标记考虑accepting it。这向更广泛的社区表明您已经找到了解决方案,并为回答者和您自己提供了一些声誉。没有义务这样做。

以上是关于如果输入是numpy数组,为啥scikit-learns Multiple Regression方法会产生错误?的主要内容,如果未能解决你的问题,请参考以下文章

ch2-基本工具介绍

为啥 Numpy 掩码数组有用?

为啥 Spyder “变量信息”在字典中显示错误的 NumPy 数组形状?

为啥 numpy.any 在大型数组上如此缓慢?

为啥我不能将一个 numpy 数组除以(或乘以)一个标量?

为啥在迭代 NumPy 数组时 Cython 比 Numba 慢得多?