用 Python 拟合和预测数据库中每一行的线性回归

Posted 2023-03-11

技术标签:

【中文标题】用 Python 拟合和预测数据库中每一行的线性回归【英文标题】：Fit and predict linear regression for each row in the database in Python 【发布时间】：2021-12-30 10:53:14 【问题描述】：

大家晚上好，我是 Python 新手，我正在尝试通过复制我在 Excel 上的模型来学习

我需要复制“趋势”函数来拟合两个极值点之间的小型线性模型，比如说

A = (1, 0.15) B= (5,0.2)

并使用给定值进行预测（比如 4.2）。

出于此代码的目的，我需要为数据库的每一行拟合一个模型。所有的 x 值都是 x_1=1 和 x_2=5，而每行的 y 值都是不同的。

我尝试以这种方式使用 sklearn.linear_model 包中的 LinearRegression() 和 model.predict

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

data = 'New_x':[5, 2.1, 4.5, 3.0],
        'X1':[1, 1, 1, 1],
        'X2':[5, 5, 5, 5],
        'Y1':[0.15, 0.7, 1.35, 0.2],
        'Y2':[0.2, 0.85, 1.55, 0.4]  

df=pd.DataFrame(data,index=["1","2","3","4"])

model=LinearRegression().fit(df[["X1","X2"]],df[["Y1","Y2"]])
prediction=model.predict(df["New_x"].values.reshape(-1,1))

但是我收到了这个错误

    ValueError                                Traceback (most recent call last)
<ipython-input-88-da83cb57bf4a> in <module>()
     18 
     19 model=LinearRegression().fit(df[["X1","X2"]],df[["Y1","Y2"]])
---> 20 prediction=model.predict(df["New_x"].values.reshape(-1,1))
     21 
     22 #model = LinearRegression().fit(SEC_ERBA_sample[["Vertex1","Vertex2"]], SEC_ERBA_sample[["SENIOR_1Y","SENIOR_5Y"]])

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)
    254             Returns predicted values.
    255         """
--> 256         return self._decision_function(X)
    257 
    258     _preprocess_data = staticmethod(_preprocess_data)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\base.py in _decision_function(self, X)
    239         X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
    240         return safe_sparse_dot(X, self.coef_.T,
--> 241                                dense_output=True) + self.intercept_
    242 
    243     def predict(self, X):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\extmath.py in safe_sparse_dot(a, b, dense_output)
    138         return ret
    139     else:
--> 140         return np.dot(a, b)
    141 
    142 

ValueError: shapes (4,1) and (2,2) not aligned: 1 (dim 1) != 2 (dim 0)

所以我假设LinearRegression().fit 正在根据列值拟合一个独特的模型。有没有办法拟合和预测每一行的线性回归？

【问题讨论】：

能否将完整回溯添加到问题中？它使调试更容易。 @user17242583 完整的回溯有点复杂，但我添加了一个有代表性的例子 【参考方案1】：

我认为这是一个简单的代码拼写错误，但可能是为了解决更深层次的概念问题，所以我会尝试给你一个更广泛的答案。 sklearn.base.BaseEstimator#fit 通过将一组特征 X 与一组真实值 y 相关联来训练 ML 模型。在您的示例中，您正在训练两个多变量回归模型来估计 Y1 和 Y2 变量，同时考虑 X1 和 X2：

model = LinearRegression().fit(df[["X1","X2"]], df[["Y1","Y2"]])

因此模型会在考虑两个其他变量的情况下学习估计这两个变量。在预测期间，模型需要准确的变量（X1 和 X2）才能预测感兴趣的值。

predictions = model.predict(df[["New_x1", "New_x2"]])

如果New_x2 信息在测试（预测）期间不可用，那么您要么必须同时估计它，要么将其从训练中完全删除。

一个简单的抽象示例：如果一个模型被训练来根据你的身高和体重来估计你喜欢的 T 恤尺寸，你需要在测试（预测）时间内知道身高和体重才能获得正确的大小估计。

【讨论】：

【参考方案2】：

我找到了使用 iterrow() 的解决方案。仍然不完整，因为我无法保存输出，但我想我会为此打开一个单独且更集中的问题

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

data = 'New_x':[5, 2.1, 4.5, 3.0],
        'X1':[1., 1, 1, 1],
        'X2':[5., 5, 5, 5],
        'Y1':[0.15, 0.7, 1.35, 0.2],
        'Y2':[0.2, 0.85, 1.55, 0.4]  

df=pd.DataFrame(data,index=["1","2","3","4"])

最后一块允许迭代线性回归。不建议使用 iterrows()，因为可以以不同的方式（包括矢量化）运行许多操作，但在这种情况下，我没有找到解决此问题的替代解决方案

for index, row in df.iterrows():
    model=LinearRegression().fit(np.array([row["X1"],row["X2"]]).reshape(-1,1),
                                 np.array([row["Y1"],row["Y2"]]))
    print(model.predict(row["New_x"]))

【讨论】：

以上是关于用 Python 拟合和预测数据库中每一行的线性回归的主要内容，如果未能解决你的问题，请参考以下文章