逻辑回归模型(二进制)交叉表错误 = 传递值的形状问题

Posted

技术标签:

【中文标题】逻辑回归模型(二进制)交叉表错误 = 传递值的形状问题【英文标题】:Logistic Regression Model (binary) crosstab error = shape of passed values issue 【发布时间】:2021-05-07 02:46:41 【问题描述】:

我目前正在尝试对数据集运行逻辑回归。我对我的 cat 变量进行了虚拟编码,并对我的连续变量进行了标准化,并用 -1 填充空值(这适用于我的数据集)。我正在完成这些步骤,直到我尝试运行我的交叉表,它抱怨我传递的值的形状之前,我没有收到任何错误。对于带 CV 和不带 CV 的 LogR,我都遇到了同样的错误。我在下面包含了我的代码,我没有包含编码,因为这似乎不是问题或代码 LogR w/out CV,因为它基本上是相同的,除了它不包括 CV。

# read in the df w/ encoded variables
allyrs=pd.read_csv("C:/Users/cyrra/OneDrive/Documents/Pythonread/HDS805/CS1W1/modelready_working.csv")

# Find locations of where I need to trim the data down selecting only the encoded variables
allyrs.columns.get_loc("BMI_C__-1.0")
23
allyrs.columns.get_loc("N_BMIR")
152

# Finding the location of the Y col
allyrs.columns.get_loc("CM")
23

#create new X and y for binary LR
y_bi = allyrs[["CM"]]
X_bi = allyrs.iloc[0:1305720, 23:152]

然后我继续检查两个变量的长度并检查 X 集中的所有列,一切都在那里。取值如下: y_bi = 1305720 rows x 1 col , X_bi = 1305720 rows × 129 columns

# Create test/train
# Create test/train for bi column
from sklearn.model_selection import train_test_split
Xbi_train, Xbi_test, ybi_train, ybi_test = train_test_split(X_bi, y_bi,
                                                    train_size=0.8,test_size = 0.2)

我再次检查 Xbi_train 和 & Ybi_train 的大小:Xbi_train=1044576 行 × 129 列,ybi_train= 1044576 行 × 1 列

# LRw/CV for the binary col
from sklearn.linear_model import LogisticRegressionCV
logitbi_cv = LogisticRegressionCV(cv=2, random_state=0).fit(Xbi_train, ybi_train)

# Set predicted (checking to see if its an array)
logitbi_cv.predict(Xbi_train)
array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

# Set predicted to its own variable 
[IN]:pred_logitbi_cv =logitbi_cv.predict(Xbi_train)

# Cross tab LR w/0ut
from sklearn.metrics import confusion_matrix
ct_bi_cv=pd.crosstab(ybi_train, pred_logitbi_cv)

错误:

[OUT]:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in create_block_manager_from_arrays(arrays, names, axes)
   1701         blocks = _form_blocks(arrays, names, axes)
-> 1702         mgr = BlockManager(blocks, axes)
   1703         mgr._consolidate_inplace()

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in __init__(self, blocks, axes, do_integrity_check)
    142         if do_integrity_check:
--> 143             self._verify_integrity()
    144 

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in _verify_integrity(self)
    322             if block.shape[1:] != mgr_shape[1:]:
--> 323                 raise construction_error(tot_items, block.shape[1:], self.axes)
    324         if len(self.items) != tot_items:

ValueError: Shape of passed values is (1, 2), indices imply (1044576, 2)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-121-c669b17c171f> in <module>
      1 # LR W/ CV
      2 # Cross tab LR w/0ut
----> 3 ct_bi_cv=pd.crosstab(ybi_train, pred_logitbi_cv)

~\anaconda3\lib\site-packages\pandas\core\reshape\pivot.py in crosstab(index, columns, values, rownames, colnames, aggfunc, margins, margins_name, dropna, normalize)
    596         **dict(zip(unique_colnames, columns)),
    597     
--> 598     df = DataFrame(data, index=common_idx)
    599     original_df_cols = df.columns
    600 

~\anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
    527 
    528         elif isinstance(data, dict):
--> 529             mgr = init_dict(data, index, columns, dtype=dtype)
    530         elif isinstance(data, ma.MaskedArray):
    531             import numpy.ma.mrecords as mrecords

~\anaconda3\lib\site-packages\pandas\core\internals\construction.py in init_dict(data, index, columns, dtype)
    285             arr if not is_datetime64tz_dtype(arr) else arr.copy() for arr in arrays
    286         ]
--> 287     return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
    288 
    289 

~\anaconda3\lib\site-packages\pandas\core\internals\construction.py in arrays_to_mgr(arrays, arr_names, index, columns, dtype, verify_integrity)
     93     axes = [columns, index]
     94 
---> 95     return create_block_manager_from_arrays(arrays, arr_names, axes)
     96 
     97 

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in create_block_manager_from_arrays(arrays, names, axes)
   1704         return mgr
   1705     except ValueError as e:
-> 1706         raise construction_error(len(arrays), arrays[0].shape, axes, e)
   1707 
   1708 

ValueError: Shape of passed values is (1, 2), indices imply (1044576, 2)

我意识到这是说传递给交叉表的行数不匹配,但有人能告诉我为什么会发生这种情况或我哪里出错了吗?我正在使用我自己的数据复制示例代码,完全按照我正在使用的书中提供的数据。

非常感谢!

【问题讨论】:

【参考方案1】:

您的目标变量的形状应该是 (n,) 而不是 (n,1),就像您调用 y_bi = allyrs[["CM"]] 时的情况一样。请参阅相关的help page。应该对此发出警告,因为拟合将不起作用,但我想这是以某种方式错过了。

如果你打电话给y_bi = allyrs["CM"],比如我设置了一些虚拟数据:

import numpy as np
import pandas as pd

np.random.seed(111)
allyrs = pd.DataFrame(np.random.binomial(1,0.5,(100,4)),columns=['x1','x2','x3','CM'])
X_bi = allyrs.iloc[:,:4]
y_bi = allyrs["CM"]

然后运行训练测试拆分,然后进行拟合:

from sklearn.model_selection import train_test_split
Xbi_train, Xbi_test, ybi_train, ybi_test = train_test_split(X_bi, y_bi,
                                                    train_size=0.8,test_size = 0.2)

from sklearn.linear_model import LogisticRegressionCV
logitbi_cv = LogisticRegressionCV(cv=2, random_state=0).fit(Xbi_train, ybi_train)

pred_logitbi_cv =logitbi_cv.predict(Xbi_train)
pd.crosstab(ybi_train, pred_logitbi_cv)

col_0   0   1
CM           
0      39   0
1       0  41

【讨论】:

很高兴它对你有用:)

以上是关于逻辑回归模型(二进制)交叉表错误 = 传递值的形状问题的主要内容,如果未能解决你的问题,请参考以下文章

交叉熵

使用交叉验证评估逻辑回归

10 Logistic Regression

逻辑回归中的类型 1&2 错误

机器学习算法面试为什么逻辑回归的损失函数是交叉熵?

机器学习基石:10 Logistic Regression