代码对列的规范抛出错误

Posted

技术标签:

【中文标题】代码对列的规范抛出错误【英文标题】:Code throws error over the specification of the columns 【发布时间】:2021-02-08 18:57:05 【问题描述】:

我在运行模型时不断收到以下值错误:

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

这是错误的完整版本:

Traceback (most recent call last):

  File "/usr/lib/python3.8/site-packages/sklearn/utils/__init__.py", line 425, in _get_column_indices
    all_columns = X.columns

AttributeError: 'numpy.ndarray' object has no attribute 'columns'


During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/home/user/Python Practice/Working/Playstore/untitled0.py", line 48, in <module>
    run.fit(x,y)

  File "/usr/lib/python3.8/site-packages/sklearn/pipeline.py", line 330, in fit
    Xt = self._fit(X, y, **fit_params_steps)

  File "/usr/lib/python3.8/site-packages/sklearn/pipeline.py", line 292, in _fit
    X, fitted_transformer = fit_transform_one_cached(

  File "/usr/lib/python3.8/site-packages/joblib/memory.py", line 352, in __call__
    return self.func(*args, **kwargs)

  File "/usr/lib/python3.8/site-packages/sklearn/pipeline.py", line 740, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)

  File "/usr/lib/python3.8/site-packages/sklearn/compose/_column_transformer.py", line 529, in fit_transform
    self._validate_remainder(X)

  File "/usr/lib/python3.8/site-packages/sklearn/compose/_column_transformer.py", line 327, in _validate_remainder
    cols.extend(_get_column_indices(X, columns))

  File "/usr/lib/python3.8/site-packages/sklearn/utils/__init__.py", line 427, in _get_column_indices
    raise ValueError("Specifying the columns using strings is only "

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

这是我的代码:

import pandas as pd
import numpy as np
from sklearn.ensemble import AdaBoostRegressor
from category_encoders import CatBoostEncoder
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer

data = pd.read_csv("data.csv",index_col=("Unnamed: 0"))
y = data.Installs
x = data.drop("Installs",axis=1)


strat = ["mean","median","most_frequent","constant"]
num_imp = SimpleImputer(strategy=strat[0])
obj_imp = SimpleImputer(strategy=strat[2])

# Set up the scaler
sc = StandardScaler()

# Set up Encoders
cb = CatBoostEncoder()
oh = OneHotEncoder(sparse=True)


# Set up columns
obj = list(x.select_dtypes(include="object"))
num = list(x.select_dtypes(exclude="object"))


cb_col = [i for i in obj if len(x[i].unique())>30]
oh_col = [i for i in obj if len(x[i].unique())<10]

# First Pipeline
imp = make_pipeline((num_imp))
enc_cb = make_pipeline((cb),(obj_imp))
enc_oh = make_pipeline((oh),(obj_imp))

# Col Transformation
col = make_column_transformer((imp,num),(sc,num))
cb_ = make_column_transformer((enc_cb,cb_col))
oh_ = make_column_transformer((enc_oh,oh_col))

model = AdaBoostRegressor(random_state=(0))

run = make_pipeline((col),(cb_),(oh_),(model))
run.fit(x,y)

关于如何解决它的任何想法?需要用到的数据可以在here找到。最初,我尝试在单个转换器变量下一次执行所有列转换,但这不起作用,建议我在再次运行之前将它们分开。我这样做了,但结果你看到了。我需要一些帮助。谢谢!

【问题讨论】:

【参考方案1】:

我不会像这样分离柱式变压器。这样,在您的 run 管道中,第一个 ColumnTransformer col 将输入从 pandas 数据帧转换为 numpy 数组。但随后cb_ 无法选择列名(更糟糕的是,列顺序已更改,因此您不能依赖原始数据中的列索引)。

请参阅my answer 回答您的另一个问题,了解我认为构建此管道的最简单方法。

【讨论】:

谢谢。我最终以类似于您的建议的方式自己修复它。

以上是关于代码对列的规范抛出错误的主要内容,如果未能解决你的问题,请参考以下文章

根据来自不同列的 2 个其他值对列的值求和

Excel - 多对列的缩小数组公式

使用 Linux 工具根据另一列的 id 对列的值求和

如何在 PostgreSQL 中对列的一部分进行分组?

05 MongoDB对列的各种操作总结

根据每个值对列的值进行分组[重复]