当我尝试为 scikit-learn 模型拟合另外 1 个功能时,出现此错误“ValueError:找到样本数量不一致的输入变量”

Posted

技术标签:

【中文标题】当我尝试为 scikit-learn 模型拟合另外 1 个功能时,出现此错误“ValueError:找到样本数量不一致的输入变量”【英文标题】:When I try to fit scikit-learn model with 1 more feature, I have this error "ValueError: Found input variables with inconsistent numbers of samples" 【发布时间】:2019-11-11 01:26:10 【问题描述】:

我的代码运行良好

    df_amazon = pd.read_csv ("datasets/amazon_alexa.tsv", sep="\t")

    X = df_amazon['variation'] # the features we want to analyze
    ylabels = df_amazon['feedback'] # the labels, or answers, we want to test against

    X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)

    # Create pipeline using Bag of Words
    pipe = Pipeline([('cleaner', predictors()),
                     ('vectorizer', bow_vector),
                     ('classifier', classifier)])

    pipe.fit(X_train,y_train)

但如果我尝试在模型中再添加 1 个功能,则替换

    X = df_amazon['variation']

通过

    X = df_amazon[['variation','verified_reviews']] 

当我致电fit 时,我收到来自 Sklearn 的错误消息:

ValueError: 发现样本数量不一致的输入变量:[2, 2205]

所以fitX_trainy_train 具有形状时起作用 (2205,) 和 (2205,)。

但不是当形状更改为 (2205, 2) 和 (2205,)。

最好的办法是什么?

【问题讨论】:

你用过Countvectorizer吗???? 是的,我做到了。也许问题可能与管道有关。 【参考方案1】:

数据的形状必须为(n_samples, n_features)。尝试转置 X (X.T)。

【讨论】:

如果我尝试转置 X,X = df_amazon[['variation','verified_reviews']].T,错误变为 ValueError: Found input variables with contrast numbers of samples: [2, 3150]【参考方案2】:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

df = pd.DataFrame(data = [['Heather Gray Fabric','I received the echo as a gift.',1],['Sandstone Fabric','Without having a cellphone, I cannot use many of her features',0]], columns = ['variation','review','feedback'])


vect = CountVectorizer()
vect.fit_transform(df[['variation','review']])

# now when you look at vocab that has been created
print(vect.vocabulary_)

#o/p, where feature has been generated only for column name and not content of particular column
Out[49]:
'variation': 1, 'review': 0 

#so you need to make one column which contain which contain variation and review both and that  need to be passed into your model
df['variation_review'] = df['variation'] + df['review']

vect.fit_transform(df['variation_review'])
print(vect.vocabulary_)

'heather': 8,
'gray': 6,
'fabrici': 3,
'received': 9,
'the': 11,
'echo': 2,
'as': 0,
'gift': 5,
'sandstone': 10,
'fabricwithout': 4,
'having': 7,
'cellphone': 1

【讨论】:

确实df['variation_review'] = df['variation'] + df['review'] 解决了这个问题,但我不知道这是否是一个好的解决方案,一旦“变体”是一个类别而“评论”是一个文本。 qaiser,你怎么看? 查看此链接,***.com/questions/39121104/…

以上是关于当我尝试为 scikit-learn 模型拟合另外 1 个功能时,出现此错误“ValueError:找到样本数量不一致的输入变量”的主要内容,如果未能解决你的问题,请参考以下文章

Scikit-Learn 逻辑回归内存错误

使 Python 能够利用所有内核来拟合 scikit-learn 模型

scikit-learn:如何使用拟合概率模型?

用 scikit-learn 拟合一维数据来预测线

并行拟合 scikit-learn 模型?

Scikit-Learn 的 DPGMM 拟合:组件数量?