python/pandas 中的 MultinomialNB 在预测时返回“对象未对齐”错误

Posted

技术标签:

【中文标题】python/pandas 中的 MultinomialNB 在预测时返回“对象未对齐”错误【英文标题】:MultinomialNB in python/pandas returns "objects are not aligned" error when predicting 【发布时间】:2014-09-04 10:33:10 【问题描述】:

我有许多电子邮件主题和绩效评级,我想使用它们来预测哪些主题行会表现良好。当我运行 MultinomialNB 时,我收到“对象未对齐”错误。这是代码。

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

input=pd.read_csv('subject_tool_input_500.csv')
input.subject[input.subject.isnull()]=' '
good=np.asarray(input.unique_open_performance>0)
subjects=input.subject

classifier = MultinomialNB()
count_vectorizer = CountVectorizer(strip_accents='unicode')
counts=count_vectorizer.fit_transform(subjects)

classifier.fit(counts,good)
classifier.predict('test subject line')

这将返回以下错误。

>>> classifier.predict('test subject line')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 63, in predict
    jll = self._joint_log_likelihood(X)
  File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 457, in _joint_log_likelihood
    return (safe_sparse_dot(X, self.feature_log_prob_.T)
  File "/Library/Python/2.7/site-packages/sklearn/utils/extmath.py", line 83, in safe_sparse_dot
    return np.dot(a, b)
ValueError: objects are not aligned

这是我正在使用的输入。

>>> subjects
0                         Thanksgiving Dinner Delivered
1           It's Not Too Late To Order for Thanksgiving
2               Stress Free Christmas Gift They'll Love
3     Save $10 On Christmas Gift Certificates - Inst...
4                    Need a Last Minute Christmas Gift?
5                           Give Mom Something Special!
6             Yummy Steaks For Dad - $15 Off Your Order
7     Order a romantic dinner today and get it by Va...
8     Taiyo Yuden Unveils Latest in SAW Filter and D...
9     Taiyo Yuden New Noise Reducing Ferrite Bead Ch...
10    Lithium Ion Capacitors Are Ultimate Replacemen...
11                                 Art Wolfe Newsletter
12                          Art Wolfe Seminar Tour 2014
13                     Art Wolfe Spring 2014 Newsletter
14                    Day of the Dead Sale at Art Wolfe
...
8797625                                 Подписка на рассылку
8797626                                 Подписка на рассылку
8797627                             Ramadan Mubarak from MFP
8797628                   Ramadan Mubarak from Insaan Relief
8797629              UK Muslims! You have one new message...
8797630    Open House - 1249 Los Robles Place, Pomona CA ...
8797631    Open House - Custom Built Home by Conrad Buff ...
8797632    Open House - Custom built by Buff, Smith & Hen...
8797633    Open House - Custom Built Home by Conrad Buff ...
8797634    Open House - Custom Built Home by Conrad Buff ...
8797635    Open House - Custom Built Home by Conrad Buff ...
8797636    Open House - Buff, Smith & Hensman custom buil...
8797637    RAMADAN PROGRAMS: Dars-e-Qur'an in Rawalpindi ...
8797638               Dars-e-Qur'an by Shaykh Hammad Mahmood
8797639               Dars-e-Qur'an by Shaykh Hammad Mahmood
Name: subject, Length: 8797640, dtype: object
>>> counts
<8797640x1172387 sparse matrix of type '<type 'numpy.int64'>'
    with 62516240 stored elements in Compressed Sparse Column format>
>>> good
array([ True, False,  True, ..., False,  True,  True], dtype=bool)

我不知道为什么会这样。上周我能够在没有 pandas 的情况下完成这项工作,但我一直在尝试使用数据框来促进我将要做的一些后续工作。

【问题讨论】:

如果你把这行改成subjects=input.subjectsubjects=input.subject.values 是否有效? 不幸的是,subjects=input.subjects.values 没有帮助。 【参考方案1】:

您需要添加 tf-idf 矩阵,而不仅仅是计数

subcount=count_vectorizer.transform(["this is a test subject"])
tfidf = tfidf_transformer.transform(subcount)
classifier.predict(tfidf)

【讨论】:

【参考方案2】:

我是个白痴。我还需要获取我试图预测的主题行的计数,所以结尾应该更像这样。

subcount=count_vectorizer.transform(["this is a test subject"])
classifier.predict(subcount)

希望未来的人们能看到这一点,不要犯同样的错误。

【讨论】:

以上是关于python/pandas 中的 MultinomialNB 在预测时返回“对象未对齐”错误的主要内容,如果未能解决你的问题,请参考以下文章

使用 Python (Pandas) 反序列化 json 文件中的 DateTime 字段

python Pandas中的多个CASE WHEN语句(Python)

Python/Pandas/Datetime:将列中的整个列表转换为日期时间

python pandas中的GroupData

行中的 Python/Pandas 数据帧时间数据(按名称分组)

确定一天是不是是 Python / Pandas 中的工作日