ValueError:np.nan 是无效文档、预期字节或 unicode 字符串
Posted
技术标签:
【中文标题】ValueError:np.nan 是无效文档、预期字节或 unicode 字符串【英文标题】:ValueError: np.nan is an invalid document, expected byte or unicode string 【发布时间】:2019-08-28 01:05:58 【问题描述】:我正在尝试对 Uber-Review 进行情绪分析。我使用 Naive bays sklearn 进行情绪分析,我使用了来自 kaggle 的 reviwes 数据, 但测试数据在 xlsx 表中,我使用 pandas 创建数据框,
import pandas as pd
test=pd.read_excel("uber.xlsx",sep="\t",encoding="ISO-8859-1");
test.head(3)
当它返回 d:type 对象时,我使用 this 将其转换为列表
test_text = []
for comments in comments_t:
test_text.append(comments)
我根据训练数据对文本进行分类的代码:
# Training Phase
from sklearn.naive_bayes import BernoulliNB
classifier = BernoulliNB().fit(train_documents,labels)
def sentiment(word):
return classifier.predict(count_vectorizer.transform([word]))
但是在预测它时返回这个值错误:
/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in transform(self, raw_documents)
1084
1085 # use the same matrix-building strategy as fit_transform
-> 1086 _, X = self._count_vocab(raw_documents, fixed_vocab=True)
1087 if self.binary:
1088 X.data.fill(1)
/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
940 for doc in raw_documents:
941 feature_counter =
--> 942 for feature in analyze(doc):
943 try:
944 feature_idx = vocabulary[feature]
/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in <lambda>(doc)
326 tokenize)
327 return lambda doc: self._word_ngrams(
--> 328 tokenize(preprocess(self.decode(doc))), stop_words)
329
330 else:
/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in decode(self, doc)
141
142 if doc is np.nan:
--> 143 raise ValueError("np.nan is an invalid document, expected byte or "
144 "unicode string.")
145
ValueError: np.nan is an invalid document, expected byte or unicode string.
我试着按照这个来解决:
https://***.com/questions/39303912/tfidfvectorizer-in-scikit-learn-valueerror-np-nan-is-an-invalid-document
【问题讨论】:
【参考方案1】:我在 Kaggle 中为 Uber 找到的数据是 https://www.kaggle.com/purvank/uber-rider-reviews-dataset/downloads/Uber_Ride_Reviews.csv/2
现在来解决您的问题
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB
df = pd.read_csv('Uber_Ride_Reviews.csv')
df.head()
Out[7]:
ride_review ... sentiment
0 I completed running New York Marathon requeste... ... 0
1 My appointment time auto repairs required earl... ... 0
2 Whether I using Uber ride service Uber Eats or... ... 0
3 Why hard understand I trying retrieve Uber cab... ... 0
4 I South Beach FL I staying major hotel ordered... ... 0
df.columns
Out[8]: Index(['ride_review', 'ride_rating', 'sentiment'], dtype='object')
vect = CountVectorizer()
vect1 = vect.fit_transform(df['ride_review'])
classifier = BernoulliNB()
classifier.fit(vect1,df['sentiment'])
# predicting new comment it is giving O/p
new_test_= vect.transform(['uber ride is very good'])
classifier.predict(new_test_)
Out[5]: array([0], dtype=int64)
# but when applying your function sentiment you are only passing word, you need to
#passclassifier as well as Countvectorizer instance
def sentiment(word, classifier, vect):
return classifier.predict(vect.transform([word]))
#calling above function for new sentiment
sentiment('uber ride is very good', vect, classifier)
O/p --> Out[10]: array([0], dtype=int64)
【讨论】:
最后,在管道中没有矢量化器的答案;),谢谢!以上是关于ValueError:np.nan 是无效文档、预期字节或 unicode 字符串的主要内容,如果未能解决你的问题,请参考以下文章
python 值比较判断,np.nan is np.nan 却 np.nan != np.nan ,pandas 单个数据框值判断nan
Django celery 4 - ValueError: int() 的无效文字,当启动 celery worker 时,基数为 10