我在 Python 中的数据(朴素贝叶斯、神经网络等)在每个分类器上出现错误
Posted
技术标签:
【中文标题】我在 Python 中的数据(朴素贝叶斯、神经网络等)在每个分类器上出现错误【英文标题】:Errors on every single classifier with my data in Python (Naives Bayes, Neural Network etc) 【发布时间】:2021-11-29 08:50:02 【问题描述】:我似乎在使用的每个分类器上都会出错。我正在使用安然数据 (Enron 1 - 5) 并尝试创建垃圾邮件过滤器。
让我们以朴素贝叶斯为例。
我如何获取数据:
# Collecting the ham/spam emails from the training set (environ 1, 3 and 5)
ham_training_location = os.listdir("Data/training(environ1,3,5)/ham")
spam_training_location = os.listdir("Data/training(environ1,3,5)/spam")
training_data = []
counter_three = 0;
start_timer = time.perf_counter()
for path_of_file in spam_training_location:
if counter_three < 10:
file_to_open = open("Data/training(environ1,3,5)/spam/" + path_of_file, "r", encoding="Latin-1")
text = str(file_to_open.read())
training_data.append([text, "spam"])
counter_three = counter_three + 1
counter_four = 0
for path_of_file in ham_training_location:
if counter_four < 10:
file_to_open = open("Data/training(environ1,3,5)/ham/" + path_of_file, "r", encoding="Latin-1")
file_text = str(file_to_open.read())
training_data.append([file_text, "ham"])
counter_four = counter_four + 1
然后我重复上述操作,但使用测试集。
ham_testing_location = os.listdir("Data/testing(environ2,4)/ham")
spam_testing_location = os.listdir("Data/testing(environ2,4)/spam")
testing_data = []
counter_one = 0
for path_of_file in spam_testing_location:
if counter_one < 10:
file_to_open = open("Data/testing(environ2,4)/spam/" + path_of_file, "r", encoding="Latin-1")
text = str(file_to_open.read())
testing_data.append([text, "spam"])
counter_one = counter_one + 1
counter_two = 0
for path_of_file in ham_testing_location:
if counter_two < 10:
file_to_open = open("Data/testing(environ2,4)/ham/" + path_of_file, "r", encoding="Latin-1")
file_text = str(file_to_open.read())
testing_data.append([file_text, "ham"])
counter_two = counter_two + 1
print("Ham file (test) collection time in seconds: ", end_timer - start_timer)
然后我将它们转换为 numpy 数组:
training_data = np.array(training_data)
testing_data = np.array(testing_data)
我如何拆分数据并初始化朴素贝叶斯分类器:
# The below code splits the training and testing data up, so we now have the feature and label for each
x_train = training_data[:, 0] # training feature
y_train = training_data[:, 1] # training label
x_test = testing_data[:, 0] # testing feature
y_test = testing_data[:, 1] # testing label
gnb = GaussianNB() # suitable for numeric features
gnb.fit(x_train, np.ravel(y_train, order='C'))
上面的最后一行给出了错误:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
然后我听从建议并进行重塑:
x_train_nb = x_train.reshape(-1, 1)
y_train_nb = y_train.reshape(-1, 1)
x_test_nb = x_test.reshape(-1, 1)
y_test_nb = y_test.reshape(-1, 1)
然后我得到另一个错误:
ValueError: Unable to convert array of bytes/strings into decimal numbers with dtype='numeric'
其他分类器:
我尝试使用的所有其他分类器都会发生这种情况。例如,如果我使用相同的数据但用于神经网络:
from keras.models import Sequential
from keras.layers import Dense
model = Sequential()
model.add(Input(shape=(1,)))
model.add(Dense(16, input_dim=1, activation='relu', input_shape=(20,)))
model.add(Dense(12, activation='relu'))
model.add(Dense(12, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(x_train, y_train, epochs=10, batch_size=64)
我得到错误:
ValueError: Shapes (None, 1) and (None, 2) are incompatible
【问题讨论】:
请不要在此处包含没有结果的print
语句、注释掉的代码或出现错误的代码,所有这些只会造成不必要的混乱;这里的代码应该是minimal(已编辑)。
【参考方案1】:
您的 X_train 数据是什么?...是一组电子邮件吗?...您的 y_train 是什么?... 主要是您没有尝试将文本数据转换为数字格式。
假设 x_train 是一组文本格式的电子邮件,您需要将文本转换为 ML 模型可以理解的向量。(为简单起见,使用 count、TF-IDF 向量化器)
将 y_train(ham
或 spam
) 转换为数字格式,即将“ham”映射为零,将“spam”映射为 1。
ytrain = pd.Series(ytrain).map(lambda x : 0 if x=="ham" else 1)
我不知道安然数据是什么,但我给出了一个简单的例子,说明 20newsgroups dataset
的多类分类,它作为 sklearn 库中的数据集存在。
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import naive_bayes
data = datasets.fetch_20newsgroups()
email=data.data
# fit and transform the train the email text data to the tfidf vectorizer
tfidf=TfidfVectorizer()
tfidf.fit(email)
email_data = tfidf.transform(email)
# fit the transformed `email_data` to the naive_bayes classifier
# data.target is an array containing integers from `0-19` for each class of news group.
clf = naive_bayes.MultinomialNB()
clf.fit(email_data, data.target)
# Now test the classifier with a test sample.
test = ["Iam having a great day, my Ducati 6700, with 3-stroke engine is roaring."]
test_vector = tfidf.transform(test)
out = clf.predict(test_vector)
print(f'Out=out')
print(f'Test sample belongs to Class=data.target_names[out[0]]')
# Output
Out=array([8])
Test sample belongs to Class=rec.motorcycles
我没有将数据拆分为训练和测试,只是试图向您展示基本实现的样子。此外,您不需要在layers.Dense(2)
中使用2 neurons
,因为无论如何您都在尝试查找电子邮件是否只是ham or spam
。所以layers.Dense(1), activation="sigmiod"
和loss=binaryCrossentropy
就足够了。
【讨论】:
非常感谢,这对我很有帮助。我使用您的代码将 y_test 和 y_train 更改为二进制文件。对于 tf-idf,我为它安装了火车电子邮件,但还必须将.ravel()
添加到 X_train 变量中,因为它是 (40, 1)
并且 tfidf 需要 (40,)
。然后,一旦我这样做了,我打印了形状并得到了(28, 1885)
,它看起来很大(1885 年)。无论如何,我用:gnb = GaussianNB()
定义了模型,当我尝试拟合它时:gnb.fit(X_train, np.ravel(y_train, order='C'))
我收到一条错误消息:A sparse matrix was passed, but dense data is required
。
假设您的X_train
或shape=(28.1885)
是拟合tfidf vectorizer
后的结果,那么将稀疏数组转换为密集数组的方法是......使用X_train.toarray()
或@987654342 @ 然后适合guassianNB()
。以上是关于我在 Python 中的数据(朴素贝叶斯、神经网络等)在每个分类器上出现错误的主要内容,如果未能解决你的问题,请参考以下文章