我在 Python 中的数据(朴素贝叶斯、神经网络等)在每个分类器上出现错误

Posted

技术标签:

【中文标题】我在 Python 中的数据(朴素贝叶斯、神经网络等)在每个分类器上出现错误【英文标题】:Errors on every single classifier with my data in Python (Naives Bayes, Neural Network etc) 【发布时间】:2021-11-29 08:50:02 【问题描述】:

我似乎在使用的每个分类器上都会出错。我正在使用安然数据 (Enron 1 - 5) 并尝试创建垃圾邮件过滤器。

让我们以朴素贝叶斯为例。

我如何获取数据:

# Collecting the ham/spam emails from the training set (environ 1, 3 and 5)
ham_training_location = os.listdir("Data/training(environ1,3,5)/ham")
spam_training_location = os.listdir("Data/training(environ1,3,5)/spam")
training_data = []

counter_three = 0;
start_timer = time.perf_counter()
for path_of_file in spam_training_location:
    if counter_three < 10:
        file_to_open = open("Data/training(environ1,3,5)/spam/" + path_of_file, "r", encoding="Latin-1")
        text = str(file_to_open.read())
        training_data.append([text, "spam"])

    counter_three = counter_three + 1

counter_four = 0
for path_of_file in ham_training_location:
    if counter_four < 10:
        file_to_open = open("Data/training(environ1,3,5)/ham/" + path_of_file, "r", encoding="Latin-1")
        file_text = str(file_to_open.read())
        training_data.append([file_text, "ham"])
    counter_four = counter_four + 1

然后我重复上述操作,但使用测试集。

ham_testing_location = os.listdir("Data/testing(environ2,4)/ham")
spam_testing_location = os.listdir("Data/testing(environ2,4)/spam")
testing_data = []

counter_one = 0
for path_of_file in spam_testing_location:
    if counter_one < 10:
        file_to_open = open("Data/testing(environ2,4)/spam/" + path_of_file, "r", encoding="Latin-1")
        text = str(file_to_open.read())
        testing_data.append([text, "spam"])

    counter_one = counter_one + 1

counter_two = 0
for path_of_file in ham_testing_location:
    if counter_two < 10:
        file_to_open = open("Data/testing(environ2,4)/ham/" + path_of_file, "r", encoding="Latin-1")
        file_text = str(file_to_open.read())
        testing_data.append([file_text, "ham"])

    counter_two = counter_two + 1

print("Ham file (test) collection time in seconds: ", end_timer - start_timer)

然后我将它们转换为 numpy 数组:

training_data = np.array(training_data)
testing_data = np.array(testing_data)

我如何拆分数据并初始化朴素贝叶斯分类器:

# The below code splits the training and testing data up, so we now have the feature and label for each
x_train = training_data[:, 0]  # training feature
y_train = training_data[:, 1]  # training label
x_test = testing_data[:, 0]  # testing feature
y_test = testing_data[:, 1]  # testing label


gnb = GaussianNB()  # suitable for numeric features
gnb.fit(x_train, np.ravel(y_train, order='C'))

上面的最后一行给出了错误:

Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

然后我听从建议并进行重塑:

x_train_nb = x_train.reshape(-1, 1)
y_train_nb = y_train.reshape(-1, 1)
x_test_nb = x_test.reshape(-1, 1)
y_test_nb = y_test.reshape(-1, 1)

然后我得到另一个错误:

ValueError: Unable to convert array of bytes/strings into decimal numbers with dtype='numeric'

其他分类器:

我尝试使用的所有其他分类器都会发生这种情况。例如,如果我使用相同的数据但用于神经网络:

from keras.models import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Input(shape=(1,)))
model.add(Dense(16, input_dim=1, activation='relu', input_shape=(20,)))
model.add(Dense(12, activation='relu'))
model.add(Dense(12, activation='relu'))
model.add(Dense(2, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(x_train, y_train, epochs=10, batch_size=64)

我得到错误:

ValueError: Shapes (None, 1) and (None, 2) are incompatible

【问题讨论】:

请不要在此处包含没有结果的print 语句、注释掉的代码或出现错误的代码,所有这些只会造成不必要的混乱;这里的代码应该是minimal(已编辑)。 【参考方案1】:

您的 X_train 数据是什么?...是一组电子邮件吗?...您的 y_train 是什么?... 主要是您没有尝试将文本数据转换为数字格式。

假设 x_train 是一组文本格式的电子邮件,您需要将文本转换为 ML 模型可以理解的向量。(为简单起见,使用 count、TF-IDF 向量化器)

将 y_train(hamspam) 转换为数字格式,即将“ham”映射为零,将“spam”映射为 1。

ytrain = pd.Series(ytrain).map(lambda x : 0 if x=="ham" else 1)

我不知道安然数据是什么,但我给出了一个简单的例子,说明 20newsgroups dataset 的多类分类,它作为 sklearn 库中的数据集存在。

import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import naive_bayes

data = datasets.fetch_20newsgroups()
email=data.data

# fit and transform the train the email text data to the tfidf vectorizer

tfidf=TfidfVectorizer()
tfidf.fit(email)
email_data = tfidf.transform(email)

# fit the transformed `email_data` to the naive_bayes classifier
# data.target is an array containing integers from `0-19` for each class of news group.

clf = naive_bayes.MultinomialNB()
clf.fit(email_data, data.target)

# Now test the classifier with a test sample.
test = ["Iam having a great day, my Ducati 6700, with 3-stroke engine is roaring."]
test_vector = tfidf.transform(test)
out = clf.predict(test_vector)
print(f'Out=out')
print(f'Test sample belongs to Class=data.target_names[out[0]]')

# Output
Out=array([8])
Test sample belongs to Class=rec.motorcycles

我没有将数据拆分为训练和测试,只是试图向您展示基本实现的样子。此外,您不需要在layers.Dense(2) 中使用2 neurons,因为无论如何您都在尝试查找电子邮件是否只是ham or spam。所以layers.Dense(1), activation="sigmiod"loss=binaryCrossentropy 就足够了。

【讨论】:

非常感谢,这对我很有帮助。我使用您的代码将 y_test 和 y_train 更改为二进制文件。对于 tf-idf,我为它安装了火车电子邮件,但还必须将 .ravel() 添加到 X_train 变量中,因为它是 (40, 1) 并且 tfidf 需要 (40,)。然后,一旦我这样做了,我打印了形状并得到了(28, 1885),它看起来很大(1885 年)。无论如何,我用:gnb = GaussianNB() 定义了模型,当我尝试拟合它时:gnb.fit(X_train, np.ravel(y_train, order='C')) 我收到一条错误消息:A sparse matrix was passed, but dense data is required 假设您的X_trainshape=(28.1885) 是拟合tfidf vectorizer 后的结果,那么将稀疏数组转换为密集数组的方法是......使用X_train.toarray() 或@987654342 @ 然后适合guassianNB()

以上是关于我在 Python 中的数据(朴素贝叶斯、神经网络等)在每个分类器上出现错误的主要内容,如果未能解决你的问题,请参考以下文章

从Scratch在Python中的朴素贝叶斯分类

朴素贝叶斯与贝叶斯网络

Python机器学习(十五)朴素贝叶斯算法原理与代码实现

用于文本分类的朴素贝叶斯 - Python 2.7 数据结构问题

机器学习实战笔记(Python实现)-03-朴素贝叶斯

从朴素贝叶斯分类到贝叶斯网络