在 sklearn 中补充朴素贝叶斯和加权类

Posted 2023-03-12

技术标签:

【中文标题】在 sklearn 中补充朴素贝叶斯和加权类【英文标题】：Complement Naive Bayes and weighted class in sklearn 【发布时间】：2021-03-11 16:35:57 【问题描述】：

我正在尝试使用 sklearn 实现一个补充朴素贝叶斯分类器。我的数据有非常不平衡的类（0 类的 30k 个样本和 1 类的 6k 个样本），我正在尝试使用加权类来弥补这一点。

这是我的数据集的形状：

enter image description here

我尝试使用 compute compute_class_weight 函数计算权重，然后在训练模型时将其传递给 fit 函数：

import numpy as np
import seaborn as sn
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.utils import class_weight
from sklearn.naive_bayes import ComplementNB

#Import the csv data
data = pd.read_csv('output_pt900.csv')

#Create the header of the csv file
header = []

for x in range(0,2500):
    header.append('pixel' + str(x))
header.append('status')

#Add the header to the csv data
data.columns = header

#Replace the b's and the f's in the status column by 0 and 1 
data['status'] = data['status'].replace('b',0)
data['status'] = data['status'].replace('f',1)

print(data)

#Drop the NaN values
data = data.dropna()

#Separate the features variables and the status
y = data['status']
x = data.drop('status',axis=1)

#Split the original dataset into two other: train and test
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2)

all_together = y_train.to_numpy()
unique_classes = np.unique(all_together)

c_w = class_weight.compute_class_weight('balanced', unique_classes, all_together)

clf = ComplementNB()

clf.fit(x_train,y_train, c_w)

y_predict = clf.predict(x_test)

cm = confusion_matrix(y_test, y_predict)

svm = sn.heatmap(cm, cmap='Blues', annot=True, fmt='g')
figure=svm.get_figure()
figure.savefig('confusion_matrix_cnb.png', dpi=400)
plt.show()

但我得到了这些错误：

ValueError: sample_weight.shape == (2,), expected (29752,)!

有人知道如何在 sklearn 模型中使用加权类吗？

【问题讨论】：

【参考方案1】：

compute_class_weight 返回一个长度等于唯一类的数量的数组，其权重分配给每个类的实例 (link)。因此，如果有 2 个唯一类，c_w 的长度为 2，包含应分配给标签为 0 和 1 的样本的权重。

当为您的模型调用 fit 时，每个样本的权重都是 sample_weight 参数所期望的。这应该解释您收到的错误。要解决此问题，您需要使用由 compute_class_weight 返回的 c_w 来创建单个样本权重的数组。你可以用 [c_w[i] for i in all_together] 来做到这一点。你的 fit 调用最终看起来像：

clf.fit(x_train, y_train, sample_weight=[c_w[i] for i in all_together])

【讨论】：

以上是关于在 sklearn 中补充朴素贝叶斯和加权类的主要内容，如果未能解决你的问题，请参考以下文章