如何将 tf-idf 应用于整个数据集(训练和测试数据集),而不是仅在朴素贝叶斯分类器类中训练数据集?
Posted
技术标签:
【中文标题】如何将 tf-idf 应用于整个数据集(训练和测试数据集),而不是仅在朴素贝叶斯分类器类中训练数据集?【英文标题】:How to apply tf-idf to whole dataset (training and testing dataset) instead of only training dataset within naive bayes classifier class? 【发布时间】:2019-10-21 22:31:02 【问题描述】:我有一个朴素的贝叶斯分类器类,它将邮件分类为垃圾邮件或垃圾邮件,其中已经实现了 tf-idf。但是,tf-idf 部分只计算训练数据集的 tf-idf。
这是分类器类 类 SpamClassifier(对象): def init(self, traindata): self.mails, self.labels = traindata['Review'], traindata['Polarity']
def train(self):
self.calc_TF_and_IDF()
self.calc_TF_IDF()
def calc_TF_and_IDF(self):
noOfMessages = self.mails.shape[0]
self.spam_mails, self.ham_mails = self.labels.value_counts()[1], self.labels.value_counts()[0]
self.total_mails = self.spam_mails + self.ham_mails
self.spam_words = 0
self.ham_words = 0
self.tf_spam = dict()
self.tf_ham = dict()
self.idf_spam = dict()
self.idf_ham = dict()
for i in range(noOfMessages):
message = self.mails[i]
count = list() #To keep track of whether the word has ocured in the message or not.
#For IDF
for word in message:
if self.labels[i]:
self.tf_spam[word] = self.tf_spam.get(word, 0) + 1
self.spam_words += 1
else:
self.tf_ham[word] = self.tf_ham.get(word, 0) + 1
self.ham_words += 1
if word not in count:
count += [word]
for word in count:
if self.labels[i]:
self.idf_spam[word] = self.idf_spam.get(word, 0) + 1
else:
self.idf_ham[word] = self.idf_ham.get(word, 0) + 1
def calc_TF_IDF(self):
self.prob_spam = dict()
self.prob_ham = dict()
self.sum_tf_idf_spam = 0
self.sum_tf_idf_ham = 0
for word in self.tf_spam:
self.prob_spam[word] = (self.tf_spam[word]) * log((self.spam_mails + self.ham_mails) \
/ (self.idf_spam[word] + self.idf_ham.get(word, 0)))
self.sum_tf_idf_spam += self.prob_spam[word]
for word in self.tf_spam:
self.prob_spam[word] = (self.prob_spam[word] + 1) / (self.sum_tf_idf_spam + len(list(self.prob_spam.keys())))
for word in self.tf_ham:
self.prob_ham[word] = (self.tf_ham[word]) * log((self.spam_mails + self.ham_mails) \
/ (self.idf_spam.get(word, 0) + self.idf_ham[word]))
self.sum_tf_idf_ham += self.prob_ham[word]
for word in self.tf_ham:
self.prob_ham[word] = (self.prob_ham[word] + 1) / (self.sum_tf_idf_ham + len(list(self.prob_ham.keys())))
self.prob_spam_mail, self.prob_ham_mail = self.spam_mails / self.total_mails, self.ham_mails / self.total_mails
def classify(self, processed_message):
pSpam, pHam = 0, 0
for word in processed_message:
if word in self.prob_spam:
pSpam += log(self.prob_spam[word])
else:
pSpam -= log(self.sum_tf_idf_spam + len(list(self.prob_spam.keys())))
if word in self.prob_ham:
pHam += log(self.prob_ham[word])
else:
pHam -= log(self.sum_tf_idf_ham + len(list(self.prob_ham.keys())))
pSpam += log(self.prob_spam_mail)
pHam += log(self.prob_ham_mail)
return pSpam >= pHam
def predict(self, testdata):
result = []
for (i, message) in enumerate(testdata):
#processed_message = process_message(message)
result.append(int(self.classify(message)))
return result
这就是我调用分类器的方式
sc_tf_idf = SpamClassifier(traindata)
sc_tf_idf.train()
preds_tf_idf = sc_tf_idf.predict(testdata['Review'])
testdata['Predictions'] = preds_tf_idf
print(testdata['Polarity'], testdata['Predictions'])
如何将分类器中的 tf-idf 计算应用于整个数据集(训练和测试数据集)?
【问题讨论】:
【参考方案1】:不应同时计算训练数据和测试数据的 tf-idf。数据集应首先分为训练和测试(以及验证),然后您可以分别计算每个数据集的 tf-idf。如果在数据分离之前计算 tf-idf,模型将学习测试/验证数据的一些“特征”并输出非常乐观的性能。可以参考详细解答here。
此外,您可以使用来自 sklearn 的 tfidfvectorizer。
【讨论】:
以上是关于如何将 tf-idf 应用于整个数据集(训练和测试数据集),而不是仅在朴素贝叶斯分类器类中训练数据集?的主要内容,如果未能解决你的问题,请参考以下文章
如何将经过训练和测试的随机森林模型应用于 tidymodels 中的新数据集?
将 kNN 模型应用于 RapidMiner 中的整个数据集