在朴素贝叶斯实现中难以获得正确的后验值
Posted
技术标签:
【中文标题】在朴素贝叶斯实现中难以获得正确的后验值【英文标题】:Difficulties to get the correct posterior value in a Naive Bayes Implementation 【发布时间】:2021-02-20 23:51:36 【问题描述】:出于学习目的,我尝试使用 python 实现this“课程”,但“不使用”sckitlearn 或类似的东西。
我的尝试代码如下:
import pandas, math
training_data = [
['A great game','Sports'],
['The election was over','Not sports'],
['Very clean match','Sports'],
['A clean but forgettable game','Sports'],
['It was a close election','Not sports']
]
text_to_predict = 'A very close game'
data_frame = pandas.DataFrame(training_data, columns=['data','label'])
data_frame = data_frame.applymap(lambda s:s.lower() if type(s) == str else s)
text_to_predict = text_to_predict.lower()
labels = data_frame.label.unique()
word_frequency = data_frame.data.str.split(expand=True).stack().value_counts()
unique_words_set = set()
unique_words = data_frame.data.str.split().apply(unique_words_set.update)
total_unique_words = len(unique_words_set)
word_frequency_per_labels = []
for l in labels:
word_frequency_per_label = data_frame[data_frame.label == l].data.str.split(expand=True).stack().value_counts()
for w, f in word_frequency_per_label.iteritems():
word_frequency_per_labels.append([w,f,l])
word_frequency_per_labels_df = pandas.DataFrame(word_frequency_per_labels, columns=['word','frequency','label'])
laplace_smoothing = 1
results = []
for l in labels:
p = []
total_words_in_label = word_frequency_per_labels_df[word_frequency_per_labels_df.label == l].frequency.sum()
for w in text_to_predict.split():
x = (word_frequency_per_labels_df.query('word == @w and label == @l').frequency.to_list()[:1] or [0])[0]
p.append((x + laplace_smoothing) / (total_words_in_label + total_unique_words))
results.append([l,math.prod(p)])
print(results)
result = pandas.DataFrame(results, columns=['labels','posterior']).sort_values('posterior',ascending = False).labels.iloc[0]
print(result)
在博客课程中,他们的结果是:
但我的结果是:
[['sports', 4.607999999999999e-05], ['not sports', 1.4293831139825827e-05]]
那么,我在 python 实现中做错了什么?我怎样才能得到相同的结果?
提前致谢
【问题讨论】:
【参考方案1】:您还没有乘以先验 p(Sport) = 3/5
和 p(Not Sport) = 2/5
。因此,只需按这些比率更新您的答案即可获得正确的结果。其他一切看起来都不错。
因此,例如,您在 math.prod(p)
计算中实现了 p(a|Sports) x p(very|Sports) x p(close|Sports) x p(game|Sports)
,但这忽略了术语 p(Sport)
。所以添加这个(并为非运动条件做同样的事情)可以解决问题。
在代码中可以通过以下方式实现:
prior = (data_frame.label == l).mean()
results.append([l,prior*math.prod(p)])
【讨论】:
【参考方案2】:@nick 的答案是正确的,应该获得赏金。
这里是一个替代实现(从头开始,不使用 pandas),它也支持概率和不在训练集中的单词的标准化
from collections import defaultdict
from dataclasses import dataclass, field
from typing import Dict, Set
def tokenize(text: str):
return [word.lower() for word in text.split()]
def normalize(result: Dict[str, float]):
total = sum([v for v in result.values()])
for k in result.keys():
result[k] /= total
@dataclass
class Model:
labels: Set[str] = field(default_factory=set)
words: Set[str] = field(default_factory=set)
prob_labels: Dict[str,float] = field(default_factory=lambda: defaultdict(float)) # P(label)
prob_words: Dict[str,Dict[str,float]] = field(default_factory=lambda: defaultdict(lambda: defaultdict(float))) # P(word | label) as prob_words[label][word]
def predict(self, text: str, norm=True) -> Dict[str, float]: # P(label | text) as model.predict(text)[label]
result = label: self.prob_labels[label] for label in self.labels
for word in tokenize(text):
for label in self.labels:
if word in self.words:
result[label] *= self.prob_words[label][word]
if norm:
normalize(result)
return result
def train(self, data):
prob_words_denominator = defaultdict(int)
for row in data:
text = row[0]
label = row[1].lower()
self.labels.add(label)
self.prob_labels[label] += 1.0
for word in tokenize(text):
self.words.add(word)
self.prob_words[label][word] += 1.0
prob_words_denominator[label] += 1.0
for label in self.labels:
self.prob_labels[label] /= len(data)
for word in self.words:
self.prob_words[label][word] = (self.prob_words[label][word] + 1.0) / (prob_words_denominator[label] + len(self.words))
training_data = [
['A great game','Sports'],
['The election was over','Not sports'],
['Very clean match','Sports'],
['A clean but forgettable game','Sports'],
['It was a close election','Not sports']
]
text_to_predict = 'A very close game'
model = Model()
model.train(training_data)
print(model.predict(text_to_predict, norm=False))
print(model.predict(text_to_predict))
print(model.predict("none of these words is in training data"))
输出:
'sports': 2.7647999999999997e-05, 'not sports': 5.7175324559303314e-06
'sports': 0.8286395560004286, 'not sports': 0.1713604439995714
'sports': 0.6, 'not sports': 0.4
【讨论】:
谢谢@pietroppeter。你的例子很好地说明了如何使用 sklearn 风格的 API 来很好地构建事物:)!以上是关于在朴素贝叶斯实现中难以获得正确的后验值的主要内容,如果未能解决你的问题,请参考以下文章