使用 NLTK 的半监督朴素贝叶斯 [关闭]
Posted
技术标签:
【中文标题】使用 NLTK 的半监督朴素贝叶斯 [关闭]【英文标题】:Semi-supervised Naive Bayes with NLTK [closed] 【发布时间】:2012-10-13 11:36:29 【问题描述】:我基于 EM(期望最大化算法)在 Python 中构建了 NLTK 朴素贝叶斯的半监督版本。但是,在 EM 的某些迭代中,我得到了负对数似然(EM 的对数似然在每次迭代中都必须为正),因此我相信我的代码中一定有一些错误。仔细查看我的代码后,我不知道为什么会发生这种情况。如果有人能在下面的代码中发现任何错误,将不胜感激:
(Reference material of semi-supervised Naive Bayes)
EM-算法主循环
#initial assumptions:
#Bernoulli NB: only feature presence (value 1) or absence (value None) is computed
#initial data:
#C: classifier trained with labeled data
#labeled_data: an array of tuples (feature dic, label)
#features: dictionary that outputs feature dictionary for a given document id
for iteration in range(1, self.maxiter):
#Expectation: compute probabilities for each class for each unlabeled document
#An array of tuples (feature dictionary, probability dist) is built
unlabeled_data = [(features[id],C.prob_classify(features[id])) for id in U]
#Maximization: given the probability distributions of previous step,
#update label, feature-label counts and update classifier C
#gen_freqdists is a custom function, see below
#gen_probdists is the original NLTK function
l_freqdist_act,ft_freqdist_act, ft_values_act = self.gen_freqdists(labeled_data,unlabeled_data)
l_probdist_act, ft_probdist_act = self.gen_probdists(l_freqdist_act, ft_freqdist_act, ft_values_act, ELEProbDist)
C = nltk.NaiveBayesClassifier(l_probdist_act, ft_probdist_act)
#Compute log-likelihood
#NLTK Naive bayes classifier prob_classify func gives logprob(class) + logprob(doc|class))
#for labeled data, sum logprobs output by the classifier for the label
#for unlabeled data, sum logprobs output by the classifier for each label
log_lh = sum([C.prob_classify(ftdic).prob(label) for (ftdic,label) in labeled_data])
log_lh += sum([C.prob_classify(ftdic).prob(label) for (ftdic,ignore) in unlabeled_data for label in l_freqdist_act.samples()])
#Continue until convergence
if log_lh_old == "first":
if self.debug: print "\tM: #iteration 1",log_lh,"(FIRST)"
log_lh_old = log_lh
else:
log_lh_diff = log_lh - log_lh_old
if self.debug: print "\tM: #iteration",iteration,log_lh_old,"->",log_lh,"(",log_lh_diff,")"
if log_lh_diff < self.log_lh_diff_min: break
log_lh_old = log_lh
自定义函数 gen-freqdists,用于创建所需的频率分布
def gen_freqdists(self, instances_l, instances_ul):
l_freqdist = FreqDist() #frequency distrib. of labels
ft_freqdist= defaultdict(FreqDist) #dictionary of freq. distrib. for ft-label pairs
ft_values = defaultdict(set) #dictionary of possible values for each ft (only 1/None)
fts = set() #set of all fts
#counts for labeled data
for (ftdic,label) in instances_l:
l_freqdist.inc(label,1)
for f in ftdic.keys():
fts.add(f)
ft_freqdist[label,f].inc(1,1)
ft_values[f].add(1)
#counts for unlabeled data
#we must compute maximum a posteriori label estimate
#and update label/ft occurrences accordingly
for (ftdic,probs) in instances_ul:
map_l = probs.max() #label with highest probability
map_p = probs.prob(map_l) #probability of map_l
l_freqdist.inc(map_l,count=map_p)
for f in ftdic.keys():
fts.add(f)
ft_freqdist[map_l,f].inc(1,count=map_p)
ft_values[f].add(1)
#features not appearing in documents get implicit None values
for l in l_freqdist.samples():
num_samples = l_freqdist[l]
for f in fts:
count = ft_freqdist[l,f].N()
ft_freqdist[l,f].inc(None, num_samples-count)
ft_values[f].add(None)
#return computed frequency distributions
return l_freqdist, ft_freqdist, ft_values
【问题讨论】:
您实际上不需要计算对数似然来测试 EM 的收敛性;您还可以检查来自prob_classify
的结果,甚至模型内部的概率是否稳定。
【参考方案1】:
我认为你对错误值求和。
这是您应该计算日志概率总和的代码:
#Compute log-likelihood
#NLTK Naive bayes classifier prob_classify func gives logprob(class) + logprob(doc|class))
#for labeled data, sum logprobs output by the classifier for the label
#for unlabeled data, sum logprobs output by the classifier for each label
log_lh = sum([C.prob_classify(ftdic).prob(label) for (ftdic,label) in labeled_data])
log_lh += sum([C.prob_classify(ftdic).prob(label) for (ftdic,ignore) in unlabeled_data for label in l_freqdist_act.samples()])
根据prob_classify(在 NaiveBayesClassifier 上)的 NLTK 文档,返回一个 ProbDistI 对象(不是logprob(class) + logprob(doc|class)
)。当你得到这个对象时,你正在为给定的标签调用prob
方法。你可能想打电话给logprob
,并否定这个返回。
【讨论】:
+1,但即使你打电话给logprob
,你仍然无法得到 OP 想要的数字。 prob_classify
的结果不代表 P(class)*P(doc|class)
,而是 P(class)*P(doc|class)/sum(P(doc|class') for class' in classes)
。不过,我猜它会收敛到同一组参数。
@larsmans,很好。如果您确实需要 LLH,您可以从数据中估计 P(doc) = sum(P(doc|class))
并将其相乘。我怀疑您是正确的,因为它不会对最终收敛产生太大影响。
谢谢@seggy 和@larsmans!我在对数似然计算中弄错了。只是为了确保最后一件事,您认为功能 gen-freqdists 可以吗?我认为增加未标记数据的 MAP 类的标签/标签频率计数,然后相应地更新 None 值就足够了,但我对此并不完全确定。以上是关于使用 NLTK 的半监督朴素贝叶斯 [关闭]的主要内容,如果未能解决你的问题,请参考以下文章