仅50行代码完成朴素贝叶斯分类器训练与预测
Posted AI先锋
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了仅50行代码完成朴素贝叶斯分类器训练与预测相关的知识,希望对你有一定的参考价值。
The Naive Bayes classifier is one of the most versatile machine learning algorithms that I have seen around during my meager experience as a graduate student, and I wanted to do a toy implementation for fun. At its core, the implementation is reduced to a form of counting, and the entire Python module, including a test harness took only 50 lines of code. I haven’t really evaluated the performance, so I welcome any comments. I am a Python amateur, and am sure that experienced Python hackers can trim a few rough edges off this code.
Intuition and Design
Here is definition a of the classifier functionality (from wikipedia):
Now this means, that for each possible class label, multiply together the conditional probability of each feature, given the class label. This means, for us to implement the classifier, all we need to do, is compute these individual conditional probabilities for each label, for each feature, p(Fi | Cj), and multiply them together with the prior probability for that label p(Cj). The label for which we get the largest product, is the label returned by the classifier.
In order to compute these individual conditional probabilities, we use the Maximum Likelihood Estimation method. In a very short sentence, we approximate these probabilities using the counts from the input/training vectors.
Hence we have: p(Fi | Cj) = count( Fi ^ Cj) / count(Cj)
That is, we count from the training corpus, the ratio of the number of occurrences of the feature Fi and the label Cj together to the total number of occurrences of the label Cj.
Zero Probability Problem
What if we have never seen a particular feature Fa and a particular label Cb together in the training dataset? Whenever they occur in the test data, p(Fa | Cb) will be zero. Hence the overall product will also be zero. This is a problem with maximum likelihood estimates. Just because a particular observation was not made during training does not mean that it will never occur in the test data. In order to remedy this issue, we use what is known as smoothing. The simplest kind of smoothing that we use in this code, is called “add one smoothing”. Essentially, the probability for an unseen event should be greater than one. We achieve this by adding one to each zero count. The net effect should be that we redistribute some of the probability mass from the non-zero count observations to the zero-count observations. Hence, we also need to increase the total count for each label by the number of possible observations, in order to maintain the total probability mass at 1.
For example, if we have two classes C = 0 and C = 1, then after smoothing, the smoothed MLE probabilities can be written as:
p-smoothed(Fi | Cj) = [count(Fi ^ Cj) + 1]/[count(Cj) + N] where N is the total number of observations across all features in the training corpus.
Code
For simplicity, we will use Weka’s ARFF file format as input. We have a single class called Model which has a few dictionaries and lists to store the counts and feature vector details. In this implementation, we only deal with discrete valued features.
from __future__ import division |
|
import collections |
|
import math |
|
class Model: |
|
def __init__(self, arffFile): |
|
self.trainingFile = arffFile |
|
self.features = {} #all feature names and their possible values (including the class label) |
|
self.featureNameList = [] #this is to maintain the order of features as in the arff |
|
self.featureCounts = collections.defaultdict(lambda: 1)#contains tuples of the form (label, feature_name, feature_value) |
|
self.featureVectors = [] #contains all the values and the label as the last entry |
|
self.labelCounts = collections.defaultdict(lambda: 0) #these will be smoothed later |
view rawnaivebayes-model.py hosted with GitHub
The dictionary ‘features’ saves all possible values for a feature. ‘featureNameList‘ is simply a list that contains the names of the features in the same order that it appears in the ARFF file. This is because our features dictionary does not have any intrinsic order, and we need to maintain feature order explicitly. ‘featureCounts‘ contains the actual counts for co-occurrence of each feature value with each label value. The keys for this dictionary are tuples of the form (class_label, feature_name, feature_value). Hence, if we have observed the feature F1 with the value ‘x’ for the label ‘yes’, fifteen times, then we will have the entry {(‘yes’, ‘F1’, 15)} in the dictionary. Note how the default values for counts in this dictionary is ‘1’ instead of ‘0’. This is because we are smoothing the counts. The ‘featureVectors‘ list actually contains all the input feature vectors from the ARFF file. The last feature in this vector is the class label itself, as is the convention with weka ARFF files. Finally, ‘labelCounts‘ stores the counts of the class labels themselves, i.e. now many times did we see the label Ci during training.
We also have the following member functions in the Model class:
def GetValues(self): |
|
file = open(self.trainingFile, 'r') |
|
for line in file: |
|
if line[0] != '@': #start of actual data |
|
self.featureVectors.append(line.strip().lower().split(',')) |
|
else: #feature definitions |
|
if line.strip().lower().find('@data') == -1 and (not line.lower().startswith('@relation')): |
|
self.featureNameList.append(line.strip().split()[1]) |
|
self.features[self.featureNameList[len(self.featureNameList) - 1]] = line[line.find('{')+1: line.find('}')].strip().split(',') |
|
file.close() |
view rawnaivebayes-getvalues.py hosted with GitHub
The above method simply reads the feature names (including class labels), their possible values, and the feature vectors themselves; and populate the appropriate data structures defined above.
def TrainClassifier(self): |
|
for fv in self.featureVectors: |
|
self.labelCounts[fv[len(fv)-1]] += 1 #udpate count of the label |
|
for counter in range(0, len(fv)-1): |
|
self.featureCounts[(fv[len(fv)-1], self.featureNameList[counter], fv[counter])] += 1 |
|
for label in self.labelCounts: #increase label counts (smoothing). remember that the last feature is actually the label |
|
for feature in self.featureNameList[:len(self.featureNameList)-1]: |
|
self.labelCounts[label] += len(self.features[feature]) |
view rawnaivebayes-trainclassifier.py hosted with GitHub
The TrainClassifier method simply counts the number of co-occurrences of each feature value with each class label, and stores them in the form of 3-tuples. These counts are automatically smoothed by using add-one smoothing as the default value of count for this dictionary is ‘1’. The counts of the labels is also adjusted by incrementing these counts by the total number of observations.
def Classify(self, featureVector): #featureVector is a simple list like the ones that we use to train |
|
probabilityPerLabel = {} #store the final probability for each class label |
|
for label in self.labelCounts: |
|
logProb = 0 |
|
for featureValue in featureVector: |
|
logProb += math.log(self.featureCounts[(label, self.featureNameList[featureVector.index(featureValue)], featureValue)]/self.labelCounts[label]) |
|
probabilityPerLabel[label] = (self.labelCounts[label]/sum(self.labelCounts.values())) * math.exp(logProb) |
|
print probabilityPerLabel |
|
return max(probabilityPerLabel, key = lambda classLabel: probabilityPerLabel[classLabel]) |
view rawnaivebayes-classify.py hosted with GitHub
Finally, we have the Classify method, that accepts as argument, a single feature vector (as a list), and computes the product of individual conditional probabilities (smoothed MLE) for each label. The final computed probabilities for each label are stored in the ‘probabilityPerLabel‘ dictionary. In the last line, we return the entry from probabilityPerLabel which has the highest probability. Note that the multiplication is actually done as addition in the log domain as the numbers involved are extremely small. Also, one of the factors used in this multiplication, is the prior probability of having this class label.
Here is the complete code, including a test method:
#Author: Krishnamurthy Koduvayur Viswanathan |
|
from __future__ import division |
|
import collections |
|
import math |
|
class Model: |
|
def __init__(self, arffFile): |
|
self.trainingFile = arffFile |
|
self.features = {} #all feature names and their possible values (including the class label) |
|
self.featureNameList = [] #this is to maintain the order of features as in the arff |
|
self.featureCounts = collections.defaultdict(lambda: 1)#contains tuples of the form (label, feature_name, feature_value) |
|
self.featureVectors = [] #contains all the values and the label as the last entry |
|
self.labelCounts = collections.defaultdict(lambda: 0) #these will be smoothed later |
|
def TrainClassifier(self): |
|
for fv in self.featureVectors: |
|
self.labelCounts[fv[len(fv)-1]] += 1 #udpate count of the label |
|
for counter in range(0, len(fv)-1): |
|
self.featureCounts[(fv[len(fv)-1], self.featureNameList[counter], fv[counter])] += 1 |
|
for label in self.labelCounts: #increase label counts (smoothing). remember that the last feature is actually the label |
|
for feature in self.featureNameList[:len(self.featureNameList)-1]: |
|
self.labelCounts[label] += len(self.features[feature]) |
|
def Classify(self, featureVector): #featureVector is a simple list like the ones that we use to train |
|
probabilityPerLabel = {} |
|
for label in self.labelCounts: |
|
logProb = 0 |
|
for featureValue in featureVector: |
|
logProb += math.log(self.featureCounts[(label, self.featureNameList[featureVector.index(featureValue)], featureValue)]/self.labelCounts[label]) |
|
probabilityPerLabel[label] = (self.labelCounts[label]/sum(self.labelCounts.values())) * math.exp(logProb) |
|
print probabilityPerLabel |
|
return max(probabilityPerLabel, key = lambda classLabel: probabilityPerLabel[classLabel]) |
|
def GetValues(self): |
|
file = open(self.trainingFile, 'r') |
|
for line in file: |
|
if line[0] != '@': #start of actual data |
|
self.featureVectors.append(line.strip().lower().split(',')) |
|
else: #feature definitions |
|
if line.strip().lower().find('@data') == -1 and (not line.lower().startswith('@relation')): |
|
self.featureNameList.append(line.strip().split()[1]) |
|
self.features[self.featureNameList[len(self.featureNameList) - 1]] = line[line.find('{')+1: line.find('}')].strip().split(',') |
|
file.close() |
|
def TestClassifier(self, arffFile): |
|
file = open(arffFile, 'r') |
|
for line in file: |
|
if line[0] != '@': |
|
vector = line.strip().lower().split(',') |
|
print "classifier: " + self.Classify(vector) + " given " + vector[len(vector) - 1] |
|
if __name__ == "__main__": |
|
model = Model("/home/tennis.arff") |
|
model.GetValues() |
|
model.TrainClassifier() |
|
model.TestClassifier("/home/tennis.arff") |
view rawnaivebayes.py hosted with GitHub
Download the sample ARFF file to try it out.
Update: I found a bug in the last but one(th) line of the GetValues() function. This line gets the possible attribute values from the arff file and stores them in self.featureNameList. This method did not deal with whitespaces correctly. Update this line to:
self.features[self.featureNameList[len(self.featureNameList) - 1]] = [featureName.strip() for featureName in line[line.find('{')+1: line.find('}')].strip().split(',')]
以上是关于仅50行代码完成朴素贝叶斯分类器训练与预测的主要内容,如果未能解决你的问题,请参考以下文章
基于朴素贝叶斯的wine数据集分类预测-机器学习实验四-朴素贝叶斯