应用支持向量机 (Support Vector Machine) 做垃圾邮件分类

Posted 加载Python技能

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了应用支持向量机 (Support Vector Machine) 做垃圾邮件分类相关的知识,希望对你有一定的参考价值。

学习思想:学习可以看作一个 输入-方法-输出 过程,敲几遍是一种有效的方式,是 Learn by Doing 的实践目标:应用线性支持向量机做垃圾邮件分类环境:python3, jupyter notebook, numpy, scipy, sklearn, nltk

算法

高斯核函数

def gaussian_kernel(x1, x2, sigma):
x1 = x1.flatten()
x2 = x2.flatten()
sim = 0

sim = np.exp(np.sum((x1-x2)**2)/(-2*sigma**2))
return sim

处理邮件

def process_email(email_contents): 
vocab_list = get_vocab_list()
word_indices = np.array([], dtype=np.int64)

email_contents = email_contents.lower()
email_contents = re.sub('<[^<>]+>', ' ', email_contents)
email_contents = re.sub('[0-9]+', 'number', email_contents)
email_contents = re.sub('(http|https)://[^\s]*', 'httpaddr', email_contents)
email_contents = re.sub('[^\s]+@[^\s]+', 'emailaddr', email_contents)
email_contents = re.sub('[$]+', 'dollar', email_contents)
print('==== Processed Email ====')

stemmer = nltk.stem.porter.PorterStemmer()
tokens = re.split('[@$/#.-:&*+=\[\]?!(){\},\'\">_<;% ]', email_contents)

for token in tokens:
token = re.sub('[^a-zA-Z0-9]', '', token)
token = stemmer.stem(token)
if len(token) < 1:
continue
for i in range(1, len(vocab_list) + 1):
if vocab_list[i] == token:
word_indices = np.append(word_indices, i)
print(token)
print('==================')
return word_indices

def get_vocab_list():
vocab_dict = {}
with open('vocab.txt') as f:
for line in f:
(val, key) = line.split()
vocab_dict[int(val)] = key
return vocab_dict

电子邮件特征变量

def email_features(word_indices):
n = 2018
features = np.zeros(n+1)
features[word_indices-1] = 1
return features

具体步骤

步骤一,处理邮件

file_contents = open('emailSample1.txt', 'r').read()
word_indices = process_email(file_contents)

步骤二,特征变量

features = email_features(word_indices)

步骤三,训练线性支持向量机垃圾邮件分类

data = scio.loadmat('spamTrain.mat')
X = data['X']
y = data['y'].flatten()
c = 0.1
clf = svm.SVC(c, kernel='linear')
clf.fit(X, y)
p = clf.predict(X)

步骤四,测试垃圾分类

data = scio.loadmat('spamTest.mat')
Xtest = data['Xtest']
ytest = data['ytest'].flatten()
p = clf.predict(Xtest)

步骤五,打印最可能标示垃圾邮件的词汇

vocab_list = get_vocab_list()
indices = np.argsort(clf.coef_).flatten()[::-1]
print(indices)

for i in range(15):
print('{} ({:0.6f})'.format(vocab_list[indices[i]], clf.coef_.flatten()[indices[i]]))


以上是关于应用支持向量机 (Support Vector Machine) 做垃圾邮件分类的主要内容,如果未能解决你的问题,请参考以下文章

支持向量机(Support Vector Machines)

监督学习——支持向量机(Support Vector Machine)

支持向量机(Support Vector Machine)

Spark MLlib模型 支持向量机Support Vector Machine

支持向量机(SVM:support vector machine)

支持向量机(support vector machines, SVM)