应用支持向量机 (Support Vector Machine) 做垃圾邮件分类
Posted 加载Python技能
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了应用支持向量机 (Support Vector Machine) 做垃圾邮件分类相关的知识,希望对你有一定的参考价值。
学习思想:学习可以看作一个 输入-方法-输出 过程,敲几遍是一种有效的方式,是 Learn by Doing 的实践目标:应用线性支持向量机做垃圾邮件分类环境:python3, jupyter notebook, numpy, scipy, sklearn, nltk
算法
高斯核函数
def gaussian_kernel(x1, x2, sigma):
x1 = x1.flatten()
x2 = x2.flatten()
sim = 0
sim = np.exp(np.sum((x1-x2)**2)/(-2*sigma**2))
return sim
处理邮件
def process_email(email_contents):
vocab_list = get_vocab_list()
word_indices = np.array([], dtype=np.int64)
email_contents = email_contents.lower()
email_contents = re.sub('<[^<>]+>', ' ', email_contents)
email_contents = re.sub('[0-9]+', 'number', email_contents)
email_contents = re.sub('(http|https)://[^\s]*', 'httpaddr', email_contents)
email_contents = re.sub('[^\s]+@[^\s]+', 'emailaddr', email_contents)
email_contents = re.sub('[$]+', 'dollar', email_contents)
print('==== Processed Email ====')
stemmer = nltk.stem.porter.PorterStemmer()
tokens = re.split('[@$/#.-:&*+=\[\]?!(){\},\'\">_<;% ]', email_contents)
for token in tokens:
token = re.sub('[^a-zA-Z0-9]', '', token)
token = stemmer.stem(token)
if len(token) < 1:
continue
for i in range(1, len(vocab_list) + 1):
if vocab_list[i] == token:
word_indices = np.append(word_indices, i)
print(token)
print('==================')
return word_indices
def get_vocab_list():
vocab_dict = {}
with open('vocab.txt') as f:
for line in f:
(val, key) = line.split()
vocab_dict[int(val)] = key
return vocab_dict
电子邮件特征变量
def email_features(word_indices):
n = 2018
features = np.zeros(n+1)
features[word_indices-1] = 1
return features
具体步骤
步骤一,处理邮件
file_contents = open('emailSample1.txt', 'r').read()
word_indices = process_email(file_contents)
步骤二,特征变量
features = email_features(word_indices)
步骤三,训练线性支持向量机垃圾邮件分类
data = scio.loadmat('spamTrain.mat')
X = data['X']
y = data['y'].flatten()
c = 0.1
clf = svm.SVC(c, kernel='linear')
clf.fit(X, y)
p = clf.predict(X)
步骤四,测试垃圾分类
data = scio.loadmat('spamTest.mat')
Xtest = data['Xtest']
ytest = data['ytest'].flatten()
p = clf.predict(Xtest)
步骤五,打印最可能标示垃圾邮件的词汇
vocab_list = get_vocab_list()
indices = np.argsort(clf.coef_).flatten()[::-1]
print(indices)
for i in range(15):
print('{} ({:0.6f})'.format(vocab_list[indices[i]], clf.coef_.flatten()[indices[i]]))
以上是关于应用支持向量机 (Support Vector Machine) 做垃圾邮件分类的主要内容,如果未能解决你的问题,请参考以下文章
支持向量机(Support Vector Machines)
监督学习——支持向量机(Support Vector Machine)
Spark MLlib模型 支持向量机Support Vector Machine