ML-9支持向量机--实验scitit-learn SVM

Posted yifanrensheng

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了ML-9支持向量机--实验scitit-learn SVM相关的知识,希望对你有一定的参考价值。

目录

  1. scitit-learn SVM API说明
  2. 鸢尾花SVM特征分类
  3. 鸢尾花数据不同分类器准确率比较
  4. 不同SVM核函数效果比较
  5. 异常值检测(OneClassSVM)
  6. 分类问题总结

一、scitit-learn SVM API说明

1.1 算法库概述分类算法

技术图片

技术图片

svm.SVC API说明:也可见另一篇博文:https://www.cnblogs.com/yifanrensheng/p/11863324.html

参数说明:

  1. C: 误差项的惩罚系数,默认为1.0;一般为大于0的一个数字,C越大表示在训练过程中对于总误差的关注度越高,也就是说当C越大的时候,对于训练集的表现会越好,但是有可能引发过度拟合的问题(overfiting)
  2. kernel:指定SVM内部函数的类型,可选值:linear、poly、rbf、sigmoid、precomputed(基本不用,有前提要求,要求特征属性数目和样本数目一样);默认是rbf;
  3. degree:当使用多项式函数作为svm内部的函数的时候,给定多项式的项数,默认为3
  4. gamma:当SVM内部使用poly、rbf、sigmoid的时候,核函数的系数值,当默认值为auto的时候,实际系数为1/n_features
  5. coef0: 当核函数为poly或者sigmoid的时候,给定的独立系数,默认为0
  6. probability:是否启用概率估计,默认不启动,不太建议启动
  7. shrinking:是否开启收缩启发式计算,默认为True
  8. tol: 模型构建收敛参数,当模型的的误差变化率小于该值的时候,结束模型构建过程,默认值:1e-3
  9. cache_size:在模型构建过程中,缓存数据的最大内存大小,默认为空,单位MB
  10. class_weight:给定各个类别的权重,默认为空
  11. max_iter:最大迭代次数,默认-1表示不限制
  12. decision_function_shape: 决策函数,可选值:ovo和ovr,默认为None;推荐使用ovr;(1.7以上版本才有)

1.2 scitit-learn SVM算法库概述回归算法

技术图片

1.3 scitit-learn SVM-OneClassSVM

技术图片

二、鸢尾花SVM特征分类

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

# Author:yifan

import numpy as np

import pandas as pd

import matplotlib as mpl

import matplotlib.pyplot as plt

import warnings

? ?

from sklearn import svm #svm导入

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

from sklearn.exceptions import ChangedBehaviorWarning

? ?

## 设置属性防止中文乱码

mpl.rcParams[‘font.sans-serif‘] = [u‘SimHei‘]

mpl.rcParams[‘axes.unicode_minus‘] = False

? ?

warnings.filterwarnings(‘ignore‘, category=ChangedBehaviorWarning)

? ?

## 读取数据

# ‘sepal length‘, ‘sepal width‘, ‘petal length‘, ‘petal width‘

iris_feature = u花萼长度, u花萼宽度, u花瓣长度, u花瓣宽度

path = ‘./datas/iris.data‘ # 数据文件路径

data = pd.read_csv(path, header=None)

x, y = data[list(range(4))], data[4]

y = pd.Categorical(y).codes #把文本数据进行编码,比如a b c编码为 0 1 2; 可以通过pd.Categorical(y).categories获取index对应的原始值

x = x[[0, 1]] # 获取第一列和第二列

? ?

## 数据分割

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0, train_size=0.8)

## 数据SVM分类器构建

clf = svm.SVC(C=1,kernel=‘rbf‘,gamma=0.1)

#gamma值越大,训练集的拟合就越好,但是会造成过拟合,导致测试集拟合变差

#gamma值越小,模型的泛化能力越好,训练集和测试集的拟合相近,但是会导致训练集出现欠拟合问题,从而,准确率变低,导致测试集准确率也变低。

## 模型训练

#SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,decision_function_shape=None, degree=3, gamma=0.1, kernel=‘rbf‘,

#max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)

clf.fit(x_train, y_train)

? ?

## 计算模型的准确率/精度

print (clf.score(x_train, y_train))

print (训练集准确率:, accuracy_score(y_train, clf.predict(x_train)))

print (clf.score(x_test, y_test))

print (测试集准确率:, accuracy_score(y_test, clf.predict(x_test)))

? ?

# 画图

N = 500

x1_min, x2_min = x.min()

x1_max, x2_max = x.max()

# print(x.max())

t1 = np.linspace(x1_min, x1_max, N)

t2 = np.linspace(x2_min, x2_max, N)

x1, x2 = np.meshgrid(t1, t2) # 生成网格采样点

grid_show = np.dstack((x1.flat, x2.flat))[0] # 测试点

? ?

grid_hat = clf.predict(grid_show) # 预测分类值

grid_hat = grid_hat.reshape(x1.shape) # 使之与输入的形状相同

? ?

cm_light = mpl.colors.ListedColormap([‘#00FFCC‘, ‘#FFA0A0‘, ‘#A0A0FF‘])

cm_dark = mpl.colors.ListedColormap([‘g‘, ‘r‘, ‘b‘])

plt.figure(facecolor=‘w‘)

## 区域图

plt.pcolormesh(x1, x2, grid_hat, cmap=cm_light)

## 所以样本点

plt.scatter(x[0], x[1], c=y, edgecolors=‘k‘, s=50, cmap=cm_dark) # 样本

## 测试数据集

plt.scatter(x_test[0], x_test[1], s=120, facecolors=‘none‘, zorder=10) # 圈中测试集样本

## lable列表

plt.xlabel(iris_feature[0], fontsize=13)

plt.ylabel(iris_feature[1], fontsize=13)

plt.xlim(x1_min, x1_max)

plt.ylim(x2_min, x2_max)

plt.title(u鸢尾花SVM特征分类, fontsize=16)

plt.grid(b=True, ls=‘:‘)

plt.tight_layout(pad=1.5)

plt.show()

结果:

0.85

训练集准确率: 0.85

0.7333333333333333

测试集准确率: 0.7333333333333333

技术图片

三、鸢尾花数据不同分类器准确率比较

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

# Author:yifan

? ?

import numpy as np

import pandas as pd

import matplotlib as mpl

import matplotlib.pyplot as plt

from sklearn.svm import SVC

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression,RidgeClassifier

from sklearn.neighbors import KNeighborsClassifier

? ?

## 设置属性防止中文乱码

mpl.rcParams[‘font.sans-serif‘] = [u‘SimHei‘]

mpl.rcParams[‘axes.unicode_minus‘] = False

## 读取数据

# ‘sepal length‘, ‘sepal width‘, ‘petal length‘, ‘petal width‘

iris_feature = u花萼长度, u花萼宽度, u花瓣长度, u花瓣宽度

path = ‘./datas/iris.data‘ # 数据文件路径

data = pd.read_csv(path, header=None)

x, y = data[list(range(4))], data[4]

y = pd.Categorical(y).codes

x = x[[0, 1]]

? ?

## 数据分割

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=28, train_size=0.6)

? ?

# 数据SVM分类器构建

svm = SVC(C=1, kernel=‘linear‘)

## Linear分类器构建

lr = LogisticRegression()

rc = RidgeClassifier()#ridge是为了解决特征大于样本,而导致分类效果较差的情况,而提出的

#svm有一个重要的瓶颈——当特征数大于样本数的时候,效果变差

knn = KNeighborsClassifier()

? ?

## 模型训练

svm.fit(x_train, y_train)

lr.fit(x_train, y_train)

rc.fit(x_train, y_train)

knn.fit(x_train, y_train)

? ?

## 效果评估

svm_score1 = accuracy_score(y_train, svm.predict(x_train))

svm_score2 = accuracy_score(y_test, svm.predict(x_test))

? ?

lr_score1 = accuracy_score(y_train, lr.predict(x_train))

lr_score2 = accuracy_score(y_test, lr.predict(x_test))

? ?

rc_score1 = accuracy_score(y_train, rc.predict(x_train))

rc_score2 = accuracy_score(y_test, rc.predict(x_test))

? ?

knn_score1 = accuracy_score(y_train, knn.predict(x_train))

knn_score2 = accuracy_score(y_test, knn.predict(x_test))

? ?

## 画图

x_tmp = [0,1,2,3]

y_score1 = [svm_score1, lr_score1, rc_score1, knn_score1]

y_score2 = [svm_score2, lr_score2, rc_score2, knn_score2]

? ?

plt.figure(facecolor=‘w‘)

plt.plot(x_tmp, y_score1, ‘r-‘, lw=2, label=u训练集准确率)

plt.plot(x_tmp, y_score2, ‘g-‘, lw=2, label=u测试集准确率)

plt.xlim(0, 3)

plt.ylim(np.min((np.min(y_score1), np.min(y_score2)))*0.9, np.max((np.max(y_score1), np.max(y_score2)))*1.1)

plt.legend(loc = ‘lower right‘)

plt.title(u鸢尾花数据不同分类器准确率比较, fontsize=16)

plt.xticks(x_tmp, [u‘SVM‘, u‘Logistic‘, u‘Ridge‘, u‘KNN‘], rotation=0)

plt.grid(b=True)

plt.show()

? ?

? ?

### 画图比较

N = 500

x1_min, x2_min = x.min()

x1_max, x2_max = x.max()

? ?

t1 = np.linspace(x1_min, x1_max, N)

t2 = np.linspace(x2_min, x2_max, N)

x1, x2 = np.meshgrid(t1, t2) # 生成网格采样点

grid_show = np.dstack((x1.flat, x2.flat))[0] # 测试点

? ?

## 获取各个不同算法的测试值

svm_grid_hat = svm.predict(grid_show)

svm_grid_hat = svm_grid_hat.reshape(x1.shape) # 使之与输入的形状相同

? ?

lr_grid_hat = lr.predict(grid_show)

lr_grid_hat = lr_grid_hat.reshape(x1.shape) # 使之与输入的形状相同

? ?

rc_grid_hat = rc.predict(grid_show)

rc_grid_hat = rc_grid_hat.reshape(x1.shape) # 使之与输入的形状相同

? ?

knn_grid_hat = knn.predict(grid_show)

knn_grid_hat = knn_grid_hat.reshape(x1.shape) # 使之与输入的形状相同

? ?

## 画图

cm_light = mpl.colors.ListedColormap([‘#A0FFA0‘, ‘#FFA0A0‘, ‘#A0A0FF‘])

cm_dark = mpl.colors.ListedColormap([‘g‘, ‘r‘, ‘b‘])

plt.figure(facecolor=‘w‘, figsize=(14,7))

? ?

### svm 区域图

plt.subplot(221)

plt.pcolormesh(x1, x2, svm_grid_hat, cmap=cm_light)

## 所以样本点

plt.scatter(x[0], x[1], c=y, edgecolors=‘k‘, s=50, cmap=cm_dark) # 样本

## 测试数据集

plt.scatter(x_test[0], x_test[1], s=120, facecolors=‘none‘, zorder=10) # 圈中测试集样本

## lable列表

plt.xlabel(iris_feature[0], fontsize=13)

plt.ylabel(iris_feature[1], fontsize=13)

plt.xlim(x1_min, x1_max)

plt.ylim(x2_min, x2_max)

plt.title(u鸢尾花SVM特征分类, fontsize=16)

plt.grid(b=True, ls=‘:‘)

plt.tight_layout(pad=1.5)

? ?

plt.subplot(222)

## 区域图

plt.pcolormesh(x1, x2, lr_grid_hat, cmap=cm_light)

## 所以样本点

plt.scatter(x[0], x[1], c=y, edgecolors=‘k‘, s=50, cmap=cm_dark) # 样本

## 测试数据集

plt.scatter(x_test[0], x_test[1], s=120, facecolors=‘none‘, zorder=10) # 圈中测试集样本

## lable列表

plt.xlabel(iris_feature[0], fontsize=13)

plt.ylabel(iris_feature[1], fontsize=13)

plt.xlim(x1_min, x1_max)

plt.ylim(x2_min, x2_max)

plt.title(u鸢尾花Logistic特征分类, fontsize=16)

plt.grid(b=True, ls=‘:‘)

plt.tight_layout(pad=1.5)

? ?

plt.subplot(223)

## 区域图

plt.pcolormesh(x1, x2, rc_grid_hat, cmap=cm_light)

## 所以样本点

plt.scatter(x[0], x[1], c=y, edgecolors=‘k‘, s=50, cmap=cm_dark) # 样本

## 测试数据集

plt.scatter(x_test[0], x_test[1], s=120, facecolors=‘none‘, zorder=10) # 圈中测试集样本

## lable列表

plt.xlabel(iris_feature[0], fontsize=13)

plt.ylabel(iris_feature[1], fontsize=13)

plt.xlim(x1_min, x1_max)

plt.ylim(x2_min, x2_max)

plt.title(u鸢尾花Ridge特征分类, fontsize=16)

plt.grid(b=True, ls=‘:‘)

plt.tight_layout(pad=1.5)

? ?

plt.subplot(224)

## 区域图

plt.pcolormesh(x1, x2, knn_grid_hat, cmap=cm_light)

## 所以样本点

plt.scatter(x[0], x[1], c=y, edgecolors=‘k‘, s=50, cmap=cm_dark) # 样本

## 测试数据集

plt.scatter(x_test[0], x_test[1], s=120, facecolors=‘none‘, zorder=10) # 圈中测试集样本

## lable列表

plt.xlabel(iris_feature[0], fontsize=13)

plt.ylabel(iris_feature[1], fontsize=13)

plt.xlim(x1_min, x1_max)

plt.ylim(x2_min, x2_max)

plt.title(u鸢尾花KNN特征分类, fontsize=16)

plt.grid(b=True, ls=‘:‘)

plt.tight_layout(pad=1.5)

plt.show()

结果:

技术图片

技术图片

? ?

四、不同SVM核函数效果比较

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

# Author:yifan

import time

import numpy as np

import pandas as pd

import matplotlib as mpl

import matplotlib.pyplot as plt

from sklearn.svm import SVC

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

? ?

## 设置属性防止中文乱码

mpl.rcParams[‘font.sans-serif‘] = [u‘SimHei‘]

mpl.rcParams[‘axes.unicode_minus‘] = False

## 读取数据

# ‘sepal length‘, ‘sepal width‘, ‘petal length‘, ‘petal width‘

iris_feature = u花萼长度, u花萼宽度, u花瓣长度, u花瓣宽度

path = ‘./datas/iris.data‘ # 数据文件路径

data = pd.read_csv(path, header=None)

x, y = data[list(range(4))], data[4]

y = pd.Categorical(y).codes

x = x[[0, 1]]

? ?

## 数据分割

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=28, train_size=0.6)

? ?

## 数据SVM分类器构建

svm1 = SVC(C=1, kernel=‘linear‘)

svm2 = SVC(C=1, kernel=‘rbf‘)

svm3 = SVC(C=1, kernel=‘poly‘)

svm4 = SVC(C=1, kernel=‘sigmoid‘)

? ?

## 模型训练

t0=time.time()

svm1.fit(x_train, y_train)

t1=time.time()

svm2.fit(x_train, y_train)

t2=time.time()

svm3.fit(x_train, y_train)

t3=time.time()

svm4.fit(x_train, y_train)

t4=time.time()

? ?

### 效果评估

svm1_score1 = accuracy_score(y_train, svm1.predict(x_train))

svm1_score2 = accuracy_score(y_test, svm1.predict(x_test))

? ?

svm2_score1 = accuracy_score(y_train, svm2.predict(x_train))

svm2_score2 = accuracy_score(y_test, svm2.predict(x_test))

? ?

svm3_score1 = accuracy_score(y_train, svm3.predict(x_train))

svm3_score2 = accuracy_score(y_test, svm3.predict(x_test))

? ?

svm4_score1 = accuracy_score(y_train, svm4.predict(x_train))

svm4_score2 = accuracy_score(y_test, svm4.predict(x_test))

? ?

## 画图

x_tmp = [0,1,2,3]

t_score = [t1 - t0, t2-t1, t3-t2, t4-t3]

y_score1 = [svm1_score1, svm2_score1, svm3_score1, svm4_score1]

y_score2 = [svm1_score2, svm2_score2, svm3_score2, svm4_score2]

? ?

plt.figure(facecolor=‘w‘, figsize=(12,6))

? ?

? ?

plt.subplot(121)

plt.plot(x_tmp, y_score1, ‘r-‘, lw=2, label=u训练集准确率)

plt.plot(x_tmp, y_score2, ‘g-‘, lw=2, label=u测试集准确率)

plt.xlim(-0.3, 3.3)

plt.ylim(np.min((np.min(y_score1), np.min(y_score2)))*0.9, np.max((np.max(y_score1), np.max(y_score2)))*1.1)

plt.legend(loc = ‘lower left‘)

plt.title(u模型预测准确率, fontsize=13)

plt.xticks(x_tmp, [u‘linear-SVM‘, u‘rbf-SVM‘, u‘poly-SVM‘, u‘sigmoid-SVM‘], rotation=0)

plt.grid(b=True)

? ?

plt.subplot(122)

plt.plot(x_tmp, t_score, ‘b-‘, lw=2, label=u模型训练时间)

plt.title(u模型训练耗时, fontsize=13)

plt.xticks(x_tmp, [u‘linear-SVM‘, u‘rbf-SVM‘, u‘poly-SVM‘, u‘sigmoid-SVM‘], rotation=0)

plt.xlim(-0.3, 3.3)

plt.grid(b=True)

plt.suptitle(u鸢尾花数据SVM分类器不同内核函数模型比较, fontsize=16)

? ?

plt.show()

? ?

? ?

### 预测结果画图

### 画图比较

N = 500

x1_min, x2_min = x.min()

x1_max, x2_max = x.max()

? ?

t1 = np.linspace(x1_min, x1_max, N)

t2 = np.linspace(x2_min, x2_max, N)

x1, x2 = np.meshgrid(t1, t2) # 生成网格采样点

grid_show = np.dstack((x1.flat, x2.flat))[0] # 测试点

? ?

## 获取各个不同算法的测试值

svm1_grid_hat = svm1.predict(grid_show)

svm1_grid_hat = svm1_grid_hat.reshape(x1.shape) # 使之与输入的形状相同

? ?

svm2_grid_hat = svm2.predict(grid_show)

svm2_grid_hat = svm2_grid_hat.reshape(x1.shape) # 使之与输入的形状相同

? ?

svm3_grid_hat = svm3.predict(grid_show)

svm3_grid_hat = svm3_grid_hat.reshape(x1.shape) # 使之与输入的形状相同

? ?

svm4_grid_hat = svm4.predict(grid_show)

svm4_grid_hat = svm4_grid_hat.reshape(x1.shape) # 使之与输入的形状相同

? ?

## 画图

cm_light = mpl.colors.ListedColormap([‘#A0FFA0‘, ‘#FFA0A0‘, ‘#A0A0FF‘])

cm_dark = mpl.colors.ListedColormap([‘g‘, ‘r‘, ‘b‘])

plt.figure(facecolor=‘w‘, figsize=(14,7))

? ?

### svm

plt.subplot(221)

## 区域图

plt.pcolormesh(x1, x2, svm1_grid_hat, cmap=cm_light)

## 所以样本点

plt.scatter(x[0], x[1], c=y, edgecolors=‘k‘, s=50, cmap=cm_dark) # 样本

## 测试数据集

plt.scatter(x_test[0], x_test[1], s=120, facecolors=‘none‘, zorder=10) # 圈中测试集样本

## lable列表

plt.xlabel(iris_feature[0], fontsize=13)

plt.ylabel(iris_feature[1], fontsize=13)

plt.xlim(x1_min, x1_max)

plt.ylim(x2_min, x2_max)

plt.title(u鸢尾花Linear-SVM特征分类, fontsize=16)

plt.grid(b=True, ls=‘:‘)

plt.tight_layout(pad=1.5)

? ?

plt.subplot(222)

## 区域图

plt.pcolormesh(x1, x2, svm2_grid_hat, cmap=cm_light)

## 所以样本点

plt.scatter(x[0], x[1], c=y, edgecolors=‘k‘, s=50, cmap=cm_dark) # 样本

## 测试数据集

plt.scatter(x_test[0], x_test[1], s=120, facecolors=‘none‘, zorder=10) # 圈中测试集样本

## lable列表

plt.xlabel(iris_feature[0], fontsize=13)

plt.ylabel(iris_feature[1], fontsize=13)

plt.xlim(x1_min, x1_max)

plt.ylim(x2_min, x2_max)

plt.title(u鸢尾花rbf-SVM特征分类, fontsize=16)

plt.grid(b=True, ls=‘:‘)

plt.tight_layout(pad=1.5)

? ?

plt.subplot(223)

## 区域图

plt.pcolormesh(x1, x2, svm3_grid_hat, cmap=cm_light)

## 所以样本点

plt.scatter(x[0], x[1], c=y, edgecolors=‘k‘, s=50, cmap=cm_dark) # 样本

## 测试数据集

plt.scatter(x_test[0], x_test[1], s=120, facecolors=‘none‘, zorder=10) # 圈中测试集样本

## lable列表

plt.xlabel(iris_feature[0], fontsize=13)

plt.ylabel(iris_feature[1], fontsize=13)

plt.xlim(x1_min, x1_max)

plt.ylim(x2_min, x2_max)

plt.title(u鸢尾花poly-SVM特征分类, fontsize=16)

plt.grid(b=True, ls=‘:‘)

plt.tight_layout(pad=1.5)

? ?

plt.subplot(224)

## 区域图

plt.pcolormesh(x1, x2, svm4_grid_hat, cmap=cm_light)

## 所以样本点

plt.scatter(x[0], x[1], c=y, edgecolors=‘k‘, s=50, cmap=cm_dark) # 样本

## 测试数据集

plt.scatter(x_test[0], x_test[1], s=120, facecolors=‘none‘, zorder=10) # 圈中测试集样本

## lable列表

plt.xlabel(iris_feature[0], fontsize=13)

plt.ylabel(iris_feature[1], fontsize=13)

plt.xlim(x1_min, x1_max)

plt.ylim(x2_min, x2_max)

plt.title(u鸢尾花sigmoid-SVM特征分类, fontsize=16)

plt.grid(b=True, ls=‘:‘)

plt.tight_layout(pad=1.5)

plt.show()

结果:

技术图片

技术图片

五、异常值检测(OneClassSVM)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

# Author:yifan

import numpy as np

import matplotlib.pyplot as plt

import matplotlib as mpl

import matplotlib.font_manager

from sklearn import svm

## 设置属性防止中文乱码

mpl.rcParams[‘font.sans-serif‘] = [u‘SimHei‘]

mpl.rcParams[‘axes.unicode_minus‘] = False

? ?

# 模拟数据产生

xx, yy = np.meshgrid(np.linspace(-5, 5, 500), np.linspace(-5, 5, 500))

# 产生训练数据

X = 0.3 * np.random.randn(100, 2)

X_train = np.r_[X + 2, X - 2]

# 产测试数据

X = 0.3 * np.random.randn(20, 2)

X_test = np.r_[X + 2, X - 2]

# 产生一些异常点数据

X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))

? ?

# 模型训练

clf = svm.OneClassSVM(nu=0.01, kernel="rbf", gamma=0.1)

clf.fit(X_train)

? ?

# 预测结果获取

y_pred_train = clf.predict(X_train)

y_pred_test = clf.predict(X_test)

y_pred_outliers = clf.predict(X_outliers)

# 返回1表示属于这个类别,-1表示不属于这个类别

n_error_train = y_pred_train[y_pred_train == -1].size

n_error_test = y_pred_test[y_pred_test == -1].size

n_error_outliers = y_pred_outliers[y_pred_outliers == 1].size

? ?

# 获取绘图的点信息

Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)

? ?

# 画图

plt.figure(facecolor=‘w‘)

plt.title("异常点检测")

# 画出区域图

plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), 0, 9), cmap=plt.cm.PuBu)

a = plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors=‘darkred‘)

plt.contourf(xx, yy, Z, levels=[0, Z.max()], colors=‘palevioletred‘)

# 画出点图

s = 40

b1 = plt.scatter(X_train[:, 0], X_train[:, 1], c=‘white‘, s=s, edgecolors=‘k‘)

b2 = plt.scatter(X_test[:, 0], X_test[:, 1], c=‘blueviolet‘, s=s, edgecolors=‘k‘)

c = plt.scatter(X_outliers[:, 0], X_outliers[:, 1], c=‘gold‘, s=s, edgecolors=‘k‘)

? ?

# 设置相关信息

plt.axis(‘tight‘)

plt.xlim((-5, 5))

plt.ylim((-5, 5))

plt.legend([a.collections[0], b1, b2, c],

["分割超平面", "训练样本", "测试样本", "异常点"],

loc="upper left",

prop=matplotlib.font_manager.FontProperties(size=11))

plt.xlabel("训练集错误率: %d/200 ; 测试集错误率: %d/40 ; 异常点错误率: %d/40"

% (n_error_train, n_error_test, n_error_outliers))

plt.show()

结果:

技术图片

六、分类问题总结

比较逻辑回归、KNN、决策树、随机森林、GBDT、Adaboost、SVM等分类算法的效果,数据集使用sklearn自带的模拟数据进行测试。

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

# Author:yifan

import numpy as np

import matplotlib.pyplot as plt

import matplotlib as mpl

from matplotlib.colors import ListedColormap

from sklearn import svm

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.datasets import make_moons, make_circles, make_classification

from sklearn.neighbors import KNeighborsClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier

from sklearn.linear_model import LogisticRegressionCV

## 设置属性防止中文乱码

mpl.rcParams[‘font.sans-serif‘] = [u‘SimHei‘]

mpl.rcParams[‘axes.unicode_minus‘] = False

#构造数据

X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,random_state=1, n_clusters_per_class=1)

rng = np.random.RandomState(2)

X += 2 * rng.uniform(size=X.shape)

linearly_separable = (X, y)

datasets = [make_moons(noise=0.3, random_state=0),

make_circles(noise=0.2, factor=0.4, random_state=1),

linearly_separable]

#建模环节,用list把所有算法装起来

names = ["Nearest Neighbors", "Logistic","Decision Tree", "Random Forest", "AdaBoost", "GBDT","svm"]

classifiers = [

KNeighborsClassifier(3),

LogisticRegressionCV(),

DecisionTreeClassifier(max_depth=5),

RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),

AdaBoostClassifier(n_estimators=10,learning_rate=1.5),

GradientBoostingClassifier(n_estimators=10, learning_rate=1.5),

svm.SVC(C=1, kernel=‘rbf‘)

]

## 画图

figure = plt.figure(figsize=(27, 9), facecolor=‘w‘)

i = 1

h = .02 # 步长

? ?

for ds in datasets:

X, y = ds

X = StandardScaler().fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)

x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5

y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5

xx, yy = np.meshgrid(np.arange(x_min, x_max, h),

np.arange(y_min, y_max, h))

cm = plt.cm.RdBu

cm_bright = ListedColormap([‘r‘, ‘b‘, ‘y‘])

ax = plt.subplot(len(datasets), len(classifiers) + 1, i)

ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)

ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6)

ax.set_xlim(xx.min(), xx.max())

ax.set_ylim(yy.min(), yy.max())

ax.set_xticks(())

ax.set_yticks(())

i += 1

# 画每个算法的图

for name, clf in zip(names, classifiers):

ax = plt.subplot(len(datasets), len(classifiers) + 1, i)

clf.fit(X_train, y_train)

score = clf.score(X_test, y_test)

# hasattr是判定某个模型中,有没有哪个参数,

# 判断clf模型中,有没有decision_function

# np.c_让内部数据按列合并

if hasattr(clf, "decision_function"):

Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])

else:

Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]

? ?

Z = Z.reshape(xx.shape)

ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)

ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)

ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright,

alpha=0.6)

? ?

ax.set_xlim(xx.min(), xx.max())

ax.set_ylim(yy.min(), yy.max())

ax.set_xticks(())

ax.set_yticks(())

ax.set_title(name)

ax.text(xx.max() - .3, yy.min() + .3, (‘%.2f‘ % score).lstrip(‘0‘),

size=25, horizontalalignment=‘right‘)

i += 1

## 展示图

figure.subplots_adjust(left=.02, right=.98)

plt.show()

# plt.savefig("cs.png")

结果:

技术图片

? ?

? ?

? ?

以上是关于ML-9支持向量机--实验scitit-learn SVM的主要内容,如果未能解决你的问题,请参考以下文章

ML-9-3支持向量机--SMO算法原理

ML-9-2支持向量机--线性不可分与核函数

ML-9-4支持向量机--SVM回归模型(SVR)

机器学习:基于支持向量机(SVM)进行人脸识别预测

机器学习:基于支持向量机(SVM)进行人脸识别预测

支持向量机算法实现