翻译文章第六章4-7

Posted 2020-09-20 DY数据科学实验室

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了翻译文章第六章4-7相关的知识，希望对你有一定的参考价值。

Visualizing an SVM fit

格式化SVM

To visualize the built model, one can first use the plot function to generate a scatter plot of data input and the SVM fit. In this plot, support vectors and classes are highlighted through the color symbol. In addition to this, one can draw a contour filled plot of the class regions to easily identify misclassified samples from the plot.

可视化的内置模型，可以RST使用的情节功能，以产生一个散点图的数据输入和SVM T.在这个情节，支持向量和类突出通过颜色符号。除此之外，一个可以画出一个轮廓填充情节类的地区很容易识别的误分类训练样本ED样本图。

Getting ready

In this recipe, we will use two datasets: the iris dataset and the telecom churn dataset. For the telecom churn dataset, one needs to have completed the previous recipe by training a support vector machine with SVM, and to have saved the SVM fit model.

在这个配方中，我们将使用两个数据集：IRIS数据集和电信流失数据集。对于电信流失数据集，需要一个支持向量机支持SVM的前一个配方完成，并保存了SVM T模型。

How to do it...

Perform the following steps to visualize the SVM fit object:

执行以下步骤来可视化SVM t对象:

Use SVM to train the support vector machine based on the iris dataset, and use the plot function to visualize the fitted model:

1.使用SVM训练支持向量机的虹膜数据集的基础上，使用的情节功能可视化拟合模型.

> data(iris)

> model.iris = svm(Species~., iris)

> plot(model.iris, iris, Petal.Width ~ Petal.Length, slice =

list(Sepal.Width = 3, Sepal.Length = 4))

Visualize the SVM fit object, model , using the plot function with the dimensions of total_day_minutes and total_intl_charge :

2。将支持向量机T对象，模型，采用total_day_minutes和total_intl_charge尺寸的绘图功能：

> plot(model, trainset, total_day_minutes ~ total_intl_charge)

Figure 7: The SVM classification plot of trained SVM fit based on churn dataset

How it works...

In this recipe, we demonstrate how to use the plot function to visualize the SVM fit. In the first plot, we train a support vector machine using the iris dataset. Then, we use the plot function to visualize the fitted SVM.

在这个食谱中，我们演示了如何使用的情节功能可视化的SVM T.在RST情节，我们训练的支持向量机使用的虹膜数据集。然后，我们使用的绘图功能，可视化的配置支持向量机。

In the argument list, we specify the fitted model in the first argument and the dataset (this should be the same data used to build the model) as the second parameter. The third parameter indicates the dimension used to generate the classification plot. By default, the plot function can only generate a scatter plot based on two dimensions (for the x-axis and y-axis). Therefore, we select the variables, Petal.Length and Petal.Width as the two dimensions to generate the scatter plot.

在参数列表中，我们指定的第一个参数拟合模型和数据集（这应该是用来建立模型相同的数据）作为第二个参数。第三个参数表示用于生成分类图的尺寸。默认情况下，绘图函数只能生成基于二维（x轴和y轴）的散点图。因此，我们选择的变量，花瓣，长度和花瓣的宽度作为两个维度产生的散射图。

From Figure 6, we find Petal.Length assigned to the x-axis, Petal.Width assigned to the y-axis, and data points with X and O symbols scattered on the plot. Within the scatter plot, the X symbol shows the support vector and the O symbol represents the data points. These two symbols can be altered through the configuration of the svSymbol and dataSymbol options. Both the support vectors and true classes are highlighted and colored depending on their label (green refers to viginica, red refers to versicolor, and black refers to setosa). The last argument, slice , is set when there are more than two variables. Therefore, in this example, we use the additional variables, Sepal.width and Sepal.length , by assigning a constant of 3 and 4.

从图6中，我们看到第二个花瓣，长度被分配给X轴，花瓣，被分配到y轴的宽度，以及x和O符号在图上散布的数据点。在散布图中，x符号表示支持向量，O符号表示数据点。这两个符号可以通过改变配置的svsymbol和符号数据选项。无论是支持向量和真正的类是突出和颜色取决于他们的标签（绿色是指viginica，红指杂色，黑色指setosa）。当有两个以上变量时，最后一个参数是切片。因此，在这个例子中，我们使用额外的变量，Sepal.width和sepal.length，通过分配3和4不变。

Next, we take the same approach to draw the SVM fit based on customer churn data. In this example, we use total_day_minutes and total_intl_charge as the two dimensions used to plot the scatterplot. As per Figure 7, the support vectors and data points in red and black are scattered closely together in the central region of the plot, and there is no simple way to separate them.

接下来，我们采取相同的方法来绘制基于客户流失数据的SVM T。在这个例子中，我们使用total_day_minutes和total_intl_charge作为两个维度用于绘制散点图。如图7所示，红色和黑色的支持向量和数据点在图的中心区域紧密地聚集在一起，没有简单的方法来区分它们。

See also

There are other parameters , such as fill , grid , symbolPalette , and so on, that can be configured to change the layout of the plot. You can use the help function to view the following document for further information:

还有其他的参数，如填充、网格、符号调色板，等等，可以配置来改变剧情的布局。您可以使用“帮助”功能查看下面的文档以获取更多

> ?svm.plot

Predicting labels based on a model trained by a support vector machine

基于支持向量机训练模型预测标签

In the previous recipe, we trained an SVM based on the training dataset. The training process finds the optimum hyperplane that separates the training data by the maximum margin. We can then utilize the SVM fit to predict the label (category) of new observations. In this recipe, we will demonstrate how to use the predict function to predict values based on a model trained by SVM.

在以前的配方中，我们训练了SVM的训练数据集的基础上。训练过程的最佳超平面分离训练数据的最大保证金。然后，我们可以利用SVM t预测标签（类别）的新的意见。在这个食谱中，我们将演示如何使用预测函数预测模型的基础上SVM训练的值。

Getting ready

You need to have completed the previous recipe by generating a fitted SVM, and save the fitted model in model.

你需要通过生成一个安装SVM完成之前的食谱，并保存在模型拟合模型。

How to do it...

Perform the following steps to predict the labels of the testing dataset:

执行下列步骤来预测测试数据集的标签：

Predict the label of the testing dataset based on the fitted SVM and attributes of the testing dataset:

1。基于SVM的预测拟合和属性的测试数据测试数据集的标签：

> svm.pred = predict(model, testset[, !names(testset) %in%

c("churn")])

Then, you can use the table function to generate a classification table with the prediction result and labels of the testing dataset:

2。然后，您可以使用表函数生成的预测结果和测试数据集的标签分类表：

> svm.table=table(svm.pred, testset$churn)

> svm.table

svm.pred yes no

yes 70 12

no 71 865

Next, you can use classAgreement to calculate coefficients compared to the classification agreement:

3.接下来，你可以使用类协议计算系数相对于分类协议：

> classAgreement(svm.table)

$diag

[1] 0.9184676

$kappa

[1] 0.5855903

$rand

[1] 0.850083

$crand

[1] 0.5260472

4. Now, you can use confusionMatrix to measure the prediction performance based on the classification table:

4。现在，你可以使用混淆矩阵来衡量基于分类表的预测性能：

> library(caret)

> confusionMatrix(svm.table)

Confusion Matrix and Statistics

svm.pred yes no

yes 70 12

no 71 865

Accuracy : 0.9185

95% CI : (0.8999, 0.9345)

No Information Rate : 0.8615

P-Value [Acc > NIR] : 1.251e-08

Kappa : 0.5856

Mcnemar‘s Test P-Value : 1.936e-10

Sensitivity : 0.49645

Specificity : 0.98632

Pos Pred Value : 0.85366

Neg Pred Value : 0.92415

Prevalence : 0.13851

Detection Rate : 0.06876

Detection Prevalence : 0.08055

Balanced Accuracy : 0.74139

‘Positive‘ Class : yes

How it works...

In this recipe, we first used the predict function to obtain the predicted labels of the testing dataset. Next, we used the table function to generate the classification table based on the predicted labels of the testing dataset. So far, the evaluation procedure is very similar to the evaluation process mentioned in the previous chapter.

在这个配方中，我们首先使用的预测功能，以获得预测的标签的测试数据集。接下来，我们用表功能的基础上产生的测试数据集的预测标签的分类表。到目前为止，评估过程非常类似于前面章节提到的评估过程。

We then introduced a new function, classAgreement , which computes several coefficients of agreement between the columns and rows of a two-way contingency table. The coefficients include diag, kappa, rand, and crand. The diag coefficient represents the percentage of data points in the main diagonal of the classification table, kappa refers to diag , which is corrected for an agreement by a change (the probability of random agreements), rand represents the Rand index, which measures the similarity between two data clusters, and crand indicates the Rand index, which is adjusted for the chance grouping of elements.

然后介绍了一个新的函数，类的协议，计算列和一个双向列联表的行数系数之间的协议。该系数包括诊断，kappa，兰德，和大。诊断系数代表的分类表的主对角线上的数据点的百分比，Kappa指诊断，这是通过改变协议（协议修正随机概率），兰德代表兰德指数，衡量两个数据簇之间的相似度，并和表明Rand Index，这是调整元素的机会分组。

Finally, we used confusionMatrix from the caret package to measure the performance of the classification model. The accuracy of 0.9185 shows that the trained support vector machine can correctly classify most of the observations. However, accuracy alone is not a good measurement of a classification model. One should also reference sensitivity and specificity.

最后，我们用confusionmatrix从符号包测量性能的分类模型。0.9185的精度表明，训练有素的支持向量机可以正确分类的大部分意见。然而，一个人并不是一个很好的测量精度的一个分类模型。还应参考敏感性和具体城市。

There‘s more...

Besides using SVM to predict the category of new observations, you can use SVM to predict continuous values. In other words, one can use SVM to perform regression analysis.

除了使用SVM来预测新的观测类别，你可以使用SVM预测连续值。换句话说，可以使用SVM进行回归分析。

In the following example, we will show how to perform a simple regression prediction based on a fitted SVM with the type specified as eps-regression.

在下面的例子中，我们将展示如何执行一个基于支持向量机的安装类型指定为EPS回归简单回归预测。

Perform the following steps to train a regression model with SVM:

用支持向量机训练回归模型：

Train a support vector machine based on a Quartet dataset:

1。基于四元数据集的支持向量机训练：

> library(car)

> data(Quartet)

> model.regression = svm(Quartet$y1~Quartet$x,type="eps-regression")

2. Use the predict function to obtain prediction results:

2。利用预测函数得到预测结果：

> predict.y = predict(model.regression, Quartet$x)

> predict.y

1 2 3 4 5 6 7

8.196894 7.152946 8.807471 7.713099 8.533578 8.774046 6.186349

5.763689

9 10 11

8.726925 6.621373 5.882946

3. Plot the predicted points as squares and the training data points as circles on the same plot:

3.绘制预测点作为广场和训练数据点为圆在同一地块上:

> plot(Quartet$x, Quartet$y1, pch=19)

> points(Quartet$x, predict.y, pch=15, col="red")

Tuning a support vector machine

调整支持向量机

Besides using different feature sets and the kernel function in support vector machines, one

trick that you can use to tune its performance is to adjust the gamma and cost configured in

the argument. One possible approach to test the performance of different gamma and cost

combination values is to write a for loop to generate all the combinations of gamma and

cost as inputs to train different support vector machines. Fortunately, SVM provides a tuning

function, tune.svm , which makes the tuning much easier. In this recipe, we will demonstrate

how to tune a support vector machine through the use of tune.svm

除了支持向量机中使用不同的功能集和核函数，一个技巧，你可以使用来调整其性能调整伽玛和成本配置的参数。一种可能的方法来测试不同的伽玛和成本的组合值的性能是写一个循环产生伽玛和所有的组合成本投入不同的支持向量机训练。幸运的是，SVM提供调谐功能，tune.svm，使调整更加容易。在这个食谱中，我们将演示如何调整支持向量机通过使用tune.svm。

Getting ready

You need to have completed the previous recipe by preparing a training dataset, trainset .

你需要准备一个训练数据集完成之前的食谱，动车组。

How to do it...

Perform the following steps to tune the support vector machine:

执行以下步骤来调整支持向量机：

First, tune the support vector machine using tune.svm :

1。首先，利用tune.svm调整支持向量机：

> tuned = tune.svm(churn~., data = trainset, gamma = 10^(-6:-1),

cost = 10^(1:2))

2. Next, you can use the summary function to obtain the tuning result:

> summary(tuned)

Parameter tuning of ‘svm‘:

- sampling method: 10-fold cross validation

- best parameters:

gamma cost

0.01 100

- best performance: 0.08077885

- Detailed performance results:

gamma cost error dispersion

1 1e-06 10 0.14774780 0.02399512

2 1e-05 10 0.14774780 0.02399512 1e-04 10 0.14774780 0.02399512

4 1e-03 10 0.14774780 0.02399512

5 1e-02 10 0.09245223 0.02046032

6 1e-01 10 0.09202306 0.01938475

7 1e-06 100 0.14774780 0.02399512

8 1e-05 100 0.14774780 0.02399512

9 1e-04 100 0.14774780 0.02399512

10 1e-03 100 0.11794484 0.02368343

11 1e-02 100 0.08077885 0.01858195

12 1e-01 100 0.12356135 0.01661508

3. After retrieving the best performance parameter from tuning the result, you can

retrain the support vector machine with the best performance parameter:

3.从调整结果检索的最佳性能参数后，你可以对支持向量机的最佳性能参数：

> model.tuned = svm(churn~., data = trainset, gamma = tuned$best.

parameters$gamma, cost = tuned$best.parameters$cost)

> summary(model.tuned)

Call:

svm(formula = churn ~ ., data = trainset, gamma = 10^-2, cost =

100)

Parameters:

SVM-Type: C-classification

SVM-Kernel: radial

cost: 100

gamma: 0.01

Number of Support Vectors: 547

( 304 243 )

Number of Classes: 2

Levels:

yes no

Then, you can use the predict function to predict labels based on the fitted SVM:

4。然后，您可以使用预测功能预测基于SVM的标签设置：

> svm.tuned.pred = predict(model.tuned, testset[, !names(testset)

%in% c("churn")])

5. Next, generate a classification table based on the predicted and original labels of the

testing dataset:

5。下一步，生成一个基于预测和测试数据集的原始标签分类表：

> svm.tuned.table=table(svm.tuned.pred, testset$churn)

> svm.tuned.table

svm.tuned.pred yes no

yes 95 24

no 46 853

Also, generate a class agreement to measure the performance:

此外，生成一个类协议来衡量性能：

> classAgreement(svm.tuned.table)

$diag

[1] 0.9312377

$kappa

[1] 0.691678

$rand

[1] 0.871806

$crand

[1] 0.6303615

7. Finally, you can use a confusion matrix to measure the performance of the

retrained model:

7。最后，你可以使用混淆矩阵测量进行模型的性能：

> confusionMatrix(svm.tuned.table)

Confusion Matrix and Statistics

svm.tuned.pred yes no

yes 95 24

no 46 853

Accuracy : 0.9312

95% CI : (0.9139, 0.946)

No Information Rate : 0.8615

P-Value [Acc > NIR] : 1.56e-12

Kappa : 0.6917

Mcnemar‘s Test P-Value : 0.01207

Sensitivity : 0.67376

Specificity : 0.97263

Pos Pred Value : 0.79832

Neg Pred Value : 0.94883

Prevalence : 0.13851

Detection Rate : 0.09332

Detection Prevalence : 0.11690

Balanced Accuracy : 0.82320

‘Positive‘ Class : yes

How it works...

To tune the support vector machine, you can use a trial and error method to find the best

gamma and cost parameters. In other words, one has to generate a variety of combinations of

gamma and cost for the purpose of training different support vector machines.

要调整支持向量机，您可以使用一个试验和错误的方法来最好的伽玛和成本参数。换言之，要产生不同的支持向量机的目的，必须产生各种伽玛和成本的组合。

In this example, we generate different gamma values from 10^-6 to 10^-1, and cost with a

value of either 10 or 100. Therefore, you can use the tuning function, svm.tune , to generate

12 sets of parameters. The function then makes 10 cross-validations and outputs the error

dispersion of each combination. As a result, the combination with the least error dispersion

is regarded as the best parameter set. From the summary table, we found that gamma with

a value of 0.01 and cost with a value of 100 are the best parameters for the SVM fit.

在这个例子中，我们产生不同的伽玛值从10 - 6到10 - - 1，和成本的值为10或。因此，你可以使用调谐功能，svm.tune，产生12组参数。该功能使10交叉验证和输出每个组合的误差分散。其结果是，与最小误差色散的组合被视为最佳参数集。从汇总表中，我们发现，伽玛值0.01和成本的值为100的SVM T的最佳参数。

After obtaining the best parameters, we can then train a new support vector machine with

gamma equal to 0.01 and cost equal to 100. Additionally, we can obtain a classification

table based on the predicted labels and labels of the testing dataset. We can also obtain a

confusion matrix from the classification table. From the output of the confusion matrix, you

can determine the accuracy of the newly trained model in comparison to the original model.

获得最佳参数后，我们可以训练一个新的支持向量机伽玛等于0.01，成本等于100。此外，我们可以得到一个基于预测的标签和标签的测试数据集的分类表。我们还可以从分类表中获取一个混淆矩阵。从混淆矩阵的输出，您可以确定新的训练有素的模型相比，原来的模型的准确性。