文章翻译第七章7-9

Posted DY数据科学实验室

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了文章翻译第七章7-9相关的知识,希望对你有一定的参考价值。

7.Selecting features using the caret package

使用插入符号包装特征选择

 The feature selection method searches the subset of features with minimized predictive errors. We can apply feature selection to identify which attributes are required to build an accurate model. The caret package provides a recursive feature elimi nationfunction,rfe,which can help automatically select the required features. In the following recipe, we will demonstrate how to use the car.

特征选择方法搜索特征子集的最小化预测误差。我们可以应用特征选择,以确定哪些属性需要建立一个准确的模型。插入软件包提供了一个递归特征消除功能,RFE,可以自动选择所需的功能。在下面的食谱中,我们将演示如何使用汽车。

How to do it...Perform the following steps to select features:1. Transform the feature named as international_plan of the training dataset, trainset, to intl_yes and intl_no:> intl_plan = model.matrix(~ trainset.international_plan - 1, data=data.frame(trainset$international_plan))> colnames(intl_plan) = c("trainset.international_planno"="intl_no", "trainset.international_planyes"= "intl_yes")

2. Transform the feature named as voice_mail_plan of the training dataset, trainset, to voice_yes and voice_no:> voice_plan = model.matrix(~ trainset.voice_mail_plan - 1,

data=data.frame(trainset$voice_mail_plan))> colnames(voice_plan) = c("trainset.voice_mail_planno" ="voice_

no", "trainset.voice_mail_planyes"="voidce_yes")3. Remove the international_plan and voice_mail_plan attributes and combine the training dataset, trainset with the data frames, intl_planand voice_plan:

> trainset$international_plan = NULL

> trainset$voice_mail_plan = NULL> trainset = cbind(intl_plan,voice_plan, trainset)4. Transform the feature named as international_plan of the testing dataset,

testset, to intl_yes and intl_no:> intl_plan = model.matrix(~ testset.international_plan - 1, data=data.frame(testset$international_plan))> colnames(intl_plan) = c("testset.international_planno"="intl_no", "testset.international_planyes"= "intl_yes")5. Transform the feature named as voice_mail_plan of the training dataset, trainset, to voice_yes and voice_no:> voice_plan = model.matrix(~ testset.voice_mail_plan - 1, data=data.frame(testset$voice_mail_plan))> colnames(voice_plan) = c("testset.voice_mail_planno" ="voice_no", "testset.voice_mail_planyes"="voidce_yes")

6. Remove the international_plan and voice_mail_plan attributes and combine the testing dataset, testset with the data frames, intl_plan and voice_plan:

> testset$international_plan = NULL> testset$voice_mail_plan = NULL> testset = cbind(intl_plan,voice_plan, testset)7. We then create a feature selection algorithm using linear discriminant analysis:> ldaControl = rfeControl(functions = ldaFuncs, method = "cv")

 In this recipe, we perform feature selection using the caret package. As there are factor-coded attributes within the dataset, we first use a function called model.matrix to transform the factor-coded attributes into multiple binary attributes. Therefore, we transform the international_plan attribute to intl_yes and intl_no. Additionally, we transform the voice_mail_plan attribute to voice

在这个食谱中,我们进行特征选择使用插入符号包。有因子编码属性的数据集,我们首先使用一个叫做model.matrix变换因子编码属性为多个二进制属性的功能。因此,我们将international_plan属性intl_yes和intl_no。此外,我们将voice_mail_plan属性的声音。

8.Measuring the performance of the regression model

  回归模型的性能测量

  To measure the performance of a regression model, we can calculate the distance from predicted output and the actual output as a quantifier of the performance of the model. Here, we often use the root mean square error (RMSE), relative square error (RSE) and R-Square as common measurements. In the following recipe, we will illustratehowto compute these measurements from a built regressio.

 要衡量的回归模型的性能,我们可以计算出预测的输出和实际输出的距离作为一个量词的性能模型。在这里,我们经常使用的均方根误差(RMSE),相对误差(RSE)判定为常见的测量。在下面的食谱,我们将说明如何从建立回归计算这些测量.

  The measurement of the performance of the regression model employs the distance between the predicted value and the actual value. We often use these three measurements, root mean square error, relative square error, and R-Square, as the quantifier of the performance of regression models. In this recipe, we first load the Quartet data from the car package. We then use the lm function to fit.

How to do it...

Perform the following steps to measure the performance of the regression model:1. Load the Quartet dataset from the car package:> library(car)

> data(Quartet)2. Plot the attribute, y3, against x using the lm function:> plot(Quartet$x, Quartet$y3)> lmfit = lm(Quartet$y3~Quartet$x)

> abline(lmfit, col="red")

Figure 4: The linear regression plot3. You can retrieve predicted values by using the predict function:> predicted= predict(lmfit, newdata=Quartet[c("x")])

4. Now, you can calculate the root mean square error:> actual = Quartet$y3> rmse = (mean((predicted - actual)^2))^0.5> rmse[1] 1.118286

5. You can calculate the relative square error:

> mu = mean(actual)

> rse = mean((predicted - actual)^2) / mean((mu - actual)^2) > rse[1] 0.3336766. Also, you can use R-Square as a measurement:> rsquare = 1 - rse> rsquare[1] 0.666324

7. Then, you can plot attribute, y3, against x using the rlm function from the MASS package:> library(MASS)> plot(Quartet$x, Quartet$y3)> rlmfit = rlm(Quartet$y3~Quartet$x)

> abline(rlmfit, col="red")Figure 5: The robust linear regression plot on the Quartet datase

t

 

回归模型的性能测量采用预测值与实际值之间的距离。我们经常使用这三个测量,均方根误差,相对误差,判定,作为回归模型性能的量词。在这个配方中,我们首先从车包加载四方数据。然后,我们使用LM函数拟合。

9.Measuring prediction performance with a confusion matrix

用混淆矩阵测量预测性能To measure the performance of a classification model, we can first generate a classification table based on our predicted label and actual label. Then, we can use a confusion matrix to obtain performance measures such as precision, recall, specificity, and accuracy. In this recipe, we will demonstrate how to retrieve a confusion mat

测量预测性能的混淆矩阵来衡量性能的分类模型,我们可以首先生成一个分类表的基础上我们的预测标签和实际标签。然后,我们可以使用一个混乱的矩阵,以获得性能的措施,如精度,召回,特异性和准确性。在这个食谱中,我们将演示如何检索混淆垫

How to do it...Perform the following steps to generate a classification measurement:

1. Train an svm model using the training dataset:> svm.model= train(churn ~ .,+ data = trainset,

+ method = "svmRadial")

2. You can then predict labels using the fitted model, svm.model:> svm.pred = predict(svm.model, testset[,! names(testset) %in% c("churn")])

3. Next, you can generate a classification table:> table(svm.pred, testset[,c("churn")])svm.pred yes no

 yes 73 16 no 68 861

Model Evaluation 2404. Lastly, you can generate a confusion matrix using the prediction results and the actual labels from the testing dataset:> confusionMatrix(svm.pred, testset[,c("churn")])

Confusion Matrix and Statistics

 ReferencePrediction yes no

 yes 73 16

 no 68 861

  Accuracy : 0.9175  95% CI : (0.8989, 0.9337) No Information Rate : 0.8615  P-Value [Acc > NIR] : 2.273e-08  Kappa : 0.5909  Mcnemar‘s Test P-Value : 2.628e-08

 Sensitivity : 0.51773

 Specificity : 0.98176  Pos Pred Value : 0.82022  Neg Pred Value : 0.92680

 Prevalence : 0.13851  Detection Rate : 0.07171  Detection Prevalence : 0.08743

 Balanced Accuracy : 0.74974

 ‘Positive‘ Class : yes

In this recipe, we demonstrate how to obtain a confusion matrix to measure the performance of a classification model. First, we use the train function from the caret package to train an svm model. Next, we use the predict function to extract the predicted labels of the svm model using the testing dataset. Then, we perform the table function to obtain the classification table based on the performance.

在这个配方中,我们演示了如何获得一个混乱的矩阵来衡量性能的分类模型。首先,我们用火车从符号功能包来训练一个SVM模型。接下来,我们使用预测函数提取预测标签的SVM模型,使用测试数据集。然后,我们执行表功能,得到的分类表的基础.

                                                                     --------摘自百度翻译

                                                                                                                                                                                                                                                     郎小敏

以上是关于文章翻译第七章7-9的主要内容,如果未能解决你的问题,请参考以下文章

文章翻译第七章10-12

文章翻译第七章4-6

文章翻译第七章10-12

文章翻译第七章4-6

文章翻译第七章1-3

python 核心编程 第七章习题