4 Performing cross-validation with the
caret packageThe Caret (classification and regression training) package contains many functions in regard to the training process for regression and classification problems. Similar to the e1071
package, it also contains a function to perform the k-fold cross validation. In this recipe, we will demonstrate how to the perform k-fold cross validation using the caret package.
Getting ready
In this recipe, we will continue to use the telecom churn dataset as the input data source to perform the k-fold cross validation
How to do it...
Perform the following steps to perform the k-fold cross-validation with the caret package:
1. First, set up the control parameter to train with the 10-fold cross validation in 3
> control = trainControl(method="repeatedcv", number=10,
2. Then, you can train the classification model on telecom churn data with rpart:
> model = train(churn~., data=trainset, method="rpart",
preProcess="scale", trControl=control)
3. Finally, you can examine the output of the generated model:
> model
2315 samples
16 predictor
2 classes: ‘yes‘, ‘no‘
Pre-processing: scaled
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 2084, 2083, 2082, 2084, 2083, 2084, ...
Resampling results across tuning parameters:
cp Accuracy Kappa Accuracy SD Kappa SD
0.0556 0.904 0.531 0.0236 0.155
0.0746 0.867 0.269 0.0153 0.153
0.0760 0.860 0.212 0.0107 0.141
Accuracy was used to select the optimal model using the largest
The final value used for the model was cp = 0.05555556.
How it works...
In this recipe, we demonstrate how convenient it is to conduct the k-fold cross-validation using the caret package. In the first step, we set up the training control and select the option to perform the 10-fold cross-validation in three repetitions. The process of repeating the k-fold validation is called repeated k-fold validation, which is used to test the stability of the model. If
the model is stable, one should get a similar test result. Then, we apply rpart on the training dataset with the option to scale the data and to train the model with the options configured in
the previous step.
See also
f You can configure the resampling function in trainControl, in which you can
specify boot, boot632, cv, repeatedcv, LOOCV, LGOCV, none, oob, adaptive_
cv, adaptive_boot, or adaptive_LGOCV. To view more detailed information of
how to choose the resampling method, view the trainControl document:
> ?trainControl
Ranking the variable importance with the rminer package
Besides using the caret package to generate variable importance, you can use the rminerpackage to generate the variable importance of a classification model. In the following recipe, we will illustrate how to use rminer to obtain the variable importance of a fitted model.Getting readyIn this recipe, we will continue to use the telecom churn dataset as the input data source to rank the va
How to do it...Perform the following steps to rank the variable importance with rminer:1. Install and load the package, rminer:> install.packages("rminer")> library(rminer)2. Fit the svm model with the training set:> model=fit(churn~.,trainset,model="svm")3. Use the Importance function to obtain the variable importance:> VariableImportance=Importance(model,trainset,method="sensv")4. Plot the variable importance ranked by the variance:> L=list(runs=1,sen=t(VariableImportance$imp),sresponses=VariableImportance$sresponses)> mgraph(L,graph="IMP",leg=names(trainset),col="gray",Grid=10)Figure 2: The visualization of variable importance using the rminer package
Similar to the caret package, the rminer package can also generate the variable importance of a classification model. In this recipe, we first train the svm model on the training dataset, trainset, with the fit function. Then, we use the Importance function to rank the variable importance with a sensitivity measure. Finally, we use mgraph to plot the rank of the variable importance. Simila
6.Finding highly correlated features with the caret package
When performing regression or classification, some models perform better if highly correlated attributes are removed. The caret package provides the findCorrelation function, which can be used to find attributes that are highly correlated to each other. In this recipe, we will demonstrate how to find highly correlated features using the caret package.
How to do it...Perform the following steps to find highly correlated attributes:1. Remove the features that are not coded in numeric characters:> new_train = trainset[,! names(churnTrain) %in% c("churn", "international_plan", "voice_mail_plan")]2. Then, you can obtain the correlation of each attribute:>cor_mat = cor(new_train)3. Next, we use findCorrelation to search for highly correlated attributes with a cut off equal to 0.75:> highlyCorrelated = findCorrelation(cor_mat, cutoff=0.75)4. We then obtain the name of highly correlated attributes:> names(new_train)[highlyCorrelated][1] "total_intl_minutes" "total_day_charge" "total_eve_minutes" "total_night_minutes"
In this recipe, we search for highly correlated attributes using the caret package. In order to retrieve the correlation of each attribute, one should first remove nonnumeric attributes. Then, we perform correlation to obtain a correlation matrix. Next, we use findCorrelation to find highly correlated attributes with the cut off set to 0.75. We finally obtain the names of highly correlated