In this tutorial, we will use the Recurrent Feature Elimination (RFE) method to select attributes in the Pima Indians Diabetes dataset. For this purpose the library caret
will be used.
First we will split the dataset in train and validation datasets.
# ensure the results are repeatable
set.seed(1234)
# load the library
library(mlbench)
library(caret)
# load the data
data(PimaIndiansDiabetes)
data = PimaIndiansDiabetes
# split into training and validation datasets
set.seed(1234)
ind <- sample(2, nrow(data), replace=TRUE, prob=c(0.7, 0.3))
trainData <- data[ind==1,]
validationData <- data[ind==2,]
trainData <- trainData[complete.cases(trainData),]
validationData <- validationData[complete.cases(validationData),]
Next we will use the rfe
method of the caret
package, setting up using the rfeControl
method.
# define the control using a random forest selection function
control <- rfeControl(functions=nbFuncs, method="cv", number=10)
# run the RFE algorithm
results <- rfe(trainData[,1:8], trainData[,9], sizes=c(1:8), rfeControl=control)
Let’s see the results.
# summarize the results
print(results)
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 1 0.7453 0.3681 0.04390 0.12449
## 2 0.7546 0.4182 0.04598 0.12317
## 3 0.7715 0.4707 0.05759 0.14388 *
## 4 0.7584 0.4479 0.04145 0.10478
## 5 0.7546 0.4457 0.04429 0.10095
## 6 0.7490 0.4344 0.04393 0.09394
## 7 0.7509 0.4290 0.04686 0.11122
## 8 0.7491 0.4201 0.03507 0.07767
##
## The top 3 variables (out of 3):
## glucose, age, mass
# list the chosen features
predictors(results)
## [1] "glucose" "age" "mass"
# plot the results
plot(results, type=c("g", "o"))
We will use the 3 top variables that come out of RFE in the Naive Bayes algorithm:
library(e1071)
(f <- as.formula(paste("diabetes", paste(results$optVariables, collapse=" + "), sep=" ~ ")))
## diabetes ~ glucose + age + mass
model <- naiveBayes(diabetes ~ ., data=trainData, laplace = 1)
simpler_model <- naiveBayes(f, data=trainData, laplace = 1)
pred <- predict(model, validationData)
simpler_pred <- predict(simpler_model, validationData)
library(MLmetrics)
train_pred <- predict(model, trainData)
train_simpler_pred <- predict(simpler_model, trainData)
paste("Accuracy in training all attributes",
Accuracy(train_pred, trainData$diabetes), sep=" - ")
## [1] "Accuracy in training all attributes - 0.760299625468165"
paste("Accuracy in training RFE attributes",
Accuracy(train_simpler_pred, trainData$diabetes), sep=" - ")
## [1] "Accuracy in training RFE attributes - 0.764044943820225"
paste("Accuracy in validation all attributes",
Accuracy(pred, validationData$diabetes), sep=" - ")
## [1] "Accuracy in validation all attributes - 0.764957264957265"
paste("Accuracy in validation RFE attributes",
Accuracy(simpler_pred, validationData$diabetes), sep=" - ")
## [1] "Accuracy in validation RFE attributes - 0.799145299145299"
Using a simpler formula with 5 out of 8 attibutes we were able to get better generalization results in the validation dataset.