Introduction

We will apply a wrapper selection method using forward search and decision trees as a model in the breast cancer Wisconsin dataset. First we will split the dataset into training and validation datasets.

data <- read.table('breast-cancer-wisconsin.data', na.strings = "?", sep=",")
data <- data[,-1]
names(data) <- c("ClumpThickness", 
                 "UniformityCellSize", 
                 "UniformityCellShape", 
                 "MarginalAdhesion",
                 "SingleEpithelialCellSize",
                 "BareNuclei",
                 "BlandChromatin",
                 "NormalNucleoli",
                 "Mitoses",
                 "Class")
data$Class <- factor(data$Class, levels=c(2,4), labels=c("benign", "malignant"))
set.seed(1234)
ind <- sample(2, nrow(data), replace=TRUE, prob=c(0.7, 0.3))
trainData <- data[ind==1,]
validationData <- data[ind==2,]
# remove cases with missing data
trainData <- trainData[complete.cases(trainData),]
validationData <- validationData[complete.cases(validationData),]

Naive Bayes

We can use the Naive Bayes algorithm to evaluate the forward selection algorithm both in the training and the validation datasets under the accuracy metric.

library(e1071)
model <- naiveBayes(Class ~ ., data=trainData, laplace = 1)
simpler_model <- naiveBayes(f, data=trainData, laplace = 1)

pred <- predict(model, validationData)
simpler_pred <- predict(simpler_model, validationData)

library(MLmetrics)
train_pred <- predict(model, trainData)
train_simpler_pred <- predict(simpler_model, trainData)
paste("Accuracy in training all attributes", 
      Accuracy(train_pred, trainData$Class), sep=" - ")
## [1] "Accuracy in training all attributes - 0.957805907172996"
paste("Accuracy in training forward search attributes", 
      Accuracy(train_simpler_pred, trainData$Class), sep=" - ")
## [1] "Accuracy in training forward search attributes - 0.953586497890295"
paste("Accuracy in validation all attributes", 
      Accuracy(pred, validationData$Class), sep=" - ")
## [1] "Accuracy in validation all attributes - 0.976076555023923"
paste("Accuracy in validation forward search attributes", 
      Accuracy(simpler_pred, validationData$Class), sep=" - ")
## [1] "Accuracy in validation forward search attributes - 0.971291866028708"

In the breast cancer Wisconsin dataset, the feature selection algorithm did not outperform the use of all attributes. The obvious cause is that there 9 attributes are handpicked by domain experts and have indeed a predictive power all together. So removing some does not product better results.