In this section we will use the Correlation Feature Selection (CFS) method to select attributes to be used when training and testing the machine learning models. We will use the Iris dataset for this purpose.

data = iris
# split into training and validation datasets
ind <- sample(2, nrow(data), replace=TRUE, prob=c(0.7, 0.3))
trainData <- data[ind==1,]
validationData <- data[ind==2,]
trainData <- trainData[complete.cases(trainData),]
validationData <- validationData[complete.cases(validationData),]

Using the CFS method

The CFS method is found in the FSelector package and to use it we call the cfs method.

subset <- cfs(Species ~ ., trainData)
f <- as.simple.formula(subset, "Species")
## Species ~ Petal.Length + Petal.Width
## <environment: 0x114e18a70>

Now that we have the formula with the CFS filter method we can use it in various classification algorithms.

Naive Bayes

For example in the Naive Bayes algorithm, we will use both the formula that includes all the attributes for predicting the Species and the formula derived from CFS.

model <- naiveBayes(Species ~ ., data=trainData, laplace = 1)
simpler_model <- naiveBayes(f, data=trainData, laplace = 1)

pred <- predict(model, validationData)
simpler_pred <- predict(simpler_model, validationData)

train_pred <- predict(model, trainData)
train_simpler_pred <- predict(simpler_model, trainData)
paste("Accuracy in training all attributes", 
      Accuracy(train_pred, trainData$Species), sep=" - ")
## [1] "Accuracy in training all attributes - 0.964285714285714"
paste("Accuracy in training CFS attributes", 
      Accuracy(train_simpler_pred, trainData$Species), sep=" - ")
## [1] "Accuracy in training CFS attributes - 0.955357142857143"
paste("Accuracy in validation all attributes", 
      Accuracy(pred, validationData$Species), sep=" - ")
## [1] "Accuracy in validation all attributes - 0.947368421052632"
paste("Accuracy in validation CFS attributes", 
      Accuracy(simpler_pred, validationData$Species), sep=" - ")
## [1] "Accuracy in validation CFS attributes - 0.973684210526316"

The accuracy in the training set is increased when using all the attrbitues, but decreased in the validation set in relation to the simpler model that only used 2 attributes.