“Raw” Principal Components Analysis

First we will try to perform Principal Components analysis (PCA) without using a premade function.

Step 1: Data

Make the dataset and plot it

d <- c(2.5, 2.4, 0.5, 0.7, 2.2, 2.9, 1.9, 2.2, 3.1, 3.0, 2.3, 
       2.7, 2, 1.6, 1, 1.1, 1.5, 1.6, 1.1, 0.9)
data <- matrix(d, ncol=2, byrow = T)
plot(data, xlab="x1", ylab="x2")

Step 2: Subtract the mean

We subtract the mean of each attribute (x1, x2) from the respective values (centering the data) using the function scale.

data_norm <- scale(data, scale=F)
plot(data_norm, xlab="x1", ylab="x2")

Step 3: Calculate the covariance matrix

S <- cov(data_norm)
print(S)
##           [,1]      [,2]
## [1,] 0.6165556 0.6154444
## [2,] 0.6154444 0.7165556

Step 4: Computer the eigenvectors of the covariance matrix

udv <- svd(S)
print(udv)
## $d
## [1] 1.2840277 0.0490834
## 
## $u
##            [,1]       [,2]
## [1,] -0.6778734 -0.7351787
## [2,] -0.7351787  0.6778734
## 
## $v
##            [,1]       [,2]
## [1,] -0.6778734 -0.7351787
## [2,] -0.7351787  0.6778734
# asp=1 keep the aspect ratio to 1 so that the eigenvectors are perpendicular
plot(data_norm, asp=1, xlab="x1", ylab="x2")
arrows(0,0,udv$u[1,1],udv$u[2,1], lwd=1)
arrows(0,0,udv$u[1,2],udv$u[2,2], lwd=0.5)

Step 5: Choosing components

barplot(udv$d)

print(cumsum(udv$d)/sum(udv$d))
## [1] 0.9631813 1.0000000

We can see that the 1st component accounts for more than 95% of the variance.

Step 6: Picking the 1st component

We transform the 2D dataset into a 1D dataset using just the 1st PC.

data_new <- t(udv$u[,1,drop=FALSE]) %*% t(data_norm)
data_new
##            [,1]    [,2]       [,3]       [,4]      [,5]       [,6]
## [1,] -0.8279702 1.77758 -0.9921975 -0.2742104 -1.675801 -0.9129491
##            [,7]     [,8]      [,9]    [,10]
## [1,] 0.09910944 1.144572 0.4380461 1.223821
plot(data_new,data_new,asp=1,xlab="x1", ylab="x2")

PCA using an R function

Next we will the prcomp function that performs PCA in one step. We will apply it on the Iris dataset.

# download the file
data = iris
set.seed(1234)
ind <- sample(2, nrow(data), replace=TRUE, prob=c(0.7, 0.3))
trainData <- data[ind==1,]
validationData <- data[ind==2,]

library(class)
library(stats)

trainData <- trainData[complete.cases(trainData),]
validationData <- validationData[complete.cases(validationData),]

trainDataX <- trainData[,-ncol(trainData)]
logTrainDataX <- log(trainDataX)
train.pca <- prcomp(logTrainDataX, center = TRUE, scale. = TRUE)
summary(train.pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.7226 0.9313 0.36484 0.17916
## Proportion of Variance 0.7419 0.2168 0.03328 0.00802
## Cumulative Proportion  0.7419 0.9587 0.99198 1.00000

From the summary, you can see that with 2 PCs, I can explain more than 95% of the variance.

trainDataY <- trainData$Species
validationDataX <- validationData[,-ncol(trainData)]
# Let's also transform the validation data
logValidationDataX <- log(validationDataX)
validation.pca <- predict(train.pca, newdata=logValidationDataX)
validationDataY <- validationData$Species

# no pca prediction
prediction = knn(trainDataX, validationDataX, trainDataY, k = 3)
# So let's predict using only the 7 principal components
prediction_pca = knn(train.pca$x[,1:2], validation.pca[,1:2], trainDataY, k = 3)

cat("Confusion matrix:\n")
## Confusion matrix:
xtab = table(prediction, validationData$Species)
print(xtab)
##             
## prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         12         1
##   virginica       0          0        15
cat("\nEvaluation:\n\n")
## 
## Evaluation:
accuracy = sum(prediction == validationData$Species)/length(validationData$Species)
cat(paste("Accuracy:\t", format(accuracy, digits=2), "\n",sep=" "))
## Accuracy:     0.97
cat("Confusion matrix PCA:\n")
## Confusion matrix PCA:
xtab = table(prediction_pca, validationData$Species)
print(xtab)
##               
## prediction_pca setosa versicolor virginica
##     setosa         10          0         0
##     versicolor      0         10         3
##     virginica       0          2        13
cat("\nEvaluation PCA:\n\n")
## 
## Evaluation PCA:
accuracy = sum(prediction_pca == validationData$Species)/length(validationData$Species)
cat(paste("Accuracy PCA:\t", format(accuracy, digits=2), "\n",sep=" "))
## Accuracy PCA:     0.87

Plotting

Using the PCA we can also plot in 2D high-dimensional data using just the first two components.

plot(train.pca$x[trainData$Species == 'setosa',1:2], col="blue", ylim = c(-3, 3), xlim=c(-3,3), asp=1)
points(train.pca$x[trainData$Species == 'versicolor',1:2], pch = 3, col="red")
points(train.pca$x[trainData$Species == 'virginica',1:2], pch = 4, col="green")

In this case the PCA is not the best method for dimensionality reduction to improve the accuracy. From the plot one cas observe that the versicolor and virginica species are intermixed in 2D and thus the k-NN algorithm will have trouble classifying between the two.