Introduction to R

The basics

R is an open source programming language and a free environment, mainly used for statistical computing and graphics. Information about R you can find in the official website. By searching with the keyword R with other topic-specific words in sites like Google, one can find additional information from sites, blog posts, tutorials, documents etc.

Even through R comes with its own environment: command line and graphical interfaces, one can use the popular RStudio, which offers additional graphical functionalities.

When in the R environment (the R prompt is >) one can exit by calling the quit() function or q() for short. When asked if you want to save the workspace, if you reply with a y for yes, all the variables that you have during the current R session will be saved into a file names .Rdata in the current working directory. If you later start R in the same directory, the variables and their names will be automatically loaded.

To check which is your current working directory, your can enter:

getwd()

To set the working directory one can use the setwd function:

setwd("~/Desktop")

What you type at the R prompt is an expression, which R attempts to evaluate and type the result. For example getwd() is an expression that is evaluated by calling the function getwd() with no arguments. The same for 42

## [1] 42

and the same for

(100 * 2 - 12 ^ 2) / 7 * 5 + 2

## [1] 42

There are also predefined constants like pi or e

sin(pi/2)

## [1] 1

To find out the documentation of a specific function you can enter ?sum or help(sum). To search for functions, there is the help.search("sin") function to help you with that. For certain functions on can see examples of use by using the expression example(plot). Comments start with #, while to assign values to variables you can use <- or =. For example:

a <- 42
b <- (42 + a) / 2
print(a)

## [1] 42

print(b)

## [1] 42

With ls() one can check all the variables existing in the current R session.

ls()

##  [1] "a"                   "accuracy"            "b"                  
##  [4] "centers.matrix"      "clustering"          "col"                
##  [7] "control"             "ctree"               "ctrl"               
## [10] "d"                   "data"                "data_new"           
## [13] "data_norm"           "dataset"             "dbscanClust"        
## [16] "demo"                "df"                  "distances"          
## [19] "entTree"             "evaluation"          "evaluator"          
## [22] "f"                   "fileURL"             "fulldata"           
## [25] "groups"              "hc"                  "ind"                
## [28] "indices"             "inTrain"             "kmeansClust"        
## [31] "kNNFit"              "kNNFit1"             "kNNFit2"            
## [34] "kNNFit4"             "knnFit5"             "knnPredict"         
## [37] "knnPredict2"         "logTrainDataX"       "logValidationDataX" 
## [40] "model"               "outlier.list1"       "outlier.list2"      
## [43] "outlier.scores"      "outliers"            "pch"                
## [46] "PimaIndiansDiabetes" "precision"           "pred"               
## [49] "pred_obj"            "pred_prob"           "prediction"         
## [52] "prediction_pca"      "recall"              "results"            
## [55] "ROCcurve"            "rules"               "S"                  
## [58] "simpler_model"       "simpler_pred"        "Sonar"              
## [61] "subset"              "testing"             "train_pred"         
## [64] "train_simpler_pred"  "train.pca"           "trainData"          
## [67] "trainDataX"          "trainDataY"          "training"           
## [70] "trans"               "tree"                "tree_with_params"   
## [73] "udv"                 "validation.pca"      "validationData"     
## [76] "validationDataX"     "validationDataY"     "x"                  
## [79] "x1"                  "x2"                  "xtab"               
## [82] "y"                   "y1"                  "y2"                 
## [85] "ytest"

# while to delete all the variables in the current session you can use the call: 
rm(list=ls())

Vectors

Create the vector a = (10, 5, 3, 100, -2, 5, -50)

a <- c(10, 5, 3, 100, -2, 5, -50)
a

## [1]  10   5   3 100  -2   5 -50

Select the elements of the vector with indices 1, 3, 4, and 5:

a[c(1,3:4)]

## [1]  10   3 100

The above expression uses the c() function for combining values and the : operator that generates sequences from:to with step 1. Another easy way of specifying sequences is to use the seq function.

c(1, 2, 7, 10)

## [1]  1  2  7 10

1:10

##  [1]  1  2  3  4  5  6  7  8  9 10

seq(1, 6, by=1)

## [1] 1 2 3 4 5 6

seq(1,6, by=2)

## [1] 1 3 5

seq(1,by=2, length=6)

## [1]  1  3  5  7  9 11

Type ?seq to get to know the function.

To check the type of a variable there is the class function:

class(a)

## [1] "numeric"

To check which a elements have a value greater than 5:

a > 5

## [1]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE

which(a>5)

## [1] 1 4

# returns the indices for which the values are TRUE

To get the positive elements of a:

b <- a > 0
positives <- a[b]
positives

## [1]  10   5   3 100   5

# or more succintly
positives <- a[a>0]
positives

## [1]  10   5   3 100   5

To check the length of a vector:

length(a)

## [1] 7

One can also bind vectors by column (cbind()) or by row (rbind())

c <- 1:7
rbind(a,c)

##   [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## a   10    5    3  100   -2    5  -50
## c    1    2    3    4    5    6    7

cbind(a,c)

##        a c
## [1,]  10 1
## [2,]   5 2
## [3,]   3 3
## [4,] 100 4
## [5,]  -2 5
## [6,]   5 6
## [7,] -50 7

Matrices

To create matrics use the matrix() function

matrix(10,3, 2)

##      [,1] [,2]
## [1,]   10   10
## [2,]   10   10
## [3,]   10   10

# or
matrix(c(1,2,3,4,5,6), 3, 2)

##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

# or 
matrix(c(1,2,3), 3, 2)

##      [,1] [,2]
## [1,]    1    1
## [2,]    2    2
## [3,]    3    3

But let’s examine how are we calling the matrix function:

args(matrix)

## function (data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL) 
## NULL

So the first argument are the data, then with nrow or ncol arguments we can declare the number of rows and columns and with the argument byrow we declare that we want to fill in the matrix column-by-column if byrow=FALSE and row-by-row if byrow=TRUE. In the above calls we didn’t use the byrow argument because the function matrix has a default value byrow=FALSE as we can also check from the documentation, ?matrix.

m = matrix(1:9, byrow = TRUE, nrow=3)
m

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

Here we have filled in a matrix with values 1 to 9, by row, with the number of rows equal to 3. This gives us a square 3x3 matrix. R is pretty smart in knowing that the number of columns should be 3 as well!

We can also call cbind and rbind and other functions like rowSums, colSums, mean, t for transpose etc.

m2 <- rbind(m, m)
m2

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9
## [4,]    1    2    3
## [5,]    4    5    6
## [6,]    7    8    9

rowSums(m2)

## [1]  6 15 24  6 15 24

colSums(m2)

## [1] 24 30 36

mean(m2)

## [1] 5

For element wise multiplication on can use the * operator while for matrix multiplication you can use the %*% operator.

am <- matrix(10:18, byrow = TRUE, nrow = 3)
am

##      [,1] [,2] [,3]
## [1,]   10   11   12
## [2,]   13   14   15
## [3,]   16   17   18

bm <- matrix(c(3,6,7,10,8,1,2,3,2), byrow = TRUE, nrow = 3)
bm

##      [,1] [,2] [,3]
## [1,]    3    6    7
## [2,]   10    8    1
## [3,]    2    3    2

am * bm

##      [,1] [,2] [,3]
## [1,]   30   66   84
## [2,]  130  112   15
## [3,]   32   51   36

am %*% bm

##      [,1] [,2] [,3]
## [1,]  164  184  105
## [2,]  209  235  135
## [3,]  254  286  165

t(am)

##      [,1] [,2] [,3]
## [1,]   10   13   16
## [2,]   11   14   17
## [3,]   12   15   18

Data frames

Unlike matrices, data frames can store values of different types in their columns. They are used extensively in R for data analysis. As rows usually we have the observations (or samples) and as columns we have the characteristics (or attributes or features). When we read from a file, the result is read as a data frame. Download the zip file r-novice-inflammation.zip and unzip it in the Desktop. Examine the file inflammation-01.csv with a text editor to see what we are going to be loading. The read the file:

data <- read.csv(file = "data/inflammation-01.csv", header = FALSE)
# Notice the use of the path including data/ since we previously set the working directory as the Desktop
getwd()

## [1] "/Users/kyrcha/Workspaces/github/ml-tutorials"

dir()

##  [1] "_site.yml"                    "ar.html"                     
##  [3] "ar.pdf"                       "ar.Rmd"                      
##  [5] "breast-cancer-wisconsin.data" "cfs.html"                    
##  [7] "cfs.Rmd"                      "classification.html"         
##  [9] "classification.Rmd"           "clustering.html"             
## [11] "clustering.Rmd"               "CNAME"                       
## [13] "data"                         "data-eng.html"               
## [15] "data-eng.Rmd"                 "dbscan_files"                
## [17] "dbscan.html"                  "dbscan.pdf"                  
## [19] "dbscan.Rmd"                   "docs"                        
## [21] "dt_files"                     "dt.html"                     
## [23] "dt.pdf"                       "dt.Rmd"                      
## [25] "finder.html"                  "finder.Rmd"                  
## [27] "footer.html"                  "footer.tex"                  
## [29] "head.tex"                     "hierarchical_files"          
## [31] "hierarchical.html"            "hierarchical.pdf"            
## [33] "hierarchical.Rmd"             "index.html"                  
## [35] "index.Rmd"                    "intro.Rmd"                   
## [37] "kmeans.pdf"                   "kmeans.Rmd"                  
## [39] "knn.Rmd"                      "ml-tutorials.Rproj"          
## [41] "nb.pdf"                       "nb.Rmd"                      
## [43] "nn.Rmd"                       "other.Rmd"                   
## [45] "outlier.pdf"                  "outlier.Rmd"                 
## [47] "pca.Rmd"                      "r.Rmd"                       
## [49] "README.md"                    "render.sh"                   
## [51] "retail.data.gz"               "rfe.Rmd"                     
## [53] "site_libs"                    "style.css"                   
## [55] "uc.Rmd"                       "wrapper.Rmd"

The dir function return the files and directories of the file system. The argument header=FALSE lets the read.csv function know that there is no header row to give the columns names.

With head(data) I can check if the data are loaded correctly. It return the first few rows:

head(data)

##   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 1  0  0  1  3  1  2  4  7  8   3   3   3  10   5   7   4   7   7  12  18
## 2  0  1  2  1  2  1  3  2  2   6  10  11   5   9   4   4   7  16   8   6
## 3  0  1  1  3  3  2  6  2  5   9   5   7   4   5   4  15   5  11   9  10
## 4  0  0  2  0  4  2  2  1  6   7  10   7   9  13   8   8  15  10  10   7
## 5  0  1  1  3  3  1  3  5  2   4   4   7   6   5   3  10   8  10   6  17
## 6  0  0  1  2  2  4  2  1  6   4   7   6   6   9   9  15   4  16  18  12
##   V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38
## 1   6  13  11  11   7   7   4   6   8   8   4   4   5   7   3   4   2   3
## 2  18   4  12   5  12   7  11   5  11   3   3   5   4   4   5   5   1   1
## 3  19  14  12  17   7  12  11   7   4   2  10   5   4   2   2   3   2   2
## 4  17   4   4   7   6  15   6   4   9  11   3   5   6   3   3   4   2   3
## 5   9  14   9   7  13   9  12   6   7   7   9   6   3   2   2   4   2   0
## 6  12   5  18   9   5   3  10   3  12   7   8   4   7   3   5   4   4   3
##   V39 V40
## 1   0   0
## 2   0   1
## 3   1   1
## 4   2   1
## 5   1   1
## 6   2   1

Other function I can use are:

# type of the variable
class(data)

## [1] "data.frame"

# dimensions
dim(data)

## [1] 60 40

# structure
str(data)

## 'data.frame':    60 obs. of  40 variables:
##  $ V1 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V2 : int  0 1 1 0 1 0 0 0 0 1 ...
##  $ V3 : int  1 2 1 2 1 1 2 1 0 1 ...
##  $ V4 : int  3 1 3 0 3 2 2 2 3 2 ...
##  $ V5 : int  1 2 3 4 3 2 4 3 1 1 ...
##  $ V6 : int  2 1 2 2 1 4 2 1 5 3 ...
##  $ V7 : int  4 3 6 2 3 2 2 2 6 5 ...
##  $ V8 : int  7 2 2 1 5 1 5 3 5 3 ...
##  $ V9 : int  8 2 5 6 2 6 5 5 5 5 ...
##  $ V10: int  3 6 9 7 4 4 8 3 8 8 ...
##  $ V11: int  3 10 5 10 4 7 6 7 2 6 ...
##  $ V12: int  3 11 7 7 7 6 5 8 4 8 ...
##  $ V13: int  10 5 4 9 6 6 11 8 11 12 ...
##  $ V14: int  5 9 5 13 5 9 9 5 12 5 ...
##  $ V15: int  7 4 4 8 3 9 4 10 10 13 ...
##  $ V16: int  4 4 15 8 10 15 13 9 11 6 ...
##  $ V17: int  7 7 5 15 8 4 5 15 9 13 ...
##  $ V18: int  7 16 11 10 10 16 12 11 10 8 ...
##  $ V19: int  12 8 9 10 6 18 10 18 17 16 ...
##  $ V20: int  18 6 10 7 17 12 6 19 11 8 ...
##  $ V21: int  6 18 19 17 9 12 9 20 6 18 ...
##  $ V22: int  13 4 14 4 14 5 17 8 16 15 ...
##  $ V23: int  11 12 12 4 9 18 15 5 12 16 ...
##  $ V24: int  11 5 17 7 7 9 8 13 6 14 ...
##  $ V25: int  7 12 7 6 13 5 9 15 8 12 ...
##  $ V26: int  7 7 12 15 9 3 3 10 14 7 ...
##  $ V27: int  4 11 11 6 12 10 13 6 6 3 ...
##  $ V28: int  6 5 7 4 6 3 7 10 13 8 ...
##  $ V29: int  8 11 4 9 7 12 8 6 10 9 ...
##  $ V30: int  8 3 2 11 7 7 2 7 11 11 ...
##  $ V31: int  4 3 10 3 9 8 8 4 4 2 ...
##  $ V32: int  4 5 5 5 6 4 8 9 6 5 ...
##  $ V33: int  5 4 4 6 3 7 4 3 4 4 ...
##  $ V34: int  7 4 2 3 2 3 2 5 7 5 ...
##  $ V35: int  3 5 2 3 2 5 3 2 6 1 ...
##  $ V36: int  4 5 3 4 4 4 5 5 3 4 ...
##  $ V37: int  2 1 2 2 2 4 4 3 2 1 ...
##  $ V38: int  3 1 2 3 0 3 1 2 1 2 ...
##  $ V39: int  0 0 1 2 1 2 1 2 0 0 ...
##  $ V40: int  0 1 1 1 1 1 1 1 0 0 ...

# statistical summarization of the data frame
summary(data)

##        V1          V2             V3              V4             V5       
##  Min.   :0   Min.   :0.00   Min.   :0.000   Min.   :0.00   Min.   :1.000  
##  1st Qu.:0   1st Qu.:0.00   1st Qu.:1.000   1st Qu.:1.00   1st Qu.:1.000  
##  Median :0   Median :0.00   Median :1.000   Median :2.00   Median :2.000  
##  Mean   :0   Mean   :0.45   Mean   :1.117   Mean   :1.75   Mean   :2.433  
##  3rd Qu.:0   3rd Qu.:1.00   3rd Qu.:2.000   3rd Qu.:3.00   3rd Qu.:3.000  
##  Max.   :0   Max.   :1.00   Max.   :2.000   Max.   :3.00   Max.   :4.000  
##        V6             V7            V8              V9       
##  Min.   :1.00   Min.   :1.0   Min.   :1.000   Min.   :2.000  
##  1st Qu.:2.00   1st Qu.:2.0   1st Qu.:2.000   1st Qu.:4.000  
##  Median :3.00   Median :4.0   Median :4.000   Median :5.000  
##  Mean   :3.15   Mean   :3.8   Mean   :3.883   Mean   :5.233  
##  3rd Qu.:4.00   3rd Qu.:5.0   3rd Qu.:5.250   3rd Qu.:7.000  
##  Max.   :5.00   Max.   :6.0   Max.   :7.000   Max.   :8.000  
##       V10             V11             V12             V13       
##  Min.   :2.000   Min.   : 2.00   Min.   : 2.00   Min.   : 3.00  
##  1st Qu.:3.750   1st Qu.: 4.00   1st Qu.: 3.75   1st Qu.: 5.00  
##  Median :6.000   Median : 6.00   Median : 5.50   Median : 9.50  
##  Mean   :5.517   Mean   : 5.95   Mean   : 5.90   Mean   : 8.35  
##  3rd Qu.:7.000   3rd Qu.: 9.00   3rd Qu.: 8.00   3rd Qu.:11.00  
##  Max.   :9.000   Max.   :10.00   Max.   :11.00   Max.   :12.00  
##       V14              V15              V16            V17        
##  Min.   : 3.000   Min.   : 3.000   Min.   : 3.0   Min.   : 4.000  
##  1st Qu.: 5.000   1st Qu.: 5.000   1st Qu.: 6.0   1st Qu.: 6.750  
##  Median : 8.000   Median : 8.000   Median :10.0   Median : 8.500  
##  Mean   : 7.733   Mean   : 8.367   Mean   : 9.5   Mean   : 9.583  
##  3rd Qu.:10.000   3rd Qu.:12.000   3rd Qu.:13.0   3rd Qu.:13.000  
##  Max.   :13.000   Max.   :14.000   Max.   :15.0   Max.   :16.000  
##       V18             V19             V20             V21       
##  Min.   : 5.00   Min.   : 5.00   Min.   : 5.00   Min.   : 5.00  
##  1st Qu.: 7.75   1st Qu.: 8.00   1st Qu.: 8.75   1st Qu.: 9.00  
##  Median :11.00   Median :11.50   Median :13.00   Median :14.00  
##  Mean   :10.63   Mean   :11.57   Mean   :12.35   Mean   :13.25  
##  3rd Qu.:13.00   3rd Qu.:15.00   3rd Qu.:16.00   3rd Qu.:16.25  
##  Max.   :17.00   Max.   :18.00   Max.   :19.00   Max.   :20.00  
##       V22             V23             V24             V25      
##  Min.   : 4.00   Min.   : 4.00   Min.   : 4.00   Min.   : 4.0  
##  1st Qu.: 8.00   1st Qu.: 7.75   1st Qu.: 6.75   1st Qu.: 7.0  
##  Median :13.00   Median :11.00   Median :10.00   Median :10.5  
##  Mean   :11.97   Mean   :11.03   Mean   :10.17   Mean   :10.0  
##  3rd Qu.:15.25   3rd Qu.:15.00   3rd Qu.:13.25   3rd Qu.:13.0  
##  Max.   :19.00   Max.   :18.00   Max.   :17.00   Max.   :16.0  
##       V26              V27             V28             V29        
##  Min.   : 3.000   Min.   : 3.00   Min.   : 3.00   Min.   : 3.000  
##  1st Qu.: 5.750   1st Qu.: 7.00   1st Qu.: 5.00   1st Qu.: 5.000  
##  Median : 9.000   Median :10.00   Median : 7.00   Median : 7.000  
##  Mean   : 8.667   Mean   : 9.15   Mean   : 7.25   Mean   : 7.333  
##  3rd Qu.:12.000   3rd Qu.:12.00   3rd Qu.: 9.00   3rd Qu.:10.000  
##  Max.   :15.000   Max.   :14.00   Max.   :13.00   Max.   :12.000  
##       V30              V31              V32            V33       
##  Min.   : 2.000   Min.   : 2.000   Min.   :2.00   Min.   :2.000  
##  1st Qu.: 3.000   1st Qu.: 4.000   1st Qu.:4.00   1st Qu.:4.000  
##  Median : 7.000   Median : 6.000   Median :6.00   Median :5.000  
##  Mean   : 6.583   Mean   : 6.067   Mean   :5.95   Mean   :5.117  
##  3rd Qu.:10.000   3rd Qu.: 8.000   3rd Qu.:8.00   3rd Qu.:6.000  
##  Max.   :11.000   Max.   :10.000   Max.   :9.00   Max.   :8.000  
##       V34           V35           V36             V37             V38     
##  Min.   :1.0   Min.   :1.0   Min.   :1.000   Min.   :1.000   Min.   :0.0  
##  1st Qu.:2.0   1st Qu.:2.0   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:0.0  
##  Median :4.0   Median :3.0   Median :4.000   Median :2.000   Median :1.0  
##  Mean   :3.6   Mean   :3.3   Mean   :3.567   Mean   :2.483   Mean   :1.5  
##  3rd Qu.:5.0   3rd Qu.:5.0   3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:3.0  
##  Max.   :7.0   Max.   :6.0   Max.   :5.000   Max.   :4.000   Max.   :3.0  
##       V39             V40        
##  Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.0000  
##  Median :1.000   Median :1.0000  
##  Mean   :1.133   Mean   :0.5667  
##  3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :2.000   Max.   :1.0000

There are also different way I can select slices of data from the data frame

# first value in data
data[1, 1]

## [1] 0

# middle value in data
data[30, 20]

## [1] 16

# first four rows and first ten columns
data[1:4, 1:10]

##   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 1  0  0  1  3  1  2  4  7  8   3
## 2  0  1  2  1  2  1  3  2  2   6
## 3  0  1  1  3  3  2  6  2  5   9
## 4  0  0  2  0  4  2  2  1  6   7

# doesn't have to start from the beginning
data[5:10, 1:10]

##    V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 5   0  1  1  3  3  1  3  5  2   4
## 6   0  0  1  2  2  4  2  1  6   4
## 7   0  0  2  2  4  2  2  5  5   8
## 8   0  0  1  2  3  1  2  3  5   3
## 9   0  0  0  3  1  5  6  5  5   8
## 10  0  1  1  2  1  3  5  3  5   8

# specific rows and columns
data[c(3, 8, 37, 56), c(10, 14, 29)]

##    V10 V14 V29
## 3    9   5   4
## 8    3   5   6
## 37   6   9  10
## 56   7  11   9

# All columns from row 5
data[5, ]

##   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
## 5  0  1  1  3  3  1  3  5  2   4   4   7   6   5   3  10   8  10   6  17
##   V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38
## 5   9  14   9   7  13   9  12   6   7   7   9   6   3   2   2   4   2   0
##   V39 V40
## 5   1   1

# All rows from column 16
data[, 16]

##  [1]  4  4 15  8 10 15 13  9 11  6  3  8 12  3  5 10 11  4 11 13 15  5 14
## [24] 13  4  9 13  6  7  6 14  3 15  4 15 11  7 10 15  6  5  6 15 11 15  6
## [47] 11 15 14  4 10 15 11  6 13  8  4 13 12  9

A subtle point is that the last selection returned a vector instead of a data frame. This is because we selected only a single column. If you don’t want this behavior do:

# All columns from row 5
d1 <- data[5, ]
class(d1)

## [1] "data.frame"

# All rows from column 16
d2 <- data[, 16]
class(d2)

## [1] "integer"

d3 <- data[, 16, drop=FALSE]
class(d3)

## [1] "data.frame"

Other functions you can call are min, max, mean, sd and median to get statistical values of interest:

# first row, all of the columns
patient_1 <- data[1, ]

# max inflammation for patient 1
max(patient_1)

## [1] 18

# max inflammation for patient 2
max(data[2, ])

## [1] 18

# minimum inflammation on day 7
min(data[, 7])

## [1] 1

# mean inflammation on day 7
mean(data[, 7])

## [1] 3.8

# median inflammation on day 7
median(data[, 7])

## [1] 4

# standard deviation of inflammation on day 7
sd(data[, 7])

## [1] 1.725187

To do more complex calculations like the maximum inflammation for all patients, or the average for each day? we need to apply the function max or mean per row or column respectivelly. Luckily there is the function apply that applies a function for each one of the “margins”, 1 for rows and 2 for columns:

args(apply) # args return NULL because it prints the information, but every function must return something!

## function (X, MARGIN, FUN, ...) 
## NULL

max_patient_inflammation <- apply(data, 1, max)
max_patient_inflammation

##  [1] 18 18 19 17 17 18 17 20 17 18 18 18 17 16 17 18 19 19 17 19 19 16 17
## [24] 15 17 17 18 17 20 17 16 19 15 15 19 17 16 17 19 16 18 19 16 19 18 16
## [47] 19 15 16 18 14 20 17 15 17 16 17 19 18 18

avg_day_inflammation <- apply(data, 2, mean)
avg_day_inflammation

##         V1         V2         V3         V4         V5         V6 
##  0.0000000  0.4500000  1.1166667  1.7500000  2.4333333  3.1500000 
##         V7         V8         V9        V10        V11        V12 
##  3.8000000  3.8833333  5.2333333  5.5166667  5.9500000  5.9000000 
##        V13        V14        V15        V16        V17        V18 
##  8.3500000  7.7333333  8.3666667  9.5000000  9.5833333 10.6333333 
##        V19        V20        V21        V22        V23        V24 
## 11.5666667 12.3500000 13.2500000 11.9666667 11.0333333 10.1666667 
##        V25        V26        V27        V28        V29        V30 
## 10.0000000  8.6666667  9.1500000  7.2500000  7.3333333  6.5833333 
##        V31        V32        V33        V34        V35        V36 
##  6.0666667  5.9500000  5.1166667  3.6000000  3.3000000  3.5666667 
##        V37        V38        V39        V40 
##  2.4833333  1.5000000  1.1333333  0.5666667

Now let’s do some plotting

plot(avg_day_inflammation)

max_day_inflammation <- apply(data, 2, max)
plot(max_day_inflammation)

min_day_inflammation <- apply(data, 2, min)
plot(min_day_inflammation)

plot is a function with many arguments so you will probably need to study a lot of examples to do what you want (change an axis, name an axis, change the plot points and/or lines, add title, add grids, add legend, color the graph, add arrows and text etc.)

R Scripts

So far we have been typing directly into the R command line. What we could also do is save a sequence of commands in an R source file to run it at will. The way to do this is to have such a file with an .R extension and use the function source to run it.

If the source file contains an analysis from the beginning to the end it is a good practice to always clear your session of variables using rm(list=ls()). On the other hand if it is used as a library, for example to load some functions you have created, then you probably should not do it. You can also include other source files inside your current source file using the function (you guessed it): source

Because it is a good practice to have coding guidelines/conventions for standardizing the way you write your script to make it more readable, Google has some: https://google.github.io/styleguide/Rguide.xml

Functions

Let’s learn how to create functions by creating a fuction fahr_to_kelvin that converts temperatures from Fahrenheit to Kelvin:

fahr_to_kelvin <- function(temp) {
  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
  return(kelvin)
}

To run a function:

# freezing point of water
fahr_to_kelvin(32)

## [1] 273.15

# boiling point of water
fahr_to_kelvin(212)

## [1] 373.15

Let’s also create a function that converts Kelvin to Celcius:

kelvin_to_celsius <- function(temp) {
  celsius <- temp - 273.15
  return(celsius)
}

#absolute zero in Celsius
kelvin_to_celsius(0)

## [1] -273.15

We can also use functions inside functions

fahr_to_celsius <- function(temp) {
  temp_k <- fahr_to_kelvin(temp)
  result <- kelvin_to_celsius(temp_k)
  return(result)
}

# freezing point of water in Celsius
fahr_to_celsius(32.0)

## [1] 0

or we can obtain this result by function chaining:

# freezing point of water in Celsius
kelvin_to_celsius(fahr_to_kelvin(32.0))

## [1] 0

For loops

Like all the programming language R also has for loops to do recurring tasks. In general the syntax is:

for (variable in collection) {
  do things with variable
}

Let’s do an example:

best_practice <- c("Let", "the", "computer", "do", "the", "work")

print_words <- function(sentence) {
  for (word in sentence) {
    print(word)
  }
}

print_words(best_practice)

## [1] "Let"
## [1] "the"
## [1] "computer"
## [1] "do"
## [1] "the"
## [1] "work"

or another example:

len <- 0
vowels <- c("a", "e", "i", "o", "u")
for (v in vowels) {
  len <- len + 1
}
# Number of vowels
len

## [1] 5

Making decisions (if & else)

To make decisions in your R scripts, R provides you with the standard if-else conditional statements

num <- 37
if (num > 100) {
  print("greater")
} else {
  print("not greater")
}

## [1] "not greater"

print("done")

## [1] "done"

# or let's create a function that uses conditionals

sign <- function(num) {
  if (num > 0) {
    return(1)
  } else if (num == 0) {
    return(0)
  } else {
    return(-1)
  }
}

sign(-3)

## [1] -1

You can make decisions using the logical operators

equal (==)
greater than or equal to (>=),
less than or equal to (<=),
and not equal to (!=).

We can also combine tests. An ampersand, &, symbolizes ???and???. A vertical bar, |, symbolizes ???or???.

Datasets and statistics

By running the function data() we can see some datasets that are currently included in the R installation. To check for example the iris dataset, we can for example use the function:

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

To see the structure of it:

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

For a statistical summary

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

For the Species column, one can observe that the summary function did not do the standard statistical calculation like it did with the other variables. From the str function we can see that the Species column is a factor, which is what R uses to declare categorial or ordinal values. More on that in the next section.

We can also get the full attribute of Sepal.Length for example by name

iris$Sepal.Length

##   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
##  [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
##  [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
##  [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
##  [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
##  [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9

and run the standard statistics:

print("mean")

## [1] "mean"

mean(iris$Sepal.Length)

## [1] 5.843333

print("median")

## [1] "median"

median(iris$Sepal.Length)

## [1] 5.8

print("min")

## [1] "min"

min(iris$Sepal.Length)

## [1] 4.3

print("max")

## [1] "max"

max(iris$Sepal.Length)

## [1] 7.9

print("sd")

## [1] "sd"

sd(iris$Sepal.Length)

## [1] 0.8280661

print("var")

## [1] "var"

var(iris$Sepal.Length)

## [1] 0.6856935

print("range")

## [1] "range"

range(iris$Sepal.Length)

## [1] 4.3 7.9

# or other functions like:
print("sort")

## [1] "sort"

sort(iris$Sepal.Length)

##   [1] 4.3 4.4 4.4 4.4 4.5 4.6 4.6 4.6 4.6 4.7 4.7 4.8 4.8 4.8 4.8 4.8 4.9
##  [18] 4.9 4.9 4.9 4.9 4.9 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.1 5.1
##  [35] 5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.2 5.2 5.2 5.2 5.3 5.4 5.4 5.4 5.4 5.4
##  [52] 5.4 5.5 5.5 5.5 5.5 5.5 5.5 5.5 5.6 5.6 5.6 5.6 5.6 5.6 5.7 5.7 5.7
##  [69] 5.7 5.7 5.7 5.7 5.7 5.8 5.8 5.8 5.8 5.8 5.8 5.8 5.9 5.9 5.9 6.0 6.0
##  [86] 6.0 6.0 6.0 6.0 6.1 6.1 6.1 6.1 6.1 6.1 6.2 6.2 6.2 6.2 6.3 6.3 6.3
## [103] 6.3 6.3 6.3 6.3 6.3 6.3 6.4 6.4 6.4 6.4 6.4 6.4 6.4 6.5 6.5 6.5 6.5
## [120] 6.5 6.6 6.6 6.7 6.7 6.7 6.7 6.7 6.7 6.7 6.7 6.8 6.8 6.8 6.9 6.9 6.9
## [137] 6.9 7.0 7.1 7.2 7.2 7.2 7.3 7.4 7.6 7.7 7.7 7.7 7.7 7.9

print("length")

## [1] "length"

length(iris$Sepal.Length)

## [1] 150

Factors

The factor() command is used to create and modify factors in R:

sex <- factor(c("male", "female", "female", "male"))

R will assign 1 to the level "female" and 2 to the level "male" (because f comes before m, even though the first element in this vector is "male"). You can check this by using the function levels(), and check the number of levels using nlevels():

levels(sex)

## [1] "female" "male"

nlevels(sex)

## [1] 2

Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., ???low???, ???medium???, ???high???) or it is required by particular type of analysis. Additionally, specifying the order of the levels allows us to compare levels:

food <- factor(c("low", "high", "medium", "high", "low", "medium", "high"))
levels(food)

## [1] "high"   "low"    "medium"

food <- factor(food, levels = c("low", "medium", "high"))
levels(food)

## [1] "low"    "medium" "high"

min(food) ## doesn't work

## Error in Summary.factor(structure(c(1L, 3L, 2L, 3L, 1L, 2L, 3L), .Label = c("low", : 'min' not meaningful for factors

food <- factor(food, levels = c("low", "medium", "high"), ordered=TRUE)
levels(food)

## [1] "low"    "medium" "high"

min(food) ## works!

## [1] low
## Levels: low < medium < high

Challenge

Use the dataset AirPassengers that comes with R and refers to number of passengers traveled every month from 1949 to 1960 in thousands. Because the dataset is of type Time-Series or ts, you can make it a data.frame through the following commands:

dn = list(paste("Y", as.character(1949:1960), sep = ""), month.abb)
airmat = matrix(AirPassengers, 12, byrow = TRUE, dimnames = dn)
air = as.data.frame(t(airmat))

Then try to answer the next questions/problems:

Use the help functionality to try and learn about the functions used above.
How many passengers traveled in average for the year 1951?
Which is the maximum number of passengers for the months January and February?
Calculate the summation per year and assign the result to a vector.
Plot the vector nicely (names in axes, point and lines for the graph, title the graph, add grid lines)
Repeat the last two bullets for every month for all the years.

Tip: to transform a row of the data frame to a vector you can use unlist (e.g. unlist(air["Jan",])