##Data
<-2
xprint(x)
[1] 2
##function
log(2)
[1] 0.6931472
R is a Statistical Programming language, it consists of 2 types of objects: data and functions.
##Data
<-2
xprint(x)
[1] 2
##function
log(2)
[1] 0.6931472
Data is stored in variables and can take many forms. To store a value in a variable use “<-”, above we set the variable x equal to 2. There are many data types in R, we will go through some of them.
#real numbers
=29.333
num num
[1] 29.333
#Some math
#adding and subtraction
2+3-2
[1] 3
#multiplying and dividing
<-5*(10/25)
num num
[1] 2
#Strings
<-"hello"
word word
[1] "hello"
='hello' word
Booleans take on either TRUE or FALSE values, and can be very useful in R. You can set booleans to the result of a comparison of two data types, some of the syntax is below:
#booleans can be initialize in a variety of ways, for example
#must capitalize the true or false
FALSE
[1] FALSE
F
[1] FALSE
T
[1] TRUE
<-TRUE
myBoolean myBoolean
[1] TRUE
<- 3<4
myBoolean2 myBoolean2
[1] TRUE
<-"this"=="that"
myBoolean3 myBoolean3
[1] FALSE
## && (and) is TRUE if BOTH input booleans are true
## || (or) is TRUE if AT LEAST one input boolean is true
<-myBoolean2&&myBoolean
myBoolean4 myBoolean4
[1] TRUE
Vectors in R are used frequently, they are “lists” or “arrays” of all the same data type.
##vectors are created with c(data,data,data)
<-c(2,3,4,5,6,7,8,9,10)
myVector myVector
[1] 2 3 4 5 6 7 8 9 10
#a:b is a shortcut for a sequence from a to b adding 1
#you can create vectors of sequences using seq(), for more type ?seq in the console
<-2:10
myVector2 myVector2
[1] 2 3 4 5 6 7 8 9 10
as.numeric(2:10)
[1] 2 3 4 5 6 7 8 9 10
as.double(2:10)
[1] 2 3 4 5 6 7 8 9 10
<-rep(NA,l=20)
myVector2
#These do not have to be numbers, they can be vectors, Strings, booleans...
<-c(myVector,myVector)
myVector myVector
[1] 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10
<-c("this","is","a","vector","of","strings")
myVector3 myVector3
[1] "this" "is" "a" "vector" "of" "strings"
#access elements with square brackets []
1] myVector[
[1] 2
#more advanced accesssing
#access elements 1 to 5
1:5] myVector[
[1] 2 3 4 5 6
#access elements 1, 4 and 6
c(1,4,6)] myVector[
[1] 2 5 7
#access elements that are greater than 2
>2] myVector[myVector
[1] 3 4 5 6 7 8 9 10 3 4 5 6 7 8 9 10
-c(1,4,6)] myVector[
[1] 3 4 6 8 9 10 2 3 4 5 6 7 8 9 10
We can perform mathematical operations and comparisons on vectors
<-1:10
x x
[1] 1 2 3 4 5 6 7 8 9 10
#adds 1 to every element
+1 x
[1] 2 3 4 5 6 7 8 9 10 11
#this works for comparisons
<4 x
[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
<4] x[x
[1] 1 2 3
#multiplies element 1 to element 1 of second vectors
*-(1:10) x
[1] -1 -4 -9 -16 -25 -36 -49 -64 -81 -100
#beware repetition
-c(1,2) x
[1] 0 0 2 2 4 4 6 6 8 8
# mathematical operations on the vector apply to each element
#squares each element
^2 x
[1] 1 4 9 16 25 36 49 64 81 100
#log each element
log(x)
[1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
[8] 2.0794415 2.1972246 2.3025851
#Example: Dot Product
<-c(1,2,3)
x<-c(2,5,8)
y#sum adds the elements of the vector together
sum(x*y)
[1] 36
You can also use matrices in R.
#you can create a matrix with matrix(vector of data,nrow=number of rows,ncol=number of columns)
#You can see it will fill in the data down the columns first
<-matrix(1:9,nrow=3,ncol=3); myMatrix myMatrix
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
myMatrix
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
#rbind and cbind add a row or column repectively to the matrix
#you can create matrices with rbind(rowvector1,rowvector2,...), or with cbind(column vector 1, column vector 2,...)
<-rbind(c(2,3,4),c(3,4,5),c(1,2,3))
myMatrix myMatrix
[,1] [,2] [,3]
[1,] 2 3 4
[2,] 3 4 5
[3,] 1 2 3
<-cbind(c(1,2,3),c(4,5,6),c(7,8,9))
myMatrix2 myMatrix2
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
<-cbind(myMatrix2,c(10,11,12))
myMatrix3 myMatrix3
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
<-cbind(c(10,11,12),myMatrix2) myMatrix3
We can also do Matrix math:
#again math functions apply to every element
^2 myMatrix
[,1] [,2] [,3]
[1,] 4 9 16
[2,] 9 16 25
[3,] 1 4 9
#multiply with '%*%
%*%myMatrix myMatrix2
[,1] [,2] [,3]
[1,] 21 33 45
[2,] 27 42 57
[3,] 33 51 69
#we can find the inverse with 'solve()
<-matrix(c(1,0,1,-2,3,0,1,4,2),nrow=3)
X X
[,1] [,2] [,3]
[1,] 1 -2 1
[2,] 0 3 4
[3,] 1 0 2
solve(X)
[,1] [,2] [,3]
[1,] -1.2 -0.8 2.2
[2,] -0.8 -0.2 0.8
[3,] 0.6 0.4 -0.6
#check dimension
dim(X)
[1] 3 3
#We can also transpose with t()
t(X)
[,1] [,2] [,3]
[1,] 1 0 1
[2,] -2 3 0
[3,] 1 4 2
#Some times to multiply vectors we have to trun them into matrix types
<-c(1,2,3)
myVector<-matrix(myVector,ncol=1) newM
Functions are objects that take an input and transform it into some output, just like in mathematics. We have already seen some, such as log()
.
They are called with this format output<-functionName(input)
.
R has many, many functions, to learn more about a function type ?functionName
and the documentation will come up.
#A simple function
#here the function log is called, with the parameter 2, and the output is stored in the variable x
<-log(2)
x x
[1] 0.6931472
#A more complicated function
#What are the parameters?
#not rep(a,n) gives a vector of size n where all elements are a
<-sample(x=1:10,size=4,replace=TRUE,prob=rep(1/10,10))
s s
[1] 8 1 3 3
We have seen other people’s functions but we can also make our own! Let’s see an example first:
#recall the dot product example...
=function(a,b){
dotProd<-sum(a*b)
valuereturn(value)
}#calling our function
dotProd(x,y)
[1] 10.39721
What exactly does this code say?
dotProd
return()
ends the function, and sends back the variable in the bracketsBack to built in functions… R
is a statistical software, what does that mean? It already includes many common statistical functions! For most common distributions there are functions for the pdf, cdf, inverse cdf as well as one to get a sample from that distribution. The syntax is in the format: dDistName(x,parameters)
, pDistName(x,parameters)
, qDistName(x,parameters)
and rDistName(x,parameters)
respectively. This will make more sense in the example below…
#The normal distribution, sd is the standard deviation
#pdf
dnorm(c(2,3,5),mean=0,sd=1)
[1] 5.399097e-02 4.431848e-03 1.486720e-06
#cdf
pnorm(c(2,3,5),mean=0,sd=1)
[1] 0.9772499 0.9986501 0.9999997
#inverse cdf
qnorm(c(0.2,.5,.3),mean=0,sd=1)
[1] -0.8416212 0.0000000 -0.5244005
#random sample of size 10
rnorm(10,mean=0,sd=1)
[1] 0.7892969 -0.6448078 -0.5486580 -0.1903205 -1.4756191 0.1602373
[7] 1.6674049 1.9578599 -0.5441672 -0.1758582
R is very good for plotting! There are many types of plots in R, here are some useful plotting functions, this list is not exhaustive…
plot(x,y,...)
produces a scatter plot.abline(a=intercept,b=slope,...)
curve(expr,...)
evaluates an expression along a grid to create a curvehist(data)
creates a histogramPlot functions have many parameters, some include col
which changes the color and add which should be set to TRUE
if the plot should be added to the existing plot. The best way to learn plots is with examples, I have included a regression example below.
#simulate errors
<-rnorm(100)
epsilon<-rexp(100)
x<-9+2*x+epsilon
y
#scatter plot with true line
plot(x,y)
abline(a=9,b=2,col="blue")
#least squares line
<-lm(y~x)
lmmsummary(lmm)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.8147 -0.5909 0.1677 0.7571 2.1544
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.02940 0.15413 58.58 <2e-16 ***
x 1.95493 0.09951 19.65 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.047 on 98 degrees of freedom
Multiple R-squared: 0.7975, Adjusted R-squared: 0.7954
F-statistic: 385.9 on 1 and 98 DF, p-value: < 2.2e-16
abline(lmm$coefficients[1],lmm$coefficients[2],col="red",lty=2)
#histogram of residuals
hist(epsilon,freq = F)
#x is what you want to evaluate the grid along
curve(dnorm(x),add=T,col="blue")
If statements are essential in programming, and they are a form of ‘Control Structure’. They take the form if(boolean variable){some task}
.
When the computer runs through the code, it checks if the boolean value is TRUE
, and if it is, it executes the code in the curly brackets, code in curly brackets is called a block. A simple example…
<-"nice"
jim
if(jim=="nice"){
="nice"
alice }
Placing an else{some code}
after the if statement will execute the code in it’s block if the code in the above if statement was not executed. The if and else must be in the same block so I have surrounded them in curly brackets.
<-"nice"
jim##same block
{if(jim=="nice"){
="nice"
alice
}else{
="not nice"
alice
}
alice }
[1] "nice"
<-"mean"
jim##same block
{if(jim=="nice"){
="nice"
alice
}else{
="not nice"
alice
}
alice }
[1] "not nice"
You may also use else if(boolean){block}
, which executes it’s block if the above (else) if statement(s) did not execute. See below:
<-"okay"
jim##same block
{if(jim=="nice"){
="nice"
alice
}else if(jim=="okay"){
="okay"
alice
}#Here if jim is not okay or nice, then we check if he is neutral.
else if(jim=="neutral"){
="neutral"
alice
}else{
="not nice"
alice
}
alice }
[1] "okay"
Lastly you may put if statements inside of other if statements, called ‘nested ifs’.
<-"nice"
jim##same block
if(jim=="nice"){
=sample(c("nice","not nice"),1)
aliceif(alice=="nice"){
print(alice)
}else{
print(alice)
} }
[1] "not nice"
Loops execute operations within their blocks repeatedly. There are 2 types of loops you will generally use, for loops and while loops. For loops repeat the block a set number of times, while while loops repeat until a condition is satisfied. You can also nest loops, like if statements.
#calculate 2 to the power of ten
<-1
x#this reads for i in 1 to 10, this can be any vector that i loops through, not just a sequential one
for(i in 1:10){
<-x*2
x
}
x
[1] 1024
for(i in 1:10){
<-x+i
x
}
=2:5
vec
for(i in vec){
<-x+i
x
}
#calculate power of 2 less that 1000
<-1
xwhile(2*x<1000){
<-x*2
x
} x
[1] 512
#nested loop
for(i in c(10,9,8,7,6,5,4,3,2,1)){
<-NULL
vfor(j in 1:i){
<-c(v,"*")
v
}print(v)
}
[1] "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
[1] "*" "*" "*" "*" "*" "*" "*" "*" "*"
[1] "*" "*" "*" "*" "*" "*" "*" "*"
[1] "*" "*" "*" "*" "*" "*" "*"
[1] "*" "*" "*" "*" "*" "*"
[1] "*" "*" "*" "*" "*"
[1] "*" "*" "*" "*"
[1] "*" "*" "*"
[1] "*" "*"
[1] "*"
You can also use the replicate function, which replicates a line of code a specified number of times. This gives a 10 by 5 matrix.
replicate(5,rnorm(10))
[,1] [,2] [,3] [,4] [,5]
[1,] -0.558998847 0.20101252 0.99886213 0.01208339 0.002006059
[2,] -1.715691896 0.08149414 0.79363149 -1.81678678 1.099775725
[3,] 0.904014958 0.61217064 -0.39064218 -0.44711259 1.628679368
[4,] -1.089662971 1.24960221 -0.87667942 1.14733194 1.581504669
[5,] -0.638169040 1.32519145 -0.82286412 -0.20608013 0.682351123
[6,] -0.007373101 -0.56919494 1.88983534 0.56852285 -0.872566382
[7,] 1.280209408 0.36214545 0.30692552 -0.12931446 -1.248042412
[8,] 1.150381910 -1.03543118 0.94982366 1.54951408 -0.377573932
[9,] -0.635902346 0.20622436 1.08120440 0.85929714 0.180561794
[10,] -1.095068391 1.52204419 -0.03901713 1.27652443 1.422437595
Similar functions include sapply()
and apply()
. sapply(X,FUN,...)
applies the function that the parameter FUN
is set to to individual elements of a vector. apply(X,MARGIN,FUN,...)
applies FUN
to the rows or columns depending on what MARGIN
is set to, 1 for rows and 2 for columns.
Here we generate 10000 samples of size 100 from the exponential distribution, with
#10000 samples, each of size 100 from the exponential distribution
<-replicate(10000,rexp(100,rate=2))
x#x is 100 by 10000, each column is a sample
dim(x)
[1] 100 10000
#calculate sample variances
<-apply(x,2,sd);
S_Vector
# S_Vector
#get the t value
<-qt(1-0.005,99)
tval#calculate the means
<-apply(x,2,mean); length(means) means
[1] 10000
# lower and upper bounds
<-means-S_Vector*tval/10
lower<-means+S_Vector*tval/10
upper<-rbind(lower,upper)
intervals#example interval
1] intervals[,
lower upper
0.3295037 0.5658261
#we now check each interval to see if it contains the mean
<-0
successesfor(i in 1:ncol(intervals)){
#if 0.5 is in the interval, add 1
if((intervals[1,i]<0.5)&&(intervals[2,i]>0.5))
<-successes+1
successes
}#here is the coverage probability...
<-successes/ncol(intervals)
coverage.prob coverage.prob
[1] 0.9848
Something more advanced…
#Vectorzing the function changes the way the function calculates when it is a passed a vector as a parameter...
#it will run the function once per element if it is vectorized instead of passing the vector as a parameter and running once
<-Vectorize(function(alpha){
getCovProb#10000 samples, each of size 100 from the exponential distribution
<-replicate(10000,rexp(100,rate=2))
x#x is 100 by 10000, each column is a sample
dim(x)
#calculate sample variances
<-apply(x,2,sd)
S_Vector#get the t value
<-qt(1-alpha/2,99)
tval#calculate the means
<-apply(x,2,mean)
means# lower and upper bounds
<-means-S_Vector*tval/10
lower<-means+S_Vector*tval/10
upper<-rbind(lower,upper)
intervals#example interval
1]
intervals[,
#we now check each interval to see if it contains the mean
<-0
successesfor(i in 1:ncol(intervals)){
#if 0.5 is in the interval, add 1
if((intervals[1,i]<0.5)&(intervals[2,i]>0.5))
<-successes+1
successes
}#here is the coverage probability...
<-successes/ncol(intervals)
coverage.probreturn(coverage.prob)
})#here we find the coverage probability for many alphas
<-seq(from=0.001,to=0.1,by=0.005)
alphas<-getCovProb(alphas)
coverages#adds a scatter plot
plot(alphas,coverages)
#adds a line
abline(a=1,b=-1,col="blue")
For more information you can visit here . It is also very easy to find tutorials on the web (Youtube is good), you could also look at the book by Lafaye, Drouilhet and Liquet (2013).