This document is a practical overview of R. Later in the course we will revisit many of these concepts and examine them more deeply.
R is a programming language for statistical computing, data analysis, and graphics. It is a re-implementation of the S language, which was developed in the 1970’s at Bell Laboratories.
R is a high level and dynamic language. Memory management and variable typing, among other activities, are handled automatically in R.
R is one of the main computing tools used for data analysis in applied research, and for research in statistics itself.
Using R
There are two main ways to run R:
Interactive mode: You can use R interactively by typing statements or short programs directly into the terminal at the R prompt.
Sourcing scripts: Most of the time, you should write your programs using a text editor. Save your program as a text file with extension
.R
, likemyprog.R
. Then you can run your program in the interpreter by typingsource('myprog.R')
at the R prompt.
You may need to set the directory path to point to the location where you saved your script. If you are working in the terminal you can usually avoid this step by launching R from within the directory that holds your scripts. If you are using a GUI of some sort there will usually be a menu that allows you to set the working directory.
Make sure you save your programs as text files.
Variables and types in R
A variable is a symbol or name, like x
, that holds a value. The
value can be any R object. Every R object has a type, which you can
discover using the typeof
function. There are 24 built-in
types
in R. One of these types is called S4
, which is a generic type that
you can use to extend the language.
In addition to having a type, values in R also have a mode and a
class. We will talk more about these later, but note that you can
obtain the mode and the class of a value using the built-in mode()
and class()
functions.
Also note that the built-in str
function gives you information about
the internal representation of a value.
Array types
Five of the built-in R data types are array types, these are perhaps
the most important data types in R. They are called: logical
,
integer
, double
, complex
, and character
, based on the type of
value stored in the array. You will likely use double
especially
often.
An array is a collection of values all of the same type (so it is a homogeneous data structure). The values are stored as a sequence, but as we will see below they can be treated semantically as having multiple dimensions.
The following creates a literal one-dimensional array of length 3:
x = c(3, 5, 9)
The character c
here is a built-in function called “combine” for
creating arrays. You can use typeof(x)
to confirm that this
statement creates a double
array.
Even though the values in the array x
are all integers, the type of
x
is still double
. The combine
function always creates double
type arrays when given numeric values as inputs. To create an array
of integer
type, you can use
x = as.integer(x)
In R there are no pure scalar types. A scalar value is simply an array of length 1.
Note that the combine
function always flattens its arguments, that
is,
c(1, 2, c(3, 4), c(5, 6))
creates the flat array [1, 2, 3, 4, 5, 6], not a nested array like [1, 2, [3, 4], [5, 6]].
Multidimensional arrays
The arrays created using the combined function c()
are
dimensionless, or flat arrays. One way to create a
multidimensional array is using the array
function:
x = array(c(1, 2, 3, 4, 5, 6), c(3, 2))
The first argument to array
is a flat array that contains the data,
the second argument gives the dimensions (rows, columns for a
two-dimensional array). Note that the array is filled column-wise, so
the state of x
is
1 4
2 5
3 6
You can also create a multidimensional array by taking a flat array and adding a dim (for “dimensions”) attribute to it:
x = c(1, 2, 3, 4, 5, 6)
dim(x) = c(3, 2)
Arrays with more than two dimensions can be created, but are not used very often:
x = c(1, 2, 3, 4, 5, 6, 7, 8)
dim(x) = c(2, 2, 2)
Function types
In R, functions are “first class objects”. This means that they can
be treated like any other variable. A function is defined using the
function
keyword, as follows:
f = function(x) { x }
The code in braces is the function body. The final statement of the function body is the return value of the function. This function is the identity function – it simply returns its input. You can also make the return explicit:
f = function(x) { return(x) }
or even
f = function(x) return(x)
We will consider functions in much more detail later. Note that the
formal type of a function is called closure
. A closure
is a standard term in computing language theory that refers to a
function as well as an environment it encloses that defines variables
(other than the function arguments) that are used in the function. We
will discuss closures in more detail later.
Variable names
You can use simple variable names like x
, y
, A
, and a
(note
that A
and a
are different variable names). You can also use
longer names like counter
, index1
, or subject_id
.
A variable name may contain digits, but it cannot begin with a digit. It may contain underscores (_) and periods (.) but not operators (* - + < > = & |) most punctuation ((), {}) or the comment character (#). Very few computing languages allow the dot (period) as a character in variable names, but it is allowed in R. Some people choose not to use periods in R variable names since it can be confusing to people who frequently work with other languages.
Be careful about clobbering built-in symbols with your own variable
names. You could create a variable named pi
, but then you would no
longer be able to use the built-in pi variable.
Function signatures
A function signature specifies what arguments can be passed into a function. Functions in R have zero or more positional arguments, and zero or more keyword arguments.
For example, the array
function we saw earlier has the following
signature:
array(data = NA, dim = length(data), dimnames = NULL)
All three arguments of this function are keyword arguments. When calling a function, keyword arguments can be passed by name or by position. If passed by position, the arguments must appear in the order specified in the signature. If passed by name, the arguments can be in any order. For example, the following are equivalent
x = array(c(1, 2, 3, 4, 5, 6), c(3, 2))
y = array(data=c(1, 2, 3, 4, 5, 6), dim=c(3, 2))
z = array(dim=c(3, 2), data=c(1, 2, 3, 4, 5, 6))
Since keyword arguments have default values, they are optional. For
example, we are not providing a value for the dimnames
argument to
array
. Positional arguments must always be provided.
Getting Information from R
You can get some documentation about almost any R command or function
using the help
command. For example, the following produces some
documentation about the array
function.
help(array)
You can get the current value of a variable by typing its name at the
prompt. You can also use the print
function to display its value.
The following displays the value of the variable x
.
print(x)
Since the value bound to a function (closure) variable is the source
code of the function, you can easily obtain the source code of any R
function by typing the name of the function (without passing any
arguments). Try typing array
at the prompt to see the source code
of this built-in function.
Comments
A comment is anything you write in your program code that is ignored
by the computer. Comments help other people understand your code.
Any text following a #
character is a comment.
x = c(3, 5, 2) # These are the doses of the new drug formulation.
Arithmetic
You can use R like a calculator. The familiar scalar arithmetic
operations +
, -
, *
, and /
(addition, subtraction,
multiplication, and division) are all built-in. After the following
program is run, z
will have the value 12, a
will have the value 2,
and w
will have the value 24.
x = 5
y = 7
z = x + y
a = y - x
w = a*z
Other binary arithmetic operators are exponentiation and modular division (remainders).
x = 5^2
y = 23 %% 5
R automatically promotes integers to doubles when doing division. For example, consider the following
> typeof(as.integer(1) + as.integer(2))
[1] "integer"
> typeof(as.integer(1) * as.integer(2))
[1] "integer"
> typeof(as.integer(1) / as.integer(2))
[1] "double"
Arithmetic expressions
You can evaluate more complicated expressions by following standard mathematical conventions. If in doubt about precedence, use parentheses (but don’t over-use them).
x = 5
y = (x+1) / (x-1)^2 + 1/x
It is possible (and useful) to modify the value of a variable
in-place, using an expression that involves its current value. After
the following program, the value of x
will be 6.
x = 5
x = x + 1
Rounding
R provides several rounding functions: floor
rounds toward negative
infinity, ceiling
rounds toward positive infinity, round
rounds to
the nearest integer, and trunc
rounds toward zero.
v = ceiling(3.8)
w = floor(3.8)
x = ceiling(-3.8)
y = trunc(-3.8)
z = round(-3.8)
Higher mathematical functions and rounding error
The square root function is sqrt
, and fractional powers are also
allowed, so sqrt(x)
is the same as x^0.5
. The natural log
function is denoted log
, and the exponential function e^x
is
denoted by exp(x)
. The trigonometric functions are denoted in the
usual way. The mathematical constant pi
is also provided.
w = sqrt(2)
x = exp(3)
y = log(x)
z = tan(pi/3)
Numbers can be represented in exponential notation in R, for example,
using 1.5e7
gives the value of 1.5*10^7. The accuracy of double
precision numbers becomes poor for very large and very small numbers.
For example, exp(-800)
is exactly 0 in R. As another example, the
value of x = tan(pi/2)
should be infinity, but since the value of
pi
used by the computer is approximate, you will see that x
is
actually a very large finite number.
Infinity, undefined values, and missing data
Undefined values are represented by NaN
(not a number):
x = sqrt(-1)
Overflow to infinity, and underflow to zero can arise from some operations due to the limited precision of floating point arithmetic:
exp(800) # yields positive infinity
exp(-800) # yields negative infinity
The special value called NA
stands for not available. This value
indicates that the corresponding data point is missing or not
available for some reason.
A arithmetic expression involving NaN
will always have a value of
NaN
(and an expression involving NA
will always have the value
NA
):
1 + 2 + 3 + NA
1 + 2 + 3 + NaN
A mathematical expression involving Inf
will generally evaluate as
it should mathematically:
3 + Inf # Yields Inf
3 + 1/Inf # Yields 3
Inf + -Inf # Yields NaN
Functions are often configurable by the user to modify how they handle
NA
arguments, for example:
> mean(c(1, 2, 3, NA))
[1] NA
> mean(c(1, 2, 3, NA), na.rm=TRUE)
[2] 2
Assignment by value
Variable values are assigned “by value” in R. Therefore after running
the following lines of code, the value of x
is (5, 4)
and the
value of y
is (99, 4)
.
x = c(5, 4)
y = x
y[1] = 99
Boolean expressions
Boolean expressions evaluate to either TRUE
or FALSE
. For
example,
3 + 2 < 5
is FALSE
,
10 - 4 > 5
is TRUE
, and
10 + 4 == 7 + 7
is TRUE
(note that you must use two equals signs for testing
equality to avoid confusion with assignment statements).
The &
(and) operator is TRUE
only if the expressions on
both sides of the operator are TRUE
. For example,
(3 < 5) & (2 > 0)
is TRUE
, and
(2 < 3) & (5 > 5)
is FALSE
.
The |
(or) operator is TRUE if at least one of the expressions
surrounding it is TRUE. For example,
(3 < 5) | (2 > 3)
is TRUE, and
(2 < 1) | (5 > 5)
is FALSE.
The !
operator (not) performs “logical negation”. It evaluates to
TRUE for FALSE statements and to FALSE for TRUE statements. For
example
!(2 < 1)
TRUE and
!(3 < 6)
is FALSE.
Boolean expressions can be combined, using parentheses to confirm precedence. For example,
((5>4) & !(3<2)) | (6>7)
is TRUE.
More on generating vectors using the array
and seq
functions
One use of the array
function is to create a vector with the same
value repeated a given number of times. For example,
z = array(3, 5)
constructs a vector of 5 consecutive 3’s, [3, 3, 3, 3, 3]. You can
also use the array
function to concatenate several copies of an
array together end-to-end. For example
z = array(c(3,5), 10)
creates the vector [3, 5, 3, 5, 3, 5, 3, 5, 3, 5]. Note that the second parameter (10 in this case) refers to the length of the result, not the number of times that the first parameter is repeated.
You can reshape a vector into a multi-dimensional array:
V = seq(1, 11, 2)
M = array(V, c(3,2))
which yields the array
1 7
3 9
5 11
The seq
function generates an arithmetic sequence (i.e. a sequence
of values with a fixed spacing between consecutive elements). For
example
z = seq(3, 8)
creates a vector variable called z
that contains the values [3, 4,
5, 6, 7, 8]. You can also use
z = seq(8, 3)
to create the vector of values [8, 7, 6, 5, 4, 3], or
z = seq(3, 8, by=2)
to create the vector of values [3, 5, 7], where the by
parameter
causes every second value in the range to be returned.
Vector and matrix arithmetic
Vectors and matrices of the same shape can be operated on arithmetically, with the operations acting element-wise. For example,
x = seq(3, 12, 2)
y = seq(10, 6)
z = x + y
calculates the vector (pointwise) sum as follows:
x = [3, 5, 7, 9, 11]
y = [10, 9, 8, 7, 6]
z = [13, 14, 15, 16, 17]
Element-wise operations on vectors and arrays
Many of the mathematical functions in R act element-wise on vectors and matrices. For example, in
x = c(9, 16, 25, 36)
y = sqrt(x)
the value of y
will be y = [3, 4, 5, 6]
. Other functions acting
element-wise include log
, exp
, and the trigonometric functions.
Reducing operations on vectors and arrays
The values in a vector can be summed using the sum
function. For
example,
v = seq(1, 100, 2)
x = sum(v)
calculates the sum of the odd integers between 1 and 100. There is
also a product function called prod
, but it is rarely used.
The max
and min
functions calculate the largest and smallest
value, respectively, in a vector or matrix.
v = array(seq(1, 100, 2), c(25,2))
mx = max(v)
mn = min(v)
The functions mean
, median
, sd
, IQR
, and var
calculate the
corresponding descriptive statistic from the values in a vector or
matrix.
Element-wise Boolean operations
Most Boolean operators act element-wise.
vec = c(3, 2, 8, 6, 5, 6, 11, 0)
ix = (vec %% 2 == 1)
Creates the vector ix
with value [TRUE, FALSE, FALSE, FALSE, TRUE,
FALSE, TRUE, FALSE].
Counting
Frequently it is useful to count how many elements within a vector satisfy some condition. For example, if we wanted to know how many of the integers between 1 and 100 are divisible by 7, we could use
vec = seq(100)
v7 = (vec %% 7 == 0)
x = sum(v7)
Note that when summing the Boolean vector v7
, true values are
counted as 1 and false values are counted as 0.
Operating on matrices by row or column
The apply
function applies a function to every row or to every
column of a matrix. For example
mat = array(seq(8), c(4,2))
rs = apply(mat, 1, sum)
cs = apply(mat, 2, sum)
takes the matrix mat
1 5
2 6
3 7
4 8
and computes the row sums rs = [6, 8, 10, 12]
or the column sums cs
= [10, 26]
.
The first argument to apply
is a matrix, and the second argument is
either 1 (to operate on the rows) or 2 (to operate on the columns).
The third argument to apply
(sum
in the example), can be any
function that maps vectors to scalars (e.g.\ mean
, max
, sd
). If
you replace sum
with max
in the example, you will get rs = [5, 6,
7, 8]
and cs = [4, 8]
.
Element access and slicing
To access the 5th element of the vector vec
, use vec[5]
. To
access the value in row 3, column 2 of a matrix mat
, use mat[3,
2]
.
To access the entire third row of a matrix mat
use mat[3, ]
. To
access the entire second column of the matrix use mat[, 2]
.
To access the 2 x 3
submatrix spanning rows 3 and 4, and columns 5,
6 and 7 of a matrix mat
, use mat[3:4, 5:7]
.
Here are some examples of element access and slicing:
# Create a 5 x 2 matrix
mat = array(seq(10), c(5, 2))
# Change every value in the fourth row to -1
mat[4, ] = -1
# Change the value in the upper-right corner to -9
mat[1, 2] = -9
# Change every value in the middle 2x2 submatrix to 0
mat[2:3, ] = 0
One-dimensional arrays expand if a value is assigned to a position
outside the current array. For example if vec
has state [3, 1, 4]
and we set vec[5] = 9
, the new state of vec
is [3, 1, 4, NA, 9]
.
Since a scalar is a vector of length one, you can index a variable that may otherwise appear to be a scalar:
x = 7
x[2] = 9
After these two lines of code execute, the state of x
is [7, 9]
.
You can create an empty vector by assigning to NULL
. Doing
vec = NULL
vec[3] = 1
yields a vector with state [NA, NA, 1]
. Assigning to the empty
array, i.e. vec = c()
, is equivalent.
You can access elements at arbitrary positions using an index array:
x = rnorm(10)
print(x[c(1, 8, 9)])
This is useful when selecting elements that meet a condition:
x = rnorm(20)
print(x[x < 0])
Learning the shape of a vector or array
The length
function returns the length of a vector, e.g. the length
function in the example below returns 3
:
x = c(4, 3, 1)
length(x)
The dim
function gives you the numbers of rows and columns in a
multidimensional array. For a two-dimensional array, dim
returns a
vector of two values. The first return value is the number of rows
and the second return value is the number of columns.
M = seq(10)
M = array(M, c(5,2))
# d will be c(5,2)
d = dim(M)
# nrow will be 5, and ncol will be 2
nrow = dim(M)[1]
ncol = dim(M)[2]
The length
function returns the total number of elements in an
array, regardless of whether it has a dim
attribute:
a = array(0, c(3, 2))
print(length(a))
Extending arrays
You can append a row to a two-dimensional array with rbind
, and a
column to a two-dimensional array with cbind
.
M = array(seq(10), c(5,2))
# Create a new array which is M extended by one row at
# its lower edge.
A = rbind(M, c(37,38))
# Create a new array which is M extended by one column at
# its right edge.
B = cbind(M, c(37,38,39,40,41))
Note that a row you are appending to a M
should have the same number
of elements as M
has columns, and a column you are appending to M
should have the same number of elements as M
has rows.
Removing elements and slices from vectors
To remove the value in a specific position from a vector, use a negative index:
vec = c(3, 1, 2, 6, 5, 8, 7, 9)
# Remove the value in the third position (which is 2)
# Note: does not remove all 3's from the vector.
# The value of A will be c(3,1,6,5,8,7,9)
A = vec[-3]
Negative indices in an array can be used to remove an entire row or column:
a = rnorm(6)
a = array(a, c(3, 2))
print(a[-2,])
print(a[,-1])
Negative indexing with an index array removes all the values at all positions in the array.
# -3:-7 is c(-3, -4, -5, -6, -7), therefore the values at
# positions 3, 4, 5, 6, and 7 are removed. The value of B will be
# c(3, 1, 9)
B = vec[-3:-7]
# 0 indices are skipped. -3:0 is c(-3, -2, -1, 0) which removes
# the elements at positions 1, 2 and 3. The value of C will
# therefore be c(6, 5, 8, 7, 9).
C = vec[-3:0]
Loops
Loops are used to carry out a sequence of related operations without having to write the code for each step explicitly.
Suppose we want to sum the integers from 1 to 10. We could use the following.
x = 0
for (i in 1:10) {
x = x + i
}
The for
statement creates a loop in which the looping variable
i
takes on the values 1, 2, …, 10 in sequence. For each value of the
looping variable, the code inside the braces {} is executed (this code
is called the body of the loop). Each execution of the loop body is
called an iteration.
In the above program, x
is an accumulator variable, meaning that
its value is repeatedly updated while the program runs. It’s
important to remember to initialize accumulator variables (to zero in
the example).
To clarify, we can add a print
statement inside the loop body.
x = 0
for (i in 1:10) {
x = x + i
print(c(i,x))
}
Run the above code. The output (to the screen) should look like this:
1 1
2 3
3 6
4 10
5 15
6 21
7 28
8 36
9 45
10 55
More on loops
Loops can run over any vector of values. For example,
x = 1
for (v in c(3, 4, 7, 2)) {
x = x*v
}
calculates the product 3 * 4 * 7 * 2 = 168.
Loops can be nested. For example,
x = 0
for (i in seq(4)) {
for (j in seq(i)) {
x = x + i*j
}
}
calculates the following sum of products:
1*1 + 2*1 + 2*2 + 3*1 + 3*2 + 3*3 + 4*1 + 4*2 + 4*3 + 4*4 = 65.
which can be represented as a triangular array, as follows.
1*1
i 2*1 2*2
3*1 3*2 3*3
4*1 4*2 4*3 4*4
j
while loops
A while
loop is used when it is not known ahead of time how many
loop iterations are needed. For example, suppose we wish to calculate
the partial sums of the harmonic series
1/1 + 1/2 + 1/3 + ...
until the partial sum exceeds 5 (since the harmonic series diverges the partial sum must eventually exceed any given constant).
The following program will produce the first value n such that the nth partial sum of the harmonic series is greater than 5:
# Initialize
n = 0
x = 0
while (x <= 5) {
n = n + 1
x = x + 1/n
}
Conditional execution (if/else blocks)
An if
block can be used to make on-the-fly decisions about what
statements of a program get executed. For example,
y = 7
if (y < 10) {
x = 2
} else {
x = 1
}
The value of x
after executing this code is 2. This doesn’t appear
very useful, but if
statements can be very useful inside a loop.
For example, the following program places the sum of the even integers
up to 100 in A
and the sum of the odd integers up to 100 in B
.
A = 0
B = 0
for (k in 1:100) {
if (k %% 2 == 1) {
A = A + k
} else {
B = B + k
}
}
The following demonstrates an even more complicated if/else if/else
construction.
A = 0
B = 0
C = 0
D = 0
for (k in (1:100)) {
# An if construct.
if ((k %% 2 == 1) & (k < 50)) {
A = A + k
} else if ((k %% 2 == 1) & (k >= 50)) {
B = B + k
} else {
C = C + k
}
# An independent if construct.
if (k >= 50) {
D = D + k
}
}
The preceding program places the sum of odd integers between 1 and 49
in A
, the sum of odd integers between 50 and 100 in B
, the sum of
even integers between 1 and 100 in C
, and the sum of all integers
greater than or equal to 50 in D
.
break
and next
in loops
A break
statement is used to exit a loop when a certain condition is
met. A next
statement results in the current iteration of the loop
stopping immediately, with the loop continuing at the beginning of
the next iteration.
The following program sums the odd integers between 1 and 49.
x = 0
for (k in seq(100))
{
# Skip even numbers, but keep looping.
if (k %% 2 == 0) {
next
}
# Quit looping when the sum exceeds 50.
if (x >= 50) {
break
}
x = x + k
}
Lists
A list in R is an ordered collection of arbitrary objects.
x = list(1, "cat", 5.5)
To index a list use the [[]]
indexing notation, for example,
x[[2]]
is “cat”. This example uses the list data type in a way that
is typical of list data structures in other languages, namely as an
ordered collection of objects of arbitrary type (i.e. as an
inhomogeneous data container).
A list can also have named entries:
x = list(age=37, name="Sally", height=5.4)
Named lists are somewhat like associative arrays (hash maps, etc.) in
other languages. To access an element of a named list use the $
operator, e.g. x$age
is 37 in the example above. You can also use
double brackets [[]]
, e.g. x[["age"]]
. However unlike most
standard map implementations, R’s named lists are ordered, and the
keys can be non-unique. For example,
list(a=1, b=2, b=3)
creates a list with three values.