Statistics 506, Fall 2016

Introduction to R


This document is a practical overview of R. Later in the course we will revisit many of these concepts and examine them more deeply.

R is a programming language for statistical computing, data analysis, and graphics. It is a re-implementation of the S language, which was developed in the 1970’s at Bell Laboratories.

R is a high level and dynamic language. Memory management and variable typing, among other activities, are handled automatically in R.

R is one of the main computing tools used for data analysis in applied research, and for research in statistics itself.

Using R

There are two main ways to run R:

  • Interactive mode: You can use R interactively by typing statements or short programs directly into the terminal at the R prompt.

  • Sourcing scripts: Most of the time, you should write your programs using a text editor. Save your program as a text file with extension .R, like myprog.R. Then you can run your program in the interpreter by typing source('myprog.R') at the R prompt.

You may need to set the directory path to point to the location where you saved your script. If you are working in the terminal you can usually avoid this step by launching R from within the directory that holds your scripts. If you are using a GUI of some sort there will usually be a menu that allows you to set the working directory.

Make sure you save your programs as text files.

Variables and types in R

A variable is a symbol or name, like x, that holds a value. The value can be any R object. Every R object has a type, which you can discover using the typeof function. There are 24 built-in types in R. One of these types is called S4, which is a generic type that you can use to extend the language.

In addition to having a type, values in R also have a mode and a class. We will talk more about these later, but note that you can obtain the mode and the class of a value using the built-in mode() and class() functions.

Also note that the built-in str function gives you information about the internal representation of a value.

Array types

Five of the built-in R data types are array types, these are perhaps the most important data types in R. They are called: logical, integer, double, complex, and character, based on the type of value stored in the array. You will likely use double especially often.

An array is a collection of values all of the same type (so it is a homogeneous data structure). The values are stored as a sequence, but as we will see below they can be treated semantically as having multiple dimensions.

The following creates a literal one-dimensional array of length 3:

x = c(3, 5, 9)

The character c here is a built-in function called “combine” for creating arrays. You can use typeof(x) to confirm that this statement creates a double array.

Even though the values in the array x are all integers, the type of x is still double. The combine function always creates double type arrays when given numeric values as inputs. To create an array of integer type, you can use

x = as.integer(x)

In R there are no pure scalar types. A scalar value is simply an array of length 1.

Note that the combine function always flattens its arguments, that is,

c(1, 2, c(3, 4), c(5, 6))

creates the flat array [1, 2, 3, 4, 5, 6], not a nested array like [1, 2, [3, 4], [5, 6]].

Multidimensional arrays

The arrays created using the combined function c() are dimensionless, or flat arrays. One way to create a multidimensional array is using the array function:

x = array(c(1, 2, 3, 4, 5, 6), c(3, 2))

The first argument to array is a flat array that contains the data, the second argument gives the dimensions (rows, columns for a two-dimensional array). Note that the array is filled column-wise, so the state of x is

1 4
2 5
3 6

You can also create a multidimensional array by taking a flat array and adding a dim (for “dimensions”) attribute to it:

x = c(1, 2, 3, 4, 5, 6)
dim(x) = c(3, 2)

Arrays with more than two dimensions can be created, but are not used very often:

x = c(1, 2, 3, 4, 5, 6, 7, 8)
dim(x) = c(2, 2, 2)

Function types

In R, functions are “first class objects”. This means that they can be treated like any other variable. A function is defined using the function keyword, as follows:

f = function(x) { x }

The code in braces is the function body. The final statement of the function body is the return value of the function. This function is the identity function – it simply returns its input. You can also make the return explicit:

f = function(x) { return(x) }

or even

f = function(x) return(x)

We will consider functions in much more detail later. Note that the formal type of a function is called closure. A closure is a standard term in computing language theory that refers to a function as well as an environment it encloses that defines variables (other than the function arguments) that are used in the function. We will discuss closures in more detail later.

Variable names

You can use simple variable names like x, y, A, and a (note that A and a are different variable names). You can also use longer names like counter, index1, or subject_id.

A variable name may contain digits, but it cannot begin with a digit. It may contain underscores (_) and periods (.) but not operators (* - + < > = & |) most punctuation ((), {}) or the comment character (#). Very few computing languages allow the dot (period) as a character in variable names, but it is allowed in R. Some people choose not to use periods in R variable names since it can be confusing to people who frequently work with other languages.

Be careful about clobbering built-in symbols with your own variable names. You could create a variable named pi, but then you would no longer be able to use the built-in pi variable.

Function signatures

A function signature specifies what arguments can be passed into a function. Functions in R have zero or more positional arguments, and zero or more keyword arguments.

For example, the array function we saw earlier has the following signature:

array(data = NA, dim = length(data), dimnames = NULL)

All three arguments of this function are keyword arguments. When calling a function, keyword arguments can be passed by name or by position. If passed by position, the arguments must appear in the order specified in the signature. If passed by name, the arguments can be in any order. For example, the following are equivalent

x = array(c(1, 2, 3, 4, 5, 6), c(3, 2))
y = array(data=c(1, 2, 3, 4, 5, 6), dim=c(3, 2))
z = array(dim=c(3, 2), data=c(1, 2, 3, 4, 5, 6))

Since keyword arguments have default values, they are optional. For example, we are not providing a value for the dimnames argument to array. Positional arguments must always be provided.

Getting Information from R

You can get some documentation about almost any R command or function using the help command. For example, the following produces some documentation about the array function.

help(array)

You can get the current value of a variable by typing its name at the prompt. You can also use the print function to display its value. The following displays the value of the variable x.

print(x)

Since the value bound to a function (closure) variable is the source code of the function, you can easily obtain the source code of any R function by typing the name of the function (without passing any arguments). Try typing array at the prompt to see the source code of this built-in function.

Comments

A comment is anything you write in your program code that is ignored by the computer. Comments help other people understand your code. Any text following a # character is a comment.

x = c(3, 5, 2)  # These are the doses of the new drug formulation.

Arithmetic

You can use R like a calculator. The familiar scalar arithmetic operations +, -, *, and / (addition, subtraction, multiplication, and division) are all built-in. After the following program is run, z will have the value 12, a will have the value 2, and w will have the value 24.

x = 5
y = 7
z = x + y
a = y - x
w = a*z

Other binary arithmetic operators are exponentiation and modular division (remainders).

x = 5^2
y = 23 %% 5

R automatically promotes integers to doubles when doing division. For example, consider the following

> typeof(as.integer(1) + as.integer(2))
[1] "integer"

> typeof(as.integer(1) * as.integer(2))
[1] "integer"

> typeof(as.integer(1) / as.integer(2))
[1] "double"

Arithmetic expressions

You can evaluate more complicated expressions by following standard mathematical conventions. If in doubt about precedence, use parentheses (but don’t over-use them).

x = 5
y = (x+1) / (x-1)^2 + 1/x

It is possible (and useful) to modify the value of a variable in-place, using an expression that involves its current value. After the following program, the value of x will be 6.

x = 5
x = x + 1

Rounding

R provides several rounding functions: floor rounds toward negative infinity, ceiling rounds toward positive infinity, round rounds to the nearest integer, and trunc rounds toward zero.

v = ceiling(3.8)
w = floor(3.8)
x = ceiling(-3.8)
y = trunc(-3.8)
z = round(-3.8)

Higher mathematical functions and rounding error

The square root function is sqrt, and fractional powers are also allowed, so sqrt(x) is the same as x^0.5. The natural log function is denoted log, and the exponential function e^x is denoted by exp(x). The trigonometric functions are denoted in the usual way. The mathematical constant pi is also provided.

w = sqrt(2)
x = exp(3)
y = log(x)
z = tan(pi/3)

Numbers can be represented in exponential notation in R, for example, using 1.5e7 gives the value of 1.5*10^7. The accuracy of double precision numbers becomes poor for very large and very small numbers. For example, exp(-800) is exactly 0 in R. As another example, the value of x = tan(pi/2) should be infinity, but since the value of pi used by the computer is approximate, you will see that x is actually a very large finite number.

Infinity, undefined values, and missing data

Undefined values are represented by NaN (not a number):

x = sqrt(-1)

Overflow to infinity, and underflow to zero can arise from some operations due to the limited precision of floating point arithmetic:

exp(800)   # yields positive infinity
exp(-800)  # yields negative infinity

The special value called NA stands for not available. This value indicates that the corresponding data point is missing or not available for some reason.

A arithmetic expression involving NaN will always have a value of NaN (and an expression involving NA will always have the value NA):

1 + 2 + 3 + NA
1 + 2 + 3 + NaN

A mathematical expression involving Inf will generally evaluate as it should mathematically:

3 + Inf    # Yields Inf
3 + 1/Inf  # Yields 3
Inf + -Inf # Yields NaN

Functions are often configurable by the user to modify how they handle NA arguments, for example:

> mean(c(1, 2, 3, NA))
[1] NA
> mean(c(1, 2, 3, NA), na.rm=TRUE)
[2] 2

Assignment by value

Variable values are assigned “by value” in R. Therefore after running the following lines of code, the value of x is (5, 4) and the value of y is (99, 4).

x = c(5, 4)
y = x
y[1] = 99

Boolean expressions

Boolean expressions evaluate to either TRUE or FALSE. For example,

3 + 2 < 5

is FALSE,

10 - 4 > 5

is TRUE, and

10 + 4 == 7 + 7

is TRUE (note that you must use two equals signs for testing equality to avoid confusion with assignment statements).

The & (and) operator is TRUE only if the expressions on both sides of the operator are TRUE. For example,

(3 < 5) & (2 > 0)

is TRUE, and

(2 < 3) & (5 > 5)

is FALSE.

The | (or) operator is TRUE if at least one of the expressions surrounding it is TRUE. For example,

(3 < 5) | (2 > 3)

is TRUE, and

(2 < 1) | (5 > 5)

is FALSE.

The ! operator (not) performs “logical negation”. It evaluates to TRUE for FALSE statements and to FALSE for TRUE statements. For example

!(2 < 1)

TRUE and

!(3 < 6)

is FALSE.

Boolean expressions can be combined, using parentheses to confirm precedence. For example,

((5>4) & !(3<2)) | (6>7)

is TRUE.

More on generating vectors using the array and seq functions

One use of the array function is to create a vector with the same value repeated a given number of times. For example,

z = array(3, 5)

constructs a vector of 5 consecutive 3’s, [3, 3, 3, 3, 3]. You can also use the array function to concatenate several copies of an array together end-to-end. For example

z = array(c(3,5), 10)

creates the vector [3, 5, 3, 5, 3, 5, 3, 5, 3, 5]. Note that the second parameter (10 in this case) refers to the length of the result, not the number of times that the first parameter is repeated.

You can reshape a vector into a multi-dimensional array:

V = seq(1, 11, 2)
M = array(V, c(3,2))

which yields the array

1   7
3   9
5  11

The seq function generates an arithmetic sequence (i.e. a sequence of values with a fixed spacing between consecutive elements). For example

z = seq(3, 8)

creates a vector variable called z that contains the values [3, 4, 5, 6, 7, 8]. You can also use

z = seq(8, 3)

to create the vector of values [8, 7, 6, 5, 4, 3], or

z = seq(3, 8, by=2)

to create the vector of values [3, 5, 7], where the by parameter causes every second value in the range to be returned.

Vector and matrix arithmetic

Vectors and matrices of the same shape can be operated on arithmetically, with the operations acting element-wise. For example,

x = seq(3, 12, 2)
y = seq(10, 6)
z = x + y

calculates the vector (pointwise) sum as follows:

x = [3, 5, 7, 9, 11]
y = [10, 9, 8, 7, 6]
z = [13, 14, 15, 16, 17]

Element-wise operations on vectors and arrays

Many of the mathematical functions in R act element-wise on vectors and matrices. For example, in

x = c(9, 16, 25, 36)
y = sqrt(x)

the value of y will be y = [3, 4, 5, 6]. Other functions acting element-wise include log, exp, and the trigonometric functions.

Reducing operations on vectors and arrays

The values in a vector can be summed using the sum function. For example,

v = seq(1, 100, 2)
x = sum(v)

calculates the sum of the odd integers between 1 and 100. There is also a product function called prod, but it is rarely used.

The max and min functions calculate the largest and smallest value, respectively, in a vector or matrix.

v = array(seq(1, 100, 2), c(25,2))
mx = max(v)
mn = min(v)

The functions mean, median, sd, IQR, and var calculate the corresponding descriptive statistic from the values in a vector or matrix.

Element-wise Boolean operations

Most Boolean operators act element-wise.

vec = c(3, 2, 8, 6, 5, 6, 11, 0)
ix = (vec %% 2 == 1)

Creates the vector ix with value [TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE].

Counting

Frequently it is useful to count how many elements within a vector satisfy some condition. For example, if we wanted to know how many of the integers between 1 and 100 are divisible by 7, we could use

vec = seq(100)
v7 = (vec %% 7 == 0)
x = sum(v7)

Note that when summing the Boolean vector v7, true values are counted as 1 and false values are counted as 0.

Operating on matrices by row or column

The apply function applies a function to every row or to every column of a matrix. For example

mat = array(seq(8), c(4,2))
rs = apply(mat, 1, sum)
cs = apply(mat, 2, sum)

takes the matrix mat

1  5
2  6
3  7
4  8

and computes the row sums rs = [6, 8, 10, 12] or the column sums cs = [10, 26].

The first argument to apply is a matrix, and the second argument is either 1 (to operate on the rows) or 2 (to operate on the columns).

The third argument to apply (sum in the example), can be any function that maps vectors to scalars (e.g.\ mean, max, sd). If you replace sum with max in the example, you will get rs = [5, 6, 7, 8] and cs = [4, 8].

Element access and slicing

To access the 5th element of the vector vec, use vec[5]. To access the value in row 3, column 2 of a matrix mat, use mat[3, 2].

To access the entire third row of a matrix mat use mat[3, ]. To access the entire second column of the matrix use mat[, 2].

To access the 2 x 3 submatrix spanning rows 3 and 4, and columns 5, 6 and 7 of a matrix mat, use mat[3:4, 5:7].

Here are some examples of element access and slicing:

# Create a 5 x 2 matrix
mat = array(seq(10), c(5, 2))

# Change every value in the fourth row to -1
mat[4, ] = -1

# Change the value in the upper-right corner to -9
mat[1, 2] = -9

# Change every value in the middle 2x2 submatrix to 0
mat[2:3, ] = 0

One-dimensional arrays expand if a value is assigned to a position outside the current array. For example if vec has state [3, 1, 4] and we set vec[5] = 9, the new state of vec is [3, 1, 4, NA, 9].

Since a scalar is a vector of length one, you can index a variable that may otherwise appear to be a scalar:

x = 7
x[2] = 9

After these two lines of code execute, the state of x is [7, 9].

You can create an empty vector by assigning to NULL. Doing

vec = NULL
vec[3] = 1

yields a vector with state [NA, NA, 1]. Assigning to the empty array, i.e. vec = c(), is equivalent.

You can access elements at arbitrary positions using an index array:

x = rnorm(10)
print(x[c(1, 8, 9)])

This is useful when selecting elements that meet a condition:

x = rnorm(20)
print(x[x < 0])

Learning the shape of a vector or array

The length function returns the length of a vector, e.g. the length function in the example below returns 3:

x = c(4, 3, 1)
length(x)

The dim function gives you the numbers of rows and columns in a multidimensional array. For a two-dimensional array, dim returns a vector of two values. The first return value is the number of rows and the second return value is the number of columns.

M = seq(10)
M = array(M, c(5,2))

# d will be c(5,2)
d = dim(M)

# nrow will be 5, and ncol will be 2
nrow = dim(M)[1]
ncol = dim(M)[2]

The length function returns the total number of elements in an array, regardless of whether it has a dim attribute:

a = array(0, c(3, 2))
print(length(a))

Extending arrays

You can append a row to a two-dimensional array with rbind, and a column to a two-dimensional array with cbind.

M = array(seq(10), c(5,2))

# Create a new array which is M extended by one row at
# its lower edge.
A = rbind(M, c(37,38))

# Create a new array which is M extended by one column at
# its right edge.
B = cbind(M, c(37,38,39,40,41))

Note that a row you are appending to a M should have the same number of elements as M has columns, and a column you are appending to M should have the same number of elements as M has rows.

Removing elements and slices from vectors

To remove the value in a specific position from a vector, use a negative index:

vec = c(3, 1, 2, 6, 5, 8, 7, 9)

# Remove the value in the third position (which is 2)
# Note: does not remove all 3's from the vector.
# The value of A will be c(3,1,6,5,8,7,9)
A = vec[-3]

Negative indices in an array can be used to remove an entire row or column:

a = rnorm(6)
a = array(a, c(3, 2))
print(a[-2,])
print(a[,-1])

Negative indexing with an index array removes all the values at all positions in the array.

# -3:-7 is c(-3, -4, -5, -6, -7), therefore the values at
# positions 3, 4, 5, 6, and 7 are removed.  The value of B will be
# c(3, 1, 9)
B = vec[-3:-7]

# 0 indices are skipped.  -3:0 is c(-3, -2, -1, 0) which removes
# the elements at positions 1, 2 and 3.  The value of C will
# therefore be c(6, 5, 8, 7, 9).
C = vec[-3:0]

Loops

Loops are used to carry out a sequence of related operations without having to write the code for each step explicitly.

Suppose we want to sum the integers from 1 to 10. We could use the following.

x = 0
for (i in 1:10) {
    x = x + i
}

The for statement creates a loop in which the looping variable i takes on the values 1, 2, …, 10 in sequence. For each value of the looping variable, the code inside the braces {} is executed (this code is called the body of the loop). Each execution of the loop body is called an iteration.

In the above program, x is an accumulator variable, meaning that its value is repeatedly updated while the program runs. It’s important to remember to initialize accumulator variables (to zero in the example).

To clarify, we can add a print statement inside the loop body.

x = 0
for (i in 1:10) {
    x = x + i
    print(c(i,x))
}

Run the above code. The output (to the screen) should look like this:

 1 1
 2 3
 3 6
 4 10
 5 15
 6 21
 7 28
 8 36
 9 45
10 55

More on loops

Loops can run over any vector of values. For example,

x = 1
for (v in c(3, 4, 7, 2)) {
    x = x*v
}

calculates the product 3 * 4 * 7 * 2 = 168.

Loops can be nested. For example,

x = 0
for (i in seq(4)) {
    for (j in seq(i)) {
        x = x + i*j
    }
}

calculates the following sum of products:

1*1 + 2*1 + 2*2 + 3*1 + 3*2 + 3*3 + 4*1 + 4*2 + 4*3 + 4*4 = 65.

which can be represented as a triangular array, as follows.

   1*1
i  2*1  2*2
   3*1  3*2  3*3
   4*1  4*2  4*3  4*4
           j

while loops

A while loop is used when it is not known ahead of time how many loop iterations are needed. For example, suppose we wish to calculate the partial sums of the harmonic series

1/1 + 1/2 + 1/3 + ...

until the partial sum exceeds 5 (since the harmonic series diverges the partial sum must eventually exceed any given constant).

The following program will produce the first value n such that the nth partial sum of the harmonic series is greater than 5:

# Initialize
n = 0
x = 0

while (x <= 5) {
    n = n + 1
    x = x + 1/n
}

Conditional execution (if/else blocks)

An if block can be used to make on-the-fly decisions about what statements of a program get executed. For example,

y = 7
if (y < 10) {
    x = 2
} else {
    x = 1
}

The value of x after executing this code is 2. This doesn’t appear very useful, but if statements can be very useful inside a loop. For example, the following program places the sum of the even integers up to 100 in A and the sum of the odd integers up to 100 in B.

A = 0
B = 0
for (k in 1:100) {
    if (k %% 2 == 1)  {
        A = A + k
    } else {
        B = B + k
    }
}

The following demonstrates an even more complicated if/else if/else construction.

A = 0
B = 0
C = 0
D = 0
for (k in (1:100)) {

    # An if construct.
    if ((k %% 2 == 1) & (k < 50)) {
        A = A + k
    } else if ((k %% 2 == 1) & (k >= 50)) {
        B = B + k
    } else {
        C = C + k
    }

    # An independent if construct.
    if (k >= 50) {
        D = D + k
    }
}

The preceding program places the sum of odd integers between 1 and 49 in A, the sum of odd integers between 50 and 100 in B, the sum of even integers between 1 and 100 in C, and the sum of all integers greater than or equal to 50 in D.

break and next in loops

A break statement is used to exit a loop when a certain condition is met. A next statement results in the current iteration of the loop stopping immediately, with the loop continuing at the beginning of the next iteration.

The following program sums the odd integers between 1 and 49.

x = 0
for (k in seq(100))
{
    # Skip even numbers, but keep looping.
    if (k %% 2 == 0) {
        next
    }

    # Quit looping when the sum exceeds 50.
    if (x >= 50) {
        break
    }
    x = x + k
}

Lists

A list in R is an ordered collection of arbitrary objects.

x = list(1, "cat", 5.5)

To index a list use the [[]] indexing notation, for example, x[[2]] is “cat”. This example uses the list data type in a way that is typical of list data structures in other languages, namely as an ordered collection of objects of arbitrary type (i.e. as an inhomogeneous data container).

A list can also have named entries:

x = list(age=37, name="Sally", height=5.4)

Named lists are somewhat like associative arrays (hash maps, etc.) in other languages. To access an element of a named list use the $ operator, e.g. x$age is 37 in the example above. You can also use double brackets [[]], e.g. x[["age"]]. However unlike most standard map implementations, R’s named lists are ordered, and the keys can be non-unique. For example,

list(a=1, b=2, b=3)

creates a list with three values.