Statistics 506, Fall 2016

The R language


This page takes a deeper look at the core R language. It builds on the material covered on our introduction to R page.

Background and history

R is a programming language and “environment for statistical computing.” The meaning of the term “environment” here is that the power of R derives as much or more from the libraries and tools that accompany it as from the language itself.

R first appeared around 1997, and was written to provide an open-source statistical computing environment that was familiar to users of the then-popular language/environment called “S”.

Although the syntax of R resembles S, the implementation is based on another programming language called Scheme. The original developers of R, Ross Ihaka and Robert Gentleman, described the origins of R as follows:

… The resulting language [R] is very similar in appearance to S, but the underlying implementation and semantics are derived from Scheme. In fact, we implemented the language by first writing an interpreter for a Scheme subset and then progressively mutating it to resemble S.

The R interpreter is written in C, and to fundamentally understand how R works it helps to have a bit of an understanding of C. We don’t want to assume much C knowledge here, so we will keep this connection slightly vague. Although the R interpreter is written in C, R is capable of calling compiled libraries that are written in other languages. For example, much of the linear algebra code used in R is written in Fortran.

Variables and types

A variable in R is a symbol, like x, that appears in the source code, and that is linked to an underlying value. A “value” has one of 24 basic types. In R, the set of basic types is fixed, and is listed on page 2 here.

R is a dynamic language, meaning that while values have types, variables do not. This means that you can assign different values to the same variable:

x = 4                 # typeof(x) is double
x = 4L                # typeof(x) is integer
x = "cat"             # typeof(x) is character
x = function(y){y+1}  # typeof(x) is closure

Scope

At each instant while a program is running, the program has a state that refers to all existing objects and the values that they hold at that particular point in the program’s execution. In a computing language, name binding (also known as name resolution) refers to the way that a symbol in the source code is “bound” to a concrete value that exists in the current state.

Scoping rules describe the way in which symbols in the source code are bound to values in the environment. For simple programs, scoping rules are straightforward. An R program in a single file that calls no functions has only a single scope. Using functions makes things more complicated. Suppose our program consists of:

x = 3
f = function(y){x+y}

This environment has two variables (x and f) and two values (a double and a closure). This program defines two nested scopes – one in the body of the function f, and one at the top level. When we call f, say by taking f(7), we receive a value (10). When the interpreter encounters the expression x+y in the function body of f, it must find a value for x (the value for y is provided as the function argument). Since there is no variable with name x defined in the scope of the function, the interpreter searches in the enclosing scope to find a value for x. Since the symbol x is bound to the value 3 in the enclosing scope, this is the value for x that is used when evaluating the function.

Now consider the following program

x = 3
f = function(y){x=10; x+y}

In this case, the function defines its own variable named x, which exists in the scope of the function. Again, the interpreter encounters the expression x+y and must find a value for x. Name resolution starts in the scope of the function body and already finds a value for x (x=10). The value of 3 is “out of scope”. Thus f(7) evaluates to 17.

Most modern computing languages follow a scoping logic called lexical scoping. In contrast, early computing languages used a different scoping logic called dynamic scoping. A few languages in modern use such as Unix shell languages are dynamically scoped, but the main reason to study dynamic scoping at this point is to better understand lexical scoping.

The following example ilustrates the difference between lexical scoping and dynamic scoping. In a lexically scoped language, like R, the following code produces a function g such that g(7) evaluates to 10. The reason for this is that the variable x referenced in the body of the function f is resolved to the closest enclosing scope in the source code in which a variable named x is defined. That gives us the value x = 3. This scope is defined “lexically”, meaning that it is defined according to where f is defined in the source code, and always binds to the same variable, regardless of where f is called.

x = 3
f = function(y){y+x}
g = function(y){x=10; f(y)}

In a dynamically scoped language, g(7) would evaluate to 17, since the value of x would resolve to whatever value the symbol x is bound to in the scope where the function is called.

When R is determining where to bind a variable, it starts in the scope of the expression where the variable appears, then moves out to the next enclosing scope, and so on, until the name can be bound. Once a scoping level where the symbol appears is reached, a binding can be made, even if the symbol appears “below” the expression involving the symbol in the source code. To clarify this, consider that in the following program, f and x appear in the same scope, so the value of x in the function call is 3, even though x has not been defined at the point where f is defined (however x must be defined before it is actually needed to produce a value).

f = function(y){y+x}  # x here will be 3, even though it is not yet defined in the source
g = function(y){f(y)}
x = 3
g(5)                  # by the time we call f, x has been defined

This however is not allowed:

f = function(y){y+x}
g = function(y){f(y)}
g(5)                  # at the time we call f, x has not been defined
x = 3

The upshot of all this is that the decision about where to bind a variable is based on the scope as defined from the source code, but the value of this bound variable is determined dynamically at the point of execution.

One more notable feature of R’s scoping and name resolution behavior is that the binding distinguishes between function and non-function values, and passes over values of the “wrong type”. In the following example, strict lexical scoping would use the value f=10 when evaluating f(y) inside of g. But since this value of f is not a function, it is passed over in favor of the definition of f in the outer scope.

f = function(y){y}
g = function(y){f=10; f(y)}

To get very esoteric you can do this:

f = function(y){y}
g = function(y){f=10; f(f + y)}

In the expression f(f+y) that is located in the body of g, the first f will bind to the function f and the second f will bind to the 10.

Occasionally it is useful to access the value to which a symbol is bound programatically, using the string name of the symbol. The function get can be used for this purpose, for example:

get("+")
get("x")

Garbage collection

When a variable goes out of scope, the data it refers to can no longer be reached via that variable. The job of the garbage collector is to keep track of which data objects are still reachable, and protect their memory. Any memory that was previously allocated to objects that are no longer reachable can be “deallocated” and made available to store other objects. For example:

# After this line is executed there is one symbol and one value
# in the state
x = 3

# After this line is executed there are two symbols and two values
# in the state
y = c(3, 4)

# Immediately after this line is executed, there are two symbols
# and three values in the state.  However the value c(3, 4),
# previously bound to y, is not referenced by any variable, so it
# can be garbage collected
y = c(5, 6)

In languages with references, two variables can refer to the same underlying data object. The garbage collector must be carefully written to only deallocate memory that is not reachable through any variable. Since R only permits references in very limited circumstances, R’s garbage collector is relatively simple.

Attributes

Any object in R can possess attributes, which are key/value pairs that are part of the object’s “data”. You can see the current attributes defined for an object using the attributes function. For example, a number (which is actually a vector of length 1) has no attributes when first defined:

x = 7
attributes(x)

You can now define an attribute on x using the attr function:

attr(x, "color") = "red"

You can retrieve this attribute as follows

color_x = attr(x, "color")

and you can remove this attribute by assigning NULL to it:

attr(x, "color") = NULL

S3 classes and method dispatching

R has a simple object system called “S3” and a more complex object system called “S4”. Here we will discuss the S3 object system, which mainly consists of a feature called “class-based method dispatch” or “generic function object orientation”.

Any R object can have a class, which is simply an attribute called “class”. You can access the class information with the class function, e.g.

x = 7
class(x)

An R object can have a single class, or a sequence of classes. If you are extending the functionality of an existing object, you should usually extend its class list, for example:

class(x) = c(class(x), "green")

We have now extended x to have class ‘numeric’ and class ‘green’.

Now we can create a generic function called do_something that acts differently for objects belonging to class ‘red’ and to class ‘green’. We also can define a default do_something that works for objects that are neither ‘red’ nor ‘green’.

do_something.red = function(x) { print("I am red") }
do_something.green = function(x) { print("I am green") }
do_something.default = function(x) { print("I am neither red nor green") }
do_something = function(x) { UseMethod("do_something") }

x = 7
do_something(x)

class(x) = c(class(x), "red")
do_something(x)

Note that if your generic function has multiple arguments, the dispatch is based by default on the class of the first argument. If you want to dispatch based on a different argument, you can pass the name of the argument’s dummy variable as a second argument to UseMethod, like this:

do_something.red = function(x, y) { print("I am red") }
do_something.green = function(x, y) { print("I am green") }
do_something.default = function(x, y) { print("I am neither red nor green") }
do_something = function(x, y) { UseMethod("do_something", y) }

Now if we create red and green objects and pass them into do_something:

x = 5
class(x) = "red"
y = 7
class(y) = "green"
do_something(x, y)

then we see that the dispatch is based on y. Although we can customize which argument controls the dispatch behavior, it must always be a single argument. In S3 classes, there is no simple way to dispatch based on the combined class information for several arguments (i.e. you cannot use this pattern to write polymorphic functions as in, say, C++ templates).

You can search the namespace for class information using the methods function. For example, methods(print) will return all the concrete print methods that can be called via the print generic. Note that in some cases the concrete methods are not exported from their namespace, meaning that they will not be accessible via their name. For example, if you type

methods(t.test)

You will see that t.test is a generic function that dispatches to either t.test.default or t.test.formula. Both of these function names are displayed with a trailing “*” character, meaning that they are not exported from the namespace of the t.test function. You can determine the namespace of t.test by typing t.test at the prompt, you will see that the namespace is stats. Therefore, we can access, say, t.test.default via

stats:::t.test.default

We can also use the methods function to get the methods that are defined for a given class. For example, methods(class=lm) will tell you all the generic methods for class lm.

Functions and lazy evaluation

Essentially everything that happens in an R program involves calling a function. For example, binary operators like “+” are functions with “syntactic sugar” to place the operator between its arguments. This is true of many languages, but somewhat unusually in R you can also call the operator as a function:

'+'(3, 5)

R function arguments are evaluated using lazy evaluation This means that when an expression is used as a function argument, it is not actually evaluated at the point of entry to the function. Instead, it is passed as an unevaluated expression called a promise which is then evaluated when (and if) the argument is actually used in the function.

The following example illustrates how this works. In this code, we seem to be evaluating the function h, and passing its result as the argument to the function f.

f = function(x) { 1 }
f(h())

However, we have never defined the function h, so we might expect to get an error when calling f. In most languages, function parameters are evaluted greedily, meaning that they are evaluated at the point of entry to the function. But in R, the code h() is passed as a promise (an unevaluated expression) to the function f. In this case, since x is never referenced inside the body of f, the promise is never evaluated.

Now suppose we rewrite f so that the argument x is referenced, which forces it to be evaluated:

f = function(x) { x; 1 }
f(h())

In this case, we get an error, because the expression x in f forces evaluation of the promise h(), which contains undefined symbols (namely, h).

Since function arguments are evaluated lazily, the body of a function can access the expressions used to define the function arguments. For example, the function

f = function(x) { substitute(x) }

returns the “language object” represented by the argument x (we will talk more about the substitute function below, but here it is used to prevent the language object x from being evaluated). To better understand this, consider the following code:

y = 3
f(2*y + 1)

In most computing languages, the expression 2*y + 1 would be immediately evaluated to produce a value of 7. This evaluation would take place prior to calling the function f. Then this 7 would be passed into f as the dummy argument x. In this case, inside of f we would have no way of knowing how the value x = 7 was obtained.

However in a language like R with lazy evaluation, the expression 2*y + 1 is passed into f as a promise. We can access the code of this promise using the substitute function. If we return the value produced by substitute, we obtain the expression itself, rather than the value it produces after being evaluated.

The result of calling substitute is a “language object” (one of the 24 basic R types). It prints as a string, but there is actually a lot more to it. To convert a language object to a string representation of the expression, use the deparse function. This is useful, for example, if you want to label the axes of a graph with the expressions used to produce the data being graphed:

my_plot = function(x, y) {
    plot(x, y, xlab=deparse(substitute(x)), ylab=deparse(substitute(y)))
}

x = rnorm(100)
my_plot(x, x^2)

The substitute function produces a language object but does not evaluate it. As shown above, we can then convert this language object to a string, but there are other things that we can do with a language object. As the name suggests, substitute can also be used to substitute values for variables into an expression. For example,

substitute(2*x + 3*y, list(x=4, y=2))

returns the language object 2*4 + 3*2. The list argument to substitute is providing an environment in which the substitutions take place. There is also an R object called an “environment” that could be used in place of the list argument.

Finally, if you have a language object that you would like to actually evaluate, you can do so with the eval function, e.g.

m = substitute(2*x + 3*y, list(x=4, y=2))
eval(m)

As another example,

f = function(x) { eval(substitute(x)) }

is equivalent to

f = function(x) { x }

Exercise

To test your ability to read and understand R code, take a look at the built-in chisq.test function. First, type chisq.test at the prompt to see the source code for this function, and see what you recognize and what looks unfamiliar. Then save this source into a text file using

cat(deparse(chisq.test), file="my_chisq.R", sep="\n")

Next you should edit the top of this file to read mychisq <- function... and save the file, since the source doesn’t actually assign the function to a variable. You can now use

source('my_chisq.R')

to load the script, and you can then use mychisq in place of the built-in chisq.test (with one caveat that we will discuss below).

If you get stuck figuring out what a particular line in the source code does, you can add a call to the browser function, (i.e. add browser() on its own line to the source in my_chisq.R). Then run source('my_chisq.R') again and call your mychisq function – the execution will now stop at the point where you inserted the browser call, and you can inspect all the variables in the local state.

The one technical issue that you need to address is that this line:

tmp <- .Call(C_chisq_sim, sr, sc, B, E)

calls a C library function C_chisq_sim which cannot be called from outside its home package. To address this, simply change this line to read

tmp <- .Call(stats:::C_chisq_sim, sr, sc, B, E)

Resources

R language definition

Documentation for R internals

A technical assessment of the R language

Hadley Wickham’s Advanced R

An early paper by the developers of R discussing the origins and design of the language