Statistics 506, Fall 2016

Data formats and data structures


Memory and addressing

The physical storage of data on a computer is always in a binary format. A single binary value is called a bit, and 8 bits taken together forms a byte. A byte can hold numerical values from 0 to 255 when viewed as a base 10 number.

Most processors are designed to operate on multi-byte values called words. Most modern computers use 8 byte words.

Every location in the computer’s memory has a memory address. The size of a memory address is usually the word size (8 bytes is the current standard).

Modern systems use a virtual memory model for application processes in which the application works with virtual addresses rather than directly with physical memory addresses. The virtual addresses are then converted to physical addresses by the operating system. The details are complex and involve memory pages.

Memory available to an application process

When a process is started it is granted a chunk of memory to use. The operating system generally prevents the application from accessing memory outside this assigned address space (attempts to do so should yield a “segmentation fault”). This is called memory isolation. In some special situations two or more applications can share memory, but this has to be specifically requested by the processes.

Applications can request additional pages of memory if more memory is needed than is initially granted. Swapping occurs when the operating system copies pages within primary memory (e.g. from cache memory to main memory), or from primary memory to secondary memory (i.e. the HDD). The latter may occur if a process using a lot of memory is inactive or if main memory is exhausted.

The memory available to an application process consists of two distinct parts. The main difference between the two parts is how the data are organized within each of them.

  • The stack is a relatively smaller amount of memory that is organized as a data structure called a stack. The underlying physical storage for the stack is usually a physically contiguous block of memory. A stack is a data structure in which objects can only be added or removed from the end of the structure. The stack is generally used for smaller data values that are needed for a relatively short time, such as local variables or function arguments. Access to the stack is very fast because it is facilitated directly by the CPU. Each process (or technically each thread of a process if it is multithreaded) has its own stack, and the stack size is fixed when the application begins executing (it cannot be grown). The stack occupies a relatively small fraction of the main memory.

  • The heap contains everything that is not in the stack. The heap is organized as a data structure called a heap, which is closely related to a tree. The underlying storage for heap variables is contained in memory pages that can be added and removed as needed. The physical memory for the heap is therefore not contiguous.

When programming in C or C++ there is syntax support in the language for placing data on either the stack or on the heap. Interpreted languages like R, Stata, Python, etc. place an abstraction layer between the user’s source code and the underlying OS-level memory management.

Representation of text data

Traditionally, text data on a computer was stored in 1 byte values (using 1 byte per character), using the ASCII encoding to convert the bit patterns (interpreted as unsigned integers) to characters. ASCII covers the 26 characters in the Latin alphabet used to write English (lower and upper case), plus numerical digits, punctuation, and a few other special characters. As non-western languages became more widely used in computing, various two-byte encoding systems were developed. Dozens of one and two-byte encoding systems were developed and it became quite problematic to handle them seamlessly in applications.

A string of text is simply a byte sequence that is interpreted using one of these encoding systems. However text strings are not self describing – they do not contain a header or other attached meta-data, so an application or user holding text data has to know what the encoding is. Also, using different encodings to represent different languages means that it is not always possible to have a single string of text that contains characters from multiple languages.

More recently, almost all text is written in Unicode and the UTF-8 encoding system. Unicode uses a variable length coding of 1 to 4 bytes per character. Over one million code points (characters in various languages plus many typographical symbols) can be represented in Unicode, including essentially all characters from most modern and extinct human languages. See here for a graph illustrating the transition of most web documents to Unicode encoding.

The variabe length encoding in Unicode is implemented via a continuation pattern. If the first two bits of a byte are 11 the byte is the first byte of a code point, if the first two bits are 10 it is a continuation byte.

Variables

As noted above, a single number or character can be stored in a fixed-size object of 1-8 bytes in length. These are often called “primitive” or “scalar” values.

Every programming language has “variables”, but the semantics of how they behave can differ. In any standard language, you can create a variable and assign a value to it:

x = 4
y = "a"

A variable is a symbol (i.e. the variable’s name), linked to a small region of memory used to store the variable’s value. In strictly typed (or statically typed) programming languages (like C, with some exceptions), all variables must be given a type which restricts the type of values that the variable can hold. Other languages (like R or Python) allow a variable to hold any type. These languages usually also allow a variable to hold different types of values at different times. This is the difference between a statically typed language and a weakly typed or dynamic language (like R, Python, or Matlab).

Another important issue arises when we assign a variable a value taken from another variable:

x = 4
y = x

Since “4” is a primitive value, in any standard language the assignment to y above will be assignment “by value”. This means that if we subsequently change the value of y, i.e.

y = y + 1

the value of x is not affected (i.e. the value of y will become 5, but the value of x will still be 4). The underlying reason for this is that the memory backing x and y are distinct memory locations, so writing into the memory for y does not affect the memory for x. This is called value semantics.

Compound data and copy/reference semantics

A compound data value is any value that is not primitive. For example, arrays, matrices, lists, file handles, and functions can all be viewed as being compound data. In many programming languages, variables holding compound values are copied “by reference”. This means that if we create an array

 x = [3, 1, 2]

then assign this array to a new variable

 y = x

and finally change the value of one element of the array through the new variable

 y[1] = 99

then these changes will change both x and y. This is called reference semantics.

The underlying reason that compound data structures have reference semantics is that the memory backing the variable does not hold the array itself, it simply holds a reference to the array data. For example, x could hold the memory address where the array is stored. When we assign x to y (i.e. y = x), the value of y is the same memory address that holds the data of x. As a result, when we change the contents of y we automatically change the contents of x.

If you know any C or C++ then you should be familiar with the way in which pointers give rise to reference semantics. However note that pointers are an implementation detail, whereas reference semantics is a behavior that can be implemented in various ways.

Assignments in R behave as if they are always copies. It is not easy to obtain a reference in R (i.e. it is not easy to have two variables link to the same underlying data or memory). It can be achieved using environments which we will consider later.

R uses uses “copy on write” to implement its copy semantics. This means that when you have an assignment x = y, initially x and y are references to the same object, but if you ever try to change the value of either x or y, a copy is generated at that point (not at the point of assignment). Thus the value of y is not affected when you change x (or vice versa).

Data structures

Next we will give a high level overview of a few very important of data structures that can be used in many different programming languages. Later we will discuss the specific data structures that are commonly used in R.

Arrays

A basic data structure is an array (also sometimes called a vector). An array holds an ordered sequence of values that all have the same type. As a result, the storage for the array can be contiguous in memory and it is very fast to jump to an arbitrary value in the array. For example, suppose you want to store a sequence of 100 double precision values in R in an array. The data for the array will be stored in a 800 (= 100 * 8) byte contiguous block of memory.

An array is an example of a homogeneous data structure, since it holds data values that all have the same type.

Multidimensional arrays (e.g. matrices) are also homogeneous data structures that can be stored in contiguous memory. The array must be vectorized to pack the values into a sequence. For example, a m x n two-dimensional array (i.e. matrix) A can be packed two ways: “row-wise”, where A[i, j] is stored in position n*i + j of the sequence, or “column-wise”, where A[i, j] is stored in position m*j + i of the sequence.

A technique called indirection can be used to store inhomogeneous values in an array. For example, we could have an array such as

[1, 3.5, [-2, 0], "cat"]

This array contains four values: an integer, a floating point value, another array, and a string. Since the storage sizes and format for these four elements are different, it is not convenient to pack the data into a contiguous block of memory. Instead, we can take a reference to each element and store the references in the array. The reference values themselves are either memory addresses or pointers of some type (generally 8 byte objects), hence an array of references is itself a homogeneous sequence, even if the data pointed to by the references is inhomogeneous.

Since this type of array contains references to the data rather than the data themselves, the use of indirection makes access somewhat less efficient than in the case of arrays that do not use indirection (but processors are optimized for working with memory addresses, so the cost is not very great). For example, if we want to read the value in position 3 of the array using indirection, we first extract the reference value in position 3 of the array, which points to a location in memory, then we jump to that location in memory and extract the actual value of interest. Note that we can have multiple layers of indirection, for example, if an element in the array is itself an array.

Lists

A doubly linked list, which we will simply call a “list” takes the notion of indirection a step further. The core element of a doubly linked list is a “node” of three values (previous, next, value), where value is the data value of a single element of the list, previous is the memory location of the previous node in the list, and next is the memory location of the subsequent node in the list.

Lists of this type are much slower to index than arrays. If we want to access the value in position 100 of a doubly linked list, we must start at the beginning and jump from node to node until we reach the 100th node. On the other hand, if we want to access the value in position 100 of an array whose values use 8 bytes of storage we simply jump straight to byte 792 in the array and retrieve the next 8 bytes of data.

An advantage of the list structure is that it is easy to delete or insert values without copying the entire list. To delete the value in position 10, simply adjust the next field of node 9 to point to node 11, and adjust the previous field of node 11 to point to node 9. The memory for node 10 can then be reclaimed. A similar process allows new nodes to be inserted.

Associative arrays

Another very important data structure is variously called a “map”, a “hash table”, an “associative array”, or a “dictionary”. Semantically, these are all maps from “keys” to “values”. The keys can ordinarily be any primitive type, and the values can generally be any type at all. For example, the following is a dictionary expressed in Python syntax:

{"Canada" : "Ottawa", "France" : "Paris", "Japan" : "Tokyo"}

This dictionary maps country names to the name of their capital city. It is a simple map from strings to strings. In some languages, we can also have a map with heterogeneous value types:

{1 : "snake", "a" : [1, 2, 3], 45.2 : {1 : 2, 2 : 4, 3 : 9}}

The key feature of an associative array is that locating a single element should require a fixed amount of time regardless of the size of the dictionary (or at worst should be logarithmic in the size). This requires the use of trees or “hashing” algorithms to map the keys to memory addresses.

There is an important type of database called a key/value store. A dictionary is a simple in-memory implementation of a key/value store.