Python concepts for programming with data

Top page

Contents:

Goals

This document briefly reviews some of the Python concepts that are most useful and relevant when programming with data. It is not a comprehensive summary of the Python language (many such resources are available). It should be especially useful to people who are comfortable using R or Matlab, and want to be able to do some of their work in Python.

The core language and its packages

Python is a general purpose programming language. The core language is useful for many data management tasks. But it is the libraries that make Python a powerful language for statistical analysis and other forms of advanced numerical computing. The central library for this type of work is NumPy, which provides efficient storage and manipulation of numerical and other homogeneously-typed arrays. Many other libraries are also useful for numerical computing with Python. This document considers only the core Python language and NumPy.

Python is currently undergoing a transition from version 2 to version 3. Most people using Python for data and numerical programming are still using version 2. This document was developed and tested using Python version 2.7, but should be applicable from version 2.3 to version 2.7.

Indentation

Python uses indentation (tabs or spaces) to define blocks of code. The line preceding each indented block terminates with a colon (:). Most other modern programming languages use braces to define code blocks.

Numbers and arithmetic in Python

Python has four numeric types (NumPy adds several more), but usually you will be using just two of them: integers (type int) and floats (type float). Literal numbers in Python that look like integers are treated as such. Hence, x = 3 creates a variable named x that holds the integer value 3. If you want x to be a float, define it as x = 3., x = 3.0, or x = float(3).

Division of integers returns an integer result. The value of y in the expression y = 3 / 5 is zero.

The exponentiation operator in Python is **, not ^.

Python has a subset of c's in-place arithmetic operators:

x += y ## Add y to x in place
x -= y ## Subtract y from x in place
x *= y ## Multiply x by y in place
x /= y ## Divide x by y in place

There are no autoincrement/decrement operators (++ or --) in Python.

See here for more information about numeric values and arithmetic in Python.

Strings in python

Strings (of text) can be enclosed in single quotes or double quotes, the meaning is the same:

name = "Fred"
name = 'Fred'  ## The same

You can concatenate strings using +:

place = "Ann Arbor" + ", " + "Michigan"

Strings can be replicated using *

## v will hold the string 'abcabcabc'
v = "abc" * 3

The in operator tests whether a substring is contained within a string:

## This is True
"grapes" in "apples, grapes, oranges"

There are many useful string methods in Python, see here for details.

Basic data structures

Python has three main built-in data structures, all of which are heavily-used in nearly all types of programming.

Lists

A list holds a sequence of arbitrary data types. Literal lists are defined using square brackets:

L = ["a", "egg", 35]

Lists are not "flattened", so [1, 2, [3, 4]] is a list of length 3, and is not the same as [1, 2, 3, 4] which is a list of length 4.

All indexing in Python is zero-based, so the first element of L is L[0] (below we will discuss indexing in more detail).

There are many useful built-in list methods, such as append, index, extend, remove, and pop. A method is a function that is called using the notation object.method(argument). For example:

A = [5, 3, 1, 2]  ## Create a list called "A" with 4 elements
A.append(35)      ## A now has 5 elements
A.index(3)        ## Returns 1

The in operator tests whether something is an element of a list:

5 in [2, 3, 6, 5, 4]               ## Returns True
"dog" in ["cat", 27, "elephant"]   ## Returns False

The range function is used to build lists of regularly spaced integer values (i.e. arithmetic sequences). The NumPy analogue of range is np.arange.

Two arrays can be flatly concatenated using the "+" operator

In [53]: [1,2,3] + [4,5]
Out[53]: [1, 2, 3, 4, 5]

A variant of a list is a "tuple", which is defined using parentheses instead of square brackets:

L = ("a", "egg", 35)

Tuples behave similarly to lists in most ways, but a key difference is that they are "immutable". This means that once you create a tuple, it cannot be changed.

Dictionaries

A dictionary in Python is a map from keys to values. The keys can be any immutable type (usually a string or number), and the values can be any Python type. A dictionary in Python is more or less equivalent to what is called a "map", "hash table", or "associative array" in other languages.

You can create a literal dictionary using braces, with colons used to separate the key/value pairs.

H = {"dog": 36, "cat": [1,4,5], "elephant": "mammal"}

Since tuples are immutable, they can be dictionary keys:

H = {(1,2): "A", (5,4): "C"}

You can retrieve the keys and values of a dictionary using the keys and values methods:

In [54]: H = {1: "A", 2: "B"}

In [55]: H.keys()
Out[55]: [1, 2]

In [56]: H.values()
Out[56]: ['A', 'B']

Dictionaries and lists can be nested in arbitrary ways:

## A list containing two dictionaries
L = [{1: 1, 2: 4, 3: 9}, {1: 1, 2: 8, 3: 27}]

## A dictionary whose values are lists
H = {"cat": [1,2,3], "dog": [7,4,5]}

## A dictionary whose values are dictionaries
D = {"cat": {"name": "Socks", color: "black"}, "dog": {"name": "daisy", "color": "brown"}}

Sets

A set in Python is an unordered set of distinct values:

S = {"cat", "dog", "elephant"}    ## Create a literal set containing three strings
S.add("cat")                      ## Nothing changes
"camel" in S                      ## Returns False
S.add("camel")                    ## Now camel is in the set
"camel" in S                      ## Returns True
S.remove("dog")                   ## Dog is no longer in the set

Values in sets must be immutable, so tuples can be elements of sets, but lists cannot.

Standard set theoretic operations can be performed on sets:

In [57]: {1, 2, 3, 5} & {2, 3}
Out[57]: set([2, 3])

In [58]: {1, 2, 3, 5} - {2, 3}
Out[58]: set([1, 5])

In [59]: {1, 2, 3, 5} - {2, 4}
Out[59]: set([1, 3, 5])

In [60]: {1, 2, 3, 5} | {2, 4}
Out[60]: set([1, 2, 3, 4, 5])

Indexing and slicing

All indexing in Python and in the standard Python libraries is zero-based. The first element of the list X is X[0] not X[1], and the upper left corner of the matrix X is X[0,0]. This applies to both standard Python data structures, and NumPy data structures.

Negative indices take values form the end of the list, e.g. X[-1] is the last element in the list, X[-2] is the second to last element in the list, and so on.

Slice indexing starts with the first value in the slice, and ends with the position after the last value in the slice. So if X = [3,1,6,4,7,9,8], then X[2:5] is [6,4,7].

You can leave out the first or last index of a slice, and the result will start from the beginning or the end of the list, respectively:

X = [3,5,7,2,9]
X[3:]            ## Returns [2, 9]
X[:2]            ## Returns [3, 5]

Slices can have a "stride", for example

X = range(20,130) ## All integers between 20 and 129
print X[10:20:2]  ## Every second element of X between indices 10 and 20
print X[:20:2]    ## Every second element of X from the beginning to index 20
print X[20::2]    ## Every second element of X from index 20 to the end
print X[::2]      ## Every second element of X

If you want to reverse a list, index it using [::-1], for example

L = ["a", "egg", 35]
print L[::-1]

Package imports

Code that is not part of the Python core belongs to "packages" or "modules". You can import packages and modules using the import statement. There are several forms of the import statement:

## Import all of numpy, but everything in the package must 
## be qualified with "numpy".  For example, to call the
## "sum" function in numpy, use "numpy.sum(...)". 
import numpy

## Import all of numpy, and locally rename the package as "np"
## Any numpy function can now be called by prepending "np" to the
## function call, e.g. "np.sum(...)" calls the numpy sum function
import numpy as np

## Import two functions from numpy.  These functions can be
## called directly, e.g. as "sum(...)" or "prod(...)".  Other
## values in numpy cannot be accessed.
from numpy import sum,prod

## Import the sum function from numpy and rename it as npsum.
## It can be called directly as "npsum(...)".  The rest of 
## numpy is not accessible.
from numpy import sum as npsum

## Import all of numpy into the current namespace.  These
## values can now be accessed directly, e.g., to call the 
## "prod" function in numpy, just use "prod(...)".
from numpy import *

Most of the time people do not like to use the from numpy import * form, since it can lead to name clashes.

NumPy arrays

A numpy array holds values of only one type. It can have one or more dimensions, but must be rectangular. NumPy arrays can be used with many arithmetic operations that are not defined for Python lists.

Here we create an array with 5 rows and 3 columns, initialized to hold all zeros, with each entry stored as an 8 byte floating point value:

X = np.zeros((5,3), dtype=np.float64)

We can also create an array with literal values:

In [58]: X = np.array([[3,2],[9,5],[4,4],[7,8],[2,0]], dtype=np.float64)

In [59]: X
Out[59]: 
array([[ 3.,  2.],
       [ 9.,  5.],
       [ 4.,  4.],
       [ 7.,  8.],
       [ 2.,  0.]])

Slicing NumPy arrays

You can slice rows and columns:

In [60]: X[2,:]
Out[60]: array([ 4.,  4.])

In [61]: X[:,1]
Out[61]: array([ 2.,  5.,  4.,  8.,  0.])

You can slice submatrices:

In [62]: X[2:4,0:2]
Out[62]: 
array([[ 4.,  4.],
       [ 7.,  8.]])

You can slice based on indices, but you should be careful about how you do this:

In [65]: i1 = [2,3]

In [66]: i2 = [0,1]

In [67]: X[i1,:][:,i2]
Out[67]: 
array([[ 4.,  4.],
       [ 7.,  8.]])

This gives you something different:

In [68]: X[i1,i2]
Out[68]: array([ 4.,  8.])

Indexing and slicing can get quite complicated. For more details, see here.

Vectorization and broadcasting with NumPy arrays

Elementwise operations on NumPy arrays are normally done on arrays with the same shape. For example, if X and Y are both 5 x 2, then X * Y is also a 5 x 2 array, and the i,j element of X * Y is equal to the i,j element of X times the i,j element of Y (i.e. it is the pointwise or elementwise product, not the linear algebraic product which is given by np.dot). This is called "vectorization", and usually will dramatically improve the speed of your code, compared to explicitly coding the operations using iteration.

In some circumstances it is also possible to apply elementwise operations to arrays with different shapes or different numbers of dimensions. This is called "broadcasting". The general rules for broadcasting are a bit complex and are explained in detail here. At this point, we will just talk about broadcasting involving 1-dimensional and 2-dimensional arrays.

First we discuss the broadcasting rules for an operation involving one 2-dimensional array and one 1-dimensional array . If you multiply an m x n array X with an n-vector Y, you get the pointwise product of each row of X with Y, e.g.

In [26]: X = np.array([[1,3],[2,4],[3,2]])

In [27]: Y = np.array([2,3])

In [28]: X * Y
Out[28]: 
array([[ 2,  9],
       [ 4, 12],
       [ 6,  6]])

In [29]: X * np.outer(np.ones(3), Y)
Out[29]: 
array([[ 2,  9],
       [ 4, 12],
       [ 6,  6]])

We would obtain the same answer using Y * X, and we could also do this with any binary arithmetic operation. One useful application of this is to center or standardize the columns of a matrix:

X -= X.mean(0) ## Center the columns of X
X /= X.std(0)  ## Also standardize the columns

Next we discuss the broadcasting rules for two 2-dimensional arrays. If we multiply a m x n matrix X by a m x 1 matrix Y, we get the pointwise product of each column of X with Y:

In [46]: X = np.array([[1,4],[3,2],[5,2]])

In [47]: Y = np.array([[2,],[3,],[1,]])

In [48]: X * Y
Out[48]: 
array([[2, 8],
       [9, 6],
       [5, 2]])

Again, we would get the same result by taking Y * X, and this can be done with any of the binary arithmetic operations.

Sometimes it is useful to quickly convert a vector to a matrix. This can be done as follows:

X = np.array([3,2,5,6,7]) ## A 1-d array
C = X[:,None]             ## A 2-d array with one column
R = X[None,:]             ## A 2-d array with one row

This allows us to easily standardize the rows of a matrix:

In [49]: X = np.array([[3,4],[2,1],[9,8]])

In [50]: X -= X.mean(1)[:,None]

In [51]: X /= X.std(1)[:,None]

References and values

Python variables hold references, not values, hence the underlying data is not copied when you assign to a variable, or pass an argument to a function.

X = [1, 4, 5]  ## Create a list
Y = X          ## Assign the list to another variable
Y[1] = 99      ## Changes both Y and X

If you want a copy of a list, use the list function

X = [1, 4, 5] ## Create a list
Y = list(X)   ## Create a copy of X assign it to Y
Y[1] = 99     ## Changes Y, not X

But note that the list function only copies one level deep:

A = [1,2]       ## Create a list
B = [3,4]       ## Another list
X = [A, B, 5]   ## Create a list that includes A and B as elements
Y = list(X)     ## Copy the top level of X
Y[0][0] = 0     ## Changes Y[0], X[0], and A
Y[0] = [0,0]    ## Changes only Y[0]
Y[2] = 99       ## Changes only Y[2], since the value is a scalar

In NumPy, slices are references:

X = np.array([[3,4],[5,6],[7,8]]) ## Create a 3x2 array
Y = X[1,:]                        ## A reference to row 1 of X
Y[0] = 99                         ## X also changes

If you want a copy in NumPy, call the copy method explicitly

X = np.array([[3,4],[5,6],[7,8]]) ## Create a 3x2 array
Y = X[1,:].copy()                 ## A reference to row 1 of X
Y[0] = 99                         ## X is unaffected by this

The operator "==" applied to basic Python lists, dictionaries, and sets compares whether the values held by two variables are identical. The following code print "True", even though A and B have separate storage in memory, and can be independently changed.

A = [2,4,1]
B = [2,4,1]
print A == B

If you want to test whether two variables refer to the same underlying object, use "is" instead of "==":

A = [2,3,4]
B = A
C = A
print B is C
print B == C

Equality testing for NumPy arrays behave differently. If you compare two NumPy arrays, you get the elementwise comparison:

A = np.array([1,3,4,5,7])
B = np.array([1,2,4,5,6])
print A == B               ## Returns [ True False  True  True False]

Functions and methods

Function definitions in Python are straightforward:

def myfunc(x):
    """Returns the argument incremented by 1."""
    return x+1

Functions are "first-class", so can be passed as arguments and returned as values:

def do_twice(f):
    """Returns a function that composes f with itself."""
    def g(x):
        return f(f(x))
    return g

myfunc_twice = do_twice(myfunc)
myfunc_twice(3)                   ## Returns 5

A "method" is a function that is attached to a specific object. We're not covering the object system here, so we won't get into how to define methods, but you will need to call methods that are defined in packages.

X = np.array([1,4,3,5,6]) ## Define an array
print X.sum()             ## Call the array's sum method
print sum(X)              ## Use Python's built-in sum method

You can define an anonymous function using lambda:

double_twice = do_twice(lambda x: 2*x)
print double_twice(3)                   ## Prints 12

Iteration

Iteration in Python is a quite general. Iteration over sequences works as expected:

## Iterate over a list
for x in [1,3,2,4,5]:
    print x

## Iterate over a tuple
for x in (1,3,2,4,5):
    print x

## You can omit the parentheses
for x in 1,3,2,4,5:
    print x

A traditional "for" loop with an integer index looks like this:

for k in range(20):
    do_something()

You can use enumerate if you want to track both the value and its position:

for j,x in enumerate(("cat","dog","mouse")):
    print j,x

You can splice together lists on the fly:

A = ["cat", "crocodile", "salamander"]
B = ["mammal", "reptile", "amphibian"]

for a,b in zip(A,B):
    print "A " + a + " is a " + b

If you want to get the grammar right:

for a,b in zip(A,B):
    print "A " + a + " is a" +\
        [" ", "n "][b[0] in "aeiou"] + b

You can iterate over any "iterator", which is an object that defines a method called "next" that returns the next element in a sequence of values. You can get an iterator for lists and tuples by calling the __iter__ method:

S = [3,1,2,5,4]   ## A list
I = S.__iter__()  ## An iterator to S
print I.next()    ## Get the first value
print I.next()    ## then the second value
print I.next()    ## and so on...

A generator looks like a function, but ends in yield instead of return. It produces an iterator that can be used in a loop or other setting where iteration is taking place.

Here is a simple generator:

def integer_sequence(a, b):
    """Yields the integers between a and b."""
    for i in range(a,b):
        yield i

Here is a more complicated and useful generator:

def primes(maxnum):
    """
    Yields the primes smaller than or equal to maxnum.
    """
    Primes = [] ## Keep track of the primes we have seen so far
    n = 2 ## Start here
    while n <= maxnum:
        isprime = True ## Is it a prime?
        for p in Primes:
            if p**2 > n:
                break
            if n % p == 0:
                isprime = False
                break              
        if isprime:
            yield n
            Primes.append(n)
        n += 1

You can use this in a loop:

for k in primes(100):
    print k

You can also use it in other settings where a sequence of values is expected:

sum(primes(100))

Comprehensions

A "list comprehension" creates a list in one line using an embedded for loop. This gives us a list containing the squares of the integers between 0 and 19:

## A list consisting of all integers smaller than 20.
X = [k**2 for k in range(20)]

You can do more complicated things, like this:

## A list consisting of the squares of the odd integers smaller than 20.
X = [k**2 for k in range(20) if k % 2 == 1]

A "dictionary comprehension" is a similar concept, but for dictionaries:

## A map from even integers smaller than 100 to their squares
X = {k: k**2 for k in range(100) if k % 2 == 0}.

Finally, there is also a "set comprehension":

## A set consisting of integers smaller than 100 that
## are evenly divisible by 7.
X = {k for k in range(1000) if k % 7 == 0}

Error handling

Python uses exceptions for error handling. In a lot of script writing, you don't need to understand much about exceptions. If an exception occurs the program will stop and some sort of error will be produced. You would then try to fix the error and run the script again.

Occasionally you might want to write exceptions yourself. For example, suppose you need to convert a large number of strings to numbers, and the data are not very clean. Then you could write a function like the following that attempts to convert each string to a number, and uses nan to mark values that could not be converted.

def getfloat(x):
    try:
        return float(x)
    except ValueError:
        return float("nan")
    except:
        print "Cannot parse " + x

This would be an alternative to calling the built-in function float directly, since such code would stop executing if a non-parseable value is encountered.

When debugging, the code 1/0 can be used as a breakpoint, since this statement raises an exception and halts execution.

Basic introspection

The type of an object is returned by the type function.

Functions definitions should normally include a "docstring", e.g.

def f(x):
    """A function that returns 1 for all inputs."""
    return 1

The docstring can be accessed as f.__doc__, or in ipython by typing help(f).

Docstrings for a method can be accessed using an instance of the class on which the method is defined. For example, if you want to access the docstring for the append method on lists, you can use [].append.__doc__, where [] is an empty list (any other list would work too).

You can get dictionaries mapping the names of all the currently defined local or global variables to their values by calling the functions locals or globals.

You can get a list of all the methods defined for a given object with the dir function. For example, to get a list of all the methods defined on a list, use dir([]). You can use dir() alone to see all the currently defined values.