Elementary statistical calculations and simulations

Top page

Contents:

Goals

There are several Python packages that provide high-quality routines for statistical analysis. However sometimes it is useful to be able to do common statistical calculations directly, without relying on libraries. Here we will illustrate how this can be done using a few commonly-encountered statistical calculations. We also illustrate how simulations can be used to assess the properties of statistical procedures.

The following import statements are used in the code examples below:

import numpy as np
from scipy.stats.distributions import norm

Comparing the means of two populations

One of the most common elementary statistical analyses is the comparison of two populations in terms of their means, based on data sampled from the two populations. This is often described as a "t-test", which is not a very descriptive term.

If you have vectors X and Y sampled independently from two populations, you can calculate the mean difference as

md = X.mean() - Y.mean().

Alternatively, using function calls you can use

md = np.mean(X) - np.mean(Y).

Note that the second form works for both NumPy arrays and Python lists, but the first form only works for NumPy arrays.

The standard error of the mean difference can be estimated as

se = np.sqrt(np.var(X)/len(X) + np.var(Y)/len(Y)).

This can be used to form a 95% confidence interval for the true mean difference

lcl,ucl = md-2*se,md+2*se

You can obtain a Z-score for testing the hypothesis that the true difference is zero

z = md/se

Then you can get a 2-sided p-value for the test using

pvalue = -2*norm.cdf(-np.abs(z))

This p-value is the Z-test p-value, which differs slightly from the more common t-test p-value. If the sample size is not very small, the two approaches give almost the same result.

We can write a small simulation study to assess the coverage probability of this confidence interval.

## Simulation study to assess coverage probability of
## the confidence interval for the comparison of two
## population means.

## Number of simulated data sets
nrep = 1000

## Sample sizes for the two observed samples
nx,ny = 10,20

## Standard deviation for the sample from the second population
sigy = 2

## Estimate the coverage probability using simulation
CP = 0
for it in range(nrep):

    ## Generate a data set that satisfies the null hypothesis
    X = np.random.normal(size=nx)
    Y = sigy*np.random.normal(size=ny)

    ## The Z-score
    md = X.mean() - Y.mean()
    se = np.sqrt(X.var()/nx + Y.var()/ny)

    ## Check whether the interval contains the true value
    if (md - 2*se < 0) and (md + 2*se > 0):
        CP += 1

CP /= float(nrep)

Vectorization can be used to do the same simulation analysis using much less code:

## Vectorized version of the simulation study to assess the coverage
## probability of the confidence interval for the comparison of two
## population means.

## Generate nrep data sets simultaneously
X = np.random.normal(size=(nrep,nx))
Y = sigy*np.random.normal(size=(nrep,ny))

## Calculate Z-scores for all the data sets simultaneously
MD = X.mean(1) - Y.mean(1)
SE = np.sqrt(X.var(1)/nx + Y.var(1)/ny)

## The coverage probability
CP = np.mean( (MD - 2*SE < 0) & (MD + 2*SE > 0) )

We can also consider non-Gaussian populations:

## Modify the previous simulation study so that one of the 
## populations is exponential.

## Generate nrep data sets
X = -np.log(np.random.uniform(size=(nrep,nx)))
Y = 1 + sigy*np.random.normal(size=(nrep,ny))

## Calculate all the Z-scores
MD = X.mean(1) - Y.mean(1)
SE = np.sqrt(X.var(1)/nx + Y.var(1)/ny)

## The coverage probability
CP = np.mean( (MD - 2*SE < 0) & (MD + 2*SE > 0) )

Odds ratios

The odds ratio is a measure of dependence between two binary values. Suppose X and Y are two binary data values, jointly observed on each observed unit. For example, X could be a person's gender (coded 0/1), and Y could be the person's political affiliation (also coded 0/1).

If nij is the number of units observed with X=i and Y=j (for i,j=0,1), then the odds ratio is

(n00n11) / (n10n01).

Usually we prefer to look at the log odds ratio

log(n00) + log(n11) - log(n10) - log(n01).

Note that this is the natural log, which is also the log that is given by the NumPy function np.log. It is easy to compute the log odds ratio in Python

N = np.array([[35,23], [20,29]])
LOR = np.log(N[0,0]) + np.log(N[1,1]) - np.log(np.log(N[0,1])) - np.log(N[1,0])

The standard error of the log odds ratio is

sqrt(1/n00 + 1/n01 + 1/n10 + 1/n11),

which can be computed in Python as

SE = np.sqrt(np.sum(1/N.astype(np.float64)))

Here is a simulation study assessing the performance of the 95% confidence interval based on this standard error. We also look at the bias of the estimated log odds ratio.

## Simulation study for the log odds ratio

from scipy.stats import rv_discrete

## Cell probabilities
P = np.array([[0.3,0.2],[0.1,0.4]])

## The population log odds ratio
PLOR = np.log(P[0,0]) + np.log(P[1,1]) - np.log(P[0,1]) - np.log(P[1,0])

## Sample size
n = 100

## ravel vectorizes by row
m = rv_discrete(values=((0,1,2,3), P.ravel()))

## Generate the data
D = m.rvs(size=(nrep,n))

## Convert to cell counts
Q = np.zeros((nrep,4))
for j in range(4):
    Q[:,j] = (D == j).sum(1)

## Calculate the log odds ratio
LOR = np.log(Q[:,0]) + np.log(Q[:,3]) - np.log(Q[:,1]) - np.log(Q[:,2])

## The standard error
SE = np.sqrt((1/Q.astype(np.float64)).sum(1))

print "The mean estimated standard error is %.3f" % SE.mean()
print "The standard deviation of the estimates is %.3f" % LOR.std()

## 95% confidence intervals
LCL = LOR - 2*SE
UCL = LOR + 2*SE

## Coverage probability
CP = np.mean((PLOR > LCL) & (PLOR < UCL))

print "The population LOR is %.2f" % PLOR
print "The expected value of the estimated LOR is %.2f" % LOR[np.isfinite(LOR)].mean()
print "The coverage probability of the 95%% CI is %.3f" % CP

Correlation coefficients

The correlation coefficient is a measure of dependence between paired quantitative observations. Given two data vectors X and Y, you can calculate the correlation coefficient using the NumPy function np.corrcoef.

X = np.random.normal(size=100)
Y = np.random.normal(size=100)
r = np.corrcoef(X, Y)[0,1]

Since np.corrcoef returns the 2x2 correlation matrix, the correlation coefficient is in the 0,1 position. That is all we want here, so we immediately select that value from the result.

The standard error of the correlation coefficient is roughly 1/sqrt(n), where n is the sample size. A more accurate statistical analysis can be obtained by applying the Fisher transformation to the correlation coefficient:

F = 0.5*np.log((1+r)/(1-r))

The Fisher transformation approximately has standard error 1/sqrt(n-3).

Here is a simulation study looking at the sampling properties of the correlation coefficient and its 95% confidence interval.

## Simulation study for the correlation coefficient

## Sample size
n = 20

## Correlation between X and Y
r = 0.3

## Generate matrices X and Y such that the i^th rows of X and Y are
## correlated with correlation coefficient 0.3.
X = np.random.normal(size=(nrep,n))
Y = r*X + np.sqrt(1-r**2)*np.random.normal(size=(nrep,n))

## Get all the correlation coefficients
R = [np.corrcoef(x,y)[0,1] for x,y in zip(X,Y)]
R = np.array(R)

## Fisher transform all the correlation coefficients
F = 0.5*np.log((1+R)/(1-R))

print "The standard deviation of the Fisher transformed " +\
      "correlation coefficients is %.3f" % F.std() 
print "1/sqrt(n-3)=%.3f" % (1/np.sqrt(n-3))

## 95% confidence intervals on the Fisher transform scale
LCL = F - 2/np.sqrt(n-3)
UCL = F + 2/np.sqrt(n-3)

## Convert the intervals back to the correlation scale
LCL = (np.exp(2*LCL)-1) / (np.exp(2*LCL)+1)
UCL = (np.exp(2*UCL)-1) / (np.exp(2*UCL)+1)

CP = np.mean((LCL < r) & (r < UCL))

print "The coverage probability is %.3f" % CP

Analysis of a 2-way contingency table

Suppose we have a 2-way contingency table, and we want to assess the relationship between the rows and the columns. This is often done using a chi square analysis. The first step in performing the chi square analysis is to fit a table to the observed counts, such that the rows and columns of the fitted table are exactly independent. This can be accomplished as follows:

T = np.array([[45,27], [80,35], [22, 18]])

PR = T.sum(1) / float(T.sum())
PC = T.sum(0) / float(T.sum())

E = np.outer(PR, PC)
C = T.sum()*E

The 3x2 table T is the observed data, E is the fitted probabilities that satisfy exact independence between rows and columns, and C are the fitted cell counts that satisfy exact independence between rows and columns. The chi-square statistic, degrees of freedom, and p-value can be calculated as follows

cs = ((T-C)**2/C).sum()
df = (T.shape[0]-1)*(T.shape[1]-1)
pvalue = 1 - chi2.cdf(cs, df)

The chi2 function gives the chi square distribution probabilities, quantiles, etc., and can be obtained using:

from scipy.stats.distributions import chi2

Here is a simulation study that looks at the chi square test for a particular 3 x 2 classification. The value of x determines how the results should be interpreted. When x=0, the data are generated under exact independence between rows and columns. Thus the test should only reject 5% of the time (since it is a level 0.05 test). The simulation assesses whether the test is performing as expected. If you set x to a nonzero value, the data are generated under a distribution in which the rows and columns are dependent. In this, case we are doing a power analysis to assess whether a given level of dependence can be detected using the specified sample size.

## Simulation study for contingency tables

## Row and column cell probabilities
PR = np.r_[0.3, 0.4, 0.3]
PC = np.r_[0.6, 0.4]

## Cell probabilities satisfying row/column independence
T = np.outer(PR, PC)

## Modify the probabilities a bit to deviate from independence
x = 0.05 ## set to 0 to give independent rows and columns
T1 = T.copy()
T1[1,:] += [x, -x]

## Sample size
n = 100

## ravel vectorizes by row
m = rv_discrete(values=(range(6), T1.ravel()))

## Generate the data
D = m.rvs(size=(nrep,n))

## Convert to cell counts
Q = np.zeros((nrep,6))
for j in range(6):
    Q[:,j] = (D == j).sum(1)
Q = np.reshape(Q, (nrep, 3, 2))

## Row and column margins
RM = Q.sum(2) / float(n)
CM = Q.sum(1) / float(n)

## Fitted probabilities and counts under independence
E = [np.outer(rm,cm) for rm,cm in zip(RM,CM)]
C = [n*x for x in E]

## Chi-square statistics
CS = [((x-c)**2/c).sum() for x,c in zip(Q,C)]
CS = np.array(CS)

## Test statistic threshold for alpha=0.04
from scipy.stats.distributions import chi2
df = (3-1)*(2-1)
ts = chi2.ppf(0.95, df)

print "Proportion of chi^2 tests that reject the independence hypothesis: %.2f" %\
    np.mean(CS > ts)

Files

stats_calculations.py