There are several Python packages that provide high-quality routines for statistical analysis. However sometimes it is useful to be able to do common statistical calculations directly, without relying on libraries. Here we will illustrate how this can be done using a few commonly-encountered statistical calculations. We also illustrate how simulations can be used to assess the properties of statistical procedures.
The following import statements are used in the code examples below:
import numpy as np from scipy.stats.distributions import norm
One of the most common elementary statistical analyses is the comparison of two populations in terms of their means, based on data sampled from the two populations. This is often described as a "t-test", which is not a very descriptive term.
If you have vectors
Y sampled independently from two
populations, you can calculate the mean difference as
md = X.mean() - Y.mean().
Alternatively, using function calls you can use
md = np.mean(X) - np.mean(Y).
Note that the second form works for both NumPy arrays and Python lists, but the first form only works for NumPy arrays.
The standard error of the mean difference can be estimated as
se = np.sqrt(np.var(X)/len(X) + np.var(Y)/len(Y)).
This can be used to form a 95% confidence interval for the true mean difference
lcl,ucl = md-2*se,md+2*se
You can obtain a Z-score for testing the hypothesis that the true difference is zero
z = md/se
Then you can get a 2-sided p-value for the test using
pvalue = -2*norm.cdf(-np.abs(z))
This p-value is the Z-test p-value, which differs slightly from the more common t-test p-value. If the sample size is not very small, the two approaches give almost the same result.
We can write a small simulation study to assess the coverage probability of this confidence interval.
## Simulation study to assess coverage probability of ## the confidence interval for the comparison of two ## population means. ## Number of simulated data sets nrep = 1000 ## Sample sizes for the two observed samples nx,ny = 10,20 ## Standard deviation for the sample from the second population sigy = 2 ## Estimate the coverage probability using simulation CP = 0 for it in range(nrep): ## Generate a data set that satisfies the null hypothesis X = np.random.normal(size=nx) Y = sigy*np.random.normal(size=ny) ## The Z-score md = X.mean() - Y.mean() se = np.sqrt(X.var()/nx + Y.var()/ny) ## Check whether the interval contains the true value if (md - 2*se < 0) and (md + 2*se > 0): CP += 1 CP /= float(nrep)
Vectorization can be used to do the same simulation analysis using much less code:
## Vectorized version of the simulation study to assess the coverage ## probability of the confidence interval for the comparison of two ## population means. ## Generate nrep data sets simultaneously X = np.random.normal(size=(nrep,nx)) Y = sigy*np.random.normal(size=(nrep,ny)) ## Calculate Z-scores for all the data sets simultaneously MD = X.mean(1) - Y.mean(1) SE = np.sqrt(X.var(1)/nx + Y.var(1)/ny) ## The coverage probability CP = np.mean( (MD - 2*SE < 0) & (MD + 2*SE > 0) )
We can also consider non-Gaussian populations:
## Modify the previous simulation study so that one of the ## populations is exponential. ## Generate nrep data sets X = -np.log(np.random.uniform(size=(nrep,nx))) Y = 1 + sigy*np.random.normal(size=(nrep,ny)) ## Calculate all the Z-scores MD = X.mean(1) - Y.mean(1) SE = np.sqrt(X.var(1)/nx + Y.var(1)/ny) ## The coverage probability CP = np.mean( (MD - 2*SE < 0) & (MD + 2*SE > 0) )
The odds ratio is a measure of dependence between two binary values. Suppose X and Y are two binary data values, jointly observed on each observed unit. For example, X could be a person's gender (coded 0/1), and Y could be the person's political affiliation (also coded 0/1).
If nij is the number of units observed with X=i and Y=j (for i,j=0,1), then the odds ratio is
(n00n11) / (n10n01).
Usually we prefer to look at the log odds ratio
log(n00) + log(n11) - log(n10) - log(n01).
Note that this is the natural log, which is also the log that is given
by the NumPy function
np.log. It is easy to compute the log odds
ratio in Python
N = np.array([[35,23], [20,29]]) LOR = np.log(N[0,0]) + np.log(N[1,1]) - np.log(np.log(N[0,1])) - np.log(N[1,0])
The standard error of the log odds ratio is
sqrt(1/n00 + 1/n01 + 1/n10 + 1/n11),
which can be computed in Python as
SE = np.sqrt(np.sum(1/N.astype(np.float64)))
Here is a simulation study assessing the performance of the 95% confidence interval based on this standard error. We also look at the bias of the estimated log odds ratio.
## Simulation study for the log odds ratio from scipy.stats import rv_discrete ## Cell probabilities P = np.array([[0.3,0.2],[0.1,0.4]]) ## The population log odds ratio PLOR = np.log(P[0,0]) + np.log(P[1,1]) - np.log(P[0,1]) - np.log(P[1,0]) ## Sample size n = 100 ## ravel vectorizes by row m = rv_discrete(values=((0,1,2,3), P.ravel())) ## Generate the data D = m.rvs(size=(nrep,n)) ## Convert to cell counts Q = np.zeros((nrep,4)) for j in range(4): Q[:,j] = (D == j).sum(1) ## Calculate the log odds ratio LOR = np.log(Q[:,0]) + np.log(Q[:,3]) - np.log(Q[:,1]) - np.log(Q[:,2]) ## The standard error SE = np.sqrt((1/Q.astype(np.float64)).sum(1)) print "The mean estimated standard error is %.3f" % SE.mean() print "The standard deviation of the estimates is %.3f" % LOR.std() ## 95% confidence intervals LCL = LOR - 2*SE UCL = LOR + 2*SE ## Coverage probability CP = np.mean((PLOR > LCL) & (PLOR < UCL)) print "The population LOR is %.2f" % PLOR print "The expected value of the estimated LOR is %.2f" % LOR[np.isfinite(LOR)].mean() print "The coverage probability of the 95%% CI is %.3f" % CP
The correlation coefficient is a measure of dependence between paired
quantitative observations. Given two data vectors
can calculate the correlation coefficient using the NumPy function
X = np.random.normal(size=100) Y = np.random.normal(size=100) r = np.corrcoef(X, Y)[0,1]
np.corrcoef returns the 2x2 correlation matrix, the
correlation coefficient is in the 0,1 position. That is all we want
here, so we immediately select that value from the result.
The standard error of the correlation coefficient is roughly 1/sqrt(n), where n is the sample size. A more accurate statistical analysis can be obtained by applying the Fisher transformation to the correlation coefficient:
F = 0.5*np.log((1+r)/(1-r))
The Fisher transformation approximately has standard error 1/sqrt(n-3).
Here is a simulation study looking at the sampling properties of the correlation coefficient and its 95% confidence interval.
## Simulation study for the correlation coefficient ## Sample size n = 20 ## Correlation between X and Y r = 0.3 ## Generate matrices X and Y such that the i^th rows of X and Y are ## correlated with correlation coefficient 0.3. X = np.random.normal(size=(nrep,n)) Y = r*X + np.sqrt(1-r**2)*np.random.normal(size=(nrep,n)) ## Get all the correlation coefficients R = [np.corrcoef(x,y)[0,1] for x,y in zip(X,Y)] R = np.array(R) ## Fisher transform all the correlation coefficients F = 0.5*np.log((1+R)/(1-R)) print "The standard deviation of the Fisher transformed " +\ "correlation coefficients is %.3f" % F.std() print "1/sqrt(n-3)=%.3f" % (1/np.sqrt(n-3)) ## 95% confidence intervals on the Fisher transform scale LCL = F - 2/np.sqrt(n-3) UCL = F + 2/np.sqrt(n-3) ## Convert the intervals back to the correlation scale LCL = (np.exp(2*LCL)-1) / (np.exp(2*LCL)+1) UCL = (np.exp(2*UCL)-1) / (np.exp(2*UCL)+1) CP = np.mean((LCL < r) & (r < UCL)) print "The coverage probability is %.3f" % CP
Suppose we have a 2-way contingency table, and we want to assess the relationship between the rows and the columns. This is often done using a chi square analysis. The first step in performing the chi square analysis is to fit a table to the observed counts, such that the rows and columns of the fitted table are exactly independent. This can be accomplished as follows:
T = np.array([[45,27], [80,35], [22, 18]]) PR = T.sum(1) / float(T.sum()) PC = T.sum(0) / float(T.sum()) E = np.outer(PR, PC) C = T.sum()*E
The 3x2 table T is the observed data, E is the fitted probabilities that satisfy exact independence between rows and columns, and C are the fitted cell counts that satisfy exact independence between rows and columns. The chi-square statistic, degrees of freedom, and p-value can be calculated as follows
cs = ((T-C)**2/C).sum() df = (T.shape-1)*(T.shape-1) pvalue = 1 - chi2.cdf(cs, df)
The chi2 function gives the chi square distribution probabilities, quantiles, etc., and can be obtained using:
from scipy.stats.distributions import chi2
Here is a simulation study that looks at the chi square test for a particular 3 x 2 classification. The value of x determines how the results should be interpreted. When x=0, the data are generated under exact independence between rows and columns. Thus the test should only reject 5% of the time (since it is a level 0.05 test). The simulation assesses whether the test is performing as expected. If you set x to a nonzero value, the data are generated under a distribution in which the rows and columns are dependent. In this, case we are doing a power analysis to assess whether a given level of dependence can be detected using the specified sample size.
## Simulation study for contingency tables ## Row and column cell probabilities PR = np.r_[0.3, 0.4, 0.3] PC = np.r_[0.6, 0.4] ## Cell probabilities satisfying row/column independence T = np.outer(PR, PC) ## Modify the probabilities a bit to deviate from independence x = 0.05 ## set to 0 to give independent rows and columns T1 = T.copy() T1[1,:] += [x, -x] ## Sample size n = 100 ## ravel vectorizes by row m = rv_discrete(values=(range(6), T1.ravel())) ## Generate the data D = m.rvs(size=(nrep,n)) ## Convert to cell counts Q = np.zeros((nrep,6)) for j in range(6): Q[:,j] = (D == j).sum(1) Q = np.reshape(Q, (nrep, 3, 2)) ## Row and column margins RM = Q.sum(2) / float(n) CM = Q.sum(1) / float(n) ## Fitted probabilities and counts under independence E = [np.outer(rm,cm) for rm,cm in zip(RM,CM)] C = [n*x for x in E] ## Chi-square statistics CS = [((x-c)**2/c).sum() for x,c in zip(Q,C)] CS = np.array(CS) ## Test statistic threshold for alpha=0.04 from scipy.stats.distributions import chi2 df = (3-1)*(2-1) ts = chi2.ppf(0.95, df) print "Proportion of chi^2 tests that reject the independence hypothesis: %.2f" %\ np.mean(CS > ts)