Other Statistical Software
Statistics 506
Other Software
While R, Stata and SAS are the most popular statistical software amongst practicing data analysts, there are a number of other pieces of statistical software that non-statistically oriented people tend to use. The goal of these notes is to offer you a passing familiarity with these tools and their strengths and weaknesses so if you encounter someone using them, you can understand where they’re coming from.
None of the material in these notes will appear on problem sets or the midterm.
One common theme that will come up is that none of these other pieces of software offers support for the same wide range of statistical models and analyses that R, Stata and SAS cover. In some cases (Python in particular) they may support the models, but not more advanced usages of them. They will support basic analyses like t-tests and linear regression, though they may require an additional tool (such as Excel’s Analyse-it) to avoid carrying out these analyses manually.
Programming Languages
Python
Python is an open source general purpose programming language that has, in recent years, become popular as a tool for statistical analysis. It’s use is covered heavily in Statistics 507, which is why we only include it here.
Python supports dynamic documents via Quarto, or others such as jupyter. RStudio has support for these, as well as running interactive Python sessions just like R.
Because Python is a general purpose programming language and not purely for statistical analysis, it requires a number of third-party libraries to perform any statistical analysis. It also can be more frustrating to install than R.
NumPy
NumPy adds support for large matrix-style objects and functions associated with them. An example from the quickstart:
>>> import numpy as np
>>> a = np.array([20, 30, 40, 50])
>>> b = np.arange(4)
>>> b
0, 1, 2, 3])
array([>>> c = a - b
>>> c
20, 29, 38, 47])
array([>>> b**2
0, 1, 4, 9])
array([>>> 10 * np.sin(a)
9.12945251, -9.88031624, 7.4511316 , -2.62374854])
array([ >>> a < 35
True, True, False, False])
array([ >>> A = np.array([[1, 1],
0, 1]])
... [>>> B = np.array([[2, 0],
3, 4]])
... [>>> A * B # elementwise product
2, 0],
array([[0, 4]])
[>>> A @ B # matrix product
5, 4],
array([[3, 4]])
[>>> A.dot(B) # another matrix product
5, 4],
array([[3, 4]]) [
Pandas
Pandas introduces an analogue of R’s data.frame
: the DataFrame
. It is built on top of NumPy. An example from the getting started tutorials:
>>> import pandas as pd
>>> df = pd.DataFrame(
... {"Name": [
... "Braund, Mr. Owen Harris",
... "Allen, Mr. William Henry",
... "Bonnell, Miss. Elizabeth",
...
... ],"Age": [22, 35, 58],
... "Sex": ["male", "male", "female"],
...
... }
... )
...>>> df
Name Age Sex0 Braund, Mr. Owen Harris 22 male
1 Allen, Mr. William Henry 35 male
2 Bonnell, Miss. Elizabeth 58 female
>>> df["Age"]
0 22
1 35
2 58
Name: Age, dtype: int64>>> df["Age"].max()
58
>>> titanic = pd.read_csv("data/titanic.csv")
>>> titanic.head()
PassengerId Survived Pclass ... Fare Cabin Embarked0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
5 rows x 12 columns]
[>>> titanic["Age"].shape
891,)
(>>> above_35 = titanic[titanic["Age"] > 35]
>>> above_35.head()
PassengerId Survived Pclass ... Fare Cabin Embarked1 2 1 1 ... 71.2833 C85 C
6 7 0 1 ... 51.8625 E46 S
11 12 1 1 ... 26.5500 C103 S
13 14 0 3 ... 31.2750 NaN S
15 16 1 2 ... 16.0000 NaN S
5 rows x 12 columns]
[>>> above_35.shape
217, 12) (
SciPy
SciPy implements a large suite of scientific computing functions. A lot of these may not be interesting to us as statisticians except in niche situations, such as fast fourier transformations or signal processing. However it does have a stats
subpackage that is very handy and implements basic statistical analyses. For example here’s a linear regression example from the SciPy API reference:
>>> import numpy as np
>>> from scipy import stats
>>> rng = np.random.default_rng()
>>> x = rng.random(10)
>>> y = 1.6*x + rng.random(10)
>>> res = stats.linregress(x, y)
>>> res.slope
2.0401139933368753
>>> res.intercept
0.22541055389034326
statsmodels
statsmodels
is a library focused solely on statistical data exploration, hypothesis testing, and modeling estimation. It implements a wide range of statistical analyses, though does not handle as many models as R’s various packages do. An example from getting started:
>>> import statsmodels.api as sm
>>> import pandas
>>> from patsy import dmatrices
>>> df = sm.datasets.get_rdataset("Guerry", "HistData").data
>>> y, X = dmatrices('Lottery ~ Literacy + Wealth + Region', data=df, return_type='dataframe')
>>> y[:3]
Lottery0 41.0
1 38.0
2 66.0
>>> X[:3]
Intercept Region[T.E] Region[T.N] ... Region[T.W] Literacy Wealth0 1.0 1.0 0.0 ... 0.0 37.0 73.0
1 1.0 0.0 1.0 ... 0.0 51.0 22.0
2 1.0 0.0 0.0 ... 0.0 13.0 61.0
3 rows x 7 columns]
[>>> mod = sm.OLS(y, X) # Describe model
>>> res = mod.fit() # Fit model
>>> print(res.summary()) # Summarize model
OLS Regression Results==============================================================================
-squared: 0.338
Dep. Variable: Lottery R-squared: 0.287
Model: OLS Adj. R-statistic: 6.636
Method: Least Squares F05 May 2023 Prob (F-statistic): 1.07e-05
Date: Fri, 13:59:50 Log-Likelihood: -375.30
Time: 85 AIC: 764.6
No. Observations: 78 BIC: 781.7
Df Residuals: 6
Df Model:
Covariance Type: nonrobust===============================================================================
>|t| [0.025 0.975]
coef std err t P-------------------------------------------------------------------------------
38.6517 9.456 4.087 0.000 19.826 57.478
Intercept -15.4278 9.727 -1.586 0.117 -34.793 3.938
Region[T.E] -10.0170 9.260 -1.082 0.283 -28.453 8.419
Region[T.N] -4.5483 7.279 -0.625 0.534 -19.039 9.943
Region[T.S] -10.0913 7.196 -1.402 0.165 -24.418 4.235
Region[T.W] -0.1858 0.210 -0.886 0.378 -0.603 0.232
Literacy 0.4515 0.103 4.390 0.000 0.247 0.656
Wealth ==============================================================================
3.049 Durbin-Watson: 1.785
Omnibus: 0.218 Jarque-Bera (JB): 2.694
Prob(Omnibus): -0.340 Prob(JB): 0.260
Skew: 2.454 Cond. No. 371.
Kurtosis: ==============================================================================
Notes:1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [
Julia
Julia is somewhat of a combination of Python and R - it’s technically an open source general-purpose programming language like Python, but is oriented towards numerical and statistical analyses like R. It made a big splash when it appeared on the scene about 10 years ago, and is moderately popular, but never lived up to it’s promise to dethrone R as the most common statistical software.
Julia was designed to be much more efficient than existing high-level interactive languages. Its syntax is very similar to R. Taken from the Julia documentation:
> 3 \ 6
julia2.0
> inv(3) * 6
julia2.0
> A = [4 3; 2 1]; x = [5, 6];
julia
> A \ x
julia2-element Vector{Float64}:
6.5
-7.0
> inv(A) * x
julia2-element Vector{Float64}:
6.5
-7.0
Here’s a regression example from GLM example:
> using DataFrames, GLM, StatsBase
julia
> data = DataFrame(X=[1,2,3], Y=[2,4,7])
julia3×2 DataFrame
Row │ X YInt64 Int64
│
─────┼──────────────1 │ 1 2
2 │ 2 4
3 │ 3 7
> ols = lm(@formula(Y ~ X), data)
juliaVector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{
~ 1 + X
Y
:
Coefficients
─────────────────────────────────────────────────────────────────────────Pr(>|t|) Lower 95% Upper 95%
Coef. Std. Error t
─────────────────────────────────────────────────────────────────────────-0.666667 0.62361 -1.07 0.4788 -8.59038 7.25704
(Intercept) 2.5 0.288675 8.66 0.0732 -1.16797 6.16797
X
─────────────────────────────────────────────────────────────────────────
> round(r2(ols); digits=5)
julia0.98684
> round(aic(ols); digits=5)
julia5.84252
> round.(vcov(ols); digits=5)
julia2×2 Matrix{Float64}:
0.38889 -0.16667
-0.16667 0.08333
Graphical Software
SPSS
SPSS used to be one of the dominant pieces of proprietary statistical software amongst people who wanted to do statistical analysis but weren’t actual statisticians. SPSS has a user-friendly interface that can be entirely driven via GUI, allowing users to accomplish almost everything without ever writing any syntax (SPSS calls their code “syntax”). For example, here is the dialog for carrying out a two-sample t-test:
which produces as an output
The corresponding syntax would be:
-TESTS GROUPS=married(0 1)
T/VARIABLES=wage
/ES DISPLAY(TRUE).
The capitalization is by convention, and is not enforced. Note that the vast majority of SPSS users do not know and never use the syntax, they use the dialog boxes.
SPSS’s use over the last 5-10 years has waned substantially. This is primarily due to the other statistical software catching up in terms of user-friendliness. SPSS also supports far less advanced models than R, Stata, SAS, and Python.
Another example, this of linear regression. Often in these dialog boxes, more complex options are in sub-dialogs.
producing as output
The corresponding syntax is:
REGRESSION/DESCRIPTIVES MEAN STDDEV CORR SIG N
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/DEPENDENT wage
/METHOD=ENTER married hours collgrad.
Excel
It’s almost impossible to not have some familiarity with Excel. Excel is a good tool for managing small to moderate sized data, and a lot of data projects start out with an Excel spreadsheet. Excel supports some basic statistical tests by default, such as a t-test:
=TTEST(A2:A10,B2:B10,2,1)
In addition, the Analysis ToolPak adds a few more analysis, such as linear regression:
=LINEST(A2:A5,B2:B5,,FALSE)
=LINEST(E2:E12,A2:D12,TRUE,TRUE)
Analyse-it
The add-in Analyse-it is an optional purchased component that adds more statistical tools to Excel. Here’s some example screenshots from their product page:
Using Excel solely for data-management
A more common use of Excel is to use it to manage and manipulate data. Features such as vlookup and pivot tables can be powerful and often much quicker than writing code.
However, keep in mind that such usage does not fall into the reproducible research paradigm, and you should keep good notes of what you’ve done.
Excel can export to .csv format, which is usually easier to import into statistical software than .xlsx files.
JMP
JMP (pronounced “jump”) is a statistical analysis suite offered by SAS. JMP is designed to used more for data exploration and visualization than SAS, and as such offers a more GUI-based interaction mode rather than SAS’s code-based interaction. Of the various GUI-based statistical software (SPSS, JMP, Prism) it is the most modern, though as usual, it doesn’t offer the depth of models.
There is the JSL, JMP Scripting Language, that can be used to generate reproducible scripts.
The main interface to choose your analysis:
Repeating the t-test from SPSS:
Again, repeating the regression from SPSS:
(The discrepancy in coefficients is in how SPSS and JMP handle binary categorical variables. The model fits are identical.)
Graphpad Prism
Graphpad Prism is very similar to JMP in that is a entirely GUI-based interaction that offers a limited subset of analyses. It is very popular amongst users with small data and little statistical experience. It operates similarly to JMP. One quirk is that it often (though not always) wants data stored in non-rectangular fashion, in a form that would be incompatible with lots of other software.
Numerical Analysis Software
MATLAB
MATLAB is one of several programming languages with a focus on numerical analysis. There’s also Octave which is mostly an open-source implementation of MATLAB.) MATLAB primarily comes into use for most statisticians due to it’s efficient and powerful matrix support. This example comes from the MATLAB help center:
>> A = [1 2 0; 2 5 -1; 4 10 -1]
>> A
A = 3×3
1 2 0
2 5 -1
4 10 -1
>> B = A'
>> C = A * B
>> C
C = 3×3
5 12 24
12 30 59
24 59 117
>> % Let's solve Ax = b
>> b = [1;3;5]
>> x = A\b
>> x
x = 3×1
1
0
-1
>> eig(A)
ans = 3×1
3.7321
0.2679
1.0000
>> svd(A)
ans = 3×1
12.3171
0.5149
0.1577
MATLAB also supports a limited set of statistical models. From the fitlm
documentation:
>> load carsmall
>> tbl = table(Weight,Acceleration,Model_Year,MPG,'VariableNames',{'Weight','Acceleration','Model_Year','MPG'})
>> lm = fitlm(tbl,'MPG~Weight+Acceleration')
lm =
Linear regression model:
MPG ~ 1 + Weight + Acceleration
Estimated Coefficients:
Estimate SE tStat pValue
__________ __________ _______ __________
Intercept) 45.155 3.4659 13.028 1.6266e-22
(Weight -0.0082475 0.00059836 -13.783 5.3165e-24
Acceleration 0.19694 0.14743 1.3359 0.18493
Number of observations: 94, Error degrees of freedom: 91
Root Mean Squared Error: 4.12
R-squared: 0.743, Adjusted R-Squared: 0.738
F-statistic vs. constant model: 132, p-value = 1.38e-27
>> tbl.Model_Year = categorical(tbl.Model_Year)
>> lm = fitlm(tbl,'MPG~Weight+Model_Year')
lm =
Linear regression model:
MPG ~ 1 + Weight + Model_Year
Estimated Coefficients:
Estimate SE tStat pValue
__________ __________ _______ __________
Intercept) 40.11 1.5418 26.016 1.2024e-43
(Weight -0.0066475 0.00042802 -15.531 3.3639e-27
Model_Year_76 1.9291 0.74761 2.5804 0.011488
Model_Year_82 7.9093 0.84975 9.3078 7.8681e-15
Number of observations: 94, Error degrees of freedom: 90
Root Mean Squared Error: 2.92
R-squared: 0.873, Adjusted R-Squared: 0.868
F-statistic vs. constant model: 206, p-value = 3.83e-40
Others
The two big other numerical analysis software are
Miscellaneous
G*Power
G*Power is open-source software used in power analysis/sample size calculations. While most software has built-in power calculations, a lot of analysts prefer a custom-built solution like GPower. As with any power analysis, obtaining a useful result from GPower requires assumed values of all parameters of the model (primarily means and covariance matrices), as well as an understanding of the results are only as good as the guesses for the parameters.
Here’s an example of a two-sample t-test. Note that standardized effect sizes (in this case \(d\)) can be manually input, or calculated in the side drawer.
G*Power can also generate plots showing how power concerns change as other parameters change.
Mplus
Mplus is extremely powerful software for fitting path analysis models, also known as structural equation models (SEM). These are models which can be represented via direct acyclic graphs (DAGs). For example, linear regression can be represented with this:
In this example there is a single outcome, illness. However, more complex models can be represented in a DAG:
While these models can be fit in other software (R’s lavaan package, Stata’s sem
command, Amos for SPSS), Mplus is incredibly powerful and can fit complex models that the other software cannot handle.
Unfortunately, the code to fit such models is complex and finicky. For example, here is a relatively simple SEM:
6
Mplus VERSION & MUTHEN
MUTHEN 04/25/2010 10:58 PM
INPUT INSTRUCTIONS
: cont3
TITLE
Classic structural equation model with multiple indicatorsin a study of the stability of alienation.
used
:
Source& Summers, G. (1977).
Wheaton, B., Muthen, B., Alwin, D., in panel models.
Assessing the reliability and stability Heise (ed), Sociological Methodology 1977.
In D.R. : Jossey-Bass.
San Francisco
: FILE IS wheacov.dat;
DATA
TYPE IS COVARIANCE;932;
NOBS ARE
: NAMES ARE anomia67 power67 anomia71 power71 educ sei;
VARIABLE
= anomia67 power67 anomia71 power71 educ sei;
USEVAR
:
MODEL
! first the measurement model part using the keyword BY:
ses BY educ sei;
alien67 BY anomia67 power67;
alien71 BY anomia71 power71;
! next the structural model part using the keyword ON:
alien71 ON alien67 ses;
alien67 ON ses;
! and then adding correlated residuals over time using
! the keyword WITH:
anomia67 WITH anomia71;
power67 WITH power71;
:
OUTPUT
modindices(0); sampstat tech1 standardized
The output from this model is very large, see this example for the full output. Some of this is skippable (e.g. the description and Source) but most of it is required precisely.