Other Statistical Software
Statistics 506

Other Software

While R, Stata and SAS are the most popular statistical software amongst practicing data analysts, there are a number of other pieces of statistical software that non-statistically oriented people tend to use. The goal of these notes is to offer you a passing familiarity with these tools and their strengths and weaknesses so if you encounter someone using them, you can understand where they’re coming from.

None of the material in these notes will appear on problem sets or the midterm.

One common theme that will come up is that none of these other pieces of software offers support for the same wide range of statistical models and analyses that R, Stata and SAS cover. In some cases (Python in particular) they may support the models, but not more advanced usages of them. They will support basic analyses like t-tests and linear regression, though they may require an additional tool (such as Excel’s Analyse-it) to avoid carrying out these analyses manually.

Programming Languages

Python

Python is an open source general purpose programming language that has, in recent years, become popular as a tool for statistical analysis. It’s use is covered heavily in Statistics 507, which is why we only include it here.

Python supports dynamic documents via Quarto, or others such as jupyter. RStudio has support for these, as well as running interactive Python sessions just like R.

Because Python is a general purpose programming language and not purely for statistical analysis, it requires a number of third-party libraries to perform any statistical analysis. It also can be more frustrating to install than R.

NumPy

NumPy adds support for large matrix-style objects and functions associated with them. An example from the quickstart:

>>> import numpy as np
>>> a = np.array([20, 30, 40, 50])
>>> b = np.arange(4)
>>> b
array([0, 1, 2, 3])
>>> c = a - b
>>> c
array([20, 29, 38, 47])
>>> b**2
array([0, 1, 4, 9])
>>> 10 * np.sin(a)
array([ 9.12945251, -9.88031624,  7.4511316 , -2.62374854])
>>> a < 35
array([ True,  True, False, False])
>>> A = np.array([[1, 1],
...              [0, 1]])
>>> B = np.array([[2, 0],
...              [3, 4]])
>>> A * B     # elementwise product
array([[2, 0],
       [0, 4]])
>>> A @ B     # matrix product
array([[5, 4],
       [3, 4]])
>>> A.dot(B)  # another matrix product
array([[5, 4],
       [3, 4]])

Pandas

Pandas introduces an analogue of R’s data.frame: the DataFrame. It is built on top of NumPy. An example from the getting started tutorials:

>>> import pandas as pd
>>> df = pd.DataFrame(
...     {
...         "Name": [
...             "Braund, Mr. Owen Harris",
...             "Allen, Mr. William Henry",
...             "Bonnell, Miss. Elizabeth",
...         ],
...         "Age": [22, 35, 58],
...         "Sex": ["male", "male", "female"],
...     }
... )
...
>>> df
                       Name  Age     Sex
0   Braund, Mr. Owen Harris   22    male
1  Allen, Mr. William Henry   35    male
2  Bonnell, Miss. Elizabeth   58  female
>>> df["Age"]
0    22
1    35
2    58
Name: Age, dtype: int64
>>> df["Age"].max()
58
>>> titanic = pd.read_csv("data/titanic.csv")
>>> titanic.head()
   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S
3            4         1       1  ...  53.1000  C123         S
4            5         0       3  ...   8.0500   NaN         S

[5 rows x 12 columns]
>>> titanic["Age"].shape
(891,)
>>> above_35 = titanic[titanic["Age"] > 35]
>>> above_35.head()
    PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
1             2         1       1  ...  71.2833   C85         C
6             7         0       1  ...  51.8625   E46         S
11           12         1       1  ...  26.5500  C103         S
13           14         0       3  ...  31.2750   NaN         S
15           16         1       2  ...  16.0000   NaN         S

[5 rows x 12 columns]
>>> above_35.shape
(217, 12)

SciPy

SciPy implements a large suite of scientific computing functions. A lot of these may not be interesting to us as statisticians except in niche situations, such as fast fourier transformations or signal processing. However it does have a stats subpackage that is very handy and implements basic statistical analyses. For example here’s a linear regression example from the SciPy API reference:

>>> import numpy as np
>>> from scipy import stats
>>> rng = np.random.default_rng()
>>> x = rng.random(10)
>>> y = 1.6*x + rng.random(10)
>>> res = stats.linregress(x, y)
>>> res.slope
2.0401139933368753
>>> res.intercept
0.22541055389034326

statsmodels

statsmodels is a library focused solely on statistical data exploration, hypothesis testing, and modeling estimation. It implements a wide range of statistical analyses, though does not handle as many models as R’s various packages do. An example from getting started:

>>> import statsmodels.api as sm
>>> import pandas
>>> from patsy import dmatrices
>>> df = sm.datasets.get_rdataset("Guerry", "HistData").data
>>> y, X = dmatrices('Lottery ~ Literacy + Wealth + Region', data=df, return_type='dataframe')
>>> y[:3]
   Lottery
0     41.0
1     38.0
2     66.0

>>> X[:3]
   Intercept  Region[T.E]  Region[T.N]  ...  Region[T.W]  Literacy  Wealth
0        1.0          1.0          0.0  ...          0.0      37.0    73.0
1        1.0          0.0          1.0  ...          0.0      51.0    22.0
2        1.0          0.0          0.0  ...          0.0      13.0    61.0

[3 rows x 7 columns]
>>> mod = sm.OLS(y, X)    # Describe model
>>> res = mod.fit()       # Fit model
>>> print(res.summary())  # Summarize model
                           OLS Regression Results
==============================================================================
Dep. Variable:                Lottery   R-squared:                       0.338
Model:                            OLS   Adj. R-squared:                  0.287
Method:                 Least Squares   F-statistic:                     6.636
Date:                Fri, 05 May 2023   Prob (F-statistic):           1.07e-05
Time:                        13:59:50   Log-Likelihood:                -375.30
No. Observations:                  85   AIC:                             764.6
Df Residuals:                      78   BIC:                             781.7
Df Model:                           6
Covariance Type:            nonrobust
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      38.6517      9.456      4.087      0.000      19.826      57.478
Region[T.E]   -15.4278      9.727     -1.586      0.117     -34.793       3.938
Region[T.N]   -10.0170      9.260     -1.082      0.283     -28.453       8.419
Region[T.S]    -4.5483      7.279     -0.625      0.534     -19.039       9.943
Region[T.W]   -10.0913      7.196     -1.402      0.165     -24.418       4.235
Literacy       -0.1858      0.210     -0.886      0.378      -0.603       0.232
Wealth          0.4515      0.103      4.390      0.000       0.247       0.656
==============================================================================
Omnibus:                        3.049   Durbin-Watson:                   1.785
Prob(Omnibus):                  0.218   Jarque-Bera (JB):                2.694
Skew:                          -0.340   Prob(JB):                        0.260
Kurtosis:                       2.454   Cond. No.                         371.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Julia

Julia is somewhat of a combination of Python and R - it’s technically an open source general-purpose programming language like Python, but is oriented towards numerical and statistical analyses like R. It made a big splash when it appeared on the scene about 10 years ago, and is moderately popular, but never lived up to it’s promise to dethrone R as the most common statistical software.

Julia was designed to be much more efficient than existing high-level interactive languages. Its syntax is very similar to R. Taken from the Julia documentation:

julia> 3 \ 6
2.0

julia> inv(3) * 6
2.0

julia> A = [4 3; 2 1]; x = [5, 6];

julia> A \ x
2-element Vector{Float64}:
  6.5
 -7.0

julia> inv(A) * x
2-element Vector{Float64}:
  6.5
 -7.0

Here’s a regression example from GLM example:

julia> using DataFrames, GLM, StatsBase

julia> data = DataFrame(X=[1,2,3], Y=[2,4,7])
3×2 DataFrame
 Row │ X      Y
Int64  Int64
─────┼──────────────
   11      2
   22      4
   33      7

julia> ols = lm(@formula(Y ~ X), data)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Y ~ 1 + X

Coefficients:
─────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept)  -0.666667    0.62361   -1.07    0.4788   -8.59038    7.25704
X             2.5         0.288675   8.66    0.0732   -1.16797    6.16797
─────────────────────────────────────────────────────────────────────────

julia> round(r2(ols); digits=5)
0.98684

julia> round(aic(ols); digits=5)
5.84252

julia> round.(vcov(ols); digits=5)
2×2 Matrix{Float64}:
  0.38889  -0.16667
 -0.16667   0.08333

Graphical Software

SPSS

SPSS used to be one of the dominant pieces of proprietary statistical software amongst people who wanted to do statistical analysis but weren’t actual statisticians. SPSS has a user-friendly interface that can be entirely driven via GUI, allowing users to accomplish almost everything without ever writing any syntax (SPSS calls their code “syntax”). For example, here is the dialog for carrying out a two-sample t-test:

which produces as an output

The corresponding syntax would be:

T-TESTS GROUPS=married(0 1)
  /VARIABLES=wage
  /ES DISPLAY(TRUE).

The capitalization is by convention, and is not enforced. Note that the vast majority of SPSS users do not know and never use the syntax, they use the dialog boxes.

SPSS’s use over the last 5-10 years has waned substantially. This is primarily due to the other statistical software catching up in terms of user-friendliness. SPSS also supports far less advanced models than R, Stata, SAS, and Python.

Another example, this of linear regression. Often in these dialog boxes, more complex options are in sub-dialogs.

producing as output

The corresponding syntax is:

REGRESSION
  /DESCRIPTIVES MEAN STDDEV CORR SIG N
  /MISSING LISTWISE
  /STATISTICS COEFF OUTS R ANOVA
  /DEPENDENT wage
  /METHOD=ENTER married hours collgrad.

Excel

It’s almost impossible to not have some familiarity with Excel. Excel is a good tool for managing small to moderate sized data, and a lot of data projects start out with an Excel spreadsheet. Excel supports some basic statistical tests by default, such as a t-test:

=TTEST(A2:A10,B2:B10,2,1)

In addition, the Analysis ToolPak adds a few more analysis, such as linear regression:

=LINEST(A2:A5,B2:B5,,FALSE)
=LINEST(E2:E12,A2:D12,TRUE,TRUE)

Analyse-it

The add-in Analyse-it is an optional purchased component that adds more statistical tools to Excel. Here’s some example screenshots from their product page:

Using Excel solely for data-management

A more common use of Excel is to use it to manage and manipulate data. Features such as vlookup and pivot tables can be powerful and often much quicker than writing code.

However, keep in mind that such usage does not fall into the reproducible research paradigm, and you should keep good notes of what you’ve done.

Excel can export to .csv format, which is usually easier to import into statistical software than .xlsx files.

JMP

JMP (pronounced “jump”) is a statistical analysis suite offered by SAS. JMP is designed to used more for data exploration and visualization than SAS, and as such offers a more GUI-based interaction mode rather than SAS’s code-based interaction. Of the various GUI-based statistical software (SPSS, JMP, Prism) it is the most modern, though as usual, it doesn’t offer the depth of models.

There is the JSL, JMP Scripting Language, that can be used to generate reproducible scripts.

The main interface to choose your analysis:

Repeating the t-test from SPSS:

Again, repeating the regression from SPSS:

(The discrepancy in coefficients is in how SPSS and JMP handle binary categorical variables. The model fits are identical.)

Graphpad Prism

Graphpad Prism is very similar to JMP in that is a entirely GUI-based interaction that offers a limited subset of analyses. It is very popular amongst users with small data and little statistical experience. It operates similarly to JMP. One quirk is that it often (though not always) wants data stored in non-rectangular fashion, in a form that would be incompatible with lots of other software.

Numerical Analysis Software

MATLAB

MATLAB is one of several programming languages with a focus on numerical analysis. There’s also Octave which is mostly an open-source implementation of MATLAB.) MATLAB primarily comes into use for most statisticians due to it’s efficient and powerful matrix support. This example comes from the MATLAB help center:

>> A = [1 2 0; 2 5 -1; 4 10 -1]
>> A
A = 3×3

     1     2     0
     2     5    -1
     4    10    -1
>> B = A'
>> C = A * B
>> C
C = 3×3

     5    12    24
    12    30    59
    24    59   117
>> % Let's solve Ax = b
>> b = [1;3;5]
>> x = A\b
>> x
x = 3×1

     1
     0
    -1
>> eig(A)
ans = 3×1

    3.7321
    0.2679
    1.0000
>> svd(A)
ans = 3×1

   12.3171
    0.5149
    0.1577

MATLAB also supports a limited set of statistical models. From the fitlm documentation:

>> load carsmall
>> tbl = table(Weight,Acceleration,Model_Year,MPG,'VariableNames',{'Weight','Acceleration','Model_Year','MPG'})
>> lm = fitlm(tbl,'MPG~Weight+Acceleration')
lm =
Linear regression model:
    MPG ~ 1 + Weight + Acceleration

Estimated Coefficients:
                     Estimate         SE         tStat       pValue
                    __________    __________    _______    __________

    (Intercept)         45.155        3.4659     13.028    1.6266e-22
    Weight          -0.0082475    0.00059836    -13.783    5.3165e-24
    Acceleration       0.19694       0.14743     1.3359       0.18493


Number of observations: 94, Error degrees of freedom: 91
Root Mean Squared Error: 4.12
R-squared: 0.743,  Adjusted R-Squared: 0.738
F-statistic vs. constant model: 132, p-value = 1.38e-27
>> tbl.Model_Year = categorical(tbl.Model_Year)
>> lm = fitlm(tbl,'MPG~Weight+Model_Year')
lm =
Linear regression model:
    MPG ~ 1 + Weight + Model_Year

Estimated Coefficients:
                      Estimate         SE         tStat       pValue
                     __________    __________    _______    __________

    (Intercept)           40.11        1.5418     26.016    1.2024e-43
    Weight           -0.0066475    0.00042802    -15.531    3.3639e-27
    Model_Year_76        1.9291       0.74761     2.5804      0.011488
    Model_Year_82        7.9093       0.84975     9.3078    7.8681e-15


Number of observations: 94, Error degrees of freedom: 90
Root Mean Squared Error: 2.92
R-squared: 0.873,  Adjusted R-Squared: 0.868
F-statistic vs. constant model: 206, p-value = 3.83e-40

Others

The two big other numerical analysis software are

Miscellaneous

G*Power

G*Power is open-source software used in power analysis/sample size calculations. While most software has built-in power calculations, a lot of analysts prefer a custom-built solution like GPower. As with any power analysis, obtaining a useful result from GPower requires assumed values of all parameters of the model (primarily means and covariance matrices), as well as an understanding of the results are only as good as the guesses for the parameters.

Here’s an example of a two-sample t-test. Note that standardized effect sizes (in this case \(d\)) can be manually input, or calculated in the side drawer.

G*Power can also generate plots showing how power concerns change as other parameters change.

Mplus

Mplus is extremely powerful software for fitting path analysis models, also known as structural equation models (SEM). These are models which can be represented via direct acyclic graphs (DAGs). For example, linear regression can be represented with this:

In this example there is a single outcome, illness. However, more complex models can be represented in a DAG:

While these models can be fit in other software (R’s lavaan package, Stata’s sem command, Amos for SPSS), Mplus is incredibly powerful and can fit complex models that the other software cannot handle.

Unfortunately, the code to fit such models is complex and finicky. For example, here is a relatively simple SEM:

Mplus VERSION 6
MUTHEN & MUTHEN
04/25/2010  10:58 PM

INPUT INSTRUCTIONS

  TITLE:  cont3

          Classic structural equation model with multiple indicators
          used in a study of the stability of alienation.

        Source:
          Wheaton, B., Muthen, B., Alwin, D., & Summers, G. (1977).
          Assessing the reliability and stability in panel models.
          In D.R. Heise (ed), Sociological Methodology 1977.
          San Francisco:  Jossey-Bass.

  DATA:  FILE IS wheacov.dat;
         TYPE IS COVARIANCE;
         NOBS ARE 932;

  VARIABLE:  NAMES ARE anomia67 power67 anomia71 power71 educ sei;

             USEVAR =  anomia67 power67 anomia71 power71 educ sei;

  MODEL:

  !        first the measurement model part using the keyword BY:

          ses BY educ sei;
          alien67 BY anomia67 power67;
          alien71 BY anomia71 power71;

  !        next the structural model part using the keyword ON:

          alien71 ON alien67 ses;
          alien67 ON ses;

  !        and then adding correlated residuals over time using
  !        the keyword WITH:

          anomia67 WITH anomia71;
          power67 WITH power71;

  OUTPUT:

          sampstat tech1 standardized modindices(0);

The output from this model is very large, see this example for the full output. Some of this is skippable (e.g. the description and Source) but most of it is required precisely.