Stata is a special-purpose language that is very narrowly focused on data analysis. It is quite different from general-purpose programming languages (e.g. Python) and most other data-oriented programming languages like R.
You can use Stata at U-M by connecting to scs.dsc.umich.edu
,
mario.dsc.umich.edu
, or luigi.dsc.umich.edu
. After you have
connected to one of these machines, type stata
on the command line
to start an interactive session. Once you get past the basics, you
will usually want to write Stata scripts (“do files”) in an editor and
run them from the command line using stata xyz.do
, or within Stata
using do xyz
.
One of the unique things about Stata is that there is only one “master” data set active at any given time. This data set is the implicit target of most Stata commands. To make this more concrete, consider that in most programming languages, you would use something like the following to fit a regression (this is pseudocode that resembles both R and Python):
d = read_data("data_file.csv")
r = regress(d.y, d.x)
print(r)
Note that each of the three lines above calls a function, using the
standard mathematical notation “f(x, …)” for function calls. The
first two function calls above return results that we assign to
variables (d
and r
respectively). The third function (print) acts
via side-effects and does not return a result.
However the Stata code for doing this looks quite different:
import delimited using "data_file.csv"
regress y x
You can think of “import delimited” and “regress” as being functions,
but they do not accept their arguments using the familiar “f(x, y, z)”
syntax style. Also, Stata functions do not return any values, they
operate entirely through side-effects. The most important results are
printed to the screen. Many other results are accessed by calling the
e(...)
function with display
. For example
display e(N)
gives the number of observations in the regression, and
display e(r2)
gives the r-squared of the regression. Note that these results will always pertain to the most recently fitted model.
A survey data example
We can get a better feel for how Stata works by doing some actual
analyses. We will look at a US government survey for which the
microdata (individual responses) are available on the web. This is
the RECS (Residential
Energy Consumption Survey), which focuses on home energy usage. You
can download the data directly to the server using wget
:
wget http://www.eia.gov/consumption/residential/data/2009/csv/recs2009_public.csv
This is a csv file, and you can inspect it in the Linux shell using:
less -S recs2009_public.csv
You will also want to get the basic variable documentation:
wget http://www.eia.gov/consumption/residential/data/2009/csv/public_layout.csv
and the more detailed variable documentation here (this is an Excel file so open it in your browser).
You can load the dataset into Stata using the following code (it may take awhile to run since the dataset is large):
import delimited recs2009_public.csv
Now that you have loaded the file, it becomes the only dataset that can exist in your Stata session. You can modify the dataset (e.g. you can add or remove rows or columns, transform data, etc.), but you cannot have two datasets in the session at the same time.
To see the variable names and types, use:
describe
To display the entire dataset to the screen, you would use list
, but
you should not do this for the RECS data because it is large and your
machine will lock up for quite awhile and then display an unmanageable
amount of output. Instead, you can view a single variable using
list
followed by the variable’s name, e.g.
list doeid
The summarize
command displays some simple descriptive statistics of
a single variable:
summarize yearmade
It is often required to calculate summary statistics on a
stratification of the data. For example, we might want to summarize
the years that houses were built (yearmade
) in each region of the
country (regionc
). To do this, first we need to sort by the
stratifying variable:
sort regionc
Next we use by
and summarize
together to calculate the summary
statistics on each stratum:
by regionc: summarize yearmade
To explore some of the categorical variables in the dataset, we can
create contingency tables using tabulate
. First, we create a
one-way contingency table of the region codes:
tabulate regionc
Next we can create a two-way contingency table showing the joint frequencies of two variables:
tabulate regionc ur
You can normalize the data to 100% within each row by using the row
option to tabulate
:
tabulate regionc ur, row
(there is also a col
option that does the obvious analogue to this).
A basic regression example
The basic pattern for fitting a linear regression model in Stata is
regress y x
, where y
and x
are the dependent and independent
variables of the regression, respectively. For example, the following
line of code fits a simple linear regression of kwh (energy used, in
kilowatt hours) on the year in which the house was made:
regress kwh yearmade
Additional predictor variables can easily be included:
regress kwh yearmade totsqft
A categorical variable will (inappropriately) be treated as a
quantitative variable unless you put “i.
” before it:
regress kwh yearmade totsqft i.regionc
Conversely, if you want to gaurantee that a variable is treated as
continuous, put “c.
” before it:
regress kwh yearmade c.totsqft i.regionc
To fit a model that has an interaction between two variables, use “#”:
regress kwh yearmade totsqft i.regionc c.yearmade#c.totsqft
Often when using interactions, one or both variables in the
interaction should be centered. The following two commands will
create a variable yearmadec
that contains the mean centered values
of yearmade
.
summarize yearmade
gen yearmadec = yearmade - r(mean)
Now we can refit the regression with the centered version of yearmade
:
regress kwh yearmade totsqft i.regionc c.yearmadec#c.totsqft
It might make sense here to consider a regression with the log of
electricity usage as the dependent variable. We can generate a log
transformed version of kwh
using gen
:
gen logkwh = log(kwh)/log(2)
Now we refit the regression:
regress logkwh yearmade totsqft i.regionc c.yearmadec#c.totsqft
Stata variables and macros
A variable in Stata can only refer to a column of the master data set. Thus all variables that exists in Stata at a given point in time have the same number of values, and these vectors are always aligned. In contrast, in a more typical programming language, a variable can hold any value, and the dimensions of these values can differ from each other.
Macros
A “macro” in Stata is a string that can be interpolated into the source code of your Stata program. Macros behave somewhat like shell variables in a Unix shell like bash. In some ways they are like ordinary variables in other languages (like R), but in other ways they are quite different. In particular, it is important to understand when the value of a macro is “evaluated” by Stata.
To define a new local macro, use the local command:
local i 33
You can retrieve the value of a macro by enclosing it between a backtick and an apostrophe:
display `i'
display `i' + 1
display "x`i'"
Note that a macro is stored in unevaluated form (basically as a string of text) until the point when it is evaluated. The backtick/apostrophe syntax extracts the value of the macro, but does not evaluate it. The macro will be evaluated when placed in the context of a Stata command, such as “display”. Wrapping the macro in double quotes converts the macro to a regular string, which prevents it from being evaluated as a Stata expression:
local i 33+1
display `i'
display "`i'"
Since a macro is just a string, it can contain Stata code that is executed when the macro is evaluated. For example, the following yields a result of “4”.
local m "*2"
display 2`m'
To understand what is happening here, it might help to view the macro in unevaluated form by converting it to a string:
display "2`m'"
Since macros contain unevaluated code, when you interpolate them with other code the resulting string is evaluated exactly as constructed. This means that you sometimes have to be careful about things like the order of operations. Run the following and see what you get:
local i 33+1
display 2*`i'
display "2*`i'"
display 2*(`i')
Note that the macros we have seen so far are defined without an equal sign, i.e. we used
local i 33
not
local i = 33
There is an important difference between creating a macro with and without the “=” sign. When the macro is defined using the “=” sign, the right hand side is immediately evaluated and the value of the macro is the result of this evaluation. But when we do not use the “=” sign, the macro is stored in unevaluated form, and will usually be evaluated at a later point. The following examples may help clarify the difference:
local x = 2+2
local y 2+2
display 2*`x'
display 2*`y'
Macros are often used with loops. The following loop displays the
results of applying summarize
to every variable in the RECS dataset
where the variable name begins with “age”:
foreach x of varlist age* {
sum `x'
}
If you just want to obtain the names of these variables, use double quotes to prevent evaluation:
foreach x of varlist age* {
display "`x'"
}
Logistic regression with data cleaning
Download the data file: messy_logistic.dta, then load it into your Stata session using:
use messy_logistic
Transfer the file to your Stata working directory (if you rare using the SCS machines, you can use mfile to upload the file to your AFS directory).
We want to fit a logistic regression model of y regressed on x1 and x2. But note how y is coded:
tabulate y
It is mostly coded “yes” and “no” - we need to convert these to numeric values before we can do the regression analysis. Also, some of the values of y are strings containing several blanks. In Stata, empty strings are treated as missing values but strings consisting exclusively of whitespace are not. Also note that a few of the “yes” values are misspelled or have trailing whitespace. We can start by using subinstr to remove all the space characters.
replace y = subinstr(y, " ", "", .)
Now check again with tabulate y
and note that the sample size has
decreased due to the missing values.
The variant spellings of “yes” and “no” can be corrected with “replace” statements.
replace y = "yes" if y == "Yes"
replace y = "yes" if y == "yse"
Since logit expects numbers, we encode our strings:
encode y, generate(y1)
At this point y1
is a numeric variable, but each numeric level of y1
is associated with a text label (taken from y). To see how the
numerical codes map onto the categories, use:
tabulate y1
tabulate y1, nolabel
The labels are coded 1 / 2, but for logistic regression we need the coding to be 0 / 1, so we define another variable
gen y2 = y1 - 1
We can now fit the logistic regression:
logit y2 x1 x2
Basic Stata commands
use
,sysuse
,import delimited
,webuse
for loading dataexit
for closing Stata, you may need toclear
first if you have changed the data sethelp
for getting information about Stata commandsdisplay
for evaluating an expression and printing the resultdescribe
for displaying information about the current data setdo
for running a “do file” (a Stata script)list
for displaying data in the current data settabulate
for creating distribution tablessummarize
for producing summary statisticsgenerate
for creating a new variablereplace
for replacing a variable with a new valuelocal
for defining a macrorecode
for recoding values in a variableencode
anddecode
for converting between string variables and categorical variablese()
for extracting results
Resources:
A discussion about how Stata differs from other Statistical languages.
Comments about Stata terminology for macros and variables.
Another Stata tutorial
A very formal summary of the Stata programming language (more than we need right now)