Statistics 506, Fall 2016

Minimal introduction to Stata


Stata is a special-purpose language that is very narrowly focused on data analysis. It is quite different from general-purpose programming languages (e.g. Python) and most other data-oriented programming languages like R.

You can use Stata at U-M by connecting to scs.dsc.umich.edu, mario.dsc.umich.edu, or luigi.dsc.umich.edu. After you have connected to one of these machines, type stata on the command line to start an interactive session. Once you get past the basics, you will usually want to write Stata scripts (“do files”) in an editor and run them from the command line using stata xyz.do, or within Stata using do xyz.

One of the unique things about Stata is that there is only one “master” data set active at any given time. This data set is the implicit target of most Stata commands. To make this more concrete, consider that in most programming languages, you would use something like the following to fit a regression (this is pseudocode that resembles both R and Python):

d = read_data("data_file.csv")
r = regress(d.y, d.x)
print(r)

Note that each of the three lines above calls a function, using the standard mathematical notation “f(x, …)” for function calls. The first two function calls above return results that we assign to variables (d and r respectively). The third function (print) acts via side-effects and does not return a result.

However the Stata code for doing this looks quite different:

import delimited using "data_file.csv"
regress y x

You can think of “import delimited” and “regress” as being functions, but they do not accept their arguments using the familiar “f(x, y, z)” syntax style. Also, Stata functions do not return any values, they operate entirely through side-effects. The most important results are printed to the screen. Many other results are accessed by calling the e(...) function with display. For example

display e(N)

gives the number of observations in the regression, and

display e(r2)

gives the r-squared of the regression. Note that these results will always pertain to the most recently fitted model.

A survey data example

We can get a better feel for how Stata works by doing some actual analyses. We will look at a US government survey for which the microdata (individual responses) are available on the web. This is the RECS (Residential Energy Consumption Survey), which focuses on home energy usage. You can download the data directly to the server using wget:

wget http://www.eia.gov/consumption/residential/data/2009/csv/recs2009_public.csv

This is a csv file, and you can inspect it in the Linux shell using:

less -S recs2009_public.csv

You will also want to get the basic variable documentation:

wget http://www.eia.gov/consumption/residential/data/2009/csv/public_layout.csv

and the more detailed variable documentation here (this is an Excel file so open it in your browser).

You can load the dataset into Stata using the following code (it may take awhile to run since the dataset is large):

import delimited recs2009_public.csv

Now that you have loaded the file, it becomes the only dataset that can exist in your Stata session. You can modify the dataset (e.g. you can add or remove rows or columns, transform data, etc.), but you cannot have two datasets in the session at the same time.

To see the variable names and types, use:

describe

To display the entire dataset to the screen, you would use list, but you should not do this for the RECS data because it is large and your machine will lock up for quite awhile and then display an unmanageable amount of output. Instead, you can view a single variable using list followed by the variable’s name, e.g.

list doeid

The summarize command displays some simple descriptive statistics of a single variable:

summarize yearmade

It is often required to calculate summary statistics on a stratification of the data. For example, we might want to summarize the years that houses were built (yearmade) in each region of the country (regionc). To do this, first we need to sort by the stratifying variable:

sort regionc

Next we use by and summarize together to calculate the summary statistics on each stratum:

by regionc: summarize yearmade

To explore some of the categorical variables in the dataset, we can create contingency tables using tabulate. First, we create a one-way contingency table of the region codes:

tabulate regionc

Next we can create a two-way contingency table showing the joint frequencies of two variables:

tabulate regionc ur

You can normalize the data to 100% within each row by using the row option to tabulate:

tabulate regionc ur, row

(there is also a col option that does the obvious analogue to this).

A basic regression example

The basic pattern for fitting a linear regression model in Stata is regress y x, where y and x are the dependent and independent variables of the regression, respectively. For example, the following line of code fits a simple linear regression of kwh (energy used, in kilowatt hours) on the year in which the house was made:

regress kwh yearmade

Additional predictor variables can easily be included:

regress kwh yearmade totsqft

A categorical variable will (inappropriately) be treated as a quantitative variable unless you put “i.” before it:

regress kwh yearmade totsqft i.regionc

Conversely, if you want to gaurantee that a variable is treated as continuous, put “c.” before it:

regress kwh yearmade c.totsqft i.regionc

To fit a model that has an interaction between two variables, use “#”:

regress kwh yearmade totsqft i.regionc c.yearmade#c.totsqft

Often when using interactions, one or both variables in the interaction should be centered. The following two commands will create a variable yearmadec that contains the mean centered values of yearmade.

summarize yearmade
gen yearmadec = yearmade - r(mean)

Now we can refit the regression with the centered version of yearmade:

regress kwh yearmade totsqft i.regionc c.yearmadec#c.totsqft

It might make sense here to consider a regression with the log of electricity usage as the dependent variable. We can generate a log transformed version of kwh using gen:

gen logkwh = log(kwh)/log(2)

Now we refit the regression:

regress logkwh yearmade totsqft i.regionc c.yearmadec#c.totsqft

Stata variables and macros

A variable in Stata can only refer to a column of the master data set. Thus all variables that exists in Stata at a given point in time have the same number of values, and these vectors are always aligned. In contrast, in a more typical programming language, a variable can hold any value, and the dimensions of these values can differ from each other.

Macros

A “macro” in Stata is a string that can be interpolated into the source code of your Stata program. Macros behave somewhat like shell variables in a Unix shell like bash. In some ways they are like ordinary variables in other languages (like R), but in other ways they are quite different. In particular, it is important to understand when the value of a macro is “evaluated” by Stata.

To define a new local macro, use the local command:

local i 33

You can retrieve the value of a macro by enclosing it between a backtick and an apostrophe:

display `i'
display `i' + 1
display "x`i'"

Note that a macro is stored in unevaluated form (basically as a string of text) until the point when it is evaluated. The backtick/apostrophe syntax extracts the value of the macro, but does not evaluate it. The macro will be evaluated when placed in the context of a Stata command, such as “display”. Wrapping the macro in double quotes converts the macro to a regular string, which prevents it from being evaluated as a Stata expression:

local i 33+1
display `i'
display "`i'"

Since a macro is just a string, it can contain Stata code that is executed when the macro is evaluated. For example, the following yields a result of “4”.

local m "*2"
display 2`m'

To understand what is happening here, it might help to view the macro in unevaluated form by converting it to a string:

display "2`m'"

Since macros contain unevaluated code, when you interpolate them with other code the resulting string is evaluated exactly as constructed. This means that you sometimes have to be careful about things like the order of operations. Run the following and see what you get:

local i 33+1
display 2*`i'
display "2*`i'"
display 2*(`i')

Note that the macros we have seen so far are defined without an equal sign, i.e. we used

local i 33

not

local i = 33

There is an important difference between creating a macro with and without the “=” sign. When the macro is defined using the “=” sign, the right hand side is immediately evaluated and the value of the macro is the result of this evaluation. But when we do not use the “=” sign, the macro is stored in unevaluated form, and will usually be evaluated at a later point. The following examples may help clarify the difference:

local x = 2+2
local y 2+2
display 2*`x'
display 2*`y'

Macros are often used with loops. The following loop displays the results of applying summarize to every variable in the RECS dataset where the variable name begins with “age”:

foreach x of varlist age* {
    sum `x'
}

If you just want to obtain the names of these variables, use double quotes to prevent evaluation:

foreach x of varlist age* {
    display "`x'"
}

Logistic regression with data cleaning

Download the data file: messy_logistic.dta, then load it into your Stata session using:

use messy_logistic

Transfer the file to your Stata working directory (if you rare using the SCS machines, you can use mfile to upload the file to your AFS directory).

We want to fit a logistic regression model of y regressed on x1 and x2. But note how y is coded:

tabulate y

It is mostly coded “yes” and “no” - we need to convert these to numeric values before we can do the regression analysis. Also, some of the values of y are strings containing several blanks. In Stata, empty strings are treated as missing values but strings consisting exclusively of whitespace are not. Also note that a few of the “yes” values are misspelled or have trailing whitespace. We can start by using subinstr to remove all the space characters.

replace y = subinstr(y, " ", "", .)

Now check again with tabulate y and note that the sample size has decreased due to the missing values.

The variant spellings of “yes” and “no” can be corrected with “replace” statements.

replace y = "yes" if y == "Yes"
replace y = "yes" if y == "yse"

Since logit expects numbers, we encode our strings:

encode y, generate(y1)

At this point y1 is a numeric variable, but each numeric level of y1 is associated with a text label (taken from y). To see how the numerical codes map onto the categories, use:

tabulate y1
tabulate y1, nolabel

The labels are coded 1 / 2, but for logistic regression we need the coding to be 0 / 1, so we define another variable

gen y2 = y1 - 1

We can now fit the logistic regression:

logit y2 x1 x2

Basic Stata commands

  • use, sysuse, import delimited, webuse for loading data

  • exit for closing Stata, you may need to clear first if you have changed the data set

  • help for getting information about Stata commands

  • display for evaluating an expression and printing the result

  • describe for displaying information about the current data set

  • do for running a “do file” (a Stata script)

  • list for displaying data in the current data set

  • tabulate for creating distribution tables

  • summarize for producing summary statistics

  • generate for creating a new variable

  • replace for replacing a variable with a new value

  • local for defining a macro

  • recode for recoding values in a variable

  • encode and decode for converting between string variables and categorical variables

  • e() for extracting results

Resources:

A discussion about how Stata differs from other Statistical languages.

Comments about Stata terminology for macros and variables.

A Stata tutorial

Another Stata tutorial

A very formal summary of the Stata programming language (more than we need right now)