SAS is a programming language and a collection of data analysis procedures. The language is implemented in closed-source software sold by the SAS corporation. SAS is primarily a declarative language, meaning that your program specifies the desired result, and SAS figures out the steps needed to produce the result.
You can run SAS on the scs computers by connecting to those machines via ssh:
ssh uniqname@scs.itd.umich.edu
The next step is to create a SAS program in a text file. You will do
this using a text editor like vim or emacs. Once you create a SAS
program, say in a file called analysis.sas
, then you can run the
program by typing the following at the shell prompt:
sas analysis.sas
After the program is run, a new file with suffix .log
will appear in
your working directory. If your program file is called
analysis.sas
, then the log file will be called analysis.log
.
If your program creates output, then the output will be contained in a
file with suffix .lst
. In this example, the output file will be
called analysis.lst
.
Another option for using SAS is using a virtual desktop at
midesktop.umich.edu
. We will not teach that here so if you want to
go that route you will need to figure out the details for yourself.
Below we will use the following three data sets to demonstrate SAS
programming. You can use wget
to download the data files without
leaving the linux shell.
The Residential Energy Consumption Survey (RECS) is a survey of residential building construction features and measures of energy consumption:
http://www.eia.gov/consumption/residential/data/2009/index.cfm?view=microdata
We will use both the SAS7BDAT and csv versions of the survey data files.
The National Health and Nutrition Examination survey (NHANES) is a multi-wave cross-sectional survey of health measures. We will use two NHANES data files below:
We will also work with climate data from the U.S. National Climate Data Center. First, download any of the single-year data files from the NCDC web site:
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year
Then download the
ghcn-stations.txt
file from this site:
Note that some of these text files are compressed text (csv) files.
You can leave them in gzip compressed form, decompress with gunzip
,
e.g. gunzip 2014.csv.gz
, or convert to an even more compressed xz
format file, e.g. using gzip -dc 2014.csv.gz | xz > 2014.csv.xz
.
Note that conversion from gzip to xz format may take a long time for a
large file. Below we will see how to read compressed text files into
SAS without decompressing the file on the disk.
Loading data
A SAS program usually consists of two types of code blocks called
data
steps and proc
(for “procedure”) steps. A program may
contain zero, one, or multiple data and proc steps. A SAS program may
also contain additional code elements such as variable and macro
definitions. To read data into SAS we can use a data step or a proc.
Techniques for loading various types of files are covered here.
Loading delimited data with import
There are several ways to read delimited data into a SAS program. Recall that “delimited data” are stored in a file where the fields are delimited by some character, such as a comma. The following is a complete SAS program that loads the RECS data from a CSV file:
proc import datafile='recs2009_public.csv' out=recs;
run;
The import
procedure reads a delimited data file. In the call to
the import
procedure, we specify the datafile name, and a name for
the working dataset as options. The working dataset, called recs
here, is an internal dataset that is available for use by subsequent
procedures in the same program.
Note that every line of a SAS program ends with a semicolon, and the
program must end with a run
statement (run
is not always
required but there are a few situations where omitting the run
statement will cause problems).
The output data name “recs” is a “one level dataset name”, since it does not contain a point (.). A one-level dataset name usually refers to a temporary dataset that will not be retained when the program exits. Below we will see how to use two-level names that allow us to save a dataset that is created by the SAS program.
The program above loads the RECS dataset but exits without doing
anything. The next example loads the dataset, then uses proc
print
to print the first 10 rows of the data file. The printed results will
appear in the listing (.lst
) file corresponding to your program
name.
proc import datafile='recs2009_public.csv' out=recs;
proc print data=recs(obs=10);
run;
The behavior of the print
procedure can be modified using options
(which appear on the same line as the proc
call) and statements
(which appear below the proc
call). For example, if you want to
print only a few of the variables, you can use a var
statement as
follows:
proc import datafile='recs2009_public.csv' out=recs;
proc print data=recs(obs=10);
var DOEID NWEIGHT;
run;
Loading a SAS7BDAT data file
Next we will rewrite the above program to load the RECS data from a
SAS7BDAT format datafile. SAS7BAT is the native binary format for SAS
data files. SAS7BDAT files tend to be larger than text files
containing the same data, but can be processed efficiently since it is
sometimes possible for SAS to read only the parts of the file that are
used in a particular analysis. Also, as shown below, SAS7BDAT files
can be passed directly into SAS procedures without an accompanying
import
procedure or data
step.
First, identify the directory where the SAS7BDAT file is stored, then
use libname
to define a SAS data library using this directory name.
Then you can use this data library in any SAS procedure by specifying
the data
option with a two-level name beginning with your library
name. For example, if the data are in the directory /home/data
we
could use the following program:
libname mydata '/home/data';
proc print data=mydata.recs2009_public_v3(obs=10);
var DOEID NWEIGHT;
run;
You can use the tilde character ~
in libname
to indicate your home
directory. For example, the following program looks for the RECS data
file in your home directory:
libname mydata '~';
proc print data=mydata.recs2009_public_v3(obs=10);
var DOEID NWEIGHT;
run;
A library defined by libname
can refer to any directory on
your system. For example,
libname mydata '~/my_project'
defines a library in the directory my_project
located within your
home directory. If you exclude the tilde character, you are refering
to directories on the file system of your local computer. For
example,
libname locdata '/var/tmp'
defines a library in the /var/tmp
directory of your local machine.
Note that a library name such as mydata
or locdata
in the examples
above is not an actual directory or file name on your computer. It is
a “handle” that exists inside your SAS program to refer to the actual
directory. You can define several libraries in the same program if
needed.
To confirm that we have loaded the data successfully, we will run the
contents
procedure. This procedure places a detailed description of
the contents of the dataset into the listing (.lst
) file. Note that
the SAS7BDAT version of the RECS data file contains long variable
descriptions, which are not included in the CSV file.
libname mydata '~';
proc contents data=mydata.recs2009_public_v3;
run;
Loading data containing text and dates
Some data types require special treatment when being read by SAS. To
illustrate text and date handling, we will work with the 2014 climate
data from the U.S. National Climate Data Center. These data are in
long format, and contain measurements of specific weather variables
such as daily maximum temperature (TMAX
) or precipitation
(PRECIP
). Each row contains one record for a particular day at a
particular station, for a particular weather variable. Below we use a
SAS data step to read the data into a dataset called ghcnd
.
filename ghcnd_gz pipe "xz -dc 2014.csv.xz" lrecl=80;
data ghcnd;
infile ghcnd_gz dsd delimiter=',';
input station : $11. date : yymmdd8. obstype : $ obsval;
format date mmddyy10.;
If you want to use the gzip format file instead of the xz format file,
replace the filename
line with
filename ghcnd_gz pipe "gzip -dc 2014.csv.gz" lrecl=80;
In the code above we are reading the data using infile
, which gives
us more control than when using import
. There are two important
options on the infile
line. First, we include the dsd
option.
This tells SAS to treat consecutive delimiters as indicating a missing
value (the same as read.table
and fread
in R). You should almost
always incude this option. Also, we specify the delimiter as a comma.
Note that for tab delimiters you should use '09'x
as the delimiter.
The climate data file contains no header, so we name the variables on
the input
line.
The syntax of the input line is complex, further documentation is here. The approach used in the example above is called modified list input. This means that we provide a delimiter and the variable names, and for some of the variables we also provide a format specifier. The colon allows SAS to determine which terms in the input statement are format specifiers and which are variable names.
A character variable format specifier begins with $
, followed
optionally by the width of the field. There are a number of date
format
specifiers.
The format yymmdd8
is used for dates coded as a digits without any
separating characters. The format yymmdd8
reads dates of the form
20140623
(for June 23, 2014) and the format yymmdd6
reads dates of
the form 140623
. These SAS date variables will be internally coded
as the days since a reference point (usually January 1, 1960). It is
helpful to give the date variable an output format, so that the date
values are easy to read in any output that we produce.
Reading fixed-width files
Another type of text data format that you may encounter is a “fixed
width file”, in which each variable occupies a defined range of
columns in the file. To demonstrate reading fixed width files, we wil
load the ghcnd-stations.txt
into SAS. The structure of this file is
described in the readme.txt
file that is located in the same site
where the data file is located. Based on these column specifications,
we can read the fixed-width file as follows:
data stations;
infile 'ghcnd-stations.txt' dsd;
input station $ 1-11 lat 13-20 lon 22-30 elev 32-37 state $ 39-40
name $ 42-71 gsn $ 73-75 hcn 77-79 wmo 81-85;
The input
statement above uses column input. This means that
every column in the file is specified by two or three tokens: the
variable name, an optional format modifier, and the columns that the
data occupy. Note that when using column input you do not use a colon
after the variable name.
Tabular summaries
We can summarize the variables in a dataset using the tabulate
procedure. Below we summarize the regionc
variable in the RECS
dataset, which contains the census region code for each residence. We
specify the summary statistics that we wish to compute using the
syntax variable*stat
, where variable
is the variable name and
stat
is the statistic. Here we compute the number of observations,
and the minimum and maximum values of the region code.
libname mydata '~';
proc tabulate data=mydata.recs2009_public_v3;
var regionc;
table regionc*N regionc*max regionc*min;
run;
If we want to tabulate after “grouping by” another variable, then we
use a class variable. After defining the class varible we then use
one of two different syntax patterns to create the summary table. One
option is to use the pattern “classvar, variable*stat
” to summarize
the variable named variable
using the statistic stat
while
stratifying by another variable named classvar
. In this case the
regionc
strata will appear as mutiple rows of the summary table.
Note that the cdd65
variable below is the number of “cooling degree
days” at the location of a residential structure.
libname mydata '~';
proc tabulate data=mydata.recs2009_public_v3;
var cdd65;
class regionc;
table regionc, cdd65*N cdd65*mean;
run;
The other approach is to use variable*stat*classvar
, which places
the levels of classvar
in the columns of the summary table:
libname mydata '~';
proc tabulate data=mydata.recs2009_public_v3;
var cdd65;
class regionc;
table cdd65*N*regionc cdd65*mean*regionc;
run;
We can combine these two approaches to create a summary that is stratified by two variables:
libname mydata '~';
proc tabulate data=mydata.recs2009_public_v3;
var cdd65;
class regionc typehuq;
table regionc, cdd65*mean*typehuq;
run;
Descriptive statistics
There are several basic SAS procedures for obtaining descriptive
statistics, including the means
, freq
, summary
, and univariate
procedures.
Note that there is alot of overlap between the means
, summary
and
univariate
procedures. For your own code it might be better to
consistently use one of them. But you need to understand all three
procedures because you will see them in code written by other people.
The means
procedure overlaps somewhat with the tabulate
procedure.
Below we calculate a set of descriptive statistics (number of
observations, mean, minimum, and maximum value) for the variable
cdd65
, within each level of the regionc
variable.
Many SAS procedures use the class
or by
statements to apply an
operation separately to subsets of the data defined by the levels of a
given categorical variable. First we have an example using the
class
statement.
libname mydata '~';
proc means data=mydata.recs2009_public_v3;
var cdd65;
class regionc;
run;
Next we obtain the same result using the by
statement. Note that
when by
is used, it is necessary to first sort the data so that the
values of the by
variable appear in ascending order. Note also that
when we have multiple procedures in a program, the “output” dataset
for each procedure becomes the input dataset for the following
procedure. So the sort
procedure’s output becomes the input of the
means
procedure.
libname mydata '~';
proc sort data=mydata.recs2009_public_v3;
by regionc;
proc means;
var cdd65;
by regionc;
run;
It is also possible to cross-stratify the means by more than two
factors. We illustrate this in the program below using the by
statement. Note that we must first sort the data by both of the by
variables.
libname mydata '~';
proc sort data=mydata.recs2009_public_v3;
by regionc ur;
proc means;
var cdd65;
by regionc ur;
run;
Another useful procedure for calculating descriptive statistics is the
freq
procedure, which is mainly used for tabulating frequencies.
Below is a simple example of the freq
procedure that calculates the
frequency distribution of the year range during which a house was
first occupied:
libname mydata '~';
proc freq data=mydata.recs2009_public_v3;
tables occupyyrange / out=mydata.ocrng_freq;
run;
Note also that in the above program we specified an out
option to
the tables
statement. This produces a sas7bdat format datafile in
the data directory containing the table information.
Finally we will discuss the summary
procedure. This procedure is
usually used to create a new dataset that contains summary statistics
calculated from another dataset. Most often the summary
procedure
will involve a by
or class
statement, because it is uninteresting
to calculate summary statistics on individual records.
The following SAS program calculates the mean and standard deviation
of electricity usage (kwh
) within each region. These summary
statistics are then printed to the listing file.
libname mydata '~';
data recs;
set mydata.recs2009_public_v3;
if typehuq = 2;
keep doeid regionc kwh;
proc summary;
class regionc;
output out=meanstats
mean(kwh) = mean_kwh
std(kwh) = std_kwh;
proc print data=meanstats;
Selecting data
In many situations you will want to do your analysis using a subset of
the overall data set. You can accomplish this by adding a data
step
to your program. The following program selects only the houses in
rural areas and creates a new dataset called rural
. This dataset is
then used in a freq
procedure.
libname mydata '~';
data rural;
set mydata.recs2009_public_v3;
if ur='R';
proc freq data=rural;
tables occupyyrange;
run;
Here is an equivalent way to do the same thing:
libname mydata '~';
data rural;
set mydata.recs2009_public_v3;
if ur='U' then delete;
proc freq data=rural;
tables occupyyrange;
run;
Defining new variables
New variables can be added to the dataset in the data
step. The
program below adds the number of full and half baths to create a new
variable nbath
that contains the total number of baths. We then use
the freq
procedure to confirm that the variable we obtained is correct.
libname mydata '~';
data recs;
set mydata.recs2009_public_v3;
nbath = ncombath + nhafbath;
proc sort data=recs;
by ncombath nhafbath;
proc freq data=recs;
tables nbath;
by ncombath nhafbath;
run;
Saving datasets
Using a data
step you can save a dataset, for example a subset or
processed version of your raw data. The following program saves the
data for only the rural houses in the RECS survey:
libname mydata '~';
data rural;
set mydata.recs2009_public_v3;
if ur='U' then delete;
data mydata.recs_reduced;
set rural;
run;
Merging datasets
To demonstrate merging datasets, we will use the two NHANES files described above. These files are in SAS XPORT format, so we need to use a slightly different approach to defining a library handle pointing to the data file:
libname demog xport 'DEMO_F.XPT';
Below we do the merge, and calculate the mean systolic blood pressure by gender (which requires the merged dataset):
libname demog xport 'DEMO_F.XPT';
libname bpx xport 'BPX_F.XPT';
data nhanes;
merge demog.DEMO_F bpx.BPX_F;
by SEQN;
proc means;
var bpxsy1;
class riagendr;
run;
Data step programming
The following SAS program applies two filters to the rows of a
dataset. First, we only retain the single-family homes (typehuq =
2
). Second, we only retain the houses that used more than 15000
kilowatt hours of electricity (kwh ge 15000
). Note that ge
is
used for “greater than or equal to” comparisons. We also retain only
a few of the many columns of the original dataset. After the data
step we run a proc contents
so that we can see what the data set
contains.
libname mydata '~';
data recs;
set mydata.recs2009_public_v3;
if typehuq = 2;
if kwh ge 15000;
keep doeid regionc kwh;
proc contents;
run;
We can use multiple data
steps to process the data in more complex
ways. The following program first determines the mean and standard
deviation of electricity usage within each region. We then “remerge”
the mean and standard deviation values back into the original dataset
so we can filter on them. The goal of this analysis is to determine
how many houses within each region are more than one standard
deviation above the mean electricity usage for their region.
libname mydata '~';
data recs1;
set mydata.recs2009_public_v3;
if typehuq = 2;
keep doeid regionc kwh;
proc sort data=recs1 out=recs2;
by regionc;
proc summary;
class regionc;
output out=meanstats1
mean(kwh) = mean_kwh
std(kwh) = std_kwh;
proc sort data=meanstats1 out=meanstats2;
by regionc;
data recs3;
merge recs2 meanstats2(keep=regionc mean_kwh std_kwh);
by regionc;
data recs4;
set recs3;
high_kwh = mean_kwh + std_kwh;
if kwh ge high_kwh;
proc summary;
class regionc;
output out=high_kwh;
proc print data=high_kwh;
run;
When working with datasets that are grouped, there are ways to process a dataset sorted by the grouping variables, and to determine when the first record of a new group begins, as shown below:
data recs2;
set recs;
by regionc;
if first.regionc;
Pivoting
To pivot between long and wide format data sets in SAS, use the
transpose
procedure. Below is a complete example of pivoting that
takes the daily temperature data, summarizes it as monthly averages
for each station, then pivots the month variable into the columns so
that we have a wide format dataset in which the stations are in the
rows and the months are in the columns. Since the temperature data
are stored as tenths of a degree, we divide by 10 to restore the
conventional units of degrees Fahrenheit.
filename ghcnd_gz pipe "gzip -dc 2014.csv.gz" lrecl=80;
data ghcnd(rename=(obsval=tmax));
infile ghcnd_gz dsd delimiter=',';
input station : $11. date : yymmdd8. obstype $ obsval;
format date mmddyy10.;
month = month(date);
if obstype = "TMAX";
obsval = obsval / 10;
proc summary data=ghcnd nway;
class station month;
output out=meanstats
mean(tmax) = mntmp;
proc transpose data=meanstats(drop=_TYPE_) out=meanstats2 prefix=month;
by station;
id month;
var mntmp;
proc print data=meanstats2(obs=10);
run;
SAS and SQL
The SQL
procedure within SAS allows SQL-like syntax to be used to do
things that would otherwise be done in various data
and proc
steps. There are two important syntax patterns within most SQL
procedures:
First, every SQL procedure must contain a select
statement. The
most common way this occurs is in the form select ... from ...
.
This means that we are selecting specified columns from a given
dataset.
Second, any SQL procedure that creates a dataset must contain the
pattern create table ... as
. This creates a table with a given
name, usually based on the following select
statement.
For example, the following SAS program uses the SQL procedure to
select three variables from a dataset, then creates a new dataset from
them. Note that SQL
procedure blocks always end with the quit
statement.
Also note that inside a SQL
procedure call, lists of variable names
are separated by commas (which differs from standard SAS syntax).
libname mydata '~';
proc sql;
create table recs as
select doeid, regionc, kwh
from mydata.recs2009_public_v3;
quit;
proc contents data=recs;
run;
We can use the SQL
procedure to create new variables. The syntax
pattern for this is ... as ...
, for example, mean(kwh) as mean_kwh
creates a variable with name mean_kwh
that is calculated by taking
the mean of the kwh
variable. Furthermore, we have included a
group by
statement which causes the means to be calculated within
levels of another variable (region in this case).
libname mydata '~';
proc sql;
create table recs as
select mean(kwh) as mean_kwh, std(kwh) as std_kwh
from mydata.recs2009_public_v3
group by regionc;
quit;
proc print data=recs;
run;
Another import SAS SQL statement is where
, which restricts the
created table based on a condition that is checked for each row. The
following program caculates the mean and standard deviation of
electricity usage for all single family homes (typehuq = 2
),
grouping by region.
libname mydata '~';
proc sql;
create table recs as
select mean(kwh) as mean_kwh, std(kwh) as std_kwh
from mydata.recs2009_public_v3
where typehuq = 2
group by regionc;
quit;
proc print data=recs;
run;
The having
statement is related to the where
statement, except
that it filters entire groups based on a summary function calculated
for the group, rather than filtering individual rows based on their
data. The following program excludes all reportable domain units for
which the average value of hdd65
is less than 2000.
libname mydata '~';
proc sql;
create table recs as
select doeid, reportable_domain, mean(hdd65) as mean_hdd65, cufeetng
from mydata.recs2009_public_v3
where typehuq = 2
group by reportable_domain
having mean(hdd65) ge 2000;
quit;
proc print data=recs;
run;
Below is a complete program using proc SQL to calculate the mean
temperature within each cell of a temperature by month cross
classification. This illustrates the use of two variables in a group
by
operation. Compare this to the proc summary
code above using
nway
.
filename ghcnd_gz pipe "gzip -dc 2014.csv.gz" lrecl=80;
data ghcnd(rename=(obsval=tmax));
infile ghcnd_gz dsd delimiter=',';
input station : $11. date : yymmdd8. obstype $ obsval;
format date mmddyy10.;
month = month(date);
if obstype = "TMAX";
obsval = obsval / 10;
proc sql;
select station, month, mean(tmax) as mntmp
from ghcnd
group by station, month;
quit;
run;