Statistics 506, Fall 2016

Basic SAS


SAS is a programming language and a collection of data analysis procedures. The language is implemented in closed-source software sold by the SAS corporation. SAS is primarily a declarative language, meaning that your program specifies the desired result, and SAS figures out the steps needed to produce the result.

You can run SAS on the scs computers by connecting to those machines via ssh:

ssh uniqname@scs.itd.umich.edu

The next step is to create a SAS program in a text file. You will do this using a text editor like vim or emacs. Once you create a SAS program, say in a file called analysis.sas, then you can run the program by typing the following at the shell prompt:

sas analysis.sas

After the program is run, a new file with suffix .log will appear in your working directory. If your program file is called analysis.sas, then the log file will be called analysis.log.

If your program creates output, then the output will be contained in a file with suffix .lst. In this example, the output file will be called analysis.lst.

Another option for using SAS is using a virtual desktop at midesktop.umich.edu. We will not teach that here so if you want to go that route you will need to figure out the details for yourself.

Below we will use the following three data sets to demonstrate SAS programming. You can use wget to download the data files without leaving the linux shell.

Note that some of these text files are compressed text (csv) files. You can leave them in gzip compressed form, decompress with gunzip, e.g. gunzip 2014.csv.gz, or convert to an even more compressed xz format file, e.g. using gzip -dc 2014.csv.gz | xz > 2014.csv.xz. Note that conversion from gzip to xz format may take a long time for a large file. Below we will see how to read compressed text files into SAS without decompressing the file on the disk.

Loading data

A SAS program usually consists of two types of code blocks called data steps and proc (for “procedure”) steps. A program may contain zero, one, or multiple data and proc steps. A SAS program may also contain additional code elements such as variable and macro definitions. To read data into SAS we can use a data step or a proc. Techniques for loading various types of files are covered here.

Loading delimited data with import

There are several ways to read delimited data into a SAS program. Recall that “delimited data” are stored in a file where the fields are delimited by some character, such as a comma. The following is a complete SAS program that loads the RECS data from a CSV file:

proc import datafile='recs2009_public.csv' out=recs;
run;

The import procedure reads a delimited data file. In the call to the import procedure, we specify the datafile name, and a name for the working dataset as options. The working dataset, called recs here, is an internal dataset that is available for use by subsequent procedures in the same program.

Note that every line of a SAS program ends with a semicolon, and the program must end with a run statement (run is not always required but there are a few situations where omitting the run statement will cause problems).

The output data name “recs” is a “one level dataset name”, since it does not contain a point (.). A one-level dataset name usually refers to a temporary dataset that will not be retained when the program exits. Below we will see how to use two-level names that allow us to save a dataset that is created by the SAS program.

The program above loads the RECS dataset but exits without doing anything. The next example loads the dataset, then uses proc print to print the first 10 rows of the data file. The printed results will appear in the listing (.lst) file corresponding to your program name.

proc import datafile='recs2009_public.csv' out=recs;

proc print data=recs(obs=10);

run;

The behavior of the print procedure can be modified using options (which appear on the same line as the proc call) and statements (which appear below the proc call). For example, if you want to print only a few of the variables, you can use a var statement as follows:

proc import datafile='recs2009_public.csv' out=recs;

proc print data=recs(obs=10);
    var DOEID NWEIGHT;

run;

Loading a SAS7BDAT data file

Next we will rewrite the above program to load the RECS data from a SAS7BDAT format datafile. SAS7BAT is the native binary format for SAS data files. SAS7BDAT files tend to be larger than text files containing the same data, but can be processed efficiently since it is sometimes possible for SAS to read only the parts of the file that are used in a particular analysis. Also, as shown below, SAS7BDAT files can be passed directly into SAS procedures without an accompanying import procedure or data step.

First, identify the directory where the SAS7BDAT file is stored, then use libname to define a SAS data library using this directory name. Then you can use this data library in any SAS procedure by specifying the data option with a two-level name beginning with your library name. For example, if the data are in the directory /home/data we could use the following program:

libname mydata '/home/data';

proc print data=mydata.recs2009_public_v3(obs=10);
    var DOEID NWEIGHT;

run;

You can use the tilde character ~ in libname to indicate your home directory. For example, the following program looks for the RECS data file in your home directory:

libname mydata '~';

proc print data=mydata.recs2009_public_v3(obs=10);
    var DOEID NWEIGHT;

run;

A library defined by libname can refer to any directory on your system. For example,

libname mydata '~/my_project'

defines a library in the directory my_project located within your home directory. If you exclude the tilde character, you are refering to directories on the file system of your local computer. For example,

libname locdata '/var/tmp'

defines a library in the /var/tmp directory of your local machine.

Note that a library name such as mydata or locdata in the examples above is not an actual directory or file name on your computer. It is a “handle” that exists inside your SAS program to refer to the actual directory. You can define several libraries in the same program if needed.

To confirm that we have loaded the data successfully, we will run the contents procedure. This procedure places a detailed description of the contents of the dataset into the listing (.lst) file. Note that the SAS7BDAT version of the RECS data file contains long variable descriptions, which are not included in the CSV file.

libname mydata '~';

proc contents data=mydata.recs2009_public_v3;

run;

Loading data containing text and dates

Some data types require special treatment when being read by SAS. To illustrate text and date handling, we will work with the 2014 climate data from the U.S. National Climate Data Center. These data are in long format, and contain measurements of specific weather variables such as daily maximum temperature (TMAX) or precipitation (PRECIP). Each row contains one record for a particular day at a particular station, for a particular weather variable. Below we use a SAS data step to read the data into a dataset called ghcnd.

filename ghcnd_gz pipe "xz -dc 2014.csv.xz" lrecl=80;

data ghcnd;
    infile ghcnd_gz dsd delimiter=',';
    input station : $11. date : yymmdd8. obstype : $ obsval;
    format date mmddyy10.;

If you want to use the gzip format file instead of the xz format file, replace the filename line with

filename ghcnd_gz pipe "gzip -dc 2014.csv.gz" lrecl=80;

In the code above we are reading the data using infile, which gives us more control than when using import. There are two important options on the infile line. First, we include the dsd option. This tells SAS to treat consecutive delimiters as indicating a missing value (the same as read.table and fread in R). You should almost always incude this option. Also, we specify the delimiter as a comma. Note that for tab delimiters you should use '09'x as the delimiter.

The climate data file contains no header, so we name the variables on the input line.

The syntax of the input line is complex, further documentation is here. The approach used in the example above is called modified list input. This means that we provide a delimiter and the variable names, and for some of the variables we also provide a format specifier. The colon allows SAS to determine which terms in the input statement are format specifiers and which are variable names.

A character variable format specifier begins with $, followed optionally by the width of the field. There are a number of date format specifiers. The format yymmdd8 is used for dates coded as a digits without any separating characters. The format yymmdd8 reads dates of the form 20140623 (for June 23, 2014) and the format yymmdd6 reads dates of the form 140623. These SAS date variables will be internally coded as the days since a reference point (usually January 1, 1960). It is helpful to give the date variable an output format, so that the date values are easy to read in any output that we produce.

Reading fixed-width files

Another type of text data format that you may encounter is a “fixed width file”, in which each variable occupies a defined range of columns in the file. To demonstrate reading fixed width files, we wil load the ghcnd-stations.txt into SAS. The structure of this file is described in the readme.txt file that is located in the same site where the data file is located. Based on these column specifications, we can read the fixed-width file as follows:

data stations;
    infile 'ghcnd-stations.txt' dsd;
    input station $ 1-11 lat 13-20 lon 22-30 elev 32-37 state $ 39-40
         name $ 42-71 gsn $ 73-75 hcn 77-79 wmo 81-85;

The input statement above uses column input. This means that every column in the file is specified by two or three tokens: the variable name, an optional format modifier, and the columns that the data occupy. Note that when using column input you do not use a colon after the variable name.

Tabular summaries

We can summarize the variables in a dataset using the tabulate procedure. Below we summarize the regionc variable in the RECS dataset, which contains the census region code for each residence. We specify the summary statistics that we wish to compute using the syntax variable*stat, where variable is the variable name and stat is the statistic. Here we compute the number of observations, and the minimum and maximum values of the region code.

libname mydata '~';

proc tabulate data=mydata.recs2009_public_v3;
    var regionc;
    table regionc*N regionc*max regionc*min;

run;

If we want to tabulate after “grouping by” another variable, then we use a class variable. After defining the class varible we then use one of two different syntax patterns to create the summary table. One option is to use the pattern “classvar, variable*stat” to summarize the variable named variable using the statistic stat while stratifying by another variable named classvar. In this case the regionc strata will appear as mutiple rows of the summary table. Note that the cdd65 variable below is the number of “cooling degree days” at the location of a residential structure.

libname mydata '~';

proc tabulate data=mydata.recs2009_public_v3;
    var cdd65;
    class regionc;
    table regionc, cdd65*N cdd65*mean;

run;

The other approach is to use variable*stat*classvar, which places the levels of classvar in the columns of the summary table:

libname mydata '~';

proc tabulate data=mydata.recs2009_public_v3;
    var cdd65;
    class regionc;
    table cdd65*N*regionc cdd65*mean*regionc;

run;

We can combine these two approaches to create a summary that is stratified by two variables:

libname mydata '~';

proc tabulate data=mydata.recs2009_public_v3;
    var cdd65;
    class regionc typehuq;
    table regionc, cdd65*mean*typehuq;

run;

Descriptive statistics

There are several basic SAS procedures for obtaining descriptive statistics, including the means, freq, summary, and univariate procedures.

Note that there is alot of overlap between the means, summary and univariate procedures. For your own code it might be better to consistently use one of them. But you need to understand all three procedures because you will see them in code written by other people.

The means procedure overlaps somewhat with the tabulate procedure. Below we calculate a set of descriptive statistics (number of observations, mean, minimum, and maximum value) for the variable cdd65, within each level of the regionc variable.

Many SAS procedures use the class or by statements to apply an operation separately to subsets of the data defined by the levels of a given categorical variable. First we have an example using the class statement.

libname mydata '~';

proc means data=mydata.recs2009_public_v3;
    var cdd65;
    class regionc;

run;

Next we obtain the same result using the by statement. Note that when by is used, it is necessary to first sort the data so that the values of the by variable appear in ascending order. Note also that when we have multiple procedures in a program, the “output” dataset for each procedure becomes the input dataset for the following procedure. So the sort procedure’s output becomes the input of the means procedure.

libname mydata '~';

proc sort data=mydata.recs2009_public_v3;
    by regionc;

proc means;
    var cdd65;
    by regionc;

run;

It is also possible to cross-stratify the means by more than two factors. We illustrate this in the program below using the by statement. Note that we must first sort the data by both of the by variables.

libname mydata '~';

proc sort data=mydata.recs2009_public_v3;
    by regionc ur;

proc means;
    var cdd65;
    by regionc ur;

run;

Another useful procedure for calculating descriptive statistics is the freq procedure, which is mainly used for tabulating frequencies. Below is a simple example of the freq procedure that calculates the frequency distribution of the year range during which a house was first occupied:

libname mydata '~';

proc freq data=mydata.recs2009_public_v3;
    tables occupyyrange / out=mydata.ocrng_freq;

run;

Note also that in the above program we specified an out option to the tables statement. This produces a sas7bdat format datafile in the data directory containing the table information.

Finally we will discuss the summary procedure. This procedure is usually used to create a new dataset that contains summary statistics calculated from another dataset. Most often the summary procedure will involve a by or class statement, because it is uninteresting to calculate summary statistics on individual records.

The following SAS program calculates the mean and standard deviation of electricity usage (kwh) within each region. These summary statistics are then printed to the listing file.

libname mydata '~';

data recs;
    set mydata.recs2009_public_v3;
    if typehuq = 2;
    keep doeid regionc kwh;

proc summary;
    class regionc;
    output out=meanstats
           mean(kwh) = mean_kwh
           std(kwh) = std_kwh;

proc print data=meanstats;

Selecting data

In many situations you will want to do your analysis using a subset of the overall data set. You can accomplish this by adding a data step to your program. The following program selects only the houses in rural areas and creates a new dataset called rural. This dataset is then used in a freq procedure.

libname mydata '~';

data rural;
    set mydata.recs2009_public_v3;
    if ur='R';

proc freq data=rural;
    tables occupyyrange;

run;

Here is an equivalent way to do the same thing:

libname mydata '~';

data rural;
    set mydata.recs2009_public_v3;
    if ur='U' then delete;

proc freq data=rural;
    tables occupyyrange;

run;

Defining new variables

New variables can be added to the dataset in the data step. The program below adds the number of full and half baths to create a new variable nbath that contains the total number of baths. We then use the freq procedure to confirm that the variable we obtained is correct.

libname mydata '~';

data recs;
    set mydata.recs2009_public_v3;
    nbath = ncombath + nhafbath;

proc sort data=recs;
    by ncombath nhafbath;

proc freq data=recs;
    tables nbath;
    by ncombath nhafbath;

run;

Saving datasets

Using a data step you can save a dataset, for example a subset or processed version of your raw data. The following program saves the data for only the rural houses in the RECS survey:

libname mydata '~';

data rural;
    set mydata.recs2009_public_v3;
    if ur='U' then delete;

data mydata.recs_reduced;
    set rural;

run;

Merging datasets

To demonstrate merging datasets, we will use the two NHANES files described above. These files are in SAS XPORT format, so we need to use a slightly different approach to defining a library handle pointing to the data file:

libname demog xport 'DEMO_F.XPT';

Below we do the merge, and calculate the mean systolic blood pressure by gender (which requires the merged dataset):

libname demog xport 'DEMO_F.XPT';
libname bpx xport 'BPX_F.XPT';

data nhanes;
    merge demog.DEMO_F bpx.BPX_F;
        by SEQN;

proc means;
    var bpxsy1;
    class riagendr;

run;

Data step programming

The following SAS program applies two filters to the rows of a dataset. First, we only retain the single-family homes (typehuq = 2). Second, we only retain the houses that used more than 15000 kilowatt hours of electricity (kwh ge 15000). Note that ge is used for “greater than or equal to” comparisons. We also retain only a few of the many columns of the original dataset. After the data step we run a proc contents so that we can see what the data set contains.

libname mydata '~';

data recs;
    set mydata.recs2009_public_v3;
    if typehuq = 2;
    if kwh ge 15000;
    keep doeid regionc kwh;

proc contents;

run;

We can use multiple data steps to process the data in more complex ways. The following program first determines the mean and standard deviation of electricity usage within each region. We then “remerge” the mean and standard deviation values back into the original dataset so we can filter on them. The goal of this analysis is to determine how many houses within each region are more than one standard deviation above the mean electricity usage for their region.

libname mydata '~';

data recs1;
    set mydata.recs2009_public_v3;
    if typehuq = 2;
    keep doeid regionc kwh;

proc sort data=recs1 out=recs2;
    by regionc;

proc summary;
    class regionc;
    output out=meanstats1
        mean(kwh) = mean_kwh
        std(kwh) = std_kwh;

proc sort data=meanstats1 out=meanstats2;
    by regionc;

data recs3;
    merge recs2 meanstats2(keep=regionc mean_kwh std_kwh);
    by regionc;

data recs4;
    set recs3;
    high_kwh = mean_kwh + std_kwh;
    if kwh ge high_kwh;

proc summary;
    class regionc;
    output out=high_kwh;

proc print data=high_kwh;

run;

When working with datasets that are grouped, there are ways to process a dataset sorted by the grouping variables, and to determine when the first record of a new group begins, as shown below:

data recs2;
    set recs;
    by regionc;
    if first.regionc;

Pivoting

To pivot between long and wide format data sets in SAS, use the transpose procedure. Below is a complete example of pivoting that takes the daily temperature data, summarizes it as monthly averages for each station, then pivots the month variable into the columns so that we have a wide format dataset in which the stations are in the rows and the months are in the columns. Since the temperature data are stored as tenths of a degree, we divide by 10 to restore the conventional units of degrees Fahrenheit.

filename ghcnd_gz pipe "gzip -dc 2014.csv.gz" lrecl=80;

data ghcnd(rename=(obsval=tmax));
    infile ghcnd_gz dsd delimiter=',';
    input station : $11. date : yymmdd8. obstype $ obsval;
    format date mmddyy10.;
    month = month(date);
    if obstype = "TMAX";
    obsval = obsval / 10;

proc summary data=ghcnd nway;
    class station month;
    output out=meanstats
        mean(tmax) = mntmp;

proc transpose data=meanstats(drop=_TYPE_) out=meanstats2 prefix=month;
    by station;
    id month;
    var mntmp;

proc print data=meanstats2(obs=10);

run;

SAS and SQL

The SQL procedure within SAS allows SQL-like syntax to be used to do things that would otherwise be done in various data and proc steps. There are two important syntax patterns within most SQL procedures:

First, every SQL procedure must contain a select statement. The most common way this occurs is in the form select ... from .... This means that we are selecting specified columns from a given dataset.

Second, any SQL procedure that creates a dataset must contain the pattern create table ... as. This creates a table with a given name, usually based on the following select statement.

For example, the following SAS program uses the SQL procedure to select three variables from a dataset, then creates a new dataset from them. Note that SQL procedure blocks always end with the quit statement.

Also note that inside a SQL procedure call, lists of variable names are separated by commas (which differs from standard SAS syntax).

libname mydata '~';

proc sql;

    create table recs as
        select doeid, regionc, kwh
        from mydata.recs2009_public_v3;

    quit;

proc contents data=recs;

run;

We can use the SQL procedure to create new variables. The syntax pattern for this is ... as ..., for example, mean(kwh) as mean_kwh creates a variable with name mean_kwh that is calculated by taking the mean of the kwh variable. Furthermore, we have included a group by statement which causes the means to be calculated within levels of another variable (region in this case).

libname mydata '~';

proc sql;

    create table recs as
        select mean(kwh) as mean_kwh, std(kwh) as std_kwh
        from mydata.recs2009_public_v3
        group by regionc;
    quit;

proc print data=recs;

run;

Another import SAS SQL statement is where, which restricts the created table based on a condition that is checked for each row. The following program caculates the mean and standard deviation of electricity usage for all single family homes (typehuq = 2), grouping by region.

libname mydata '~';

proc sql;

    create table recs as
        select mean(kwh) as mean_kwh, std(kwh) as std_kwh
        from mydata.recs2009_public_v3
        where typehuq = 2
        group by regionc;
    quit;

proc print data=recs;

run;

The having statement is related to the where statement, except that it filters entire groups based on a summary function calculated for the group, rather than filtering individual rows based on their data. The following program excludes all reportable domain units for which the average value of hdd65 is less than 2000.

libname mydata '~';

proc sql;

    create table recs as
        select doeid, reportable_domain, mean(hdd65) as mean_hdd65, cufeetng
        from mydata.recs2009_public_v3
        where typehuq = 2
        group by reportable_domain
        having mean(hdd65) ge 2000;
    quit;

proc print data=recs;

run;

Below is a complete program using proc SQL to calculate the mean temperature within each cell of a temperature by month cross classification. This illustrates the use of two variables in a group by operation. Compare this to the proc summary code above using nway.

filename ghcnd_gz pipe "gzip -dc 2014.csv.gz" lrecl=80;

data ghcnd(rename=(obsval=tmax));
    infile ghcnd_gz dsd delimiter=',';
    input station : $11. date : yymmdd8. obstype $ obsval;
    format date mmddyy10.;
    month = month(date);
    if obstype = "TMAX";
    obsval = obsval / 10;

proc sql;
    select station, month, mean(tmax) as mntmp
        from ghcnd
        group by station, month;
    quit;

run;