Statistics 506, Fall 2016

Problem set 4


Due November 21 (changed to November 22)

1) The first problem uses US Medicare payment data, available here. You can choose one year of data for this exercise. The data are documented in this file. You will need to read the data documentation closely to fully understand this, but briefly, each line in the data file represents a number of similar services provided to various patients by a single “provider” (usually a doctor, nurse, pharmacist, etc.). Only the mean and standard deviation of the payment amounts is given, along with some information about the provider, and about the type of service.

Note: You will need to carefully read the data documentation to understand how the questions given below relate to the data in the file.

You should complete this exercise in R using data.table where possible.

Solution: Read data set:

library(data.table)

fname = "zcat /nfs/kshedden/CMS/2013/Medicare_Provider_Util_Payment_PUF_CY2013.txt.gz"

da = fread(fname)

# Drop line with no data
da = da[2:dim(da)[1],]

a) Create a new variable called totpay that contains the amount of money paid from Medicare to the provider for the services represented in each line of data. You will need to look at the variable definitions to figure out how to do this.

Solution:

da[,totpay := LINE_SRVC_CNT * AVERAGE_MEDICARE_PAYMENT_AMT]

b) Using the data you constructed in part (a), obtain the average cost per service for each type of medical service. Identify the service with the highest average cost, and the service with the highest average cost among all services provided 100,000 or more times.

Solution:

first = function(x)x[1]
dax = da[, .(s=sum(totpay), n=sum(LINE_SRVC_CNT), description=first(HCPCS_DESCRIPTION)), by=HCPCS_CODE]
dax[,avg:=s/n]
dax = dax[,.(n, avg, description)]
dax = dax[order(-avg)]
daz = dax[n>100000]

# use head(dax), head(daz) to see the most expensive treatments

The most expensive claim on average is Sipuleucel-t drug treatment ($24,943 per claim). The most expensive claim which occurs at least 100,000 times is pegfilgrastim drug treatment ($2,335 average per claim).

c) Restrict the data to individual providers (see the data documentation for details). Determine the total amount of money paid to each provider. Then do the following two analyses with this data: (i) restrict to all providers who charged more than 1 million dollars in total, create a frequency distribution of the provider types within this set of providers, and report the 10 most frequent provider types; (ii) average the total amount paid to the providers within each provider type, and report the provider types with the two highest and two lowest averages.

Solution:

dai = da[NPPES_ENTITY_CODE=="I"]
dax = dai[, .(s=sum(totpay), provtype=first(PROVIDER_TYPE)), by=NPI]

# i
daz = dax[s>1e6]
daz = daz[,.(n=.N), by=provtype]
daz = daz[order(-n)]
print(daz[1:10])

The result is:

provtype    n
1:          Ophthalmology 1123
2:    Hematology/Oncology  664
3:     Radiation Oncology  357
4:           Rheumatology  269
5:            Dermatology  239
6:             Cardiology  220
7:       Medical Oncology  211
8:      Internal Medicine  128
9:   Diagnostic Radiology   90
10:            Nephrology   83

For part (ii):

avg_type = dax[, .(a=mean(s)), by=provtype]
avg_type = avg_type[order(-a)]

The two highest averages are for opthalmologists and hematologists/oncologists. The two lowest averages are for certified nurse midwives and mass immunization roster billers.

d) Calculate the average cost for each service type within each state. Then take the logarithm of these averages, center them to have mean zero for each service type (over all the states), and calculate the mean of these centered values within each state. Sort the results. Provide a statistical interpretation of the values that are obtained by this process.

Solution:

The code below calculates this result, which can be interpreted as the (approximate) relative difference in costs in one state compared to the country as a whole, averaged over the service types.

dx = da[, .(s=sum(totpay), n=sum(LINE_SRVC_CNT)), by=.(HCPCS_CODE, NPPES_PROVIDER_STATE)]
dx[, m := log(s/n)]

# The mean below is unweighted, but it would also be reasonable to weight
# it by the number of procedures by state
dx = dx[, .(mr = m - mean(m), NPPES_PROVIDER_STATE), by=HCPCS_CODE]

# The mean below is unweighted, but it would also be reasonable to weight
# it by the frequency of each procedure
dx = dx[, .(sm = mean(mr)), by=NPPES_PROVIDER_STATE]
dx = dx[order(-sm)]

e) Use the “law of total variance” to determine the standard deviation of the per-service charges for each service type. Then calculate the mean charge for each service type, and the coefficient of variation (CV) of charges for each service type. Sort the results by CV and report the 5 service types with the greatest CV value.

Solution:

dx = da[, .(tp=totpay, ns=LINE_SRVC_CNT, mn=AVERAGE_MEDICARE_PAYMENT_AMT,
            va=STDEV_MEDICARE_PAYMENT_AMT^2, w=LINE_SRVC_CNT/sum(LINE_SRVC_CNT)),
            by=HCPCS_CODE]
dx = dx[, .(mn=sum(tp)/sum(ns), n=sum(ns), vm=sum(w*(mn-sum(w*mn))^2),
            mv=sum(w*va)), by=HCPCS_CODE]
dx = dx[, vt:=vm+mv]
dx = dx[order(-vt)]
dx = dx[, sd:=sqrt(vt)]
dx = dx[, cv:=sd/mn]
dx = dx[order(-cv)]

The median CV is around 0.3, so a typical procedure varies by around 30%.

2) Obtain a subset of the Medicare data from here. In this exercise you will perform some basic manipulations of this dataset in SAS.

a) Write a SAS program using a data step to load the data into SAS without uncompressing the file. Note that you should use a data step, not proc import. You will need to search the web to see how to set the delimiter for a tab-delimited file. Print the first 10 rows and run proc contents to verify that the data are loaded correctly. Note that you will need to use the dsd option on the infile line to correctly handle empty fields, and you must set the proper delimiter for tab-delimited files.

Solution:

All code is below.

b) Use proc freq to determine the frequency of claims within each state that were paid to female and male providers. Which states had the least and the greatest proportion of claims paid to females?

Solution:

Among the US states, Utah has the highest proportion of male doctors (83%), and Maine has the lowest (64%).

c) Reduce the dataset to have one record for each provider (hint: you can do this via a data step with the first.xxx syntax). Then repeat part (b) using this dataset.

Solution:

Again, Utah has the highest proportion of male doctors (77%), and Maine has the lowest (53%).

Solution:

filename cms pipe "gzip -dc Medicare_Provider_Util_Payment_PUF_CY2013_subset.txt.gz" lrecl=5000;

data cms;
    infile cms dsd delimiter='09'x firstobs=3;
    input NPI NPPES_CREDENTIALS $ NPPES_PROVIDER_GENDER $ NPPES_ENTITY_CODE $
    NPPES_PROVIDER_ZIP $ NPPES_PROVIDER_STATE $ PROVIDER_TYPE $
    MEDICARE_PARTICIPATION_INDICATOR $ PLACE_OF_SERVICE $ HCPCS_CODE $
    HCPCS_DRUG_INDICATOR $ LINE_SRVC_CNT BENE_UNIQUE_CNT BENE_DAY_SRVC_CNT
    AVERAGE_MEDICARE_ALLOWED_AMT STDEV_MEDICARE_ALLOWED_AMT
    AVERAGE_SUBMITTED_CHRG_AMT STDEV_SUBMITTED_CHRG_AMT AVERAGE_MEDICARE_PAYMENT_AMT
    STDEV_MEDICARE_PAYMENT_AMT;

proc print data=cms(obs=10);

proc contents data=cms;

proc freq data=cms;
    tables NPPES_PROVIDER_STATE*NPPES_PROVIDER_GENDER;

proc sort data=cms;
    by NPI;

data cmsu;
    set cms;
    by NPI;
    if first.NPI;

proc print data=cmsu(obs=10);

proc freq data=cmsu;
    tables NPPES_PROVIDER_STATE*NPPES_PROVIDER_GENDER;

run;