Statistics 506, Fall 2016

Problem set 5


Due December 18

There is only one exercise to complete here:

Suppose we are considering a linear regression with a response variable y, and independent variables x1, x2, …, xp. We can fit the regression using all p of the covariates, or we can fit the regression using any subset of the covariates. “All subsets selection” is a technique that fits every possible submodel, and selects the submodel with the best fit.

In this exercise you should implement all subsets selection using the R futures package to obtain the fits concurrently.

Hint on how to cycle through all possible models: Represent a particular submodel using a vector of p 0’s and 1’s, where a 0 indicates that the corresponding variable is not included and a 1 indicates that the variable is included. To visit all the models, view these binary vectors as representing integers in base 2 form. If you add 1 (in base 2) to each vector, you will obtain a new model, until you have visited all the models, at which point the process will repeat. For example, if there are three variables, the models are represented as triples (a, b, c), corresponding to the integer a*4 + b*2 + c. The first model is 0 = (0, 0, 0), meaning that none of the variables are included (always include the intercept, which R does by default). Adding 1 to this model in base 2 gives us 1 = (0, 0, 1); adding 1 again gives us 2 = (0, 1, 0), and so on.

You can use any appropriate model selection statistic to define which model is “best”. Appropriate choices would be AIC, BIC, or adjusted R^2, but not the unadjusted R^2. These are all immediately available in R.

You should implement a configurable cap on the number of processes that can be run concurrently (as in the examples in the notes).

To demonstrate your code, simulate data from a single population of your choice. Run the procedure on your data (one run sufficient), and briefly discuss the results. Your discussion should cover your findings regarding the model that is selected, and the time that was required for the computation to complete.