Statistics 506, Fall 2016

Due November 3

The exercises in this problem set should primarily use dplyr and tidyr

1) Download this dataset containing information about investments made by Chinese companies and government entities outside of China. This is an Excel spreadsheet. Use the readxl package to load the spreadsheet into R. Then use dplyr and tidyr to answer the following questions:

Solution:

Here is some code to read and process the data:

library(readxl)
library(dplyr)
library(tidyr)

path = "/nfs/kshedden/506/China-Global-Investment-Tracker.xlsx"

da = read_excel(path, skip=5)

# Remove empty rows/columns
da = da[1:988, 1:11]

# Use shorter variable names
da = rename(da, Quantity=`Quantity in Millions`)
da$Region = replace(da$Region, da$Region=="Arab Middle East and North Africa", "N_Afr")
da$Region = replace(da$Region, da$Region=="East Asia", "E_Asia")
da$Region = replace(da$Region, da$Region=="North America", "N_Amer")
da$Region = replace(da$Region, da$Region=="South America", "S_Amer")
da$Region = replace(da$Region, da$Region=="Sub-Saharan Africa", "SS_Afr")
da$Region = replace(da$Region, da$Region=="West Asia", "W_Asia")
da$Region = replace(da$Region, da$Region=="Australia", "Aust")

a) Use dplyr to write a simple check that every country is only listed as belonging to a single region. If the check fails for any country, provide details.

Solution:

The code below shows that Saudi Arabia was listed as being in both North Africa and in West Asia.

g = group_by(da, Country)
f = summarize(g, m=length(unique(Region)))
f = filter(f, f$m >= 2)
f = filter(da, Country==f[[1]])
f = select(f, Country, Region)

b) Construct a table with the regions in the columns and the sectors in the rows, showing the proportion of the total investment for each region that was made in each sector. Shorten the column names and truncate the proportions so that the table fits on the screen without wrapping. Which sector most commonly receives the greatest share of investment? What are the exceptions to this finding?

Solution:

In most regions, energy receives the greatest share of investment, but in Sub-Saharan Africa, metals recieves the greatest share of investment.

g = group_by(da, Region, Sector)
s1 = summarize(g, Quantity=sum(Quantity))
s2 = summarize(group_by(s1, Region), TQuantity=sum(Quantity))
s = left_join(s1, s2)
s$PQuantity = s$Quantity / s$TQuantity
s = spread(select(s, Sector, Region, PQuantity), Region, PQuantity)
s = replace(s, is.na(s), 0)
s[,2:10] = round(s[,2:10], 2)

c) Construct a table showing the mean and standard deviation of the investment amounts in each sector, sorted by the means. Construct the same table for the log investment amounts. Describe the overall relationship between the means and standard deviations. Identify an exception to the overall pattern and explain it in more detail.

Solution:

There is a strong postive relationship between the means and standard deviations (sectors with higher mean investment also have higher standard deviations). However the chemicals sector has an unusually low standard deviation for its mean value.

g = group_by(da, Sector)
s = summarize(g, m=mean(Quantity), s=sd(Quantity))
s = arrange(s, m)

d) Construct a table showing the total investment per sector (column) per year (row). State which sectors contributed the most to investment growth from 2005-2012, and which contributed the most from 2013-2015.

Solution:

Up to 2012, energy investments dominated all other categories, and tended to increase from each year to the next. After 2012, energy investments declined substantially, although they are still large in absolute terms. Beteen 2013 and 2015 there were some years with large investments in entertainment, finance, and in the “other” category, but there is no single category that dominates in this time period.

dt = group_by(da, Sector, Year)
dt = summarize(dt, total=sum(Quantity))
dt = select(dt, Sector, Year, total)
dt = spread(dt, Sector, total)

2) Go to http://cesrusc.org/election, and click the “Detailed Data” link. Then download the 15 csv files with the filenames of the form pop_xxx_x.csv. These files contain estimates of the level of support for the two major candidates in the 2016 US Presidential election based on survey responses. Information about the file format is here.

Load all 15 files, create a column containing the difference between the levels of support for the two candidates, and retain only the date and this difference value. Then add columns to the dataframe for race, education, age, income, and sex, initially setting all values in these columns to NA.

Next, note that each file contains aggregates for a given stratum of the population (e.g. for all women). For each dataframe, set the value of the appropriate stratification variable to the level corresponding to the data in the file, e.g. if you are working with the pop_age_3.csv file, set the age variable to have value 3.

Next, create a loop, and within each loop iteration replace the missing values with a random sample from the observed values for the same variable. Use this dataset to fit a linear model of the support difference on all the demographic variables. Briefly describe your findings.

In addition to the dplyr functions we have discussed in class, you may find the following functions to be useful here: list.files, grepl, substr, replace_na, sample, and bind_rows. Search the internet for information about these or other functions that you may need, but do not use any packages other than dplyr and tidyr.

Solution:

Here is the code to prepare the dataset:

library(dplyr)

dpath = "/nfs/kshedden/506/cesrusc"

fi = list.files(dpath)

adf = NULL
for (f in fi) {

    fn = file.path(dpath, f)
    df = read.table(fn, header=TRUE)
    for (x in c("age", "inc", "rac", "edu", "sex")) {
        df[[x]] = NA
    }

    g = strsplit(f, "[.]")[[1]][1]
    g = strsplit(g, "_")
    g = unlist(g)

    df[[g[2]]] = g[3]

    adf = rbind(adf, df)
}

Here is the code to do the analysis:

nrep = 100
qpos = 0

for (k in 1:nrep) {

    # One imputed data set
    dx = adf
    for (x in colnames(dx)) {
        i0 = which(is.na(dx[[x]]))
        if (length(i0) > 0) {
            i1 = which(!is.na(dx[[x]]))
	    s = sample(dx[[x]][i1], length(i0), replace=TRUE)
            dx[[x]][i0] = s
	}
    }

    dx$y = dx$Trump - dx$Clinton
    md = lm(y ~ rac + edu + inc + age + sex, dx)

    qpos = qpos + (md$coeff > 0)
}

The result of qpos is the number of imputed data sets in which each coefficient estimate is positive. Since the covariates are categorical, the effects are relative to the refereence category, which in R defaults to the lowest coded value.

The key findings are: race categories 2, 3, 4 are never positive (higher support for candidate C among non-white relative to white respondents); education category 3 is never positive (higher support for candidate C among people with a Bachelor’s degree or more education); the two higher income categories are always positive (greater support for candidate T among people in families making more than $35K per year); age categories 2-3 are usually positive (possibly greater support for candidate T among older respondents; sex category 1 is always positive (greater support for candidate T among men).