Statistics 506, Fall 2016

Problem set 1


Due October 4

  • You should submit your Stata code as text format files (one file for each problem, with the parts clearly marked using comments). Also submit a single PDF file containing your answers to all the questions that are posed below.

  • Your code should be clearly written and it should be possible to assess it by reading it. Use appropriate variable names and comments.

  • All work for this probem set should be done using Stata. Do not do any data input/output or pre/post processing of results using tools other than Stata.

  • Some of these exercises will require you to use Stata commands or techniques that were not covered in class or in the course notes. You can use the web as needed to identify appropriate approaches. Part of the purpose of these exercises is to be resourceful and self sufficient. Questions are welcome at all times, but please make an attempt to locate relevant information yourself first.

1 Use the RECS data discussed in class to answer the following questions:

a) What state has the greatest proportion of stucco-walled construction?

b) Calculate the proportion of each type of wall material for all houses built in each decade. What type of wall material is most consistently dropping in popularity over time?

2 This database contains information about “notable people” throughout history. You can load an Excel file into Stata (Google to learn how), or you can use Excel to convert it to a csv text file and use import delimited.

a) Consider all people born between 1700 and 1900 who have an obseved (non-missing) year of death. Calculate the decade of each person’s birth and their lifespan in years. Then use Stata to create a clean table containing only the mean lifespan within each cell of a cross-tabulation between decade of birth and gender. What do you observe about the relationship between the lifespans of notable women and notable men in the 18th and 19th centuries?

b) Use the Haversine formula to calculate the distance between each person’s birth and death location. Then create a table formatted as in part (a) showing the mean distance by decade of birth and gender (you can restrict to the same subset of people). How has this mean distance changed over time?

c) Calculate the centroid of birth locations for people born in each decade from 1700 to the present. To do this, take each birth location and view it as a point in R^3 (i.e. embedding the sphere in 3-space). Then calculate the usual geometric centroid of these points. Finally, scale this centroid point back to the surface of the sphere (note that this will be undefined if the centroid is exactly at the center of the sphere). The final result should be expressed in latitude/longitude coordinates. Describe how this centroid has varied over the decades from 1700 to the present.

3 Download the file OHX_D from this location, and determine how to read it into Stata. Then download the file DEMO_D from this location. Merge the two files to create a single Stata dataset, using the SEQN variable for merging. Then select all subjects between 30 and 50 years of age. Use for loops in Stata to calculate the proportion of people who are missing each of the 32 teeth represented in the data file. Finally, create a clean table showing the rate that each tooth is missing for each gender.