library(DBI)
<- dbConnect(RSQLite::SQLite(), "data/sakila_master.db") sakila
Problem Set #03 Solutions
Statistics 506
Problem 1 Solutions - Vision
The complete Do-file can be found here.
a.
"https://wwwn.cdc.gov/Nchs/Nhanes/2005-2006/VIX_D.XPT", clear
. import sasxport5 rename _all, lower
. all newnames==oldnames)
(quietly compress
. save ~/Desktop/vision, replace
.
file ~/Desktop/vision.dta saved"https://wwwn.cdc.gov/Nchs/Nhanes/2005-2006/DEMO_D.XPT", clear
. import sasxport5 rename _all, lower
. all newnames==oldnames)
(quietly compress
. merge 1:1 seqn using ~/Desktop/vision
. of obs
Result Number
-----------------------------------------
Not matched 3,368_merge==1)
from master 3,368 (using 0 (_merge==2)
from
_merge==3)
Matched 6,980 (
-----------------------------------------keep if _merge == 3
.
(3,368 observations deleted)count
. 6,980
b.
Clean up and rename variables.
rename viq220 glasses
. replace glasses = . if glasses == 9
. real changes made, 2 to missing)
(2 replace glasses = glasses - 1
. real changes made)
(6,545 rename ridageyr age .
Generate a variable capturing the age brackets.
capture drop agecat
. generate agecat = floor(age/10)
. replace agecat = . if missing(age)
. real changes made) (0
I’ll demonstrate a number of different approaches:
Using mean
Since the average of a binary variable is it’s proportion, I can use mean
to generate the proportions, then just need to clean up the matrix it produces.
capture matrix drop prop
. quietly mean glasses, over(agecat)
. matrix prop = e(b)'
. matrix prop = prop*100
. matrix colnames prop = "Proportion"
. matrix rownames prop = "10-19" "20-29" "30-39" "40-49" "50-59" ///
. "60-69" "70-79" "80-89"
> matrix list prop, format(%3.1f)
. prop[8,1]
Proportion
10-19 67.9
20-29 67.3
30-39 64.1
40-49 63.0
50-59 45.0
60-69 37.8
70-79 33.1 80-89 33.1
Using tabulate
and mata
to manually calculate
tabulate
generates a table with the count of 0’s and 1’s in glasses. I can use mata
to quickly calculate proportions, then again clean up the matrix.
capture matrix drop prop
. tabulate agecat glasses, matcell(x)
.
| Glasses/contactfor
| lenses worn
| distance
agecat | 0 1 | Total
-----------+----------------------+----------
1 | 670 1,418 | 2,088
2 | 306 631 | 937
3 | 269 481 | 750
4 | 286 487 | 773
5 | 335 274 | 609
6 | 392 238 | 630
7 | 299 148 | 447
8 | 208 103 | 311
-----------+----------------------+----------
Total | 2,765 3,780 | 6,545mata:
. mata (type end to exit) ----------
------------------------------------------------- "x")
: x = st_matrix(
: x = x[., 2]:/(x[., 1] + x[., 2])"x", x)
: st_matrix(end
:
------------------------------------------------------------------------------------matrix prop = x*100
. matrix colnames prop = "Proportion"
. matrix rownames prop = "10-19" "20-29" "30-39" "40-49" "50-59" ///
. "60-69" "70-79" "80-89"
> matrix list prop, format(%3.1f)
. prop[8,1]
Proportion
10-19 67.9
20-29 67.3
30-39 64.1
40-49 63.0
50-59 45.0
60-69 37.8
70-79 33.1 80-89 33.1
Using table
table
is designed for this sort of operation, though it’s harder if not impossible to format things as nicely.
table (agecat) (), stat(mean glasses) nototal nformat("%9.3f" mean)
.
----------------
| Mean
--------+-------
agecat |
10-19 | 0.679
20-29 | 0.673
30-39 | 0.641
40-49 | 0.630
50-59 | 0.450
60-69 | 0.378
70-79 | 0.331
80-89 | 0.331 ----------------
Using a loop to fill a matrix
This is the most R-like approach; generating an empty matrix and filling it up as we loop through the categories. Finally, we again clean up the matrix prior to displaying.
capture matrix drop prop
. matrix define prop = (0\0\0\0\0\0\0\0)
. foreach n of numlist 1/8 {
. local lowerage = `n'*10
2. local upperage = (`n'+1)*10
3. quietly mean glasses if age >= `lowerage' & age < `upperage'
4. matrix prop[`n', 1] = e(b)
5.
6. }matrix prop = prop*100
. matrix colnames prop = "Proportion"
. matrix rownames prop = "10-19" "20-29" "30-39" "40-49" "50-59" ///
. "60-69" "70-79" "80-89"
> matrix list prop, format(%3.1f)
. prop[8,1]
Proportion
10-19 67.9
20-29 67.3
30-39 64.1
40-49 63.0
50-59 45.0
60-69 37.8
70-79 33.1 80-89 33.1
Calculate in-place in the dataset
We can use collapse
to replace our dataset with the proportions we want, do some minor cleaning, then just print the data.
preserve
. collapse (mean) glasses, by(agecat)
. rename agecat Age
. rename glasses Percent
. replace Percent = Percent*100
. real changes made)
(8 format Percent %9.1f
. list, sep(0)
.
+-----------------+
| Age Percent |
|-----------------|
1. | 10-19 67.9 |
2. | 20-29 67.3 |
3. | 30-39 64.1 |
4. | 40-49 63.0 |
5. | 50-59 45.0 |
6. | 60-69 37.8 |
7. | 70-79 33.1 |
8. | 80-89 33.1 |
+-----------------+restore .
c.
Clean up and rename variables.
rename riagendr gender
.
.capture drop female
. generate female = gender == 2
. replace female = . if missing(female)
. real changes made)
(0
.rename ridreth1 race
. label define race 1 "Mexican American" ///
. "Other Hispanic" ///
> 2 "Non-Hispanic White" ///
> 3 "Non-Hispanic Black" ///
> 4 "Multi-racial"
> 5 label values race race
.
.rename indfmpir pir .
Run the models.
quietly logit glasses c.age
. store m1
. estimate quietly logit glasses c.age i.race i.female
. store m2
. estimate quietly logit glasses c.age i.race i.female c.pir
. store m3 . estimate
Produce the table.
estimates table m1 m2 m3, stats(N r2_p aic) eform
.
-----------------------------------------------------
Variable | m1 m2 m3
-------------+---------------------------------------
age | .97562897 .97767872 .978056
|
race |
Other His.. | .8552837 .89045517
Non-Hispa.. | .51225603 .60560404
Non-Hispa.. | .7696096 .81270706l | .52152821 .587002
Multi-rac~
|
female |
1 | .60526462 .59674151
|
pir | .89261693_cons | 3.5288428 6.2755778 7.5094302
-------------+---------------------------------------N | 6545 6545 6247
r2_p | .04973123 .07195445 .07339952
aic | 8475.8866 8287.7609 7909.8082 -----------------------------------------------------
d.
estimates restore m3
.
(results m3 are active now)logit, or
.
of obs = 6,247
Logistic regression Number chi2(7) = 625.30
LR chi2 = 0.0000
Prob >
Log likelihood = -3946.9041 Pseudo R2 = 0.0734
-----------------------------------------------------------------------------------ratio Std. err. z P>|z| [95% conf. interval]
glasses | Odds
------------------+----------------------------------------------------------------
age | .978056 .0012665 -17.14 0.000 .9755769 .9805414
|
race |
Other Hispanic | .8904552 .1498325 -0.69 0.490 .6403015 1.238339
Non-Hispanic W.. | .605604 .0455103 -6.67 0.000 .5226635 .7017062
Non-Hispanic B.. | .8127071 .0643806 -2.62 0.009 .6958313 .9492139
Multi-racial | .587002 .0822693 -3.80 0.000 .4460076 .7725683
|
1.female | .5967415 .032406 -9.51 0.000 .5364902 .6637595
pir | .8926169 .0158059 -6.42 0.000 .8621694 .9241397_cons | 7.50943 .6592368 22.97 0.000 6.322398 8.919328
-----------------------------------------------------------------------------------_cons estimates baseline odds. Note:
The estimated odds ratio for females is ~.60 and statistically significant, providing evidence that the odds of females wearing glasses/contacts for distance vision is statistically significantly lower than the odds for males.
. margins female
of obs = 6,247
Predictive margins Number VCE: OIM
Model
predict()
Expression: Pr(glasses),
------------------------------------------------------------------------------
| Delta-methodstd. err. z P>|z| [95% conf. interval]
| Margin
-------------+----------------------------------------------------------------
female |
0 | .6334955 .0083353 76.00 0.000 .6171587 .6498324
1 | .5190438 .0084379 61.51 0.000 .5025058 .5355819
------------------------------------------------------------------------------
. margins female, pwcompare(pv)
of predictive margins Number of obs = 6,247
Pairwise comparisons VCE: OIM
Model
predict()
Expression: Pr(glasses),
-----------------------------------------------------
| Delta-method Unadjustedstd. err. z P>|z|
| Contrast
-------------+---------------------------------------
female |
1 vs 0 | -.1144517 .0118695 -9.64 0.000 -----------------------------------------------------
We also see evidence that females have a statistically significantly lower probability of wearing glasses/contact lenses for distance vision than males.
Problem 2 Solutions - Salika
a.
dbGetQuery(sakila, "
SELECT l.name, count(l.name) AS count
FROM film AS f
LEFT JOIN language AS l on f.language_id = l.language_id
GROUP BY l.name
ORDER by -count
")
name count
1 English 1000
Trick question: They’re all in English.
b.
R approach:
<- dbGetQuery(sakila, "SELECT * FROM film_category")
fc <- dbGetQuery(sakila, "SELECT * FROM category")
cat <- table(fc$category_id)
catcount <- which.max(catcount)
maxcat c(cat$name[cat$category_id == maxcat], catcount[maxcat])
15
"Sports" "74"
With a single query:
dbGetQuery(sakila, "
SELECT c.name, count(c.category_id) AS count
FROM category as c
RIGHT JOIN film_category AS fc ON fc.category_id = c.category_id
GROUP BY c.category_id
ORDER by -count
LIMIT 1
")
name count
1 Sports 74
c.
First, to use R, let’s extract all relevant tables into data.frames.
<- dbGetQuery(sakila, "SELECT * FROM customer")
customer <- dbGetQuery(sakila, "SELECT * FROM address")
address <- dbGetQuery(sakila, "SELECT * FROM city")
city <- dbGetQuery(sakila, "SELECT * FROM country") country
Next, we can use merges to mimic what SQL is actually doing. (This produces a warning, but on variables we don’t care about)
<- merge(customer, address, by = "address_id",
merged1 all.x = TRUE, all.y = FALSE)
<- merge(merged1, city, by = "city_id",
merged2 all.x = TRUE, all.y = FALSE)
<- merge(merged2, country, by = "country_id",
merged3 all.x = TRUE, all.y = FALSE)
Warning in merge.data.frame(merged2, country, by = "country_id", all.x = TRUE,
: column names 'last_update.x', 'last_update.y' are duplicated in the result
<- table(merged3$country)
t == 9] t[t
United Kingdom
9
Here’s a more R-style approach:
<- address$city_id[match(customer$address_id, address$address_id)]
cities <- city$country_id[match(cities, city$city_id)]
countries <- table(country$country[match(countries, country$country_id)])
tcountries == 9] tcountries[tcountries
United Kingdom
9
Finally, the query:
dbGetQuery(sakila, "
SELECT co.country, count(co.country) AS count
FROM country AS co
RIGHT JOIN
(SELECT country_id
FROM city AS ci
RIGHT JOIN
(SELECT city_id
FROM customer AS c
LEFT JOIN address AS a ON c.address_id = a.address_id
) AS ca ON ca.city_id = ci.city_id
) AS ccc ON ccc.country_id = co.country_id
GROUP BY co.country
HAVING count == 9")
country count
1 United Kingdom 9
Problem 3
<- read.csv("data/us-500.csv") dat
a.
length(dat$email[grepl("net$", dat$email)])/nrow(dat)
[1] 0.14
b.
Checking the username portion first - extract the usernames, then detect anything non-alphanumeric
<- strsplit(dat$email, "@")
emails <- sapply(emails, "[[", 1)
usernames <- grepl("[^a-zA-Z0-9]", usernames) username_non_alphanumeric
Repeat for domains, stripping off the TLD first.
<- sapply(emails, "[[", 2)
domains <- gsub("\\.[a-z]{3}", "", domains)
domains <- grepl("[^a-zA-Z0-9]", domains) domain_non_alphanumeric
Finally, we can get the proportion.
mean(username_non_alphanumeric | domain_non_alphanumeric)
[1] 0.506
c.
First, make sure that all the phone numbers are the same number of digits so we don’t have preceeding 1’s or anything like that
table(sapply(dat$phone1, nchar))
12
500
Looks good, so we can assume the first three characters of every number is the area code.
<- substr(dat$phone1, 1, 3)
phone1area <- substr(dat$phone2, 1, 3)
phone2area sort(table(c(phone1area, phone2area)), decreasing = TRUE)[1]
973
36
d.
In this first approach, we identify any address that ends in a number, then split the string on spaces and store the last entry.
<- dat$address[grepl("[0-9]+$", dat$address)]
apartments <- sapply(strsplit(apartments, " "), function(x) x[length(x)])
numbers <- as.numeric(gsub("#", "", numbers))
numbers hist(log(numbers))
Another approach; here we just extract out the numbers that are at the end of the address.
<- regmatches(apartments, regexpr("[0-9]+$", apartments))
numbers2 hist(log(as.numeric(numbers2)))
Note that for this problem, there were edge cases that would require human intervention to uncover. For example,
108] apartments[
[1] "51120 State Route 18"
Here the “18” is the highway number, not an apartment number. Identifying all cases where this occurs would require full knowledge of all such possible roads - e.g. “State Route”, “Highway”, “Rt”, “Route”, etc. In addition, this assumes no user-error on input. Generally these would require a human review of the input or a far more advanced solution. This is true of most human-readble data, and especially true of user-input data. Handling this is not required for this problem.
e.
table(substr(numbers, 1, 1))
1 2 3 4 5 6 7 8 9
15 13 12 12 15 11 12 11 17
This is a uniform distribution, rather than the decreasing distribution as expected by Benford’s law. This data obviously does not appear real.
f.
<- sapply(strsplit(dat$address, " "), function(x) x[1])
housenumbers <- sapply(housenumbers, function(x) {
lastnum substr(x, length(x), length(x))
})table(lastnum)
lastnum
1 2 3 4 5 6 7 8 9
52 63 67 58 43 55 60 48 54
We see another uniform distribuion. It gets a bit tricky to apply Benford’s law. On non-leading digits, Benford’s law predicts a uniform distribution as the position of the digit increases. However, this is even trickier in this case as the last digit does not hold a fixed position:
table(sapply(housenumbers, nchar))
1 2 3 4 5
106 115 82 109 88
We can look at each length separately.
<- sapply(housenumbers, nchar)
lens for (l in names(table(lens))) {
print(table(lastnum[lens == l]))
}
1 2 3 4 5 6 7 8 9
12 13 9 16 13 10 11 8 14
1 2 3 4 5 6 7 8 9
10 10 17 12 8 15 18 13 12
1 2 3 4 5 6 7 8 9
10 9 12 9 10 8 12 5 7
1 2 3 4 5 6 7 8 9
11 23 20 12 3 9 8 13 10
1 2 3 4 5 6 7 8 9
9 8 9 9 9 13 11 9 11
We seem, if anything, an increase in the frequency of the larger digits with low lengths. But overall, all look like noisy uniforms distributions. So on the one hand, with large counts this supports Benford’s Law, more realisitically, the first position results (either from part e. or the first table when length = 1) being uniform contradicts Benford’s Law for this data, as expected by the artificial nature of this data.