Simulations

Introduction
Permutation tests
Power Analysis
Assumption Violations
- Example 1
- Example 2
Generating Random Data
Tips for writing simulations

Thank you for visiting my notes. I've noticed a large uptick in views lately which is terrific! However, these notes are quite old (2017) and some details may be outdated. I have a more recent class (2023) which covers much of the same information and is hopefully more up-to-date.

If you are using these notes, I'd love to hear from you and how you are using them! Please reach out to jerrick (at) umich (dot) edu.

Introduction

Simulation studies are a general term for any sort of analysis which involves repeatedly performing the same analysis on data which is in some fashion randomly generated, and aggregating the results. We’ve already talked about one limited form of simulation studies in multiple imputation; the imputations are random and the results are aggregated, though the number of imputations is low compared to what we generally think of for simulation studies.

We’re going to talk about three different situations in which a simulation study may be useful:

Permutation tests - A generalization of nonparametric methods in which we wish to test the significance of an estimate but do not know its sampling distribution (or believe the assumptions about its sampling distribution may be violated).
Power analyses - When there is not a closed form power calculation available, we can use a simulation study to estimate the power under certain assumptions.
Assumption violations - If there’s a concern that a model assumption is violated, a simulation can test the effect that the assumption violation may have.

Permutation tests

Permutation tests are generalizations of nonparametric methods; e.g. Wilcoxon rank-sum is a specific version of it. They typically (though not exclusively) apply to treatment effect settings where we are interested in the difference between two groups. We’ll start with a small example before considering simulation versions.

A small example

Let’s start with a small example. Say we’ve got sample data from two groups and are interested in whether the median in the groups are significantly different. Here’s the data:

Group 1: 21 28 32
Group 2: 17 22

Define the medians as $m_1$ estimated by $\hat{m}_1 = 28$ and $m_2$ estimates by $\hat{m}_2 = 19.5$. Their difference $D$ is estimated by$\hat{D} = 28 - 19.5 = -8.5$. Define the null hypothesis as

\[ H_0: m_1 = m_2. \]

If we were to perform a parametric test (specifically Mood’s median test), we’d compute a $p$-value where

\[ p = P\Big(|D| \geq |\hat{D}| \Big| H_0\Big). \]

We could compute that probability manually. First, consider the null hypothesis. An alternative equivalent formalization is that group membership is random. If group membership were random, then the medians would be equivalent in the asymptotic sense.

With the sample size of 5, there are only a limited number of permutations (specifically ${5 \choose 3} = 10$) we could have observed if group membership were random:

Group 1	Group 2	$\hat{D}$
22 28 32	17 21	9$*$
21 28 32	17 22	8.5$*$
21 22 32	17 28	-0.5
21 22 28	17 32	-2.5
17 28 32	21 22	6.5
17 22 32	21 28	-2.5
17 22 28	21 32	-4.5
17 21 32	22 28	-4
17 21 28	22 32	-6
17 21 22	28 32	-9$*$

We see that three of these permutations (marked with $*$) are as extreme or more extreme than the observed -8.5. Therefore, the $p$-value is 3/1 = .30.

With Larger sample sizes

As the sample size increase, we quickly lose the ability to measure every permutation (e.g. with even a reasonable size, ${100 \choose 50} > 10e^{28}$). Instead we can observe a random subset of the iterations and use that to estimate the distribution of the test statistic; computing the $p$-value exactly as described above.

The process is similar to above:

Compute the observed test statistic $t_\textrm{obs}$.
Repeatedly re-sample the data in some fashion that simulates the null hypothesis. From each run, compute $t_i$.
The sample distribution of all $t_i$ is an estimate of the population distribution of $t$.
The $p$-value can be calculated as

\[ p = \frac{\textrm{number of |t_i|'s} > |t_\textrm{obs}|}{\textrm{number of simulations}}. \]

Example

Let’s consider an example based upon the National Support Work Demonstration data. This program intervened on “hard-to-employ” individuals (e.g. ex-convicts or ex-addicts) by exposing them to 12 months of “employment in a supportive but performance-oriented environment”. The data (named lalonde after the 1986 paper by that author which analyzed the data) actually exists in several versions and several packages, we’ll use the version from MatchIt:

data(lalonde, package = "MatchIt")

Let’s ask whether the intervention reduced the percentage of individuals who made below the federal poverty line. The two variables of interest are treat, a binary indicator of treatment status, and re78, total income in 1978.

The tricky part here is that the federal poverty line is based upon the total income for a family of a given size. In 1978 ¹, the poverty line was $3,400, with an additional $1,100 for each additional person (so a family of three had a poverty line of $5,600). In the data there is the variable married, a binary indicator of whether the individual was married.

Let’s make the following simplifying (albeit most certainly wrong) assumptions:

Individuals in the data set are independent (e.g. only one family member can participate).
Individuals in the data set are the sole income earners in their family.
Individuals have no children and support no parents/extended family.

In other words, each family is of size 1 or 2 and the re78 variable represents the total household income. Therefore, each individual is below the poverty line if

\[ \textrm{re78} < 3400 + \textrm{married}\times1100. \]

We can compute

\[ q = \frac{1}{n}\sum_{i=1}^n\mathbb{1}\Big\{\textrm{re78} < 3400 + \mathbb{1}\{\textrm{married} = 1\}*1100\Big\}. \]

Note that $n$ is the sample size and $\mathbb{1}\{...\}$ represents an indicator that is 1 if the statement inside is true and 0 if false. Therefore the test statistic is

\[ t = q_1 - q_0, \]

or the difference in the proportions between the treatment and control groups. Note that the $n$ above varies in $q_1$ and $q_0$, being the number of treated and control members respectively. If $t$ is positive, then more individuals in the treatment group are under the poverty line; if $t$ is negative than less are; if $t$ is 0 there is no difference.

We could easily work out the distribution of this test statistic if we wanted, but its equally easy to run this simulation quickly.

First, let’s calculate $t_\textrm{obs}$, the value in the true sample.

tstat <- function(txt, married, salary) {
  stopifnot(length(txt) == length(married),
            length(txt) == length(salary),
            all(married) %in% c(0, 1),
            mean(txt) > 0,
            mean(txt) < 1)
  q1 <- mean(salary[txt == 1] < 3400 + married[txt == 1]*1100)
  q2 <- mean(salary[txt == 0] < 3400 + married[txt == 0]*1100)
  return(q1 - q2)
}

tobs <- tstat(lalonde$treat,
              lalonde$married,
              lalonde$re78)
tobs

## [1] 0.01106281

Note that we see this is positive, so we know almost surely that the $p$-value will not be significant, but let’s get continue.

The null hypothesis is that the treatment is ineffective; in other words, that whether an individual received treatment will not affect their likelihood of being under the poverty line.

So let’s iterate by randomly assigning treatment status. It’s often easier to randomize treatment status than outcome.

married <- lalonde$married
salary <- lalonde$re78
treatment <- lalonde$treat

reps <- 10000
ts <- rep(-999, times = reps)
for (i in 1:reps) {
  treatment.iterate <- sample(treatment)
  ts[i] <- tstat(treatment.iterate,
                 married,
                 salary)
}
library(ggplot2)
qplot(ts, geom = "density") + geom_vline(xintercept = tobs)

So there’s the distribution of the test statistic with a vertical line at the observed test statistic. As you could have anticipated, it is centered around 0. Just from visualizing we can see that it’s unlikely we’ll see a significant difference, but we can calculate the two-sided $p$-value easily:

mean(abs(ts) > abs(tobs))

## [1] 0.7995

Power Analysis

Statistical power can be thought of a test’s ability to detect an effect - the larger the power, the smaller an effect we can detect. A lot of basic tests have closed form solutions for power - t-tests, one-way balanced ANOVA, correlation tests - but more complicated tests and models usually don’t. However, following a similar but distinct pattern to the permutation tests, we can use a simulation to determine the power of a test, with certain assumptions.

Basics of Power

We have two hypotheses, the null $H_0$ and the alternative $H_1$. We have two results, rejecting $H_0$ or failing to reject $H_0$. For a given test, we can define the probability of a given result conditional on the truth.

\[\begin{align} \alpha = \textrm{Type I Error} &= P(\textrm{reject} | H_0)\\ 1 - \beta = \textrm{Type II Error} &= P(\textrm{fail to reject} | H_1)\\ \beta = \textrm{power} &= P(\textrm{reject} | H_1)\\ \end{align}\]

Note that as decreases (so we lower the chance of making a Type I error), 1 - increases (we have a higher chance of making a Type II error), which directly lowers the power (we have a lower chance of detecting an effect if there truly is one).

Balancing these are up to the statistician, the common values suggested are $\alpha = .05$ (hence the threshold for significant $p$-values) and $\beta = .80$.

Power analyses require several estimates. Specifically, two of the following three must be estimated (either from prior knowledge or from a pilot study) and the third can be estimated:

Sample size
Target power
Effect size

If we estimate:

Sample size and target power, we can examine the lowest effect size that can be detected.
Sample size and effect size, we obtain estimates of the power of the test.
Target power and effect size, we obtain the sample size needed to power the test appropriately.

The quality of a power analysis is entirely dependent on the quality of the estimates! Poor estimates will give poor power analyses. One remedy is to instead of producing a single estimate of power (or sample size or effect size), plot the relationship between the three variables, known as a power curve. For example,

Source: http://www.nature.com/ng/journal/v40/n5/fig_tab/ng0508-489_F1.html

From this we can read any of the above such as:

With a sample size of 5,000 and 80% power, the smallest effect would could detect would be about .32.
With a sample size of 20,000 and an effect size of .1, the power is about 30%.
To obtain 80% power to detect an effect of .2, we’d need a sample size somewhere between 10,000 and 20,000, likely somewhere around 18,0000.

Simulating power

The simulation runs similar to earlier:

For a given sample size and effect size, repeatedly generate data which
1. is a realistic representation of the data you expect to obtain.
2. has the effect size baked in.
Perform your analysis and record whether a significant effect was detected.
The power is the proportion of the iterations which detected a significant effect.
Repeat 1-3 as desired with different sample sizes and effect sizes.
Generate the plot(s) as above.

Example

Let’s say we are trying to figure out how to cook the perfect hamburger. We plan to cook a lot of burgers, recording various statistics about the burger and our cooking method, then present these to judges and get a 1-10 score on the hamburger.

Two predictors we might look at are thickness of the burger and temperature the burger was cooked at. In theory these are independent - while a thicker burger might take longer to cook, it can reach the same temperatures that a thin burger can. So we don’t have to worry about correlation between these. (In lots of situations you do!)

We want to fit a regression model predicting the judges scores (assume these scores have been validated in some sense, likely by aggregating multiple judge’s scores) based upon temperature and thickness. What we’d like to be able to do is test whether the coefficients on these predictors (after normalization) are significantly different, thus enabling us to say which is most impactful on burger quality.

Here’s the power simulation, with comments throughout:

reps <- 1000
# Lists of sample sizes and effect sizes we'll iterate over
ns <- c(10, 25, 50, 100)
effect_size <- seq(.1, 1, by = .1)
# We'll save the power at each combination here.
# Each row represents one combination of n and effect size,
# the 2nd and 3rd column will save the values.
results <- matrix(nrow = 0,
                  ncol = 3)

# Loop over each combination and perform `reps`.
for (i in seq_along(ns)) {
  n <- ns[i]
  for (j in seq_along(effect_size)) {
    es <- effect_size[j]
    # We'll save whether each test rejects.
    save <- vector(length = reps)
    for (r in 1:reps) {
      # Burger temps can range from 125 (rare) to 165 (well-done).
      temp <- sample(125:165, n, TRUE)
      temp <- scale(temp)

      # Widths can vary between .5 to 1.5 inches, with a normal distribution.
      width <- rnorm(n, 1, .25)
      # careful, min and max don't work with vectors!
      width <- pmax(pmin(width, 1.5), .5)
      width <- scale(width)

      # Generate the responses.
      verdict <- 1*temp + (1 + es)*width + rnorm(n)

      # Run the model
      mod <- lm(verdict ~ temp + width)
      # This function from the car package test whether temp = width
      save[r] <- car::linearHypothesis(mod, "temp = width")$Pr[2] < .05
    }
    # Now that we have the results for this choice of n and effect size, 
    # let's save it and move on.
    results <- rbind(results, c(mean(save), n, es))
  }
}
resultsdf <- data.frame(results)

names(resultsdf) <- c("Power", "n", "Effect.size")

# Convert n to a factor to make plotting nicer
resultsdf$n <- as.factor(resultsdf$n)

ggplot(resultsdf,
       aes(x = Effect.size, y = Power, group = n, color = n)) + geom_line() + geom_hline(yintercept = .8)

A few notes about this code:

I violate my own best practices by dynamically growing the results matrix. However, this matrix only grows to a total of 80 rows so this slowdown will not be substantial. In addition, in the inner loop, save is not dynamically created, but pre-allocated properly.
Note that while I could loop directly over ns and effect_size (e.g. for (n in ns), looping over i and j makes it easier for me to place the results into the results matrix.
lm and car::LinearHypothesis are slow. I really should optimize this and would if I had more time!

Looping at the results, we can see for example, to obtain 80% power to detect an effect size of .5, we’d need a sample size around 75. With a sample size of 25, we’d only be able to detect effects around .8 with 80% power.

Assumption Violations

Each statistical model includes a variety of assumptions - either in the model itself, such as normality for a t-test or independence of observations for linear regression, or in the data, such as assuming the predictors are measured without error. If these assumptions are violated, the penalty can vary, from an increase in bias and/or variance, to complete invalidation of the results.

Sometimes there are closed form solutions to these situations, where we can either use a different model (such as a mixed effects model instead of least squares if independence is violated) or we can model the violation explicitly (such as in the measurement error literate).

However, in other situations there aren’t. That’s what this sort of simulation comes in. Its a combination of the previous two simulation styles:

Like permutation tests, we’ll randomly generate data each iteration.
Like power analyses simulations, we’ll look at a variety of values for some parameters.

The process is similar:

For a given set of parameters, repeatedly generate data which
1. is a realistic representation of the data you expect to obtain.
2. has the assumption violation baked in.
Perform your analysis and record the results (see below for which results to record).
Repeat 1-2 as desired with different parameters
Visualize the results to examine the effect of the parameter violation.

How to decide which results to record? It depends on the goal of your analysis and the effect you expect the violation assumption to have. You want to save whatever test statistic you are concerned with and its associated error, for example, a regression coefficient. Sometimes you’re more interested in the model as a whole so something like $R^2$ is appropriate. Generally you’ll want to save more rather then less, as sometimes the assumption violations can cause issues where you don’t expect. Saving a variety of statistics allows you to check them all.

For example, for a least squares model, I’d likely save:

All coefficients and their errors.
$R^2$ and adjusted $R^2$.
The F-test statistic and associated degrees of freedom.
Some summary of the residuals, e.g. MSE or absolute deviation.
If you have multiple predictors (and aren’t already saving them all), the percentage of non-zero coefficients you estimate.

Sometimes detecting the effects is easy - if the distribution of the coefficient is not centered around the value you chose, the violation is causing bias. Sometimes it is harder to detect, for example increased variance. In this case, you may consider running the simulation twice, once with the violation and once without, and comparing the distribution of the test statistics.

Example 1

Let’s take a very simple scenario. Let’s say we’re trying to predict age based upon log of salary.² To do so, I send two plucky research assistants to two different locations of Ann Arbor with instructions to conduct a pseudo-random convenience sample.

Assistant 1 performs as requested and asks every $n$th random person who walks by. Assistant 2 unfortunately attempts to “improve” the results, and skips anyone who looks too young or too old to work. As a result, the variability of assistant 2’s data is reduced.

We can simulate this by generating the data ignoring restriction; then for assistant two’s data, drop any observations with high or low ages. What threshold we use is one parameter to use in the model; for my purposes I will drop if the age is randomly below 15-25 or randomly above 60-80. The randomization is to account for “looking” too young or old, by keeping a few people who maybe are young but look older (or vice-versa) we may model it better.

In a real scenario, we’d want to base the distribution of age and income on the sample data, trying to account for the error. Alternatively, you could generated standardized data and talk in more generalities, but that still assumes the distribution of age and income are not too bizarrely distributed.

For generating the income, we’d like the income to be roughly between $15,000 and $200,000 (choosing arbitrarily by me, but you could use the data to set limits). Salary is usually log-normal and the median salary in 2000 was around $46,000. To account for inflation and to simplify, let’s say it’s $50,000.

income	log(income)
$15k	9.6
$50k	10.8
$200k	12.2

So therefore we can draw from a normal distribution with mean 10.8 and standard deviation around .43 so the lower and upper bound are roughly 3 standard deviations away.

For age, we want to use

\[ \textrm{age} = \beta_0 + \beta_1\textrm{log(income)} + \epsilon \]

To make choosing a specific $\beta_1$ easier, we’ll center log income. This will affect the value of $\beta_0$ but not its significance (not that we care about it anyways). From trial and error (repeatedly generating and plotting the relationship) I found that values $\beta_1 = 50$ and $\epsilon$ having a standard deviation of 25 worked well (it generated a decent relationship with enough noise to be interesting).

n <- 100
reps <- 1000

save_full <- vector(length = reps)
save_adj <- vector(length = reps)
for (i in 1:reps) {
  income = rnorm(n, 10.8, .43)
  age <- 55 + 50*scale(income, scale = FALSE) + rnorm(n, 0, 25)

  # Let's save the original results just to prove to ourselves that any 
  # discrepancy is due to the changes below.
  save_full[i] <- .lm.fit(cbind(1, income), age)$coef[2]

  # In the second half, drop anyone outside of the range.
  lowdrop <- sample(15:25, n/2, replace = TRUE)
  highdrop <- sample(60:80, n/2, replace = TRUE)

  todrop <- (age < lowdrop | age > highdrop)
  todrop[1:(n/2)] <- FALSE
  age <- age[todrop]
  income <- income[todrop]

  save_adj[i] <- .lm.fit(cbind(1, income), age)$coef[2]
}
qplot(save_full) + geom_vline(xintercept = 50)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(save_adj) + geom_vline(xintercept = 50)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(save_full, save_adj) + geom_abline(slope = 1, intercept = 0)

So we can see that this introduces a substantial bias to the model. If this were real data, we’d either have to restrict our population of interest to those between 25 and 60 and discard anyone outside that range, or discard the second research assistants data. Neither of which are ideal solutions, which is why it’s imperative to convince your collaborators and clients to plan the data collection with you ahead of time!

Example 2

Another situation which can arise is where there’s some methodology (or an old methodology applied in a non-standard way or in a novel setting) and it’s not clear that it should necessarily function as desired. Typically this is for detecting bias in the estimates, but can be used for other situations. The simulation here is somewhat of a hybrid between a permutation test and simulating to check assumption violations.

The situation I’ve seen it most commonly used with is estimating the difference between two groups (specifically a treatment effect, but more generally any two groups). Rather than wholly simulating the data, instead utilize only one group (usually the control group) and randomly split it into two faux groups. We know, asymptotically, there should be no difference between the two groups.

Rather than show code, which is straightforward and would be repetitive at this point, I’ll instead link to a publication which uses this method: Abadie, A., Chingos, M.M. and West, M.R., 2013. Endogenous stratification in randomized experiments (No. w19742). National Bureau of Economic Research..

In the paper, the authors examine a two-stage regression setting in which the predicted outcome in the first stage is used to stratify the sample into lowest 1/3rd, middle 1/3rd and highest 1/3rd. The second stage model is fit on each third individually with the goal of estimating the treatment effect. If this procedure worked as anticipated, it would provide evidence whether a treatment is more effect on those who are high at risk.

To demonstrate the existence of bias, the authors proceed exactly as described: Discard the treatment group data, and randomly assign the control group into fake treatment and control groups³ before performing the analysis. Repeating this many times⁴, they generated the distribution of the estimated treatment effect for the terciles:

Note that since the “treatment” group is nothing more than a random subset of the control group, there should be no treatment effect. However, the lower and upper terciles find an effect!

Generating Random Data

I’ve generally assumed you already know how to generate random data appropriately. Let’s be a bit more explicit about it.

Generating from “common” distributions

Most distributions have functions in R to allow easy access to random numbers drawn from them. For example, rnorm, rt, rpoisson and rgamma for Normal, t, Poisson and Gamma distributions respectively. They take as an argument n for the sample size, and the parameters for the family; e.g. $\mu$ and $\sigma$ for rnorm or the degrees of freedom for rt.

See help(Distributions) for a full list of built-in distributions.

Generating from less common distributions

`sample`

The sample function can simulate draws from any categorical distribution (named or un-named) by modifying the prob argument:

table(sample(0:1, size = 100, replace = TRUE))

##
##  0  1
## 51 49

table(sample(0:1, size = 100, replace = TRUE, prob = c(.3, .7)))

##
##  0  1
## 32 68

These simulate 100 draws each from Bernoulli distribution with $p$ of .5 and .7 respectively. (Or, identically from Binomial(100, .5) and Binomial(100, .7).)

For a more custom distribution, consider that the 2010 census of Ann Arbor reported that Ann Arbor was 73% white, 7.7% black, 4.1% Latino and 14.4% Asian (leaving .8% as Other). Let’s simulate a draw from that:

aa <- sample(c("White", "Black", "Latino", "Asian", "Other"),
             size = 1000, replace = TRUE,
             prob = c(.73, .077, .041, .144, .008))
table(aa)

## aa
##  Asian  Black Latino  Other  White
##    144     68     38      9    741

Variable transformations

There are various well known connections between distributions, such as that the sum of squares of $n$ standard normal variables is distributed $\chi^2(n)$. We can exploit this fact to sample from any distribution which has a connection, if R does not support such a distribution.

For summaries of the various connections, see this paper by Leemis and McQuestion. Alternatively, the wikipedia page for most distributions does a good job of summarizing connections between other distributions.

For example, the Erlang distribution describes the waiting time until the $k$th event occurs given that the average waiting time per event is $\lambda$. It is known that the sum of $k$ Exp($\lambda$) random variables is distributed as Erlang($k, \lambda$). It is also known that the Erlang distribution is a special case of the Gamma distribution with the shape parameter of the Gamma as an Integer.

# Method 1
k <- 15
lambda <- 2
n <- 10000
exps <- matrix(rexp(n*k, rate = 1/lambda), nrow = n)
erlang <- apply(exps, 1, sum)

# Method 2
erlang2 <- rgamma(n, 15, 1/lambda)

qqplot(erlang, erlang2)
abline(0, 1)

Note that $\lambda$ is inverted in the draws. R loves using the rate rather instead of the scale, so it is often inverted. Just watch out for this.

Inverse uniform sampling

You may have learned about this method in an introductory class. It exploits the fact that the distribution of the CDF of a random variable is uniform. Let $F_x$ be the CDF of $X$. If the inverse $F^{-1}_x$ exists, then for $U$ drawn from Uniform(0,1), $F^{-1}_x(U)$ has the same distribution as $X$.

The downside to this method is that we frequently cannot express the inverse CDF. So do some research to see if computing the inverse CDF is feasible before attempting.

As an example, let’s generate data from the Weibull distribution, which is similar to Erlang in that it measures time-to-failure. It has two parameters, $k$ representing the shape and $\lambda$ representing the scale. This not being a theoretical statistics course, I’m not going to derive the inverse CDF, I’ll merely say it is

\[ -\ln(x)^\frac{1}{k}\lambda \]

U <- runif(10000, 0, 1)

weibull <- (-log(U))^(1/4)/6

weibull2 <- rweibull(10000, 4, 1/6)

qqplot(weibull, weibull2)
abline(0,1)

Note again that rweibull uses the rate, the inverse of the scale parameter.

Generating correlated data

Normal data

The package “mvtnorm” allows drawing from correlated normal data. Similar to rnorm, it takes in a mean and standard deviation, but multidimensional: A vector of means and a covariance matrix.

library(mvtnorm)
mean <- c(1, 5, 3)
vcov <- matrix(c(1, .4, -.2,
                 .4, 2, 1.2,
                 -.2, 1.2, 1.4),
               nrow = 3)
s <- rmvnorm(10000, mean, vcov)
vcov

##      [,1] [,2] [,3]
## [1,]  1.0  0.4 -0.2
## [2,]  0.4  2.0  1.2
## [3,] -0.2  1.2  1.4

round(cov(s), 2)

##       [,1] [,2]  [,3]
## [1,]  1.01 0.41 -0.20
## [2,]  0.41 1.99  1.21
## [3,] -0.20 1.21  1.41

Be careful with correlation versus covariance here; a covariance of .4 does not imply a correlation of .4:

round(cor(s), 2)

##       [,1] [,2]  [,3]
## [1,]  1.00 0.29 -0.17
## [2,]  0.29 1.00  0.72
## [3,] -0.17 0.72  1.00

Be sure you are working on the scale you are anticipating. Note that if you pass a correlation matrix instead (diagonal elements are all 1 and off-diagonals are in $[-1, 1]$), then you will have the equivalence. Don’t forget that

\[ cor(X, Y) = cor(a*X, b*Y) \]

so if you want $X$ and $Y$ to be correlated with some variance, generate the correlated data with variance 1 and scale afterwards.

Non-normal

We can combine the use of mvtnorm with inverse uniform sampling to generate data from any two correlated distributions. The idea is to draw normal data with the correlation we desire, the use inverse uniform sampling twice, once to convert to uniform then a second time to convert to the desired distribution. Doing so should (roughly) preserve the correlation.

For an example, let’s draw two correlated exponentially distributed random variables. The nice part is that most distributions have built-in CDF and inverse CDF functions, q____ and p____ respectively.

n <- 10000
m <- c(0, 0)
vcov <- matrix(c(1, .4,
                 .4, 1),
               nrow = 2)

x <- rmvnorm(n, m, vcov)
cor(x)

##           [,1]      [,2]
## [1,] 1.0000000 0.4166103
## [2,] 0.4166103 1.0000000

qplot(x[,1], x[,2])

As expected, we have correlated normal data. Let’s use pnorm to convert to correlated uniform.

U <- pnorm(x)
cor(U)

##           [,1]      [,2]
## [1,] 1.0000000 0.4004628
## [2,] 0.4004628 1.0000000

qplot(U[,1], U[,2])

While not exactly correlated, we’re pretty close. Let’s use qexp to finally convert to exponential.

exp <- qexp(U)
cor(exp)

##           [,1]      [,2]
## [1,] 1.0000000 0.3716019
## [2,] 0.3716019 1.0000000

qplot(exp[,1], exp[,2])

qqplot(exp[,1], rexp(n))
abline(0, 1)

qqplot(exp[,2], rexp(n))
abline(0, 1)

We have correlation 0.37, very close to our initial value of .4, and both variables are good fits for the exponential distribution.

Tips for writing simulations

In general

Don’t forget to profile! Sometimes these simulations can be very slow.
When using sample, don’t forget that the default is “without replacement.”
Save more. You’ll rarely regret saving too much, you will regret having to re-run a slow simulation because you saved too little.
Use set.seed while designing and testing, but be sure to remove it before you run your real simulations!
Test your code with small sample sizes and few reps - Don’t just jump into the large numbers!
You should spend a lot of time thinking through the set-up of the simulation. If you set-up your null hypothesis incorrectly (permutation tests) or use a nonsensical parameter (power analysis) or generate data which is not representative (assumption violations), the results are useless no matter how cleverly designed the simulation code is.

Permutation test

Recall that margin of error is roughly estimated by $1/\sqrt{n}$. So 10,000 replicates of a simulation leads to a 1% margin of error. That may not be sufficient, especially if the p-value is small to begin with. You may be better off starting with a relatively few number of iterations (say 1000) and increasing it to a million or more if the p-value is “interesting” (e.g. somewhere below ~.05).
Permutation tests are only valid for null hypotheses that can be framed as “no effect”. By this I mean, if we had a null hypothesis of $H_0: \mu = 5$, we could let $\phi = \mu - 5$ and then use $H'_0: \phi = 0$. Most hypotheses can be written this way, but not all; such as $H_0 \mu \in (-1, 1)$.

Power analysis

Remember again: A power analysis is only as good as its estimated parameters.

Assumption Violations

If you cannot properly replicate your data for the simulation, try running with a series of standard data replication techniques (standardized data with varying levels of correlation, varying the ratio of signal-to-noise, etc). In other words, vary your parameters similar to a power analysis simulation, but over a larger range of feasible values. These results will be less hyper-targeted to your data and situation, but may provide a more overall picture.

https://www.nlsinfo.org/sites/nlsinfo.org/files/attachments/15128/Family%20Poverty%20Status%20and%20Family%20Poverty%20Level%20Historical%20Variables.pdf ↩
Notably we’re ignoring gender here.↩
They technically used bootstrap sampling but the effect remains the same.↩
Again, technically they used an MCMC but a standard simulation would suffice.↩

Group 1	Group 2	\(\hat{D}\)
22 28 32	17 21	9\(*\)
21 28 32	17 22	8.5\(*\)
21 22 32	17 28	-0.5
21 22 28	17 32	-2.5
17 28 32	21 22	6.5
17 22 32	21 28	-2.5
17 22 28	21 32	-4.5
17 21 32	22 28	-4
17 21 28	22 32	-6
17 21 22	28 32	-9\(*\)