Simulation I wrote for a Darden professor and his collaborator. They were asking about an experiment involving people drawing wealth distributions. They’re wondering if people have different “ideal” wealth distributions for various groups. They’re measuring a wealth distribution with something called a gini coefficient that measures inequality. It ranges in value from 0 to 1. In order to help them plan their analysis I offered to simulate data for them from a beta distribution, show them how to analyze the data using beta regression, and then how to repeatedly simulate and analyze the data to estimate power given a sample size.
Simulate data for people’s attitudes towards economic inequality (ie, gini coefficient) affected by group structures NG (no groups), WM (well mixed), and SG (segregated)
Assumed effect size
Let’s say we want our study to detect mean gini differences between groups if expected means are as follows. (I just made these up. These should be based on subject matter expertise and what you consider “important” differences to detect.)
NG = 0.5 SG = 0.55 WM = 0.6
Without going into too much details, beta regression involves a link
function, the default of which (at least in R) is the logit (or
log-odds). Therefore we need to rescale our effect sizes to the logit
scale. The qlogis() function is the logit function.
NG <- qlogis(0.5) # NG = 0
SG <- qlogis(0.55) - qlogis(0.5) # SG - NG = 0.2
WM <- qlogis(0.6) - qlogis(0.5) # WM - NG = 0.4
We’ll use these as the “True” values in generating data.
# set sample size
N <- 300
# create group variable; 100 each
grp <- rep(c("NG", "SG", "WM"), each = N/3)
# generate age; uniform random values from age 18 - 75
age <- round(runif(N, 18, 75))
# parameters (or beta coefficients); include 0.005 for age effect; (assuming older people will score higher gini. I have no idea...)
B <- c(NG, SG, WM, 0.005)
B
## [1] 0.0000000 0.2006707 0.4054651 0.0050000
precision parameter; smaller means less precise (more variance) this is another assumption! I tried to be somewhat conservative.
phi <- 15
Now generate mean; plogis() is the inverse logit
function; it returns values on [0,1]
mu <- plogis(B[1] +
B[2]*(grp == "SG") +
B[3]*(grp == "WM") +
B[4]*age)
Finally generate gini using rbeta() function:
gini <- rbeta(N, mu * phi, (1 - mu) * phi)
Now combine into data frame:
d <- data.frame(gini, grp, age)
Histogram of simulated gini
hist(d$gini)
Mean Gini by groups:
aggregate(gini ~ grp, data = d, mean)
## grp gini
## 1 NG 0.5574288
## 2 SG 0.6221767
## 3 WM 0.6246971
Now work backwards and try to recover “True” values of 0.5, 0.55, and 0.6.
Use the betreg package to fit a beta regression model. Only need to
install once, but load using library() every time you start
R.
# install.packages("betareg")
library(betareg)
Now fit model. This is the same model we used to generate the data so it’s the “correct model”.
m <- betareg(gini ~ grp + age, data = d)
summary(m)
##
## Call:
## betareg(formula = gini ~ grp + age, data = d)
##
## Quantile residuals:
## Min 1Q Median 3Q Max
## -2.2817 -0.7008 0.0854 0.5702 3.3337
##
## Coefficients (mean model with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.068796 0.096545 0.713 0.476108
## grpSG 0.242973 0.073403 3.310 0.000933 ***
## grpWM 0.269440 0.073407 3.670 0.000242 ***
## age 0.003715 0.001759 2.112 0.034693 *
##
## Phi coefficients (precision model with identity link):
## Estimate Std. Error z value Pr(>|z|)
## (phi) 14.336 1.134 12.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Type of estimator: ML (maximum likelihood)
## Log-likelihood: 201.5 on 5 Df
## Pseudo R-squared: 0.06558
## Number of iterations: 11 (BFGS) + 1 (Fisher scoring)
The coefficients should be somewhat close to the True values defined in B and phi above.
The coefficient tests for grpSG and grpWM tell us whether or not we “detected” the difference in expected gini scores for the groups.
The coefficients are on the logit scale and are hard to interpret. I suggest using the emmeans package to get estimate differences on the gini scale with corrected p-values for multiple comparisons. Below we find expected difference in mean gini scores between groups holding age at the mean.
# install.packages("emmeans")
library(emmeans)
emm1 <- emmeans(m, specs = "grp", at = list(age = mean(d$age)))
contrast(emm1, "pairwise")
## contrast estimate SE df z.ratio p.value
## NG - SG -0.05868 0.0177 Inf -3.318 0.0026
## NG - WM -0.06490 0.0176 Inf -3.681 0.0007
## SG - WM -0.00622 0.0174 Inf -0.357 0.9322
##
## P value adjustment: tukey method for comparing a family of 3 estimates
Is 300 a large enough sample (assuming we have 100 in each group) to detect difference between NG and SG & NG and WM?
We can simulate data and fit a model many times. Below we…
Create a function that does this that allows us to change sample size. Funtion returns TRUE if both “grpSG” and “grpWM” coefficients are less than 0.05.
f <- function(N){
grp <- rep(c("NG", "SG", "WM"), each = N/3)
age <- round(runif(N, 18, 75))
B <- c(0, 0.2, 0.4, 0.005)
phi <- 15
mu <- plogis(B[1] +
B[2]*(grp == "SG") +
B[3]*(grp == "WM") +
B[4]*age)
gini <- rbeta(N, mu * phi, (1 - mu) * phi)
d <- data.frame(gini, grp, age)
m <- betareg(gini ~ grp + age, data = d)
p <- coef(summary(m))$mean[c("grpSG", "grpWM"), "Pr(>|z|)"]
all(p < 0.05)
}
Now replicate function 1000 times. Use pbreplicate()
function to get a progress bar on the simulation
# install.packages("pbapply")
library(pbapply)
res <- pbreplicate(n = 1000, expr = f(N = 300))
estimated power (mean of T/F is proportion of TRUEs)
mean(res)
## [1] 0.795