Simulate data

Following advice from https://stats.stackexchange.com/questions/321770/simulating-data-for-an-ordered-logit-model.

Below we simulate data for 200 people (n) with 20 observations each. b1 is the effect of grp = “H”. It has the effect of increasing the probability of choosing “in group” when presented with a Human task. I have also added a random effect for each subject, which is some noise drawn from a Normal distribution with mean 0 and standard deviation 0.1.

n <- 200
id <- 1:n
rep <- 1:20
d <- expand.grid(rep = rep, id = id)
d <- d[order(d$id),]
d$rep <- NULL
d$grp <- rep(rep(c("AI", "H"), each = 10), n)

# coefficients
b01 <- 0.1   # intercept 1
b02 <- -0.5  # intercept 2
b1 <- 1.2    # "H" effect

# random effects for subjects
set.seed(99)
z <- rnorm(n = n, 0, sd=0.1)
re <- rep(z, each = 20)

# generate log odds using coefficients
logodds1 <- b01 + b1 * (d$grp=="H") + re
logodds2 <- b02 + b1 * (d$grp=="H") + re

# convert to probabilities; plogis() is inverse logit (ie, converts logodds to
# probabilities)
prob_2to3 <- plogis(logodds1)
prob_3 <- plogis(logodds2)
prob_1 <- 1 - prob_2to3
prob_2 <- prob_2to3 - prob_3

# now simulate 3-level response using probabilities
f <- function(p1, p2, p3)sample(x = c(1:3), size = 1, prob = c(p1, p2, p3))
set.seed(12)
d$y <- mapply(f, p1 = prob_1, p2 = prob_2, p3 = prob_3)
# create ordered factor
d$y <- ordered(d$y, labels = c("out_group", "neither", "in_group"))

Investigate data.

summary(d$y)
## out_group   neither  in_group 
##      1332       546      2122
xtabs(~ grp + y, data = d) |>
  proportions(margin = 1) |>
  round(2)
##     y
## grp  out_group neither in_group
##   AI      0.47    0.14     0.39
##   H       0.20    0.13     0.67

Subjects classify “H” tasks as “in_group” 67% of the time versus only 39% of the time for “AI” tasks.

Fit model

Now we fit a model try to recover the “true” value of b1 = 1.2.

I show two approaches. The first is a mixed-effect proportional odds model using the clmm() function from the ordinal package. It’s actually the appropriate choice here because it assumes the data was generated precisely as I generated it.

Notice it comes very close to the “true” value of 1.2 with an estimate of 1.21571. I also specified 0.1 as the standard deviation for the subject random effect. The model estimates 0.2463. Not too bad. Finally the threshold estimates are quite close the true intercepts of 0.1 and -0.5 in absolute value. (There’s a reason the signs are flipped which is not worth going into at the moment.)

library(ordinal)
m <- clmm(y ~ grp + (1|id), data = d)
summary(m)
## Cumulative Link Mixed Model fitted with the Laplace approximation
## 
## formula: y ~ grp + (1 | id)
## data:    d
## 
##  link  threshold nobs logLik   AIC     niter    max.grad cond.H 
##  logit flexible  4000 -3707.69 7423.39 157(337) 1.64e-05 1.5e+02
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  id     (Intercept) 0.06064  0.2463  
## Number of groups:  id 200 
## 
## Coefficients:
##      Estimate Std. Error z value Pr(>|z|)    
## grpH  1.21571    0.06426   18.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##                   Estimate Std. Error z value
## out_group|neither -0.15288    0.04753  -3.217
## neither|in_group   0.47803    0.04819   9.920

We can visualize the model with ggeffects package. It basically shows the proportions we calculated earlier with SE bars provided by the model.

library(ggeffects)
# visualize model
ggpredict(m, terms = c("grp")) |> plot()

We can interpret the grp coefficient by exponentiating. The result is an odds ratio. The estimated odds that “H” tasks are classified into the higher direction (ie, toward the “in group”) are about 3 times higher than the estimated odds that “AI” tasks would be classified in the higher direction. That’s a mouthful. I prefer the above effect plot that communicates the result using probabilities.

exp(1.21571)
## [1] 3.372688

We can also use GEE as follows. The results are very similar to the previous model, but I’m not sure how to easily visualize.

library(geepack)
m2 <- ordgee(y ~ grp, id = id, data = d)
summary(m2)
## 
## Call:
## ordgee(formula = y ~ grp, id = id, data = d)
## 
## Mean Model:
##  Mean Link:                 logit 
##  Variance to Mean Relation: binomial 
## 
##  Coefficients:
##                   estimate     san.se      wald            p
## Inter:out_group  0.1549075 0.04670813  10.99917 0.0009115256
## Inter:neither   -0.4692746 0.04554033 106.18462 0.0000000000
## grpH             1.2222423 0.06635624 339.27446 0.0000000000
## 
## Scale is fixed.
## 
## Correlation Model:
##  Correlation Structure:     independence 
## 
## Returned Error Value:    0 
## Number of clusters:   200   Maximum cluster size: 20