Large CIs in mixed effect model

The data

library(lme4)
library(emmeans)
library(ggplot2)

LitterCN <- read.table("Litter20192020.txt", header = TRUE)
LitterCN$Age <- factor(LitterCN$Age)
LitterCN$Plot <- factor(LitterCN$Plot)
LitterCN$Year <- factor(LitterCN$Year)

Notice the response variable ranges from 0.9 to 3.2.

summary(LitterCN$N_perc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9149  1.3709  1.6825  1.7726  2.1190  3.2110

Quick visual.

ggplot(LitterCN) +
  aes(x = Age, y = log(N_perc)) +
  geom_jitter(width = 0.1, height = 0) +
  facet_grid(~ Year)

The goal is to model N_perc as a function of Age taking into account that there are clsters of measures within Plots that are nested within Years.

xtabs(~ Plot + Year, data = LitterCN)

##      Year
## Plot  2019 2020
##   E1A    2    1
##   E1B    1    0
##   E1C    2    1
##   E1D    3    1
##   E1E    2    1
##   E2A    3    0
##   E2B    0    1
##   E2D    2    1
##   E2E    3    1
##   L1A    2    0
##   L1B    2    1
##   L1C    2    1
##   L1D    0    1
##   L1E    3    0
##   L2A    2    0
##   L2B    0    1
##   L2C    2    1
##   L2E    2    0
##   M1A    2    1
##   M1B    2    1
##   M1C    2    1
##   M1D    3    1
##   M1E    2    0
##   M2A    3    1
##   M2B    3    0
##   M2C    0    1
##   M2E    2    1

Notice all the zeroes! There are several plots with 0 observations for a given year.

Fit the model

Here is the model the student fit. They log-transformed N_perc to help meet modeling assumptions. They elected to specify Year as a random effect which I don’t agree with. This basically says there are three sources of variation in the data:

between years
between all combinations of plots and years
within all combinations of plots and years

The model says to fit a random intercept to all plot:year combinations. That’s 42 random effects. The size of the data is n = 71.

Nperc.lme <- lmer(log(N_perc) ~ Age + (1|Year/Plot), 
                  data=LitterCN)

Make estimates

Next they used the emmeans package to get estimated mean N_perc for each level of Age.

emm <- emmeans(Nperc.lme, "Age", type = "response")
emm

##  Age response    SE   df lower.CL upper.CL
##  E       1.58 0.263 1.16    0.338     7.37
##  L       1.39 0.235 1.22    0.340     5.70
##  M       1.76 0.293 1.16    0.375     8.26
## 
## Degrees-of-freedom method: kenward-roger 
## Confidence level used: 0.95 
## Intervals are back-transformed from the log scale

Notice how large the confidence intervals are! Recall the response variable only ranges from about 0.9 to 3.2. Their question was, “why are the intervals so big?”

Degrees of freedom

The answer is the approximate degrees of freedom. Why are they approximate? When data is unbalanced in a mixed-effect model the null distribution of the coefficient test statistics are unknown. Hence statisticians have derived a couple of approximations. The one emmeans uses by default is called the Kenward-Roger.

Now notice the values are about 1.2. That’s tiny. The t-distribution is parameterized by its degrees of freedom. The smallest df it can take is 1, which produces the pathological Cauchy distribution with infinite variance. So a t-distribution with df = 1.2 is going to have huge variability!

We use the t distribution to get a quantile to multiply the standard error by and get a margin of error. The classic example given in an Intro Stats book is this:

\[\text{MOE} = \text{se} \times 1.96\]

\[\text{CI} = \bar{x} \pm \text{MOE} \]

The 1.96 value is the 97.5% quantile of the standard normal distribution.

qnorm(0.975)

## [1] 1.959964

The 97.5% quantile of a t-distribution with df = 1.16 is MUCH larger.

qt(0.975, df = 1.16)

## [1] 9.228909

Hence the reason the confidence intervals are so big.

Now the follow-up question is why are df approximations so small? This is tougher to answer. The Kenward-Roger derivation is very complicated (see Appendix A.1 of this paper). Common sense tells us however that it’s likely due to having 42 random effects on sample size of 71, along with about a dozen of those random effects based on 0 observations. (see xtabs table above)

If we treat Year as a fixed effect, the degrees of freedom climb up to around 23 - 25, which results in quantiles closer to 1.96.¹

sapply(23:25, function(x)qt(0.975, df = x))

## [1] 2.068658 2.063899 2.059539

Nperc.lme2 <- lmer(log(N_perc) ~ Age + Year + (1|Plot), 
                  data=LitterCN)
emmeans(Nperc.lme2, "Age", type = "response")

##  Age response     SE   df lower.CL upper.CL
##  E       1.62 0.1119 23.9     1.41     1.87
##  L       1.36 0.0957 25.8     1.18     1.58
##  M       1.71 0.1166 23.3     1.48     1.97
## 
## Results are averaged over the levels of: Year 
## Degrees-of-freedom method: kenward-roger 
## Confidence level used: 0.95 
## Intervals are back-transformed from the log scale

I happen to think Year should be a fixed effect. Time is usually a fixed effect in mixed-effect models. One way to think about it: if I were to replicate this experiment I would also carry it out over two years but use a different set of plots. Hence Year (time) is a fixed effect but Plot is a random effect.

My email reply

Let’s look at the default result for emmeans(Nperc.lme, “Age”, type = “response”):

> as.data.frame(emm)
  Age response        SE       df  lower.CL upper.CL
1   E 1.577338 0.2628918 1.158641 0.3375770 7.370157
2   L 1.392959 0.2354046 1.223774 0.3402429 5.702791
3   M 1.758831 0.2930040 1.156475 0.3745622 8.258938

The first entry, Age = E, has a huge 95% confidence interval of [0.338, 7.37]. The confidence interval is calculated using the t-distribution. You basically add and subtract the margin of error from the estimated value of 1.577. The margin of error is the standard error, 0.2628, times the 0.975 quantile from a t distribution with 1.16 degrees of freedom. That is a tiny df! A t-distribution with df = 1 is a Cauchy distribution with undefined variance! So the quantile is large. We can calculate in R as follows:

> qt(p = 0.975, df = 1.158641)
[1] 9.25012

Calculate margin of error using the standard error on the log scale, add/subtract to the estimated mean on log scale, and then exponentiate:

se <- 0.2628918/1.577338 ## SE on log scale: 0.1666680
moe <- se * qt(p = 0.975, df = 1.158641)

exp(log(1.577338) + c(-moe, moe))
[1] 0.3375774 7.3701480

There’s some rounding error, but it matches. The weird standard error calculation is due to back-transforming a log transformed standard deviation. To get it to the original scale, you have to multiply the estimated mean by the standard error:

1.577338 * 0.1666680 = 0.2628918

So to work backwards and get 0.1666680 (the SE above), you have to divide 0.2628918 by 1.577338.

ANYWAY….all that to say, the Kenward-Roger degree of freedom estimates are tiny, probably because of the “empty” Plots within Years. A df estimate in a t-distribution is basically the sample size minus estimated parameters. The largest sample size you have in Plots nested within Years is 3, with a lot of ones, twos and zeroes. Hence the estimated df is barely above 1. So lots of uncertainty and thus big confidence intervals.

Many statisticians simply add/subtract 2 standard errors to get confidence intervals in their head. (No reason to obsess over 2 versus 1.96 when we’re dealing with estimates.)↩︎