ANCOVA

ANCOVA stands for Analysis of Covariance. It’s simply linear regression with one numeric predictor and one categorical predictor. It can help us assess how the linear association between the numeric variable and our dependent variable changes between the groups.

Load data

Read and prepare Prism data. I believe this data corresponds to Fig 2D.

library(pzfx)
library(tidyverse)
H2SO4_d1 <- read_pzfx("21.09.21 Six Month H2SO4 Study.pzfx")

# reshape data to "long format"
d <- H2SO4_d1 %>% 
  pivot_longer(cols = -ROWTITLE, names_to = "g", values_to = "hue") %>% 
  rename(month = ROWTITLE) %>% 
  mutate(g = factor(str_remove(g, "_[0-9]$")),
         month = as.numeric(str_remove(month, "Month ")))
head(d)

## # A tibble: 6 x 3
##   month g             hue
##   <dbl> <fct>       <dbl>
## 1     1 (-) BPB H2O  59.2
## 2     1 (-) BPB H2O  67.0
## 3     1 (-) BPB H2O  44.0
## 4     1 (-) BPB H2O  52.4
## 5     1 (-) BPB H2O  50.3
## 6     1 (-) BPB H2O  61.5

Visualize data

Raw data.

ggplot(d) +
  aes(x = month, y = hue, color = g) +
  geom_point(position = position_dodge(width = 0.4)) +
  scale_x_continuous(breaks = 1:6)

Raw data with smooth trend lines.

ggplot(d) +
  aes(x = month, y = hue, color = g) +
  geom_point(position = position_dodge(width = 0.4)) +
  geom_smooth() +
  scale_x_continuous(breaks = 1:6)

Means over time +/- 1 SD.

ggplot(d) +
  aes(x = month, y = hue, color = g, group = g) +
  stat_summary(fun.data = "mean_sdl", 
               position = position_dodge(width = 0.4)) +
  scale_x_continuous(breaks = 1:6)

Often want to show +/- 1 standard error (SE) (ie, \(\frac{SD}{\sqrt n}\)).

ggplot(d) +
  aes(x = month, y = hue, color = g, group = g) +
  stat_summary(fun.data = "mean_se", 
               position = position_dodge(width = 0.4)) +
  scale_x_continuous(breaks = 1:6)

Run ANCOVA

We use the lm() function to fit a “linear model”. Since month is numeric and g is categorical, this is an ANCOVA. The formula hue ~ g * month means model hue as a function of g, month, and their interaction.

m <- lm(hue ~ g * month, data = d)
anova(m)

## Analysis of Variance Table
## 
## Response: hue
##            Df Sum Sq Mean Sq   F value    Pr(>F)    
## g           3 582704  194235 3418.9677 < 2.2e-16 ***
## month       1     39      39    0.6857 0.4090848    
## g:month     3   1117     372    6.5513 0.0003596 ***
## Residuals 136   7726      57                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The interaction appears to be significant. This says the effect of month depends on the group and vice versa. How so? Let’s follow up with some plots and comparisons.

Plot model

The ggeffects package allows us to quickly visualize our ANCOVA model.

library(ggeffects)
plot(ggpredict(m, terms = c("month", "g")))

This shows the fitted trajectory of the effect of month on hue for the different groups. We can see the significant interaction appears to be due to the different slopes for (+) BPB H20 (green) and (-) BPB H20 (red).

Are these fitted slopes a “good” fit for the data? We can add data to the plot to assess.

plot(ggpredict(m, terms = c("month", "g")), add.data = TRUE)

The slope doesn’t seem to fit month 3 very well. There’s also much more variability around the fitted line for (-) BPB H20 (red). One of the assumptions of ANCOVA is that the variance around the fitted line is the same for each group.

Are slopes different from 0?

The emtrends function in the emmeans package can help us quickly assess this.

In the first section (“emtrends”) we get the slope for each group in the “month.trend” column as well as 95% confidence intervals for each slope. We see that (-) BPB H2O and (+) BPB H2O are the two groups with confidence intervals that do not contain 0. Therefore we’re confident these slopes are different from 0. (Are these estimated slopes practically significant?) The slopes for the acid groups are small and indistinguishable from 0.

In the second section (“contrasts”) we get Tukey pairwise comparisons between the slopes. The slope for (-) BPB H2O) appears to be different from all other groups.

library(emmeans)
emtrends(m, pairwise ~ g, var = "month")

## $emtrends
##  g             month.trend    SE  df lower.CL upper.CL
##  (-) BPB H2O        -2.881 0.736 136   -4.336    -1.43
##  (-) BPB H2SO4       0.188 0.736 136   -1.267     1.64
##  (+) BPB H2O         1.627 0.736 136    0.172     3.08
##  (+) BPB H2SO4      -0.152 0.736 136   -1.606     1.30
## 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast                          estimate   SE  df t.ratio p.value
##  ((-) BPB H2O) - ((-) BPB H2SO4)     -3.069 1.04 136  -2.950  0.0193
##  ((-) BPB H2O) - ((+) BPB H2O)       -4.508 1.04 136  -4.333  0.0002
##  ((-) BPB H2O) - ((+) BPB H2SO4)     -2.730 1.04 136  -2.624  0.0471
##  ((-) BPB H2SO4) - ((+) BPB H2O)     -1.439 1.04 136  -1.383  0.5122
##  ((-) BPB H2SO4) - ((+) BPB H2SO4)    0.339 1.04 136   0.326  0.9879
##  ((+) BPB H2O) - ((+) BPB H2SO4)      1.778 1.04 136   1.709  0.3229
## 
## P value adjustment: tukey method for comparing a family of 4 estimates

We can plot each of these results if we like.

em <- emtrends(m, pairwise ~ g, var = "month")
plot(em$emtrends)

plot(em$contrasts)