Load packages

library(haven)
library(mice)
library(mediation)
library(car)
library(ggplot2)

Load data

d <- read_sav("KS_data.sav")

# set as categorical variable
d$PreKTeacherID <- factor(d$PreKTeacherID)

# drop columns we're not using
vars <- c("PreKTeacherID", "WJMath3", "TMathAnx", "MathExFW","WJMath1")
d <- d[,vars]
summary(d)
##  PreKTeacherID    WJMath3         TMathAnx       MathExFW      
##  1052   :  9   Min.   : 48.0   Min.   : 1.0   Min.   :0.00000  
##  3062   :  9   1st Qu.: 93.0   1st Qu.: 2.0   1st Qu.:0.01093  
##  1051   :  8   Median :103.5   Median : 4.0   Median :0.08333  
##  1054   :  8   Mean   :102.9   Mean   : 4.1   Mean   :0.11291  
##  1059   :  8   3rd Qu.:114.0   3rd Qu.: 6.0   3rd Qu.:0.17349  
##  3036   :  8   Max.   :156.0   Max.   :10.0   Max.   :0.34676  
##  (Other):319   NA's   :25                                      
##     WJMath1      
##  Min.   : 54.00  
##  1st Qu.: 88.00  
##  Median : 97.00  
##  Mean   : 97.38  
##  3rd Qu.:107.00  
##  Max.   :154.00  
##  NA's   :10

Visualize data

I made a few plots to get acquainted with the data.

Pairwise scatterplot. Not much variation in MathExFW relative to WJMath3. Only ranges in value from 0 to 0.35.

scatterplotMatrix(d[,-1], )

Also WJMath1 and WJMath3 are highly correlated.

cor(d$WJMath3, d$WJMath1, use = "pairwise.complete")
## [1] 0.8126532

Quick look at variation in WJMath3 among teachers. Seems fairly constant.

ggplot(d) +
  aes(x = reorder(PreKTeacherID, WJMath3, FUN = median, na.rm = TRUE), 
      y = WJMath3) +
  geom_boxplot() +
  labs(x = "Teacher ID") +
  coord_flip() 

Closer look at TMathAnx and WJMath3.

ggplot(d) +
  aes(x = TMathAnx, y = WJMath3) +
  geom_point() +
  geom_smooth()

Closer look at TMathAnx and MathExFW. Points slightly “jittered” since TMathAnx is constant within teacher.

ggplot(d) +
  aes(x = TMathAnx, y = MathExFW) +
  geom_jitter(width = 0.1, height = 0.005) +
  geom_smooth()

Mediation Analysis without imputation

Here use the mediate function from the mediation package. It requires you first fit models for the mediator and outcome. dropobs = TRUE means the models are re-fit using common data rows since each has different combinations of missing data. The argument sims=250 is the number of bootstrap replicates. That should be higher, like around 1000. I did 250 to save some time.

model.M <- lm(MathExFW ~ TMathAnx + WJMath1, data = d)
model.Y <- lm(WJMath3 ~ TMathAnx + MathExFW + WJMath1, data = d)
results <- mediate(model.M, model.Y, treat='TMathAnx', mediator='MathExFW',
        boot=TRUE, sims=250, dropobs = TRUE)
## Running nonparametric bootstrap
summary(results)
## 
## Causal Mediation Analysis 
## 
## Nonparametric Bootstrap Confidence Intervals with the Percentile Method
## 
##                Estimate 95% CI Lower 95% CI Upper p-value  
## ACME             0.0951       0.0186         0.21   0.016 *
## ADE              0.2949      -0.0595         0.68   0.136  
## Total Effect     0.3899       0.0382         0.78   0.032 *
## Prop. Mediated   0.2438       0.0110         1.64   0.048 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sample Size Used: 339 
## 
## 
## Simulations: 250

There is also an optional argument for clustered standard errors. For example I could specify cluster = d$PreKTeacherID to get standard errors clustered on Teacher. However that option is disabled when running the bootstrap. Here’s the mediation analysis on complete data with clustered standard errors and no bootstrap. Of course we get a warning about missing data. And The Average Causal Mediation Effect (ACME) is no longer “significant”.

model.M <- lm(MathExFW ~ TMathAnx + WJMath1, data = d)
model.Y <- lm(WJMath3 ~ TMathAnx + MathExFW + WJMath1, data = d)
results <- mediate(model.M, model.Y, treat='TMathAnx', mediator='MathExFW',
                   dropobs = TRUE, cluster = d$PreKTeacherID)
## Warning in mediate(model.M, model.Y, treat = "TMathAnx", mediator =
## "MathExFW", : cluster IDs may not correctly match original observations due to
## missing data
summary(results)
## 
## Causal Mediation Analysis 
## 
## Quasi-Bayesian Confidence Intervals
## 
##                Estimate 95% CI Lower 95% CI Upper p-value
## ACME             0.0956      -0.0410         0.28    0.18
## ADE              0.2929      -0.1615         0.72    0.21
## Total Effect     0.3884      -0.0956         0.84    0.11
## Prop. Mediated   0.2128      -0.6389         1.84    0.24
## 
## Sample Size Used: 339 
## 
## 
## Simulations: 1000

Mediation Analysis with Multiple Imputation and bootstrap

First impute data 10 times using predictive mean matching (pmm). Notice the warning about “logged events”. That’s due to the high collinarity between WJMath1 and WJMath3.

imp <- mice(d, m=10, method = "pmm", print=FALSE, seed = 12345)
## Warning: Number of logged events: 148

Check imputations to make sure imputed data makes sense in light of warning message (ie, nothing imputed outside the range of the data). The red dots are the imputed values. Imputation number 0 is the original data.

stripplot(imp, WJMath3, pch = 19, xlab = "Imputation number") 

stripplot(imp, WJMath1, pch = 19, xlab = "Imputation number")

Plots look OK for most of the imputations.

Now do mediation analysis on all 10 imputed data sets.

fit <- with(data = imp, expr = {
  model.M <- lm(MathExFW ~ TMathAnx + WJMath1)
  model.Y <- lm(WJMath3 ~ TMathAnx + MathExFW + WJMath1)
  mediate(model.M, model.Y, treat='TMathAnx', mediator='MathExFW',
          boot=TRUE, sims=250)
  })
## Running nonparametric bootstrap
## 
## Running nonparametric bootstrap
## 
## Running nonparametric bootstrap
## 
## Running nonparametric bootstrap
## 
## Running nonparametric bootstrap
## 
## Running nonparametric bootstrap
## 
## Running nonparametric bootstrap
## 
## Running nonparametric bootstrap
## 
## Running nonparametric bootstrap
## 
## Running nonparametric bootstrap

Finally pool the results. The only results are for the Average Causal Mediation Effects (acme) and the Average Direct Effects (ade). I’m not sure why they’re duplicated or have “0” and “1” appended to the end. Plus warnings. (?)

summary(pool(fit))
## Warning in get.dfcom(object, dfcom): Infinite sample size assumed.
##     term  estimate  std.error statistic       df   p.value
## 1 acme_0 0.1029198 0.07063795  1.457005 47.41431 0.1517063
## 2 acme_1 0.1029198 0.07063795  1.457005 47.41431 0.1517063
## 3  ade_0 0.3200182 0.28043982  1.141130 87.03192 0.2569462
## 4  ade_1 0.3200182 0.28043982  1.141130 87.03192 0.2569462

Mediation Analysis with Multiple Imputation and clustered SE

Here we include clustered Standard Errors for Teacher but suppress the bootstrap.

fit2 <- with(data = imp, expr = {
  model.M <- lm(MathExFW ~ TMathAnx + WJMath1)
  model.Y <- lm(WJMath3 ~ TMathAnx + MathExFW + WJMath1)
  mediate(model.M, model.Y, treat='TMathAnx', mediator='MathExFW',
          cluster = d$PreKTeacherID)
  })

And pool the results.

summary(pool(fit2))
## Warning in get.dfcom(object, dfcom): Infinite sample size assumed.
##     term  estimate std.error statistic       df   p.value
## 1 acme_0 0.1030459 0.1151990 0.8945033 313.9811 0.3717378
## 2 acme_1 0.1030459 0.1151990 0.8945033 313.9811 0.3717378
## 3  ade_0 0.3175318 0.3090569 1.0274218 146.7938 0.3059119
## 4  ade_1 0.3175318 0.3090569 1.0274218 146.7938 0.3059119

This is the best I can do at the moment for mediation analysis with covariates on multiply imputed data.