library(haven)
library(mice)
library(mediation)
library(car)
library(ggplot2)
d <- read_sav("KS_data.sav")
# set as categorical variable
d$PreKTeacherID <- factor(d$PreKTeacherID)
# drop columns we're not using
vars <- c("PreKTeacherID", "WJMath3", "TMathAnx", "MathExFW","WJMath1")
d <- d[,vars]
summary(d)
## PreKTeacherID WJMath3 TMathAnx MathExFW
## 1052 : 9 Min. : 48.0 Min. : 1.0 Min. :0.00000
## 3062 : 9 1st Qu.: 93.0 1st Qu.: 2.0 1st Qu.:0.01093
## 1051 : 8 Median :103.5 Median : 4.0 Median :0.08333
## 1054 : 8 Mean :102.9 Mean : 4.1 Mean :0.11291
## 1059 : 8 3rd Qu.:114.0 3rd Qu.: 6.0 3rd Qu.:0.17349
## 3036 : 8 Max. :156.0 Max. :10.0 Max. :0.34676
## (Other):319 NA's :25
## WJMath1
## Min. : 54.00
## 1st Qu.: 88.00
## Median : 97.00
## Mean : 97.38
## 3rd Qu.:107.00
## Max. :154.00
## NA's :10
I made a few plots to get acquainted with the data.
Pairwise scatterplot. Not much variation in MathExFW
relative to WJMath3
. Only ranges in value from 0 to
0.35.
scatterplotMatrix(d[,-1], )
Also WJMath1
and WJMath3
are highly
correlated.
cor(d$WJMath3, d$WJMath1, use = "pairwise.complete")
## [1] 0.8126532
Quick look at variation in WJMath3
among teachers. Seems
fairly constant.
ggplot(d) +
aes(x = reorder(PreKTeacherID, WJMath3, FUN = median, na.rm = TRUE),
y = WJMath3) +
geom_boxplot() +
labs(x = "Teacher ID") +
coord_flip()
Closer look at TMathAnx
and WJMath3
.
ggplot(d) +
aes(x = TMathAnx, y = WJMath3) +
geom_point() +
geom_smooth()
Closer look at TMathAnx
and MathExFW
.
Points slightly “jittered” since TMathAnx
is constant
within teacher.
ggplot(d) +
aes(x = TMathAnx, y = MathExFW) +
geom_jitter(width = 0.1, height = 0.005) +
geom_smooth()
Here use the mediate
function from the
mediation package. It requires you first fit models for
the mediator and outcome. dropobs = TRUE
means the models
are re-fit using common data rows since each has different combinations
of missing data. The argument sims=250
is the number of
bootstrap replicates. That should be higher, like around 1000. I did 250
to save some time.
model.M <- lm(MathExFW ~ TMathAnx + WJMath1, data = d)
model.Y <- lm(WJMath3 ~ TMathAnx + MathExFW + WJMath1, data = d)
results <- mediate(model.M, model.Y, treat='TMathAnx', mediator='MathExFW',
boot=TRUE, sims=250, dropobs = TRUE)
## Running nonparametric bootstrap
summary(results)
##
## Causal Mediation Analysis
##
## Nonparametric Bootstrap Confidence Intervals with the Percentile Method
##
## Estimate 95% CI Lower 95% CI Upper p-value
## ACME 0.0951 0.0186 0.21 0.016 *
## ADE 0.2949 -0.0595 0.68 0.136
## Total Effect 0.3899 0.0382 0.78 0.032 *
## Prop. Mediated 0.2438 0.0110 1.64 0.048 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Sample Size Used: 339
##
##
## Simulations: 250
There is also an optional argument for clustered standard errors. For
example I could specify cluster = d$PreKTeacherID
to get
standard errors clustered on Teacher. However that option is disabled
when running the bootstrap. Here’s the mediation analysis on complete
data with clustered standard errors and no bootstrap. Of course we get a
warning about missing data. And The Average Causal Mediation Effect
(ACME) is no longer “significant”.
model.M <- lm(MathExFW ~ TMathAnx + WJMath1, data = d)
model.Y <- lm(WJMath3 ~ TMathAnx + MathExFW + WJMath1, data = d)
results <- mediate(model.M, model.Y, treat='TMathAnx', mediator='MathExFW',
dropobs = TRUE, cluster = d$PreKTeacherID)
## Warning in mediate(model.M, model.Y, treat = "TMathAnx", mediator =
## "MathExFW", : cluster IDs may not correctly match original observations due to
## missing data
summary(results)
##
## Causal Mediation Analysis
##
## Quasi-Bayesian Confidence Intervals
##
## Estimate 95% CI Lower 95% CI Upper p-value
## ACME 0.0956 -0.0410 0.28 0.18
## ADE 0.2929 -0.1615 0.72 0.21
## Total Effect 0.3884 -0.0956 0.84 0.11
## Prop. Mediated 0.2128 -0.6389 1.84 0.24
##
## Sample Size Used: 339
##
##
## Simulations: 1000
First impute data 10 times using predictive mean matching (pmm). Notice the warning about “logged events”. That’s due to the high collinarity between WJMath1 and WJMath3.
imp <- mice(d, m=10, method = "pmm", print=FALSE, seed = 12345)
## Warning: Number of logged events: 148
Check imputations to make sure imputed data makes sense in light of warning message (ie, nothing imputed outside the range of the data). The red dots are the imputed values. Imputation number 0 is the original data.
stripplot(imp, WJMath3, pch = 19, xlab = "Imputation number")
stripplot(imp, WJMath1, pch = 19, xlab = "Imputation number")
Plots look OK for most of the imputations.
Now do mediation analysis on all 10 imputed data sets.
fit <- with(data = imp, expr = {
model.M <- lm(MathExFW ~ TMathAnx + WJMath1)
model.Y <- lm(WJMath3 ~ TMathAnx + MathExFW + WJMath1)
mediate(model.M, model.Y, treat='TMathAnx', mediator='MathExFW',
boot=TRUE, sims=250)
})
## Running nonparametric bootstrap
##
## Running nonparametric bootstrap
##
## Running nonparametric bootstrap
##
## Running nonparametric bootstrap
##
## Running nonparametric bootstrap
##
## Running nonparametric bootstrap
##
## Running nonparametric bootstrap
##
## Running nonparametric bootstrap
##
## Running nonparametric bootstrap
##
## Running nonparametric bootstrap
Finally pool the results. The only results are for the Average Causal Mediation Effects (acme) and the Average Direct Effects (ade). I’m not sure why they’re duplicated or have “0” and “1” appended to the end. Plus warnings. (?)
summary(pool(fit))
## Warning in get.dfcom(object, dfcom): Infinite sample size assumed.
## term estimate std.error statistic df p.value
## 1 acme_0 0.1029198 0.07063795 1.457005 47.41431 0.1517063
## 2 acme_1 0.1029198 0.07063795 1.457005 47.41431 0.1517063
## 3 ade_0 0.3200182 0.28043982 1.141130 87.03192 0.2569462
## 4 ade_1 0.3200182 0.28043982 1.141130 87.03192 0.2569462
Here we include clustered Standard Errors for Teacher but suppress the bootstrap.
fit2 <- with(data = imp, expr = {
model.M <- lm(MathExFW ~ TMathAnx + WJMath1)
model.Y <- lm(WJMath3 ~ TMathAnx + MathExFW + WJMath1)
mediate(model.M, model.Y, treat='TMathAnx', mediator='MathExFW',
cluster = d$PreKTeacherID)
})
And pool the results.
summary(pool(fit2))
## Warning in get.dfcom(object, dfcom): Infinite sample size assumed.
## term estimate std.error statistic df p.value
## 1 acme_0 0.1030459 0.1151990 0.8945033 313.9811 0.3717378
## 2 acme_1 0.1030459 0.1151990 0.8945033 313.9811 0.3717378
## 3 ade_0 0.3175318 0.3090569 1.0274218 146.7938 0.3059119
## 4 ade_1 0.3175318 0.3090569 1.0274218 146.7938 0.3059119
This is the best I can do at the moment for mediation analysis with covariates on multiply imputed data.