Question from student:
I’m running some simple linear regression analyses for my research, and I’m hoping to get some clarification on when it is appropriate to use dummy coding versus effects coding for categorical variables. As an example, I’ve tried coding gender as 1 and 0 and also as 1 and -1 (as described in the simple effects contrast coding section of the article linked below), and I get very different results, so I’m not sure which to use or which is more appropriate for my research question.
We first simulate a vector of genders. Then we calculate y, which is
12 if male and 14 (ie, 12 + 2) if female. We add noise to the means
using the rnorm()
function with a standard deviation of 2.
We then calculate the simulated means of each gender. Notice this is a
balanced design: each group has 50 observations.
set.seed(12)
gender <- rep(c("male", "female"), each = 50)
y <- rnorm(100, mean = 12 + (gender == "female")*2, sd = 2)
# calculate group means
means <- aggregate(y ~ gender, FUN = mean)
means
## gender y
## 1 female 14.16118
## 2 male 11.71414
Now we fit a model with dummy coding. This is also called treatment contrasts.
gender_dummy <- ifelse(gender == "male", 0, 1)
m <- lm(y ~ gender_dummy)
summary(m)
##
## Call:
## lm(formula = y ~ gender_dummy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4597 -1.1438 -0.2272 1.1526 4.4299
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.7141 0.2439 48.038 < 2e-16 ***
## gender_dummy 2.4470 0.3449 7.096 2.04e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.724 on 98 degrees of freedom
## Multiple R-squared: 0.3394, Adjusted R-squared: 0.3327
## F-statistic: 50.35 on 1 and 98 DF, p-value: 2.036e-10
The intercept is the mean of male and the slope is the difference in means.
# intercept: mean of male
subset(means, gender == "male")
## gender y
## 2 male 11.71414
# slope: difference in means
means$y[1] - means$y[2]
## [1] 2.447044
Now we fit a model with effects coding.
gender_effects <- ifelse(gender == "male", -1, 1)
m2 <- lm(y ~ gender_effects)
summary(m2)
##
## Call:
## lm(formula = y ~ gender_effects)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4597 -1.1438 -0.2272 1.1526 4.4299
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.9377 0.1724 75.031 < 2e-16 ***
## gender_effects 1.2235 0.1724 7.096 2.04e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.724 on 98 degrees of freedom
## Multiple R-squared: 0.3394, Adjusted R-squared: 0.3327
## F-statistic: 50.35 on 1 and 98 DF, p-value: 2.036e-10
The intercept is the grand mean, and the slope is the absolute difference in each group mean from the grand mean.
# intercept: grand mean
mean(y)
## [1] 12.93766
# slope: absolute difference in each group mean from the grand mean
abs(mean(y) - means$y)
## [1] 1.223522 1.223522
Notice the standard error for the model with effects coding is smaller than the standard error for the model with dummy coding. This is because the grand mean is estimated more precisely than the group means.
If we don’t have balance in our groups, the interpretation of coefficients changes slightly.
The intercept is the mean of the group means, and the slope is the absolute difference in each group mean from the “mean of group means”.
set.seed(12)
gender <- sample(c("male", "female"), size = 100, replace = TRUE)
y <- rnorm(100, mean = 12 + (gender == "female")*2, sd = 2)
# calculate means
means <- aggregate(y ~ gender, FUN = mean)
means
## gender y
## 1 female 13.97248
## 2 male 12.32410
gender_effects <- ifelse(gender == "male", -1, 1)
m3 <- lm(y ~ gender_effects)
summary(m3)
##
## Call:
## lm(formula = y ~ gender_effects)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2710 -1.3736 -0.0839 1.4237 4.2925
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.1483 0.1932 68.049 < 2e-16 ***
## gender_effects 0.8242 0.1932 4.266 4.6e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.907 on 98 degrees of freedom
## Multiple R-squared: 0.1566, Adjusted R-squared: 0.148
## F-statistic: 18.2 on 1 and 98 DF, p-value: 4.6e-05
# intercept: mean of group means
mean(means$y)
## [1] 13.14829
# slope: absolute difference in each group mean from "mean of group means"
abs(means$y - mean(means$y))
## [1] 0.8241916 0.8241916