dummy coding versus effects coding

Question from student:

I’m running some simple linear regression analyses for my research, and I’m hoping to get some clarification on when it is appropriate to use dummy coding versus effects coding for categorical variables. As an example, I’ve tried coding gender as 1 and 0 and also as 1 and -1 (as described in the simple effects contrast coding section of the article linked below), and I get very different results, so I’m not sure which to use or which is more appropriate for my research question.

simulate data

We first simulate a vector of genders. Then we calculate y, which is 12 if male and 14 (ie, 12 + 2) if female. We add noise to the means using the rnorm() function with a standard deviation of 2. We then calculate the simulated means of each gender. Notice this is a balanced design: each group has 50 observations.

set.seed(12)
gender <- rep(c("male", "female"), each = 50)
y <- rnorm(100, mean = 12 + (gender == "female")*2, sd = 2)

# calculate group means
means <- aggregate(y ~ gender, FUN = mean)
means

##   gender        y
## 1 female 14.16118
## 2   male 11.71414

model with dummy coding

Now we fit a model with dummy coding. This is also called treatment contrasts.

gender_dummy <- ifelse(gender == "male", 0, 1)
m <- lm(y ~ gender_dummy)
summary(m)

## 
## Call:
## lm(formula = y ~ gender_dummy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4597 -1.1438 -0.2272  1.1526  4.4299 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   11.7141     0.2439  48.038  < 2e-16 ***
## gender_dummy   2.4470     0.3449   7.096 2.04e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.724 on 98 degrees of freedom
## Multiple R-squared:  0.3394, Adjusted R-squared:  0.3327 
## F-statistic: 50.35 on 1 and 98 DF,  p-value: 2.036e-10

The intercept is the mean of male and the slope is the difference in means.

# intercept: mean of male
subset(means, gender == "male")

##   gender        y
## 2   male 11.71414

# slope: difference in means
means$y[1] - means$y[2]

## [1] 2.447044

model with effects coding

Now we fit a model with effects coding.

gender_effects <- ifelse(gender == "male", -1, 1)
m2 <- lm(y ~ gender_effects)
summary(m2)

## 
## Call:
## lm(formula = y ~ gender_effects)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4597 -1.1438 -0.2272  1.1526  4.4299 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     12.9377     0.1724  75.031  < 2e-16 ***
## gender_effects   1.2235     0.1724   7.096 2.04e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.724 on 98 degrees of freedom
## Multiple R-squared:  0.3394, Adjusted R-squared:  0.3327 
## F-statistic: 50.35 on 1 and 98 DF,  p-value: 2.036e-10

The intercept is the grand mean, and the slope is the absolute difference in each group mean from the grand mean.

# intercept: grand mean
mean(y)

## [1] 12.93766

# slope: absolute difference in each group mean from the grand mean
abs(mean(y) - means$y)

## [1] 1.223522 1.223522

Notice the standard error for the model with effects coding is smaller than the standard error for the model with dummy coding. This is because the grand mean is estimated more precisely than the group means.

If we don’t have balance in our groups, the interpretation of coefficients changes slightly.

The intercept is the mean of the group means, and the slope is the absolute difference in each group mean from the “mean of group means”.

set.seed(12)
gender <- sample(c("male", "female"), size = 100, replace = TRUE)
y <- rnorm(100, mean = 12 + (gender == "female")*2, sd = 2)
# calculate means
means <- aggregate(y ~ gender, FUN = mean)
means

##   gender        y
## 1 female 13.97248
## 2   male 12.32410

gender_effects <- ifelse(gender == "male", -1, 1)
m3 <- lm(y ~ gender_effects)
summary(m3)

## 
## Call:
## lm(formula = y ~ gender_effects)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2710 -1.3736 -0.0839  1.4237  4.2925 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     13.1483     0.1932  68.049  < 2e-16 ***
## gender_effects   0.8242     0.1932   4.266  4.6e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.907 on 98 degrees of freedom
## Multiple R-squared:  0.1566, Adjusted R-squared:  0.148 
## F-statistic:  18.2 on 1 and 98 DF,  p-value: 4.6e-05

# intercept: mean of group means
mean(means$y)

## [1] 13.14829

# slope: absolute difference in each group mean from "mean of group means"
abs(means$y - mean(means$y))

## [1] 0.8241916 0.8241916

dummy coding versus effects coding

Clay Ford

2023-01-07

simulate data

model with dummy coding

model with effects coding