Simulate data with multicollinearity. Below correlation is set to 0.95. Coefficients arbitrarily set to 0.5, -0.5, and 0.9.
library(MASS)
set.seed(10)
r <- 0.95
n <- 300
d <- as.data.frame(mvrnorm(n = n, mu = c(0,0,0),
Sigma = matrix(c(1,r,r,
r,1,r,
r,r,1), byrow = T, nrow = 3)))
d$Y <- 0.2 + 0.5*d$V1 + -0.5*d$V2 + 0.9*d$V3 +
rnorm(n = n, sd = 0.3)
Confirm predictors are highly correlated.
pairs(d)
cor(d)
## V1 V2 V3 Y
## V1 1.0000000 0.9391058 0.9412956 0.9287819
## V2 0.9391058 1.0000000 0.9510774 0.8662134
## V3 0.9412956 0.9510774 1.0000000 0.9246915
## Y 0.9287819 0.8662134 0.9246915 1.0000000
Now fit a model. This is the “true” model. Estimated coefficients and standard errors look fine.
m <- lm(Y ~ ., data = d)
summary(m)
##
## Call:
## lm(formula = Y ~ ., data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.85077 -0.18219 -0.00848 0.20050 0.76958
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.20462 0.01699 12.042 < 2e-16 ***
## V1 0.65386 0.05677 11.518 < 2e-16 ***
## V2 -0.43534 0.06244 -6.972 2.04e-11 ***
## V3 0.69492 0.06493 10.702 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2928 on 296 degrees of freedom
## Multiple R-squared: 0.9012, Adjusted R-squared: 0.9002
## F-statistic: 899.8 on 3 and 296 DF, p-value: < 2.2e-16
VIF are large, but does it matter?
car::vif(m)
## V1 V2 V3
## 10.66004 12.72734 13.18720
Even with correlation of 0.99 we get reasonable results.
set.seed(20)
r <- 0.99
d <- as.data.frame(mvrnorm(n = n, mu = c(0,0,0),
Sigma = matrix(c(1,r,r,
r,1,r,
r,r,1), byrow = T, nrow = 3)))
d$Y <- 0.2 + 0.5*d$V1 + -0.5*d$V2 + 0.9*d$V3 +
rnorm(n = n, sd = 0.3)
m <- lm(Y ~ ., data = d)
summary(m)
##
## Call:
## lm(formula = Y ~ ., data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.64289 -0.18245 0.02629 0.17418 0.75999
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.17172 0.01607 10.683 < 2e-16 ***
## V1 0.61767 0.12926 4.778 2.79e-06 ***
## V2 -0.57011 0.12556 -4.540 8.18e-06 ***
## V3 0.85323 0.12906 6.611 1.78e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2771 on 296 degrees of freedom
## Multiple R-squared: 0.9178, Adjusted R-squared: 0.917
## F-statistic: 1102 on 3 and 296 DF, p-value: < 2.2e-16
VIF are really high, but we fit the right model and we’re getting reasonable estimates.
car::vif(m)
## V1 V2 V3
## 67.02465 63.09730 67.19383
Replicate 1000 times with correlation of 0.99. Extract coefficients and their standard errors. Summarize distributions.
r <- 0.99
rout <- replicate(n = 1000, expr = {
d <- as.data.frame(mvrnorm(n = n, mu = c(0,0,0),
Sigma = matrix(c(1,r,r,
r,1,r,
r,r,1),
byrow = T,
nrow = 3)))
d$Y <- 0.2 + 0.5*d$V1 + -0.5*d$V2 + 0.9*d$V3 +
rnorm(n = n, sd = 0.3)
m <- lm(Y ~ ., data = d)
c(coef(m)[-1], coef(summary(m))[-1,2])})
rownames(rout) <- c("V1", "V2", "V3", "V1_SE", "V2_SE", "V3_SE")
apply(rout, 1, summary)
## V1 V2 V3 V1_SE V2_SE V3_SE
## Min. 0.1167174 -1.04914851 0.4731471 0.1173488 0.1206018 0.1162677
## 1st Qu. 0.4000896 -0.59080884 0.8041736 0.1372277 0.1364114 0.1371338
## Median 0.5011103 -0.49355189 0.9002364 0.1421406 0.1417376 0.1420293
## Mean 0.4998070 -0.49957104 0.8999321 0.1425915 0.1421041 0.1424768
## 3rd Qu. 0.5967001 -0.40289622 0.9917656 0.1478256 0.1470749 0.1475023
## Max. 1.0451780 -0.09674094 1.3272648 0.1684968 0.1846908 0.1649418
All coefficients and standard errors are consistently estimated.
Now generate three collinear variables, but only use one to generate the outcome.
set.seed(30)
r <- 0.99
d <- as.data.frame(mvrnorm(n = n, mu = c(0,0,0),
Sigma = matrix(c(1,r,r,
r,1,r,
r,r,1), byrow = T, nrow = 3)))
d$Y <- 0.2 + 0.5*d$V1 + rnorm(n = n, sd = 0.3)
Now fit model using all three predictors. V2 and V3 are highly collinear with V1.
m <- lm(Y ~ ., data = d)
summary(m)
##
## Call:
## lm(formula = Y ~ ., data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.76192 -0.20135 0.01293 0.18471 0.88839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.19062 0.01755 10.858 <2e-16 ***
## V1 0.29380 0.13887 2.116 0.0352 *
## V2 0.17378 0.13683 1.270 0.2051
## V3 0.01418 0.13278 0.107 0.9150
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3008 on 296 degrees of freedom
## Multiple R-squared: 0.7421, Adjusted R-squared: 0.7395
## F-statistic: 284 on 3 and 296 DF, p-value: < 2.2e-16
Seems ok. Neither V2 or V3 are significant. The V1 coefficient estimate is within two standard errors of the true value, 0.5.
Replicate 1000 times. This time pull out p-values instead of standard errors
r <- 0.99
rout <- replicate(n = 1000, expr = {
d <- as.data.frame(mvrnorm(n = n, mu = c(0,0,0),
Sigma = matrix(c(1,r,r,
r,1,r,
r,r,1),
byrow = T,
nrow = 3)))
d$Y <- 0.2 + 0.5*d$V1 + rnorm(n = n, sd = 0.3)
m <- lm(Y ~ ., data = d)
c(coef(m)[-1], coef(summary(m))[-1,4])})
rownames(rout) <- c("V1", "V2", "V3", "V1_p", "V2_p", "V3_p")
Check proportion of times coefficients declared significant at 5% level. V1 (the only predictor used to generate Y) is significant about 95% of the time. V2 and V3, on the other hand, are around 5%, suggesting the p-values are uniformly distributed because V2 and V3 are not associated with Y.
apply(rout[4:6,], 1, function(x)mean(x < 0.05))
## V1_p V2_p V3_p
## 0.937 0.038 0.043
Check distribution of p-values.
hist(rout[5,], main = "V2 p-values")
hist(rout[6,], main = "V3 p-values")
Example from Applied Linear Statistical Models (Kutner et al. 2005). Predict body fat from tricep skinfold thickness, thigh circumference, and midarm circumference.
bf <- read.table('CH07TA01.txt')
names(bf) <- c("triceps", "thigh", "midarm", "body_fat")
One high pairwise correlation, but rest look ok.
cor(bf[,1:3])
## triceps thigh midarm
## triceps 1.0000000 0.9238425 0.4577772
## thigh 0.9238425 1.0000000 0.0846675
## midarm 0.4577772 0.0846675 1.0000000
The pairs plots looks ok.
pairs(bf[,1:3])
Fit a series of models and compare coefficients:
m1 <- lm(body_fat ~ triceps, data = bf)
m2 <- lm(body_fat ~ thigh, data = bf)
m3 <- lm(body_fat ~ triceps + thigh, data = bf)
m4 <- lm(body_fat ~ triceps + thigh + midarm, data = bf)
car::compareCoefs(m1, m2, m3, m4, se = FALSE)
## Calls:
## 1: lm(formula = body_fat ~ triceps, data = bf)
## 2: lm(formula = body_fat ~ thigh, data = bf)
## 3: lm(formula = body_fat ~ triceps + thigh, data = bf)
## 4: lm(formula = body_fat ~ triceps + thigh + midarm, data = bf)
##
## Model 1 Model 2 Model 3 Model 4
## (Intercept) -1.5 -23.6 -19.2 117.1
## triceps 0.857 0.222 4.334
## thigh 0.857 0.659 -2.857
## midarm -2.19
Notice how the coefficients for triceps and thigh change depending on what other variables are in the model.
Look at variance inflation factors for final model. They’re huge!
car::vif(m4)
## triceps thigh midarm
## 708.8429 564.3434 104.6060
Should we drop a variable? The textbook does not say. VIF simply alerts us to a multicollinearity problem.
The authors instead present ridge regression as a remedial measure. I implement below without comment using the {glmnet} package.
library(glmnet)
cvfit <- cv.glmnet(x = as.matrix(bf[,1:3]),
y = as.matrix(bf[,4]),
alpha = 0) # alpha = 0 means ridge regression
coef(cvfit, s = "lambda.min")
## 4 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -10.0097587
## triceps 0.4358802
## thigh 0.4385841
## midarm -0.1183033
Compare these coefficients to model 4 above. Big difference. However ridge regression coefficients have no standard errors, and thus no test statistics or p-values.
Why was VIF so big? Turns out midarm is highly correlated with tricep and thigh together. In other words, it’s conditional association that is the problem. To investigate, regress midarm on tricep and thigh.
mod <- lm(midarm ~ thigh + triceps, data = bf)
summary(mod)
##
## Call:
## lm(formula = midarm ~ thigh + triceps, data = bf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58200 -0.30625 0.02592 0.29526 0.56102
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.33083 1.23934 50.29 <2e-16 ***
## thigh -1.60850 0.04316 -37.26 <2e-16 ***
## triceps 1.88089 0.04498 41.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.377 on 17 degrees of freedom
## Multiple R-squared: 0.9904, Adjusted R-squared: 0.9893
## F-statistic: 880.7 on 2 and 17 DF, p-value: < 2.2e-16
Look at the R squared: 0.9904. It turns out that midarm can be almost perfectly predicted by thigh and triceps.
plot(bf$midarm, mod$fitted.values)
Look again at the correlations of midarm with tripceps and thigh: 0.45 and 0.08.
cor(bf[,1:3])
## triceps thigh midarm
## triceps 1.0000000 0.9238425 0.4577772
## thigh 0.9238425 1.0000000 0.0846675
## midarm 0.4577772 0.0846675 1.0000000
Clearly simply looking pairwise correlations would not have detected this.
Richard McElreath has this to say in Chapter 6 of his book Statistical Rethinking:
“In the scientific literature, you might encounter a variety of dodgy ways of coping with multicollinearity. Few of them take a causal perspective. Some fields actually teach students to inspect pairwise correlations before fitting a model, to identify and drop highly correlated predictors. This is a mistake. Pairwise correlations are not the problem. It is the conditional associations — not correlations — that matter. And even then, the right thing to do will depend upon what is causing the collinearity. The associations within the data alone are not enough to decide what to do.”