Investigating Multicolinearity

Three collinear predictors, all in model

Simulate data with multicollinearity. Below correlation is set to 0.95. Coefficients arbitrarily set to 0.5, -0.5, and 0.9.

library(MASS)
set.seed(10)
r <- 0.95
n <- 300
d <- as.data.frame(mvrnorm(n = n, mu = c(0,0,0), 
             Sigma = matrix(c(1,r,r, 
                              r,1,r,
                              r,r,1), byrow = T, nrow = 3)))
d$Y <- 0.2 + 0.5*d$V1 + -0.5*d$V2 + 0.9*d$V3 +
  rnorm(n = n, sd = 0.3)

Confirm predictors are highly correlated.

pairs(d)

cor(d)

##           V1        V2        V3         Y
## V1 1.0000000 0.9391058 0.9412956 0.9287819
## V2 0.9391058 1.0000000 0.9510774 0.8662134
## V3 0.9412956 0.9510774 1.0000000 0.9246915
## Y  0.9287819 0.8662134 0.9246915 1.0000000

Now fit a model. This is the “true” model. Estimated coefficients and standard errors look fine.

m <- lm(Y ~ ., data = d)
summary(m)

## 
## Call:
## lm(formula = Y ~ ., data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.85077 -0.18219 -0.00848  0.20050  0.76958 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.20462    0.01699  12.042  < 2e-16 ***
## V1           0.65386    0.05677  11.518  < 2e-16 ***
## V2          -0.43534    0.06244  -6.972 2.04e-11 ***
## V3           0.69492    0.06493  10.702  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2928 on 296 degrees of freedom
## Multiple R-squared:  0.9012, Adjusted R-squared:  0.9002 
## F-statistic: 899.8 on 3 and 296 DF,  p-value: < 2.2e-16

VIF are large, but does it matter?

car::vif(m)

##       V1       V2       V3 
## 10.66004 12.72734 13.18720

Even with correlation of 0.99 we get reasonable results.

set.seed(20)
r <- 0.99
d <- as.data.frame(mvrnorm(n = n, mu = c(0,0,0), 
             Sigma = matrix(c(1,r,r, 
                              r,1,r,
                              r,r,1), byrow = T, nrow = 3)))
d$Y <- 0.2 + 0.5*d$V1 + -0.5*d$V2 + 0.9*d$V3 +
  rnorm(n = n, sd = 0.3)
m <- lm(Y ~ ., data = d)
summary(m)

## 
## Call:
## lm(formula = Y ~ ., data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.64289 -0.18245  0.02629  0.17418  0.75999 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.17172    0.01607  10.683  < 2e-16 ***
## V1           0.61767    0.12926   4.778 2.79e-06 ***
## V2          -0.57011    0.12556  -4.540 8.18e-06 ***
## V3           0.85323    0.12906   6.611 1.78e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2771 on 296 degrees of freedom
## Multiple R-squared:  0.9178, Adjusted R-squared:  0.917 
## F-statistic:  1102 on 3 and 296 DF,  p-value: < 2.2e-16

VIF are really high, but we fit the right model and we’re getting reasonable estimates.

car::vif(m)

##       V1       V2       V3 
## 67.02465 63.09730 67.19383

Replicate 1000 times with correlation of 0.99. Extract coefficients and their standard errors. Summarize distributions.

r <- 0.99
rout <- replicate(n = 1000, expr = {
                  d <- as.data.frame(mvrnorm(n = n, mu = c(0,0,0), 
                                             Sigma = matrix(c(1,r,r,
                                                              r,1,r,
                                                              r,r,1), 
                                                            byrow = T, 
                                                            nrow = 3)))
                  d$Y <- 0.2 + 0.5*d$V1 + -0.5*d$V2 + 0.9*d$V3 +
                    rnorm(n = n, sd = 0.3)
                  m <- lm(Y ~ ., data = d)
                  c(coef(m)[-1], coef(summary(m))[-1,2])})
rownames(rout) <- c("V1", "V2", "V3", "V1_SE", "V2_SE", "V3_SE")
apply(rout, 1, summary)

##                V1          V2        V3     V1_SE     V2_SE     V3_SE
## Min.    0.1167174 -1.04914851 0.4731471 0.1173488 0.1206018 0.1162677
## 1st Qu. 0.4000896 -0.59080884 0.8041736 0.1372277 0.1364114 0.1371338
## Median  0.5011103 -0.49355189 0.9002364 0.1421406 0.1417376 0.1420293
## Mean    0.4998070 -0.49957104 0.8999321 0.1425915 0.1421041 0.1424768
## 3rd Qu. 0.5967001 -0.40289622 0.9917656 0.1478256 0.1470749 0.1475023
## Max.    1.0451780 -0.09674094 1.3272648 0.1684968 0.1846908 0.1649418

All coefficients and standard errors are consistently estimated.

Three collinear predictors, only one in model

Now generate three collinear variables, but only use one to generate the outcome.

set.seed(30)
r <- 0.99
d <- as.data.frame(mvrnorm(n = n, mu = c(0,0,0), 
             Sigma = matrix(c(1,r,r, 
                              r,1,r,
                              r,r,1), byrow = T, nrow = 3)))
d$Y <- 0.2 + 0.5*d$V1 + rnorm(n = n, sd = 0.3)

Now fit model using all three predictors. V2 and V3 are highly collinear with V1.

m <- lm(Y ~ ., data = d)
summary(m)

## 
## Call:
## lm(formula = Y ~ ., data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.76192 -0.20135  0.01293  0.18471  0.88839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.19062    0.01755  10.858   <2e-16 ***
## V1           0.29380    0.13887   2.116   0.0352 *  
## V2           0.17378    0.13683   1.270   0.2051    
## V3           0.01418    0.13278   0.107   0.9150    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3008 on 296 degrees of freedom
## Multiple R-squared:  0.7421, Adjusted R-squared:  0.7395 
## F-statistic:   284 on 3 and 296 DF,  p-value: < 2.2e-16

Seems ok. Neither V2 or V3 are significant. The V1 coefficient estimate is within two standard errors of the true value, 0.5.

Replicate 1000 times. This time pull out p-values instead of standard errors

r <- 0.99
rout <- replicate(n = 1000, expr = {
                  d <- as.data.frame(mvrnorm(n = n, mu = c(0,0,0), 
                                             Sigma = matrix(c(1,r,r,
                                                              r,1,r,
                                                              r,r,1), 
                                                            byrow = T, 
                                                            nrow = 3)))
                  d$Y <- 0.2 + 0.5*d$V1 + rnorm(n = n, sd = 0.3)
                  m <- lm(Y ~ ., data = d)
                  c(coef(m)[-1], coef(summary(m))[-1,4])})
rownames(rout) <- c("V1", "V2", "V3", "V1_p", "V2_p", "V3_p")

Check proportion of times coefficients declared significant at 5% level. V1 (the only predictor used to generate Y) is significant about 95% of the time. V2 and V3, on the other hand, are around 5%, suggesting the p-values are uniformly distributed because V2 and V3 are not associated with Y.

apply(rout[4:6,], 1, function(x)mean(x < 0.05))

##  V1_p  V2_p  V3_p 
## 0.937 0.038 0.043

Check distribution of p-values.

hist(rout[5,], main = "V2 p-values")

hist(rout[6,], main = "V3 p-values")

One predictor is predicted well from other predictors

Example from Applied Linear Statistical Models (Kutner et al. 2005). Predict body fat from tricep skinfold thickness, thigh circumference, and midarm circumference.

bf <- read.table('CH07TA01.txt')
names(bf) <- c("triceps", "thigh", "midarm", "body_fat")

One high pairwise correlation, but rest look ok.

cor(bf[,1:3])

##           triceps     thigh    midarm
## triceps 1.0000000 0.9238425 0.4577772
## thigh   0.9238425 1.0000000 0.0846675
## midarm  0.4577772 0.0846675 1.0000000

The pairs plots looks ok.

pairs(bf[,1:3])

Fit a series of models and compare coefficients:

m1 <- lm(body_fat ~ triceps, data = bf)
m2 <- lm(body_fat ~ thigh, data = bf)
m3 <- lm(body_fat ~ triceps + thigh, data = bf)
m4 <- lm(body_fat ~ triceps + thigh + midarm, data = bf)
car::compareCoefs(m1, m2, m3, m4, se = FALSE)

## Calls:
## 1: lm(formula = body_fat ~ triceps, data = bf)
## 2: lm(formula = body_fat ~ thigh, data = bf)
## 3: lm(formula = body_fat ~ triceps + thigh, data = bf)
## 4: lm(formula = body_fat ~ triceps + thigh + midarm, data = bf)
## 
##             Model 1 Model 2 Model 3 Model 4
## (Intercept)    -1.5   -23.6   -19.2   117.1
## triceps       0.857           0.222   4.334
## thigh                 0.857   0.659  -2.857
## midarm                                -2.19

Notice how the coefficients for triceps and thigh change depending on what other variables are in the model.

Triceps is 0.857 by itself, 0.222 with thigh, and jumps to 4.3 with both thigh and midarm.
Thigh is 0.857 by itself, 0.659 with triceps, and falls to -2.857 with both tricep and midarm! (note the change in sign)

Look at variance inflation factors for final model. They’re huge!

car::vif(m4)

##  triceps    thigh   midarm 
## 708.8429 564.3434 104.6060

Should we drop a variable? The textbook does not say. VIF simply alerts us to a multicollinearity problem.

The authors instead present ridge regression as a remedial measure. I implement below without comment using the {glmnet} package.

library(glmnet)
cvfit <- cv.glmnet(x = as.matrix(bf[,1:3]), 
                   y = as.matrix(bf[,4]), 
                   alpha = 0)  # alpha = 0 means ridge regression
coef(cvfit, s = "lambda.min")

## 4 x 1 sparse Matrix of class "dgCMatrix"
##                      s1
## (Intercept) -10.0097587
## triceps       0.4358802
## thigh         0.4385841
## midarm       -0.1183033

Compare these coefficients to model 4 above. Big difference. However ridge regression coefficients have no standard errors, and thus no test statistics or p-values.

Why was VIF so big? Turns out midarm is highly correlated with tricep and thigh together. In other words, it’s conditional association that is the problem. To investigate, regress midarm on tricep and thigh.

mod <- lm(midarm ~ thigh + triceps, data = bf)
summary(mod)

## 
## Call:
## lm(formula = midarm ~ thigh + triceps, data = bf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58200 -0.30625  0.02592  0.29526  0.56102 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 62.33083    1.23934   50.29   <2e-16 ***
## thigh       -1.60850    0.04316  -37.26   <2e-16 ***
## triceps      1.88089    0.04498   41.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.377 on 17 degrees of freedom
## Multiple R-squared:  0.9904, Adjusted R-squared:  0.9893 
## F-statistic: 880.7 on 2 and 17 DF,  p-value: < 2.2e-16

Look at the R squared: 0.9904. It turns out that midarm can be almost perfectly predicted by thigh and triceps.

plot(bf$midarm, mod$fitted.values)

Look again at the correlations of midarm with tripceps and thigh: 0.45 and 0.08.

cor(bf[,1:3])

##           triceps     thigh    midarm
## triceps 1.0000000 0.9238425 0.4577772
## thigh   0.9238425 1.0000000 0.0846675
## midarm  0.4577772 0.0846675 1.0000000

Clearly simply looking pairwise correlations would not have detected this.

Statistical Rethinking excerpt

Richard McElreath has this to say in Chapter 6 of his book Statistical Rethinking:

“In the scientific literature, you might encounter a variety of dodgy ways of coping with multicollinearity. Few of them take a causal perspective. Some fields actually teach students to inspect pairwise correlations before fitting a model, to identify and drop highly correlated predictors. This is a mistake. Pairwise correlations are not the problem. It is the conditional associations — not correlations — that matter. And even then, the right thing to do will depend upon what is causing the collinearity. The associations within the data alone are not enough to decide what to do.”

Investigating Multicolinearity

Clay Ford

2025-03-19

Three collinear predictors, all in model

Three collinear predictors, only one in model

One predictor is predicted well from other predictors

Statistical Rethinking excerpt