EFA with Binary Data

Simulate Data

Below I do the following to generate correlated binary data:

Create a 300 x 4 matrix from random uniform values ranging from [0,1]. We’ll use these as probabilities. Each row represents a subject. Each column represents a latent factor. So we have 300 subjects and 4 factors.
Use the values in column 1 as probabilities for generating zeroes and ones. The rbinom() function generates random data from a binomial distribution. Think of it as flipping 5 coins with the given probability and observing 5 results: 1 = head, 0 = tails. A result of 0, 0, 1, 1, 0 would mean two heads and three tails. If a subject has a high probability, they’re more likely to see more ones, and thus score higher on this factor.
Repeat step 2 for each column of the matrix (y).
Combine all groups into one matrix.

set.seed(1234)
y <- matrix(runif(300 * 4), ncol = 4)
g1 <- sapply(y[,1], function(x)rbinom(n = 5, size = 1, prob = x)) |> t()
g2 <- sapply(y[,2], function(x)rbinom(n = 5, size = 1, prob = x)) |> t()
g3 <- sapply(y[,3], function(x)rbinom(n = 5, size = 1, prob = x)) |> t()
g4 <- sapply(y[,4], function(x)rbinom(n = 5, size = 1, prob = x)) |> t()
d <- cbind(g1, g2, g3, g4)
head(d)

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,]    0    0    0    0    0    0    0    0    0     1     0     0     0     0
## [2,]    0    0    0    1    0    1    1    1    1     1     0     1     1     0
## [3,]    1    1    0    1    0    0    1    0    1     1     0     0     0     0
## [4,]    0    1    0    0    1    0    1    1    0     1     0     0     1     1
## [5,]    1    0    1    1    1    1    1    1    1     1     0     0     0     1
## [6,]    1    1    1    1    0    1    1    0    0     0     0     0     0     0
##      [,15] [,16] [,17] [,18] [,19] [,20]
## [1,]     0     0     1     1     1     0
## [2,]     0     0     1     1     1     1
## [3,]     0     0     0     1     1     1
## [4,]     0     1     0     0     1     1
## [5,]     1     1     1     1     0     1
## [6,]     0     0     0     0     0     0

Visualize Correlation matrix

With our data simulated, let’s visualize the correlation matrix. First we calculate correlation using the tetrachoric() function from the psych package. Then we visualize it using corrplot().

library(psych)
library(corrplot)

r <- tetrachoric(d)$rho

corrplot::corrplot(r, type = "upper", diag = F)

We see we have four clumps of correlated items, as we should since we simulated the data that way.

EFA

Now let’s try some factor analyses. First we use the default settings. It seems to run fine and recover the factors.

fa_default <- fa(r = d, fm = "minres", nfactors = 4)
print(fa_default$loadings, cut = 0.3)

## 
## Loadings:
##       MR1    MR2    MR3    MR4   
##  [1,]                0.452       
##  [2,]                0.620       
##  [3,]                0.637       
##  [4,]                0.568       
##  [5,]                0.599       
##  [6,]                       0.577
##  [7,]                       0.671
##  [8,]                       0.553
##  [9,]                       0.564
## [10,]                       0.488
## [11,]         0.598              
## [12,]         0.625              
## [13,]         0.585              
## [14,]         0.617              
## [15,]         0.503              
## [16,]  0.436                     
## [17,]  0.564                     
## [18,]  0.722                     
## [19,]  0.646                     
## [20,]  0.592                     
## 
##                  MR1   MR2   MR3   MR4
## SS loadings    1.829 1.752 1.709 1.689
## Proportion Var 0.091 0.088 0.085 0.084
## Cumulative Var 0.091 0.179 0.264 0.349

It would probably be better to use cor="tet", which creates the correlation matrix using the tetrachoric() function. This results in higher loadings but the same basic result:

fa_mr <- fa(r = d, fm = "minres", nfactors = 4, cor = "tet")

print(fa_mr$loadings, cut = 0.3)

## 
## Loadings:
##       MR1    MR2    MR3    MR4   
##  [1,]                0.562       
##  [2,]                0.756       
##  [3,]                0.775       
##  [4,]                0.693       
##  [5,]                0.734       
##  [6,]                       0.707
##  [7,]                       0.813
##  [8,]                       0.679
##  [9,]                       0.694
## [10,]                       0.605
## [11,]         0.727              
## [12,]         0.758              
## [13,]         0.723              
## [14,]         0.749              
## [15,]         0.623              
## [16,]  0.544                     
## [17,]  0.690                     
## [18,]  0.869                     
## [19,]  0.786                     
## [20,]  0.717                     
## 
##                  MR1   MR2   MR3   MR4
## SS loadings    2.711 2.621 2.562 2.543
## Proportion Var 0.136 0.131 0.128 0.127
## Cumulative Var 0.136 0.267 0.395 0.522

Using maximum likelihood doesn’t seem to make a difference.

fa_ml <- fa(r = d, fm = "ml", nfactors = 4, cor = "tet")

print(fa_ml$loadings, cut = 0.3)

## 
## Loadings:
##       ML1    ML2    ML3    ML4   
##  [1,]                0.561       
##  [2,]                0.728       
##  [3,]                0.782       
##  [4,]                0.704       
##  [5,]                0.739       
##  [6,]                       0.706
##  [7,]                       0.816
##  [8,]                       0.681
##  [9,]                       0.691
## [10,]                       0.608
## [11,]         0.740              
## [12,]         0.756              
## [13,]         0.707              
## [14,]         0.758              
## [15,]         0.620              
## [16,]  0.554                     
## [17,]  0.692                     
## [18,]  0.860                     
## [19,]  0.764                     
## [20,]  0.732                     
## 
##                  ML1   ML2   ML3   ML4
## SS loadings    2.710 2.624 2.560 2.546
## Proportion Var 0.136 0.131 0.128 0.127
## Cumulative Var 0.136 0.267 0.395 0.522

Neither does OLS.

fa_ols <- fa(r = d, fm = "ols", nfactors = 4, cor = "tet")

print(fa_ols$loadings, cut = 0.3)

## 
## Loadings:
##       [,1]   [,2]   [,3]   [,4]  
##  [1,]                0.562       
##  [2,]                0.756       
##  [3,]                0.775       
##  [4,]                0.693       
##  [5,]                0.734       
##  [6,]                       0.707
##  [7,]                       0.813
##  [8,]                       0.679
##  [9,]                       0.694
## [10,]                       0.605
## [11,]         0.727              
## [12,]         0.758              
## [13,]         0.723              
## [14,]         0.749              
## [15,]         0.623              
## [16,]  0.544                     
## [17,]  0.690                     
## [18,]  0.869                     
## [19,]  0.786                     
## [20,]  0.717                     
## 
##                 [,1]  [,2]  [,3]  [,4]
## SS loadings    2.711 2.621 2.562 2.543
## Proportion Var 0.136 0.131 0.128 0.127
## Cumulative Var 0.136 0.267 0.395 0.522

Working with “bad” data

Let’s generate crazy 0/1 data with no correlation and see what happens.

set.seed(666)
d_bad <- matrix(data = sample(0:1, size = 300 * 20, replace = TRUE), 
                ncol = 20)
r2 <- tetrachoric(d_bad)$rho

corrplot::corrplot(r2, type = "upper", diag = F)

Amazingly, this seems to converge to something using the default correlation calculations, though of course it’s nonsense.

fa2_default <- fa(r = d_bad, fm = "minres", nfactors = 4)
print(fa2_default$loadings, cut = 0.3)

## 
## Loadings:
##       MR1    MR2    MR3    MR4   
##  [1,]         0.644              
##  [2,]                0.385       
##  [3,]                            
##  [4,]                            
##  [5,]                            
##  [6,]  0.998                     
##  [7,]                            
##  [8,]                            
##  [9,]                            
## [10,]                            
## [11,]                            
## [12,]                            
## [13,]                            
## [14,]                            
## [15,]               -0.405       
## [16,]                            
## [17,]                            
## [18,]                0.309       
## [19,]                       0.511
## [20,]                            
## 
##                  MR1   MR2   MR3   MR4
## SS loadings    1.095 0.651 0.545 0.483
## Proportion Var 0.055 0.033 0.027 0.024
## Cumulative Var 0.055 0.087 0.115 0.139

It would seem more appropriate to use cor = "tet" since our data is 0/1. This results in warnings about a mysterious “ultra-Heywood case”! While this seems weird and troublesome, in this case it’s probably working as intended. That’s probably a sign we’re looking for factors that don’t exist. (Warnings can be a good thing.) A Heywood case is when a uniqueness falls below 0.

fa2_mr <- fa(r = d_bad, fm = "minres", nfactors = 4, cor = "tet")

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect. Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected. Examine the results carefully

fa2_mr$uniquenesses[fa2_mr$uniquenesses < 0]

## [1] -0.001767915

Reducing the number of factors to 1 eliminates the warnings, but of course this “factor” is due to noise.

fa2_mr <- fa(r = d_bad, fm = "minres", nfactors = 1, cor = "tet")

print(fa2_mr$loadings, cut = 0.3)

## 
## Loadings:
##       MR1   
##  [1,]       
##  [2,]  0.308
##  [3,]       
##  [4,]       
##  [5,]       
##  [6,]  0.587
##  [7,]       
##  [8,]       
##  [9,]       
## [10,]       
## [11,] -0.347
## [12,]       
## [13,]       
## [14,]       
## [15,] -0.445
## [16,]       
## [17,]       
## [18,]       
## [19,]       
## [20,]       
## 
##                  MR1
## SS loadings    1.112
## Proportion Var 0.056

EFA with Binary Data

Clay Ford

2022-07-28

Simulate Data

Visualize Correlation matrix

EFA

Working with “bad” data