Below I do the following to generate correlated binary data:
rbinom()
function generates random data from
a binomial distribution. Think of it as flipping 5 coins with the given
probability and observing 5 results: 1 = head, 0 = tails. A result of
0, 0, 1, 1, 0
would mean two heads and three tails. If a
subject has a high probability, they’re more likely to see more ones,
and thus score higher on this factor.set.seed(1234)
y <- matrix(runif(300 * 4), ncol = 4)
g1 <- sapply(y[,1], function(x)rbinom(n = 5, size = 1, prob = x)) |> t()
g2 <- sapply(y[,2], function(x)rbinom(n = 5, size = 1, prob = x)) |> t()
g3 <- sapply(y[,3], function(x)rbinom(n = 5, size = 1, prob = x)) |> t()
g4 <- sapply(y[,4], function(x)rbinom(n = 5, size = 1, prob = x)) |> t()
d <- cbind(g1, g2, g3, g4)
head(d)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 0 0 0 0 0 0 0 0 0 1 0 0 0 0
## [2,] 0 0 0 1 0 1 1 1 1 1 0 1 1 0
## [3,] 1 1 0 1 0 0 1 0 1 1 0 0 0 0
## [4,] 0 1 0 0 1 0 1 1 0 1 0 0 1 1
## [5,] 1 0 1 1 1 1 1 1 1 1 0 0 0 1
## [6,] 1 1 1 1 0 1 1 0 0 0 0 0 0 0
## [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] 0 0 1 1 1 0
## [2,] 0 0 1 1 1 1
## [3,] 0 0 0 1 1 1
## [4,] 0 1 0 0 1 1
## [5,] 1 1 1 1 0 1
## [6,] 0 0 0 0 0 0
With our data simulated, let’s visualize the correlation matrix.
First we calculate correlation using the tetrachoric()
function from the psych package. Then we visualize it using
corrplot()
.
library(psych)
library(corrplot)
r <- tetrachoric(d)$rho
corrplot::corrplot(r, type = "upper", diag = F)
We see we have four clumps of correlated items, as we should since we simulated the data that way.
Now let’s try some factor analyses. First we use the default settings. It seems to run fine and recover the factors.
fa_default <- fa(r = d, fm = "minres", nfactors = 4)
print(fa_default$loadings, cut = 0.3)
##
## Loadings:
## MR1 MR2 MR3 MR4
## [1,] 0.452
## [2,] 0.620
## [3,] 0.637
## [4,] 0.568
## [5,] 0.599
## [6,] 0.577
## [7,] 0.671
## [8,] 0.553
## [9,] 0.564
## [10,] 0.488
## [11,] 0.598
## [12,] 0.625
## [13,] 0.585
## [14,] 0.617
## [15,] 0.503
## [16,] 0.436
## [17,] 0.564
## [18,] 0.722
## [19,] 0.646
## [20,] 0.592
##
## MR1 MR2 MR3 MR4
## SS loadings 1.829 1.752 1.709 1.689
## Proportion Var 0.091 0.088 0.085 0.084
## Cumulative Var 0.091 0.179 0.264 0.349
It would probably be better to use cor="tet"
, which
creates the correlation matrix using the tetrachoric()
function. This results in higher loadings but the same basic result:
fa_mr <- fa(r = d, fm = "minres", nfactors = 4, cor = "tet")
print(fa_mr$loadings, cut = 0.3)
##
## Loadings:
## MR1 MR2 MR3 MR4
## [1,] 0.562
## [2,] 0.756
## [3,] 0.775
## [4,] 0.693
## [5,] 0.734
## [6,] 0.707
## [7,] 0.813
## [8,] 0.679
## [9,] 0.694
## [10,] 0.605
## [11,] 0.727
## [12,] 0.758
## [13,] 0.723
## [14,] 0.749
## [15,] 0.623
## [16,] 0.544
## [17,] 0.690
## [18,] 0.869
## [19,] 0.786
## [20,] 0.717
##
## MR1 MR2 MR3 MR4
## SS loadings 2.711 2.621 2.562 2.543
## Proportion Var 0.136 0.131 0.128 0.127
## Cumulative Var 0.136 0.267 0.395 0.522
Using maximum likelihood doesn’t seem to make a difference.
fa_ml <- fa(r = d, fm = "ml", nfactors = 4, cor = "tet")
print(fa_ml$loadings, cut = 0.3)
##
## Loadings:
## ML1 ML2 ML3 ML4
## [1,] 0.561
## [2,] 0.728
## [3,] 0.782
## [4,] 0.704
## [5,] 0.739
## [6,] 0.706
## [7,] 0.816
## [8,] 0.681
## [9,] 0.691
## [10,] 0.608
## [11,] 0.740
## [12,] 0.756
## [13,] 0.707
## [14,] 0.758
## [15,] 0.620
## [16,] 0.554
## [17,] 0.692
## [18,] 0.860
## [19,] 0.764
## [20,] 0.732
##
## ML1 ML2 ML3 ML4
## SS loadings 2.710 2.624 2.560 2.546
## Proportion Var 0.136 0.131 0.128 0.127
## Cumulative Var 0.136 0.267 0.395 0.522
Neither does OLS.
fa_ols <- fa(r = d, fm = "ols", nfactors = 4, cor = "tet")
print(fa_ols$loadings, cut = 0.3)
##
## Loadings:
## [,1] [,2] [,3] [,4]
## [1,] 0.562
## [2,] 0.756
## [3,] 0.775
## [4,] 0.693
## [5,] 0.734
## [6,] 0.707
## [7,] 0.813
## [8,] 0.679
## [9,] 0.694
## [10,] 0.605
## [11,] 0.727
## [12,] 0.758
## [13,] 0.723
## [14,] 0.749
## [15,] 0.623
## [16,] 0.544
## [17,] 0.690
## [18,] 0.869
## [19,] 0.786
## [20,] 0.717
##
## [,1] [,2] [,3] [,4]
## SS loadings 2.711 2.621 2.562 2.543
## Proportion Var 0.136 0.131 0.128 0.127
## Cumulative Var 0.136 0.267 0.395 0.522
Let’s generate crazy 0/1 data with no correlation and see what happens.
set.seed(666)
d_bad <- matrix(data = sample(0:1, size = 300 * 20, replace = TRUE),
ncol = 20)
r2 <- tetrachoric(d_bad)$rho
corrplot::corrplot(r2, type = "upper", diag = F)
Amazingly, this seems to converge to something using the default correlation calculations, though of course it’s nonsense.
fa2_default <- fa(r = d_bad, fm = "minres", nfactors = 4)
print(fa2_default$loadings, cut = 0.3)
##
## Loadings:
## MR1 MR2 MR3 MR4
## [1,] 0.644
## [2,] 0.385
## [3,]
## [4,]
## [5,]
## [6,] 0.998
## [7,]
## [8,]
## [9,]
## [10,]
## [11,]
## [12,]
## [13,]
## [14,]
## [15,] -0.405
## [16,]
## [17,]
## [18,] 0.309
## [19,] 0.511
## [20,]
##
## MR1 MR2 MR3 MR4
## SS loadings 1.095 0.651 0.545 0.483
## Proportion Var 0.055 0.033 0.027 0.024
## Cumulative Var 0.055 0.087 0.115 0.139
It would seem more appropriate to use cor = "tet"
since
our data is 0/1. This results in warnings about a mysterious
“ultra-Heywood case”! While this seems weird and troublesome, in this
case it’s probably working as intended. That’s probably a sign we’re
looking for factors that don’t exist. (Warnings can be a good thing.) A
Heywood case is when a uniqueness falls below 0.
fa2_mr <- fa(r = d_bad, fm = "minres", nfactors = 4, cor = "tet")
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect. Try a
## different factor score estimation method.
## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected. Examine the results carefully
fa2_mr$uniquenesses[fa2_mr$uniquenesses < 0]
## [1] -0.001767915
Reducing the number of factors to 1 eliminates the warnings, but of course this “factor” is due to noise.
fa2_mr <- fa(r = d_bad, fm = "minres", nfactors = 1, cor = "tet")
print(fa2_mr$loadings, cut = 0.3)
##
## Loadings:
## MR1
## [1,]
## [2,] 0.308
## [3,]
## [4,]
## [5,]
## [6,] 0.587
## [7,]
## [8,]
## [9,]
## [10,]
## [11,] -0.347
## [12,]
## [13,]
## [14,]
## [15,] -0.445
## [16,]
## [17,]
## [18,]
## [19,]
## [20,]
##
## MR1
## SS loadings 1.112
## Proportion Var 0.056