The {tidyLPA} package provides a wrapper for the {mclust} package.
Latent Profile Analysis (LPA) is a statistical modeling approach for estimating distinct profiles, or groups, of variables.
LPA can be used for different dimensions, or for a single dimension over time.
Works best with numeric data.
The mclust algorithm does not allow for missing data.
The following example is from 2024-11-01. Data from grad student in Psych department. This was the code she sent us.
library(haven)
library(dplyr)
data <- read_sav("data_subset.sav")
# Removing missing cases; The mclust algorithm does not allow for missing data.
data_complete <- data %>%
filter(!is.na(w2cope) & !is.na(w3cope) & !is.na(w4cope) & !is.na(w5cope) &
!is.na(w2DISCRIM) & !is.na(w3DISCRIM) & !is.na(w4DISCRIM) & !is.na(w5DISCRIM),
!is.na(w2GOAL) & !is.na(w3GOAL) & !is.na(w4GOAL) & !is.na(w5GOAL))
Now run a LPA for “coping” measured at four time points. The key
{tidyLPA} function is estimate_profiles()
.
n_profiles = 2:4
argument says to try 2, 3, and 4
groups, or profiles.models = 1
argument says to fit a model with “Equal
variances and covariances fixed to 0”. In {mclust}, this is model “EEI”
with diagonal distribution, equal volume, and equal shape. See Table 3
in Scrucca, et
al 2016.library(tidyLPA)
# Coping LPA
lpa_model <- data_complete %>%
select(w2cope, w3cope, w4cope, w5cope) %>%
estimate_profiles(n_profiles = 2:4, models = 1)
The output lists the results of three models.
lpa_model
## tidyLPA analysis using mclust:
##
## Model Classes AIC BIC Entropy prob_min prob_max n_min n_max BLRT_p
## 1 2 2121.78 2169.04 0.76 0.92 0.94 0.44 0.56 0.01
## 1 3 2078.92 2144.35 0.72 0.85 0.88 0.20 0.46 0.01
## 1 4 2080.73 2164.33 0.72 0.50 0.88 0.07 0.42 0.34
To get the same result with {mclust} we use the Mclust()
function with G = 2:4
and
modelNames = "EEI"
.
library(mclust)
mclust_model <- Mclust(data_complete[,c("w2cope", "w3cope", "w4cope", "w5cope")],
G = 2:4, modelNames = "EEI")
Calling summary()
on the model object only returns the
best fitting model, the model with 3 classes (or profiles, or
“components”). Notice it also assigns observations to the classes,
reported in the “Clustering table”.
summary(mclust_model)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust EEI (diagonal, equal volume and shape) model with 3 components:
##
## log-likelihood n df BIC ICL
## -1021.46 280 18 -2144.347 -2231.189
##
## Clustering table:
## 1 2 3
## 128 95 57
{mclust} estimates means and variances of ellipsoids. Again, see Table 3 in Scrucca, et al 2016. {tidyLPA} is simply a wrapper for {mclust}.
To see the means and variances of a {tidyLPA} object use the
get_estimates()
function. Notice it’s returned as a nice
data frame.
get_estimates(lpa_model)
## # A tibble: 72 × 8
## Category Parameter Estimate se p Class Model Classes
## <chr> <chr> <dbl> <dbl> <dbl> <int> <dbl> <int>
## 1 Means w2cope 2.63 0.0890 1.43e-192 1 1 2
## 2 Means w3cope 2.58 0.0624 0 1 1 2
## 3 Means w4cope 2.66 0.0813 3.14e-235 1 1 2
## 4 Means w5cope 2.61 0.0885 1.03e-190 1 1 2
## 5 Variances w2cope 0.327 0.0333 9.42e- 23 1 1 2
## 6 Variances w3cope 0.268 0.0306 1.93e- 18 1 1 2
## 7 Variances w4cope 0.289 0.0333 4.58e- 18 1 1 2
## 8 Variances w5cope 0.295 0.0356 1.17e- 16 1 1 2
## 9 Means w2cope 3.44 0.0440 0 2 1 2
## 10 Means w3cope 3.47 0.0735 0 2 1 2
## # ℹ 62 more rows
To see the means and variances of a {mclust} object call
summary()
with parameters = TRUE
.
summary(mclust_model, parameters = TRUE)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust EEI (diagonal, equal volume and shape) model with 3 components:
##
## log-likelihood n df BIC ICL
## -1021.46 280 18 -2144.347 -2231.189
##
## Clustering table:
## 1 2 3
## 128 95 57
##
## Mixing probabilities:
## 1 2 3
## 0.4404199 0.3509536 0.2086265
##
## Means:
## [,1] [,2] [,3]
## w2cope 3.107510 3.546986 2.272306
## w3cope 2.921205 3.732339 2.319565
## w4cope 3.077434 3.688877 2.448043
## w5cope 3.034711 3.696047 2.402060
##
## Variances:
## [,,1]
## w2cope w3cope w4cope w5cope
## w2cope 0.2731808 0.0000000 0.0000000 0.000000
## w3cope 0.0000000 0.1816459 0.0000000 0.000000
## w4cope 0.0000000 0.0000000 0.2752901 0.000000
## w5cope 0.0000000 0.0000000 0.0000000 0.283196
## [,,2]
## w2cope w3cope w4cope w5cope
## w2cope 0.2731808 0.0000000 0.0000000 0.000000
## w3cope 0.0000000 0.1816459 0.0000000 0.000000
## w4cope 0.0000000 0.0000000 0.2752901 0.000000
## w5cope 0.0000000 0.0000000 0.0000000 0.283196
## [,,3]
## w2cope w3cope w4cope w5cope
## w2cope 0.2731808 0.0000000 0.0000000 0.000000
## w3cope 0.0000000 0.1816459 0.0000000 0.000000
## w4cope 0.0000000 0.0000000 0.2752901 0.000000
## w5cope 0.0000000 0.0000000 0.0000000 0.283196
Notice the “mixing probabilities”. These are probabilities for class assignment. The minimum and maximum probabilities are reported in the {tidyLPA} output above as “n_min” and “n_max”.
The {tidyLPA} package provides the plot_profiles()
function, which make a plot for all models that were fit.
lpa_model |>
plot_profiles(rawdata = FALSE)
{mclust} provides a classification plot showing pairwise scatterplots with the classification superimposed. You can see the ellipsoids are all the same size (ie, “Equal variances and covariances fixed to 0”)
plot(mclust_model, what = "classification")
Since this data has a time component, we might want to see how these 3 profiles change over time by plotting the mean versus time. The following code is courtesy of Lauren Brideau:
cop_mean <- get_estimates(lpa_model) |>
mutate(Class = as.factor(Class),
Wave = stringr::str_extract(Parameter, "w\\d+")) %>%
filter(Category == "Means")
library(ggplot2)
cop_mean %>%
filter(Classes == 3) %>%
ggplot(., aes(x = Wave, y = Estimate, group = Class)) +
geom_line() +
facet_wrap(vars(Class), labeller = "label_both") +
theme_bw()
We could also add standard error ribbons.
cop_mean %>%
filter(Classes == 3) %>%
ggplot(., aes(x = Wave, y = Estimate, group = Class)) +
geom_line() +
geom_ribbon(mapping = aes(ymin = Estimate - se, ymax = Estimate + se), alpha = 1/4) +
facet_wrap(vars(Class), labeller = "label_both") +
theme_bw()
Use color and combine into one plot.
cop_mean %>%
filter(Classes == 3) %>%
ggplot(., aes(x = Wave, y = Estimate, group = Class,
fill = Class, color = Class)) +
geom_line() +
geom_ribbon(mapping = aes(ymin = Estimate - se, ymax = Estimate + se), alpha = 1/4) +
theme_minimal()
The grad student who contacted us for help was examining two dimensions: “coping” and “discrimination”. She wanted to combine the profiles into two plots. Again, credit to Lauren for the data wrangling and plotting code.
# fit the model for discrimination
lpa_discrim <- data_complete %>%
select(w2DISCRIM, w3DISCRIM, w4DISCRIM, w5DISCRIM) %>%
estimate_profiles(n_profiles = 2:4, models = 1)
# get means
cop_mean <- get_estimates(lpa_model) %>%
mutate(type = as.character("coping"))
dis_mean <- get_estimates(lpa_discrim) %>%
mutate(type = as.character("discrimination"))
#Join data and clean
cop.dis_mean <- bind_rows(cop_mean, dis_mean) %>%
mutate(Parameter = as.factor(Parameter),
Class = as.factor(Class),
Wave = stringr::str_extract(Parameter, "w\\d+")) %>%
#filter to just the mean values
filter(Category == "Means")
# create the plot
cop.dis_mean %>% filter(Classes == 3) %>%
ggplot(., aes(x = Wave, y = Estimate, color = type)) +
geom_point(aes(shape = Class)) +
facet_wrap(vars(Class)) +
geom_line(aes(group = interaction(type, Class))) +
theme_bw()
The documentation for {tidyLPA} states there are only 4 models available when using {mclust}:
These correspond to the following {mclust} models:
The {mclust} package can fit 14 different models. It’s not clear to me why {tidyLPA} is only limited to four of the models.
The mclustBIC()
function allows us try all models and
group sizes (or profiles) ranging from 1 - 9. For example:
out_bic <- mclustBIC(data_complete[,c("w2cope", "w3cope", "w4cope", "w5cope")])
summary(out_bic)
## Best BIC values:
## VEI,6 VEE,5 VEI,5
## BIC -1920.856 -1922.54809 -1936.2656
## BIC diff 0.000 -1.69215 -15.4097
The “best” model is “VEI” with 6 profiles, a model that is not
available in {tidyLPA}. We can then fit the model using
Mclust()
.
mod <- Mclust(data_complete[,c("w2cope", "w3cope", "w4cope", "w5cope")], x = out_bic)
summary(mod)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VEI (diagonal, equal shape) model with 6 components:
##
## log-likelihood n df BIC ICL
## -853.367 280 38 -1920.856 -1976.453
##
## Clustering table:
## 1 2 3 4 5 6
## 11 24 73 47 59 66
However, the classification plot has me wondering if maybe 3 or 4 groups would be better.
plot(mod, what = "classification")
See also this blog post on the dangers of falling in love with clustering.
sess <- sessionInfo()
sess$R.version$version.string
## [1] "R version 4.4.1 (2024-06-14 ucrt)"
sess$platform
## [1] "x86_64-w64-mingw32/x64"
packageVersion("tidyLPA")
## [1] '1.1.0'
packageVersion("mclust")
## [1] '6.1.1'
packageVersion("dplyr")
## [1] '1.1.4'
packageVersion("haven")
## [1] '2.5.4'