Polyserial correlation is a measure of association between a continuous variable and an ordinal variable. Two assumptions about this correlation:
The polyserial()
function from the {polycor} package
calculates polyserial correlation. The help page for this function
provides a helpful example that illustrate the assumptions.
First we simulate bivariate normal data using the
rmvnorm()
function from the {mvtnorm} package. The means
are 0, the standard deviations are 1, and the covariance (correlation)
is 0.5. This satisfies the first assumption above (i.e., the
distribution of the variables is bivariate normal).
library(mvtnorm)
set.seed(12345)
data <- rmvnorm(1000, mean = c(0, 0), sigma = matrix(c(1, .5, .5, 1), 2, 2))
x <- data[,1]
y <- data[,2]
cor(x, y)
## [1] 0.5263698
The sample correlation is close to 0.5 used to simulate data.
Now polychotomize y
into 4 categories. Notice the
thresholds are -1, 0.5, and 1.5. These are arbitrary selections. This
satisfies the second assumption above (i.e., y is derived from a
continuous variable).
y <- cut(y, c(-Inf, -1, .5, 1.5, Inf))
summary(y)
## (-Inf,-1] (-1,0.5] (0.5,1.5] (1.5, Inf]
## 162 525 246 67
Now calculate polyserial correlation using the
polyserial()
function. The default is to use the 2-step
estimate to perform the calculation. Notice the result is similar
to 0.5 used to simulate data.
library(polycor)
polyserial(x, y)
## [1] 0.5121031
We can request a standard error for the estimated correlation by
setting std.err=TRUE
. When we do that, the correlation is
estimated numerically which results in a slightly different estimate.
This also returns the estimated thresholds. Notice how close they are to
the original thresholds we specified in the cut()
function:
-1, .5, 1.5. Finally a bivariate test of normality is run. The null is
the data are bivariate normal. The test fails to reject this
hypothesis.
polyserial(x, y, std.err=TRUE)
##
## Polyserial Correlation, 2-step est. = 0.5085 (0.02413)
## Test of bivariate normality: Chisquare = 8.604, df = 11, p = 0.6584
##
## 1 2 3
## Threshold -0.9863 0.4874 1.499
There is also a maximum likelihood estimator for polyserial
correlation that is more computationally demanding. This can be
requested by setting ML=TRUE
. Notice this returns estimated
standard errors for the threshold estimates.
polyserial(x, y, ML=TRUE, std.err=TRUE)
##
## Polyserial Correlation, ML est. = 0.5083 (0.02466)
## Test of bivariate normality: Chisquare = 8.548, df = 11, p = 0.6635
##
## 1 2 3
## Threshold -0.98560 0.4812 1.50700
## Std.Err. 0.04408 0.0379 0.05847
Drasgow (1986) works an example to calculate the polyserial correlation between the weight of ewes at mating (in lbs) and the number of lambs born to the ewe.
# Table 1: weights and number of lambs born for 25 ewes
X <- c(72, 88, 112, 93, 86, 87, 78,
69, 101, 108, 104, 92, 77,
80, 99, 92, 72, 81, 88,
104, 103, 85, 96, 96, 104)
d <- c(rep(0,3), rep(1, 17), rep(2, 5))
table(d) # ordered categorical variable; number of lambs born
## d
## 0 1 2
## 3 17 5
polyserial(X, d, std.err = TRUE)
##
## Polyserial Correlation, 2-step est. = 0.2153 (0.2228)
## Test of bivariate normality: Chisquare = 5.443, df = 8, p = 0.7093
##
## 1 2
## Threshold -1.175 0.8416
The correlation is estimated to be about 0.22, but the standard error of 0.22 suggests this estimate is very uncertain. The thresholds of -1.18 and 0.84 are the estimated cutpoints where the latent continuous data was cut to create the ordered values of 0, 1, and 2. But it’s weird to me to think of “number of lambs born” having an underlying latent variable.
Polychoric correlation is a measure of association between two ordinal variables. Two assumptions about this correlation:
The polychor()
function from the {polycor} package
calculates polychoric correlation. The help page for this function
provides a helpful example that illustrate the assumptions.
First we simulate bivariate normal data using rmvnorm()
function from the {mvtnorm} package. The means are 0, the standard
deviations are 1, and the covariance (correlation) is 0.5. This
satisfies the first assumption above (i.e., the distribution of the
variables is bivariate normal).
library(mvtnorm)
set.seed(12345)
data <- rmvnorm(1000, mean = c(0, 0), sigma = matrix(c(1, .5, .5, 1), 2, 2))
x <- data[,1]
y <- data[,2]
cor(x, y)
## [1] 0.5263698
The sample correlation is close to 0.5 used to simulate data.
Now dichotomize x
into two categories and polychotomize
y
into 4 categories. The thresholds are arbitrary
selections. This satisfies the second assumption above (i.e., x and y
are derived from continuous variables).
x <- cut(x, c(-Inf, .75, Inf))
summary(x)
## (-Inf,0.75] (0.75, Inf]
## 775 225
y <- cut(y, c(-Inf, -1, .5, 1.5, Inf))
summary(y)
## (-Inf,-1] (-1,0.5] (0.5,1.5] (1.5, Inf]
## 162 525 246 67
Now calculate polychoric correlation using the
polychor()
function. The default is to use the 2-step
estimate to perform the calculation. Notice the result is similar
to 0.5 used to simulate data.
polychor(x, y)
## [1] 0.5230474
We can request a standard error for the estimated correlation by
setting std.err=TRUE
. When we do that, the correlation is
estimated numerically which results in a slightly different estimate.
This also returns the estimated thresholds for the “row” and “column”
variables. This terminology is due to the fact that two ordinal
variables are usually presented in a table as a cross tabulation. Notice
how close they are to the original thresholds we specified in the
cut()
function above. Finally a bivariate test of normality
is run. The null is the data are bivariate normal. The test fails to
reject this hypothesis.
polychor(x, y, std.err=TRUE)
##
## Polychoric Correlation, 2-step est. = 0.5231 (0.03729)
## Test of bivariate normality: Chisquare = 2.758, df = 2, p = 0.2519
##
## Row Threshold
## 0.7554
##
##
## Column Thresholds
## 1 -0.9863
## 2 0.4874
## 3 1.4990
There is also a maximum likelihood estimator for polychoric
correlation that is more computationally demanding. This can be
requested by setting ML=TRUE
. Notice this returns estimated
standard errors for the threshold estimates.
polychor(x, y, ML=TRUE, std.err=TRUE)
##
## Polychoric Correlation, ML est. = 0.5231 (0.03819)
## Test of bivariate normality: Chisquare = 2.739, df = 2, p = 0.2543
##
## Row Threshold
## Threshold Std.Err.
## 0.7537 0.04403
##
##
## Column Thresholds
## Threshold Std.Err.
## 1 -0.9842 0.04746
## 2 0.4841 0.04127
## 3 1.5010 0.06118
Drasgow (1986) works an example to calculate the polychoric correlation between number of lambs born to ewes in 1953 versus 1952. The following reproduces Table 2 in the article.
labels <- c("No lambs", "1 lamb", "2 lambs")
table2 <- matrix(c(58, 52, 1,
26, 58, 3,
8, 12, 9), byrow = TRUE, nrow = 3,
dimnames = list("1952" = labels, "1953" = labels))
addmargins(table2)
## 1953
## 1952 No lambs 1 lamb 2 lambs Sum
## No lambs 58 52 1 111
## 1 lamb 26 58 3 87
## 2 lambs 8 12 9 29
## Sum 92 122 13 227
The maximum likelihood estimate of polychoric correlation between the two variables is about 0.42 with a standard error of about 0.08. However, the test of bivariate normality calls into question the null hypothesis of underlying normally distributed variables. Perhaps this isn’t surprising since it seems unusual to assume “number of lambs born” would be derived from an underlying continuous variable.
polychor(table2, ML = TRUE, std.err = TRUE)
##
## Polychoric Correlation, ML est. = 0.4191 (0.07616)
## Test of bivariate normality: Chisquare = 11.54, df = 3, p = 0.009157
##
## Row Thresholds
## Threshold Std.Err.
## 1 -0.02988 0.08299
## 2 1.13300 0.10630
##
##
## Column Thresholds
## Threshold Std.Err.
## 1 -0.2422 0.08361
## 2 1.5940 0.13720
Another measure of association between two ordinal factors is
Goodman Kruskal’s Gamma, available in the {DescTools} package.
This analysis is based on examining pairs of observations. Use
the conf.level
argument to request a confidence
interval.
library(DescTools)
GoodmanKruskalGamma(table2, conf.level = 0.95)
## gamma lwr.ci upr.ci
## 0.4581637 0.2700843 0.6462431
If one of the ordinal variables can be considered a dependent
variable, Somers’ Delta, aka Somers’ d, can also be used to measure
association. Perhaps we entertain “1953” results as being dependent on
“1952” results. Since “1953” is presented in the rows, we set
direction = "row"
. This result is appreciably smaller. This
is because Somers’ d accounts for ties in the number of pairs.
SomersDelta(table2, direction = "row", conf.level = 0.95)
## somers lwr.ci upr.ci
## 0.2893046 0.1575945 0.4210147
For derivations of the Gamma and Delta statistics see the StatLab article, Understanding Somers’ D
R.version.string
## [1] "R version 4.4.1 (2024-06-14 ucrt)"
packageVersion("polycor")
## [1] '0.8.1'
packageVersion("DescTools")
## [1] '0.99.56'