Polyserial correlation

Polyserial correlation is a measure of association between a continuous variable and an ordinal variable. Two assumptions about this correlation:

  1. The joint distribution of the two variables is bivariate normal.
  2. The ordinal variable is assumed to be derived from a latent continuous variable.

The polyserial() function from the {polycor} package calculates polyserial correlation. The help page for this function provides a helpful example that illustrate the assumptions.

First we simulate bivariate normal data using the rmvnorm() function from the {mvtnorm} package. The means are 0, the standard deviations are 1, and the covariance (correlation) is 0.5. This satisfies the first assumption above (i.e., the distribution of the variables is bivariate normal).

library(mvtnorm)
set.seed(12345)
data <- rmvnorm(1000, mean = c(0, 0), sigma = matrix(c(1, .5, .5, 1), 2, 2))
x <- data[,1]
y <- data[,2]
cor(x, y)  
## [1] 0.5263698

The sample correlation is close to 0.5 used to simulate data.

Now polychotomize y into 4 categories. Notice the thresholds are -1, 0.5, and 1.5. These are arbitrary selections. This satisfies the second assumption above (i.e., y is derived from a continuous variable).

y <- cut(y, c(-Inf, -1, .5, 1.5, Inf))
summary(y)
##  (-Inf,-1]   (-1,0.5]  (0.5,1.5] (1.5, Inf] 
##        162        525        246         67

Now calculate polyserial correlation using the polyserial() function. The default is to use the 2-step estimate to perform the calculation. Notice the result is similar to 0.5 used to simulate data.

library(polycor)
polyserial(x, y)
## [1] 0.5121031

We can request a standard error for the estimated correlation by setting std.err=TRUE. When we do that, the correlation is estimated numerically which results in a slightly different estimate. This also returns the estimated thresholds. Notice how close they are to the original thresholds we specified in the cut() function: -1, .5, 1.5. Finally a bivariate test of normality is run. The null is the data are bivariate normal. The test fails to reject this hypothesis.

polyserial(x, y, std.err=TRUE) 
## 
## Polyserial Correlation, 2-step est. = 0.5085 (0.02413)
## Test of bivariate normality: Chisquare = 8.604, df = 11, p = 0.6584
## 
##                 1      2     3
## Threshold -0.9863 0.4874 1.499

There is also a maximum likelihood estimator for polyserial correlation that is more computationally demanding. This can be requested by setting ML=TRUE. Notice this returns estimated standard errors for the threshold estimates.

polyserial(x, y, ML=TRUE, std.err=TRUE) 
## 
## Polyserial Correlation, ML est. = 0.5083 (0.02466)
## Test of bivariate normality: Chisquare = 8.548, df = 11, p = 0.6635
## 
##                  1      2       3
## Threshold -0.98560 0.4812 1.50700
## Std.Err.   0.04408 0.0379 0.05847

Example

Drasgow (1986) works an example to calculate the polyserial correlation between the weight of ewes at mating (in lbs) and the number of lambs born to the ewe.

# Table 1: weights and number of lambs born for 25 ewes
X <- c(72, 88, 112, 93, 86, 87, 78, 
       69, 101, 108, 104, 92, 77,
       80, 99, 92, 72, 81, 88,
       104, 103, 85, 96, 96, 104)
d <- c(rep(0,3), rep(1, 17), rep(2, 5))
table(d) # ordered categorical variable; number of lambs born
## d
##  0  1  2 
##  3 17  5
polyserial(X, d, std.err = TRUE)
## 
## Polyserial Correlation, 2-step est. = 0.2153 (0.2228)
## Test of bivariate normality: Chisquare = 5.443, df = 8, p = 0.7093
## 
##                1      2
## Threshold -1.175 0.8416

The correlation is estimated to be about 0.22, but the standard error of 0.22 suggests this estimate is very uncertain. The thresholds of -1.18 and 0.84 are the estimated cutpoints where the latent continuous data was cut to create the ordered values of 0, 1, and 2. But it’s weird to me to think of “number of lambs born” having an underlying latent variable.

Polychoric correlation

Polychoric correlation is a measure of association between two ordinal variables. Two assumptions about this correlation:

  1. The joint distribution of the two variables is bivariate normal.
  2. The ordinal variables are assumed to be derived from latent continuous variables.

The polychor() function from the {polycor} package calculates polychoric correlation. The help page for this function provides a helpful example that illustrate the assumptions.

First we simulate bivariate normal data using rmvnorm() function from the {mvtnorm} package. The means are 0, the standard deviations are 1, and the covariance (correlation) is 0.5. This satisfies the first assumption above (i.e., the distribution of the variables is bivariate normal).

library(mvtnorm)
set.seed(12345)
data <- rmvnorm(1000, mean = c(0, 0), sigma = matrix(c(1, .5, .5, 1), 2, 2))
x <- data[,1]
y <- data[,2]
cor(x, y)  
## [1] 0.5263698

The sample correlation is close to 0.5 used to simulate data.

Now dichotomize x into two categories and polychotomize y into 4 categories. The thresholds are arbitrary selections. This satisfies the second assumption above (i.e., x and y are derived from continuous variables).

x <- cut(x, c(-Inf, .75, Inf))
summary(x)
## (-Inf,0.75] (0.75, Inf] 
##         775         225
y <- cut(y, c(-Inf, -1, .5, 1.5, Inf))
summary(y)
##  (-Inf,-1]   (-1,0.5]  (0.5,1.5] (1.5, Inf] 
##        162        525        246         67

Now calculate polychoric correlation using the polychor() function. The default is to use the 2-step estimate to perform the calculation. Notice the result is similar to 0.5 used to simulate data.

polychor(x, y)
## [1] 0.5230474

We can request a standard error for the estimated correlation by setting std.err=TRUE. When we do that, the correlation is estimated numerically which results in a slightly different estimate. This also returns the estimated thresholds for the “row” and “column” variables. This terminology is due to the fact that two ordinal variables are usually presented in a table as a cross tabulation. Notice how close they are to the original thresholds we specified in the cut() function above. Finally a bivariate test of normality is run. The null is the data are bivariate normal. The test fails to reject this hypothesis.

polychor(x, y, std.err=TRUE) 
## 
## Polychoric Correlation, 2-step est. = 0.5231 (0.03729)
## Test of bivariate normality: Chisquare = 2.758, df = 2, p = 0.2519
## 
##   Row Threshold        
##   0.7554
## 
## 
##   Column Thresholds         
## 1 -0.9863
## 2  0.4874
## 3  1.4990

There is also a maximum likelihood estimator for polychoric correlation that is more computationally demanding. This can be requested by setting ML=TRUE. Notice this returns estimated standard errors for the threshold estimates.

polychor(x, y, ML=TRUE, std.err=TRUE) 
## 
## Polychoric Correlation, ML est. = 0.5231 (0.03819)
## Test of bivariate normality: Chisquare = 2.739, df = 2, p = 0.2543
## 
##   Row Threshold
##   Threshold Std.Err.
##      0.7537  0.04403
## 
## 
##   Column Thresholds
##   Threshold Std.Err.
## 1   -0.9842  0.04746
## 2    0.4841  0.04127
## 3    1.5010  0.06118

Example

Drasgow (1986) works an example to calculate the polychoric correlation between number of lambs born to ewes in 1953 versus 1952. The following reproduces Table 2 in the article.

labels <- c("No lambs", "1 lamb", "2 lambs")
table2 <- matrix(c(58, 52, 1,
                   26, 58, 3,
                   8, 12, 9), byrow = TRUE, nrow = 3,
                 dimnames = list("1952" = labels, "1953" = labels))

addmargins(table2)
##           1953
## 1952       No lambs 1 lamb 2 lambs Sum
##   No lambs       58     52       1 111
##   1 lamb         26     58       3  87
##   2 lambs         8     12       9  29
##   Sum            92    122      13 227

The maximum likelihood estimate of polychoric correlation between the two variables is about 0.42 with a standard error of about 0.08. However, the test of bivariate normality calls into question the null hypothesis of underlying normally distributed variables. Perhaps this isn’t surprising since it seems unusual to assume “number of lambs born” would be derived from an underlying continuous variable.

polychor(table2, ML = TRUE, std.err = TRUE) 
## 
## Polychoric Correlation, ML est. = 0.4191 (0.07616)
## Test of bivariate normality: Chisquare = 11.54, df = 3, p = 0.009157
## 
##   Row Thresholds
##   Threshold Std.Err.
## 1  -0.02988  0.08299
## 2   1.13300  0.10630
## 
## 
##   Column Thresholds
##   Threshold Std.Err.
## 1   -0.2422  0.08361
## 2    1.5940  0.13720

Gamma and Delta

Another measure of association between two ordinal factors is Goodman Kruskal’s Gamma, available in the {DescTools} package. This analysis is based on examining pairs of observations. Use the conf.level argument to request a confidence interval.

library(DescTools)
GoodmanKruskalGamma(table2, conf.level = 0.95)
##     gamma    lwr.ci    upr.ci 
## 0.4581637 0.2700843 0.6462431

If one of the ordinal variables can be considered a dependent variable, Somers’ Delta, aka Somers’ d, can also be used to measure association. Perhaps we entertain “1953” results as being dependent on “1952” results. Since “1953” is presented in the rows, we set direction = "row". This result is appreciably smaller. This is because Somers’ d accounts for ties in the number of pairs.

SomersDelta(table2, direction = "row", conf.level = 0.95)
##    somers    lwr.ci    upr.ci 
## 0.2893046 0.1575945 0.4210147

For derivations of the Gamma and Delta statistics see the StatLab article, Understanding Somers’ D

References

Session Information

R.version.string
## [1] "R version 4.4.1 (2024-06-14 ucrt)"
packageVersion("polycor")
## [1] '0.8.1'
packageVersion("DescTools")
## [1] '0.99.56'