Three approaches to logistic regression in R

Responses for a binary logistic regression model in R can be specified in one of three ways:

As a 0/1 indicator
As a proportion with the total number of cases given by the weights argument.
As a two-column matrix, with the first column giving the number of successes and the second the number of failures.

This is documented on the family help page. (?family)

Example data

This data comes from the text Analysis of Categorical Data with R. Programs and data are here. The data are on field goals attempted in the NFL. A field goal is good if it goes through the goal posts. The explanatory variable is distance in yards from the field goal. Does distance help explain the probability of a field goal being “good” (good = 1)?

d <- read.csv("http://www.chrisbilder.com/categorical/Chapter2/Placekick.csv")
d <- d[, c("good", "distance")]
head(d)

##   good distance
## 1    1       21
## 2    1       21
## 3    1       20
## 4    1       28
## 5    1       20
## 6    1       25

Method 1

Response is coded as 0 or 1. One row per field goal attempt. This is the given structure of this data.

m1 <- glm(good ~ distance, data = d, family = binomial)
exp(coef(m1))

## (Intercept)    distance 
## 334.3136943   0.8913424

The odds of a successful field goal decrease about 11% for each yard we get further from the goal posts.

Method 2

Response is the proportion of successes for each distance. We need to restructure the data to implement this method.

# sum of success by distance
x <- aggregate(good ~ distance, data = d, sum)
# sum of attempts or trials by distance
n <- aggregate(good ~ distance, data = d, length)
# combine into data frame and create column for proportions
d2 <- data.frame(distance = x$distance, success = x$good, 
                 trials = n$good, proportion = x$good/n$good)
head(d2)

##   distance success trials proportion
## 1       18       2      3  0.6666667
## 2       19       7      7  1.0000000
## 3       20     776    789  0.9835234
## 4       21      19     20  0.9500000
## 5       22      12     14  0.8571429
## 6       23      26     27  0.9629630

Now use proportion as the response, but also include the argument weights = trials to indicate total number of attempts for each distance.

m2 <- glm(proportion ~ distance, data = d2, weights = trials, 
          family = binomial)
exp(coef(m1))

## (Intercept)    distance 
## 334.3136943   0.8913424

Notice the result is the same as method 1.

We could also calculate proportion “on-the-fly” using success/trials, like so:

m2 <- glm(success/trials ~ distance, data = d2, weights = trials, 
          family = binomial)

Method 3

Response is a two-column matrix, with the first column giving the number of successes and the second the number of failures. We need to add number of failures to our data frame, which is simply the difference between attempts and successes.

d2$failure <- d2$trials - d2$success

Now use cbind() to create a matrix of successes and failures and use that as our response. We no longer need to use the weights argument in this case.

m3 <- glm(cbind(success, failure) ~ distance, data = d2, 
          family = binomial)
exp(coef(m3))

## (Intercept)    distance 
## 334.3136943   0.8913424

Why one method versus the other?

Even though all three methods produce the same model coefficients, method 1 produces residuals that differ from methods 2 and 3.

Method 1 produces 2 residuals for each unique predictor value. The resulting residual versus fitted plot produces weird patterns that make it hard to assess model fit.

plot(m1, which = 1, main = "Method #1")

Methods 2 and 3 produce one residual per unique predictor value which makes it easier to assess fit.

plot(m2, which = 1, main = "Methods #2 and #3")

Rows 15, 10 and 1 appear to have a smaller proportion of success than the model predicts.

d2[c(1,15,10),"proportion"]

## [1] 0.6666667 0.7666667 0.8275862

Indeed the model predicts probability of success from those distances to be much higher.

predict(m2, newdata = data.frame(distance = c(18,32,27)), 
        type = "response")

##         1         2         3 
## 0.9768333 0.8939014 0.9374009

Of course the model with 0/1 responses makes the same predicted probabilities.

predict(m1, newdata = data.frame(distance = c(18,32,27)), 
        type = "response")

##         1         2         3 
## 0.9768333 0.8939014 0.9374009

But the associated residual versus fitted plot for that model doesn’t tip us off to that.