Plotting logistic regression probabilities

Load python modules

import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd

Fit model using R-like syntax and print summary. This takes care of creating all the dummy variables.

formula = "low ~ age + lwt + race + smoke + ptd + ht + ui + ftv"
mod1 = smf.glm(formula=formula, data=birthwt,
               family=sm.families.Binomial()).fit()
print(mod1.summary())

##                  Generalized Linear Model Regression Results                  
## ==============================================================================
## Dep. Variable:                    low   No. Observations:                  189
## Model:                            GLM   Df Residuals:                      178
## Model Family:                Binomial   Df Model:                           10
## Link Function:                  Logit   Scale:                          1.0000
## Method:                          IRLS   Log-Likelihood:                -97.738
## Date:                Mon, 18 Apr 2022   Deviance:                       195.48
## Time:                        09:26:02   Pearson chi2:                     179.
## No. Iterations:                     5   Pseudo R-squ. (CS):             0.1873
## Covariance Type:            nonrobust                                         
## =================================================================================
##                     coef    std err          z      P>|z|      [0.025      0.975]
## ---------------------------------------------------------------------------------
## Intercept         0.8230      1.245      0.661      0.508      -1.617       3.263
## race[T.black]     1.1924      0.536      2.225      0.026       0.142       2.243
## race[T.other]     0.7407      0.462      1.604      0.109      -0.164       1.646
## smoke[T.1]        0.7555      0.425      1.778      0.075      -0.078       1.589
## ptd[T.1]          1.3438      0.481      2.796      0.005       0.402       2.286
## ht[T.1]           1.9132      0.721      2.654      0.008       0.501       3.326
## ui[T.1]           0.6802      0.464      1.465      0.143      -0.230       1.590
## ftv[T.1]         -0.4364      0.479     -0.910      0.363      -1.376       0.503
## ftv[T.2+]         0.1790      0.456      0.392      0.695      -0.715       1.074
## age              -0.0372      0.039     -0.962      0.336      -0.113       0.039
## lwt              -0.0157      0.007     -2.211      0.027      -0.030      -0.002
## =================================================================================

Make predictions for ptd = 0 and ptd = 1. We want to replicate output of ggpredict in R:

# Predicted probabilities of low

ptd | Predicted |       95% CI
------------------------------
0   |      0.13 | [0.05, 0.27]
1   |      0.36 | [0.13, 0.67]

Adjusted for:
*   age =  23.00
*   lwt = 121.00
*  race =  white
* smoke =      0
*    ht =      0
*    ui =      0
*   ftv =      0

Create a Pandas DataFrame for our predictors. Notice these match the output of ggpredict above under the “Adjusted for” section. This needs to be a Pandas DataFrame since we used the formula option when fitting the model.

# create dictionary of values
d = {'age': 23, 'lwt': 123, 'race': "white",
     'smoke' : "0", 'ptd' : ["0","1"], 'ht': "0", 
     'ui': "0", 'ftv': "0"}
# convert dictionary to Pandas DataFrame.
# Is there a more direct way to do this?
nd = pd.DataFrame(data=d)

Now plug in our new data (nd) into our model using the get_prediction method. See this page. This returns the predictions as a PredictionResults class. We can access the predicted probabilities using the predicted_mean property. See this page.

pred = mod1.get_prediction(exog=nd)
prob = pred.predicted_mean
print(prob)

## [0.12360891 0.35093623]

We can use the conf_int method to extract the confidence intervals for the predicted probabilities.

ci = pred.conf_int()
print(ci)

## [[0.05166761 0.26746827]
##  [0.12724595 0.66722908]]

Now we use the matplotlib errorbar function to create the plot. I cannot find a seaborn plot for this. (The pointplot function will automatically add error bars, but does not appear to allow you to supply the lower and upper values, which we need to do in this case.)

Of course, the errorbar function requires margin of error, not the lower and upper limits! So we have to do some subtraction to get the lower and upper margin of errors.

lower = [prob[0] - ci[0,0], prob[1] - ci[1,0]]
upper = [ci[0,1] - prob[0], ci[1,1] - prob[1]]

Finally we can make the plot.

import matplotlib.pyplot as plt
plt.clf()
plt.errorbar(x = ["0","1"], y = prob, 
             yerr=[lower, upper], fmt='ok')

## <ErrorbarContainer object of 3 artists>

plt.xlabel('ptd')
plt.ylabel('low')
plt.title('Predicted probabilities of low')
plt.show()

Plotting logistic regression probabilities

Clay Ford

2022-04-15