Load python modules
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
Fit model using R-like syntax and print summary. This takes care of creating all the dummy variables.
formula = "low ~ age + lwt + race + smoke + ptd + ht + ui + ftv"
mod1 = smf.glm(formula=formula, data=birthwt,
family=sm.families.Binomial()).fit()
print(mod1.summary())
## Generalized Linear Model Regression Results
## ==============================================================================
## Dep. Variable: low No. Observations: 189
## Model: GLM Df Residuals: 178
## Model Family: Binomial Df Model: 10
## Link Function: Logit Scale: 1.0000
## Method: IRLS Log-Likelihood: -97.738
## Date: Mon, 18 Apr 2022 Deviance: 195.48
## Time: 09:26:02 Pearson chi2: 179.
## No. Iterations: 5 Pseudo R-squ. (CS): 0.1873
## Covariance Type: nonrobust
## =================================================================================
## coef std err z P>|z| [0.025 0.975]
## ---------------------------------------------------------------------------------
## Intercept 0.8230 1.245 0.661 0.508 -1.617 3.263
## race[T.black] 1.1924 0.536 2.225 0.026 0.142 2.243
## race[T.other] 0.7407 0.462 1.604 0.109 -0.164 1.646
## smoke[T.1] 0.7555 0.425 1.778 0.075 -0.078 1.589
## ptd[T.1] 1.3438 0.481 2.796 0.005 0.402 2.286
## ht[T.1] 1.9132 0.721 2.654 0.008 0.501 3.326
## ui[T.1] 0.6802 0.464 1.465 0.143 -0.230 1.590
## ftv[T.1] -0.4364 0.479 -0.910 0.363 -1.376 0.503
## ftv[T.2+] 0.1790 0.456 0.392 0.695 -0.715 1.074
## age -0.0372 0.039 -0.962 0.336 -0.113 0.039
## lwt -0.0157 0.007 -2.211 0.027 -0.030 -0.002
## =================================================================================
Make predictions for ptd = 0 and ptd = 1. We want to replicate output
of ggpredict
in R:
# Predicted probabilities of low
ptd | Predicted | 95% CI
------------------------------
0 | 0.13 | [0.05, 0.27]
1 | 0.36 | [0.13, 0.67]
Adjusted for:
* age = 23.00
* lwt = 121.00
* race = white
* smoke = 0
* ht = 0
* ui = 0
* ftv = 0
Create a Pandas DataFrame for our predictors. Notice these match the
output of ggpredict
above under the “Adjusted for” section.
This needs to be a Pandas DataFrame since we used the formula option
when fitting the model.
# create dictionary of values
d = {'age': 23, 'lwt': 123, 'race': "white",
'smoke' : "0", 'ptd' : ["0","1"], 'ht': "0",
'ui': "0", 'ftv': "0"}
# convert dictionary to Pandas DataFrame.
# Is there a more direct way to do this?
nd = pd.DataFrame(data=d)
Now plug in our new data (nd) into our model using the
get_prediction
method. See this
page. This returns the predictions as a
PredictionResults
class. We can access the predicted
probabilities using the predicted_mean
property. See this
page.
pred = mod1.get_prediction(exog=nd)
prob = pred.predicted_mean
print(prob)
## [0.12360891 0.35093623]
We can use the conf_int
method to extract the confidence
intervals for the predicted probabilities.
ci = pred.conf_int()
print(ci)
## [[0.05166761 0.26746827]
## [0.12724595 0.66722908]]
Now we use the matplotlib errorbar
function to create
the plot. I cannot find a seaborn plot for this. (The
pointplot
function will automatically add error bars, but
does not appear to allow you to supply the lower and upper values, which
we need to do in this case.)
Of course, the errorbar
function requires margin of
error, not the lower and upper limits! So we have to do some
subtraction to get the lower and upper margin of errors.
lower = [prob[0] - ci[0,0], prob[1] - ci[1,0]]
upper = [ci[0,1] - prob[0], ci[1,1] - prob[1]]
Finally we can make the plot.
import matplotlib.pyplot as plt
plt.clf()
plt.errorbar(x = ["0","1"], y = prob,
yerr=[lower, upper], fmt='ok')
## <ErrorbarContainer object of 3 artists>
plt.xlabel('ptd')
plt.ylabel('low')
plt.title('Predicted probabilities of low')
plt.show()