This file is available for download at https://uvastatlab.github.io/phdplus2020/ under the final session Linear Modeling.
<-
)
QUESTIONS? email jcf2d@virginia.edu
# Load packages we'll use today
library(tidyverse)
library(ggeffects)
library(car)
library(splines)
library(stargazer)
Let’s begin by simulating some fake data
x <- 1:25
y <- 10 + 5*x
plot(x, y)
10 is the intercept, 5 is the slope. y is completely determined by x
Let’s add some “noise” to our data by adding random draws from a Normal distribution with mean = 0 and a standard deviation = 10.
set.seed(1)
ensures we all get the same “random” data
set.seed(1)
noise <- rnorm(n = 25, mean = 0, sd = 10)
Add the noise to 10 + 5*x and re-draw plot
y <- 10 + 5*x + noise
plot(x, y)
This data is the combination of two parts:
What if we were given this data and told to determine the process that generated it? In other words, fill in the blanks:
That’s basically what linear modeling is.
Traditional linear modeling assumes the following (among others):
Linear modeling tries to recover the weights in the first assumption (10 and 5) and the standard deviation in the 3rd assumption (10).
Let’s attempt to recover the data generating process. For this we use the lm()
function. We have to specify the formula for the first assumption. The 2nd and 3rd assumptions are built into lm()
.
“y ~ x” means we think Part 1 of the model is "y = intercept + slope*x". This tells lm()
to take our data and find the best intercept and slope. Notice this is the correct model!
mod <- lm(y ~ x)
summary(mod)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.876 -4.613 2.254 5.909 14.648
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.135 3.999 2.784 0.0105 *
## x 5.042 0.269 18.743 1.98e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.7 on 23 degrees of freedom
## Multiple R-squared: 0.9386, Adjusted R-squared: 0.9359
## F-statistic: 351.3 on 1 and 23 DF, p-value: 1.981e-15
The model returns the following estimates:
intercept = 11.135 slope = 5.042 sd = 9.7 (Residual standard error)
Recall the “true” values:
intercept = 10 slope = 5 sd = 10
We could use those estimates to generate data and see if they look similar to our original data.
y2 <- 11.135 + 5.042*x + rnorm(25, 0, 9.7)
# plot original data
plot(x, y)
# plot simulated data using estimates
points(x, y2, col = "red")
Looks similar.
We could also add the original and fitted lines
plot(x, y)
points(x, y2, col = "red")
# a = intercept, b = slope
abline(a = 10, b = 5)
abline(mod, col = "red")
We could also compare distributions using histograms
hist(y, main = "observed data")
hist(y2, main = "data simulated from model")
Or we could compare using density curves (smooth version of histograms)
plot(density(y))
lines(density(y2), col = "red")
In real life, we DO NOT KNOW the formula of weighted sums, or even if a formula of weighted sums is appropriate. We also don’t know if the Normality assumption or constant variance assumption of the noise is plausible.
Again, let’s simulate some data. The function set.seed(15)
ensures we all generate the same data.
set.seed(15)
# random sample of "m" and "f"
gender <- sample(c("m", "f"), size = 200, replace = TRUE)
# random numbers from the range of 1 - 20
score <- runif(n = 200, min = 1, max = 20)
# the noise: random draws from a N(0,5) distribution
noise <- rnorm(n = 200, mean = 0, sd = 5)
Now we simulate y. The third predictor is an “interaction”. That is, the “effect” of score depends on gender.
y <- -1 + -3*(gender=="m") + 2*score + 3*(gender=="m")*score + noise
dat <- data.frame(y, gender, score)
Scatter plot of y versus x, colored by gender. Now ggplot2 comes in handy.
ggplot(dat, aes(x = score, y = y, color = gender)) +
geom_point()
Notice the distribution of y is not Normal. That’s OK. The normality assumption is for the noise/error.
ggplot(dat, aes(x = y, y = stat(density))) +
geom_histogram(bins = 20) +
geom_density()
Let’s try to recover the “true” values in the formula and the SD of the noise.
To fit an interaction, use a colon.
mod2 <- lm(y ~ gender + score + gender:score, data = dat)
# or more concisely
mod2 <- lm(y ~ gender * score, data = dat)
We come pretty close to recovering the true values…
# y <- -1 + -3*(gender=="m") + 2*score + 3*(gender=="m")*score + noise
summary(mod2)
##
## Call:
## lm(formula = y ~ gender * score, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.4835 -3.4282 0.1576 3.4301 12.7603
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.1841 1.3389 0.884 0.37758
## genderm -5.3155 1.6563 -3.209 0.00156 **
## score 1.8227 0.1090 16.726 < 2e-16 ***
## genderm:score 3.2059 0.1372 23.363 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.12 on 196 degrees of freedom
## Multiple R-squared: 0.9631, Adjusted R-squared: 0.9625
## F-statistic: 1704 on 3 and 196 DF, p-value: < 2.2e-16
Let’s simulate data from our model and see how it compares to the original data. For this we can use the simulate()
function
sim_y <- simulate(mod2)
head(sim_y)
## sim_1
## 1 42.19224
## 2 67.73883
## 3 10.33191
## 4 31.29291
## 5 22.11217
## 6 24.98789
plot both y
and sim.y
and compare. This time we use ggplot2.
ggplot(dat, aes(x = y)) +
geom_density() +
geom_density(aes(sim_1), sim_y, color = "red")
Or the base R way
plot(density(dat$y))
lines(density(sim_y$sim_1), col = "red")
By specifying geom_smooth(method = “lm”) in the ggplot2 call we can add the fitted lines.
ggplot(dat, aes(x = score, y = y, color = gender)) +
geom_point() +
geom_abline(intercept = -1, slope = 2) + # true line for "f"
geom_abline(intercept = -4, slope = 5) + # true line for "m"
geom_smooth(method = "lm") # model predicted lines
## `geom_smooth()` using formula 'y ~ x'
Probably better to use the model to create the plot. These are usually referred to as “effect plots”. Here we use the ggpredict()
function from the ggeffects package.
eff <- ggpredict(mod2, terms = c("score", "gender"))
eff
##
## # Predicted values of y
## # x = score
##
## # gender = f
##
## x | Predicted | SE | 95% CI
## --------------------------------------
## 0 | 1.18 | 1.34 | [-1.44, 3.81]
## 5 | 10.30 | 0.87 | [ 8.59, 12.00]
## 10 | 19.41 | 0.56 | [18.32, 20.51]
## 15 | 28.52 | 0.68 | [27.19, 29.86]
## 20 | 37.64 | 1.10 | [35.48, 39.79]
##
## # gender = m
##
## x | Predicted | SE | 95% CI
## --------------------------------------
## 0 | -4.13 | 0.98 | [-6.04, -2.22]
## 5 | 21.01 | 0.65 | [19.74, 22.28]
## 10 | 46.15 | 0.49 | [45.20, 47.11]
## 15 | 71.30 | 0.63 | [70.06, 72.54]
## 20 | 96.44 | 0.96 | [94.57, 98.31]
plot(eff, add.data = TRUE)
Let’s fit a “wrong” model with no interaction.
“y ~ gender + score” means “y = intercept + b1 * gender + b2 * score”
The effect of score is the same for both males and female
mod3 <- lm(y ~ gender + score, data = dat)
summary(mod3)
##
## Call:
## lm(formula = y ~ gender + score, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.137 -7.288 -0.479 6.739 26.620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -21.5267 1.7868 -12.05 <2e-16 ***
## genderm 29.3970 1.4206 20.69 <2e-16 ***
## score 3.8447 0.1285 29.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.935 on 197 degrees of freedom
## Multiple R-squared: 0.8602, Adjusted R-squared: 0.8588
## F-statistic: 606.3 on 2 and 197 DF, p-value: < 2.2e-16
Compare simulated data to oberserved data; not good!
sim_y <- simulate(mod3)
plot(density(dat$y))
lines(density(sim_y$sim_1), col = "red")
Plot the fitted model using ggpredict
eff <- ggpredict(mod3, terms = c("score", "gender"))
plot(eff, add.data = TRUE)
Hence this is linear modeling:
# submit the following code to simulate some data
set.seed(2)
x1 <- sample(1:5, size = 1000, replace = TRUE,
prob = c(0.1,0.2,0.3,0.3,0.1))
x2 <- rnorm(n = 1000, mean = 12, sd = 2)
noise <- rnorm(n = 1000, mean = 0, sd = 4)
y <- 5 + 10*x1 + -4*x2 + noise
df <- data.frame(y, x1, x2)
head(df)
# Use lm() as we did above to attempt to recover the "true" values that were
# used to generate y.
Recall we (David Martin) downloaded and cleaned over 30,000 records from Albemarle County’s office of Geographic Data Services. The final clean version is available as an RDS file on GitHub. The following code will download the file and read into R. Notice we have to assign the result of readRDS
to a name. This creates an object in our workspace called “homes”.
URL <- "https://github.com/uvastatlab/phdplus2020/raw/master/data/homes_final.rds"
homes <- readRDS(file = url(URL))
glimpse(homes)
## Observations: 30,258
## Variables: 23
## $ yearbuilt <dbl> 1939, 1981, 1999, 1754, 1880, 1769, 1818, 1935, 1...
## $ yearremodeled <dbl> NA, NA, NA, NA, NA, 1988, 1991, 1994, 1978, NA, N...
## $ usecode <chr> "Single Family", "Single Family", "Single Family"...
## $ finsqft <dbl> 1596, 1616, 480, 828, 1224, 4875, 5573, 1120, 145...
## $ cooling <fct> No Central Air, No Central Air, No Central Air, N...
## $ bedroom <dbl> 3, 6, 0, 1, 2, 4, 6, 3, 3, 3, 3, 3, 3, 3, 3, 4, 2...
## $ fullbath <dbl> 1, 4, 0, 0, 1, 3, 4, 1, 1, 3, 2, 2, 3, 3, 2, 1, 1...
## $ halfbath <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0...
## $ totalrooms <dbl> 5, 9, 1, 3, 4, 8, 11, 5, 6, 6, 9, 0, 10, 8, 8, 6,...
## $ lotsize <dbl> 55.955, 5.167, 80.750, 80.718, 65.000, 5.102, 453...
## $ landvalue <dbl> 347900, 140300, 312500, 327100, 497000, 106400, 1...
## $ improvementsvalue <dbl> 80700, 212300, 21600, 13500, 113100, 827000, 8022...
## $ totalvalue <dbl> 428600, 352600, 334100, 340600, 610100, 933400, 2...
## $ lastsaleprice <dbl> 424000, 300000, 0, 0, 0, 0, 0, 2820000, 0, 0, 520...
## $ esdistrict <fct> Broadus Wood, Broadus Wood, Broadus Wood, Broadus...
## $ msdistrict <fct> Jouett, Jouett, Jouett, Jouett, Jouett, Henley, H...
## $ hsdistrict <fct> Albemarle, Albemarle, Albemarle, Albemarle, Albem...
## $ censustract <fct> 101, 101, 101, 101, 101, 101, 101, 101, 101, 101,...
## $ lastsaledate <date> 2003-11-18, 2012-03-23, 1994-06-30, 1994-08-09, ...
## $ age <dbl> 80, 38, 20, 265, 139, 250, 201, 84, 139, 14, 15, ...
## $ condition <fct> Fair, Average, Average, Substandard, Fair, Averag...
## $ remodel <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ fp <dbl> 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0...
Notice we have a lot of variability in totalvalue of the home.
summary(homes$totalvalue)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4300 243500 340500 424524 491300 7859000
hist(homes$totalvalue)
# homes less than $1,000,000
hist(homes$totalvalue[homes$totalvalue < 1e6])
Perhaps we could build a linear model to help us understand that variability and estimate the expected totalvalue of a home given various predictors such as finsqft, censustract, bedrooms, lotsize, cooling, etc. Let’s try that with finsqft.
Below we propose that totalvalue = Intercept + slope * finsqft + noise, where noise is assumed to be drawn from a Normal distribution with mean 0 and some unknown standard deviation. Therefore we are estimating three quantities:
plot(totalvalue ~ finsqft, data = homes)
m1 <- lm(totalvalue ~ finsqft, data = homes)
summary(m1)
##
## Call:
## lm(formula = totalvalue ~ finsqft, data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1192704 -85548 -14220 50496 6376635
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.472e+05 3.082e+03 -47.76 <2e-16 ***
## finsqft 2.761e+02 1.351e+00 204.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 224600 on 30256 degrees of freedom
## Multiple R-squared: 0.5797, Adjusted R-squared: 0.5797
## F-statistic: 4.173e+04 on 1 and 30256 DF, p-value: < 2.2e-16
coef(m1)
## (Intercept) finsqft
## -147204.5943 276.0597
sigma(m1)
## [1] 224574.9
# can add the fitted line to our plot
abline(m1, col = "blue")
A naive interpretation: - Every additional square foot adds about $276 to the total value of a home.
An informal check of the “goodness” of this model is to use it to generate data and then compare that data to the original data. Below we use simulate
to generate 50 sets of totalvalues from our model. Then we plot the density of the observed totalvalue and overlay it with the densities of the simulations.
The syntax sim1[[i]]
extracts the ith column of the sim1 data frame.
sim1 <- simulate(m1, nsim = 50)
plot(density(homes$totalvalue))
for(i in 1:50)lines(density(sim1[[i]]), col = "red")
# tidyverse method - slower, more complicated
ggplot(homes, aes(x = totalvalue)) +
geom_line(stat = "density") +
geom_line(aes(x = totalvalue, group = sim),
pivot_longer(sim1, everything(),
names_to = "sim",
values_to = "totalvalue"),
stat = "density",
color = "red")
This doesn’t look good.
Recall our assumptions:
Calling plot
on our model object produces a series of plots to assess those last two assumptions.
Let’s see what these plots look like when the model assumptions are satisfied. We simulated the data for mod2
using these assumptions and then fit the correct model.
summary(mod2)
##
## Call:
## lm(formula = y ~ gender * score, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.4835 -3.4282 0.1576 3.4301 12.7603
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.1841 1.3389 0.884 0.37758
## genderm -5.3155 1.6563 -3.209 0.00156 **
## score 1.8227 0.1090 16.726 < 2e-16 ***
## genderm:score 3.2059 0.1372 23.363 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.12 on 196 degrees of freedom
## Multiple R-squared: 0.9631, Adjusted R-squared: 0.9625
## F-statistic: 1704 on 3 and 196 DF, p-value: < 2.2e-16
plot(mod2)
How to interpret plots
Residuals vs Fitted: should have a horizontal line with uniform and symmertic scatter of points; if not, evidence that SD is not constant
Normal Q-Q: points should lie close to diagonal line; if not, evidence that noise is not drawn from N(0, SD)
Scale-Location: should have a horizontal line with uniform scatter of point; (similar to #1 but easier to detect trend in dispersion)
Residuals vs Leverage: points outside the contour lines are influential observations
Let’s look at the diagnostic plots for our homes model.
plot(m1)
Plots 1, 2, and 3 are very concerning
How to address concerns?
Non-constant SD can be evidence of a mispecified model or a very skewed response where some values are orders of magnitude larger than the rest of the data. Notice that our response is quite skewed:
hist(homes$totalvalue)
We could try transforming totalvalue to a different scale. A common transformation is a log transformation. This certainly looks more symmetric but definitely has some outlying observations.
hist(log(homes$totalvalue))
Let’s try modeling log-transformed totalvalue and checking our diagnostic plots.
m2 <- lm(log(totalvalue) ~ finsqft, data = homes)
plot(m2)
The plots look better but there is still evidence of non-constant variance. We may need to include other predictors in our model.
Let’s also simulate some data from the model and compare to the observed totalvalue:
sim_tv <- simulate(m2, nsim = 50)
plot(density(log(homes$totalvalue)))
for(i in 1:50)lines(density(sim_tv[[i]]), col = "red")
Not bad.
But first how to interpret what we have at the moment?
coef(m2) %>% round(4)
## (Intercept) finsqft
## 11.7213 0.0005
The response is log transformed, so interpretation of coefficients changes. We first need to exponentiate and then interpret as multiplicative instead of additive effect.
exp(coef(m2)) %>% round(4)
## (Intercept) finsqft
## 123168.2092 1.0005
Naive interpretation - each additional finished square foot increases price by 0.05%
We can (and should) look at the confidence interval of the estimates
exp(confint(m2)) %>% round(5)
## 2.5 % 97.5 %
## (Intercept) 122029.7943 124317.24430
## finsqft 1.0005 1.00051
Let’s examine the summary output in detail:
summary(m2)
##
## Call:
## lm(formula = log(totalvalue) ~ finsqft, data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5973 -0.1611 0.0130 0.1719 3.0423
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.172e+01 4.738e-03 2474.1 <2e-16 ***
## finsqft 5.049e-04 2.077e-06 243.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3452 on 30256 degrees of freedom
## Multiple R-squared: 0.6613, Adjusted R-squared: 0.6613
## F-statistic: 5.908e+04 on 1 and 30256 DF, p-value: < 2.2e-16
Call:
lm(formula = log(totalvalue) ~ finsqft, data = homes)
Repeat of the function call. Useful if result is saved and then printed later.
Residuals:
Min 1Q Median 3Q Max
-3.5973 -0.1611 0.0130 0.1719 3.0423
Quick check of the distributional (Normal) assumptions of residuals. Median should not be far from 0. Max and Min, and 1Q and 3Q, should be roughly equal in absolute value.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.172e+01 4.738e-03 2474.1 <2e-16 ***
finsqft 5.049e-04 2.077e-06 243.1 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3452 on 30256 degrees of freedom
Multiple R-squared: 0.6613, Adjusted R-squared: 0.6613
F-statistic: 5.908e+04 on 1 and 30256 DF, p-value: < 2.2e-16
m2
to include the lotsize variable, which is the size of the lot in acres on which the house is built. Call the model m3
.simulate
) and compare to the observed totalvalue. TIP: copy and paste the code used previously and update it to use m3
# update model m2 to include lotsize; save as m3
m3 <- lm()
# Interptation of lotsize coefficient:
# check diagnostics
# What do you notice about the last plot?
# OPTIONAL: simulate data and compare to observed data
Let’s include censustract in our model. This is a categorical variable.
Categorical predictors are added to models in the form of a contrast, which is a matrix of zeroes and ones. A contrast can take many forms, but the default contrast in R is the treatment contrast. One level of a categorical variable is designated as the baseline or reference level and all other levels are estimated in contrast to the reference level. Therefore if a categorical variable has k levels, the model will have k - 1 coefficients.
summary(homes$censustract)
## 101 102.01 102.02 103 104.01 104.02 105 106.01 106.02 107 108
## 1768 1925 1188 2751 1506 1391 1406 1359 1453 1205 1055
## 109.02 110 111 112.01 112.02 113.01 113.02 113.03 114
## 57 2380 3375 1385 1437 1737 293 1055 1532
plot(homes$censustract)
m4 <- lm(log(totalvalue) ~ finsqft + lotsize + censustract, data = homes)
summary(m4)
##
## Call:
## lm(formula = log(totalvalue) ~ finsqft + lotsize + censustract,
## data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1120 -0.1402 0.0097 0.1596 3.1575
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.183e+01 8.420e-03 1405.208 < 2e-16 ***
## finsqft 4.523e-04 1.987e-06 227.605 < 2e-16 ***
## lotsize 4.516e-03 8.368e-05 53.962 < 2e-16 ***
## censustract102.01 -8.985e-02 1.007e-02 -8.921 < 2e-16 ***
## censustract102.02 1.028e-01 1.149e-02 8.948 < 2e-16 ***
## censustract103 3.093e-03 9.359e-03 0.330 0.74103
## censustract104.01 -1.358e-01 1.067e-02 -12.735 < 2e-16 ***
## censustract104.02 4.974e-03 1.101e-02 0.452 0.65146
## censustract105 5.151e-02 1.093e-02 4.711 2.48e-06 ***
## censustract106.01 5.104e-02 1.104e-02 4.622 3.82e-06 ***
## censustract106.02 9.557e-03 1.085e-02 0.881 0.37853
## censustract107 -1.939e-01 1.151e-02 -16.855 < 2e-16 ***
## censustract108 2.153e-02 1.189e-02 1.811 0.07022 .
## censustract109.02 2.441e-01 4.095e-02 5.962 2.52e-09 ***
## censustract110 2.802e-01 9.688e-03 28.927 < 2e-16 ***
## censustract111 -4.839e-03 9.008e-03 -0.537 0.59115
## censustract112.01 -9.269e-02 1.092e-02 -8.485 < 2e-16 ***
## censustract112.02 -2.687e-03 1.085e-02 -0.248 0.80442
## censustract113.01 -1.260e-01 1.032e-02 -12.209 < 2e-16 ***
## censustract113.02 -5.746e-02 1.927e-02 -2.983 0.00286 **
## censustract113.03 2.548e-02 1.193e-02 2.137 0.03264 *
## censustract114 -4.910e-01 1.067e-02 -46.022 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3041 on 30236 degrees of freedom
## Multiple R-squared: 0.7373, Adjusted R-squared: 0.7371
## F-statistic: 4040 on 21 and 30236 DF, p-value: < 2.2e-16
Every censustract coefficient is in comparison to censustract 101. It got designated as the reference level since it came first numerically.
As a whole does censustract contribute to the model? We can’t (or shouldn’t) tell just by looking at all the p-values. The formal method for evaluating if a variable contributes to a model is the partial F test. We can easily run that in R with the anova()
function.
anova(m4)
## Analysis of Variance Table
##
## Response: log(totalvalue)
## Df Sum Sq Mean Sq F value Pr(>F)
## finsqft 1 7039.4 7039.4 76105.35 < 2.2e-16 ***
## lotsize 1 174.5 174.5 1886.41 < 2.2e-16 ***
## censustract 19 633.8 33.4 360.65 < 2.2e-16 ***
## Residuals 30236 2796.7 0.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This says that with finsqft and lotsize already in the model, censustract appears to help explain or account for the remaining variability.
Another equivalent way is to use anova to compare the new model with censustract to the previous model (m3) without censustract
m3 <- lm(log(totalvalue) ~ finsqft + lotsize, data = homes)
anova(m3, m4)
## Analysis of Variance Table
##
## Model 1: log(totalvalue) ~ finsqft + lotsize
## Model 2: log(totalvalue) ~ finsqft + lotsize + censustract
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30255 3430.5
## 2 30236 2796.7 19 633.81 360.65 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Notice the F statistic is the same.
How do the diagnostic plots and simulated data look?
plot(m4)
sim_tv <- simulate(m4, nsim = 50)
plot(density(log(homes$totalvalue)))
for(i in 1:50)lines(density(sim_tv[[i]]), col = "red")
How to interpret?
exp(coef(m4)) %>% round(4)
## (Intercept) finsqft lotsize censustract102.01
## 137632.3236 1.0005 1.0045 0.9141
## censustract102.02 censustract103 censustract104.01 censustract104.02
## 1.1082 1.0031 0.8730 1.0050
## censustract105 censustract106.01 censustract106.02 censustract107
## 1.0529 1.0524 1.0096 0.8237
## censustract108 censustract109.02 censustract110 censustract111
## 1.0218 1.2765 1.3234 0.9952
## censustract112.01 censustract112.02 censustract113.01 censustract113.02
## 0.9115 0.9973 0.8816 0.9442
## censustract113.03 censustract114
## 1.0258 0.6120
The censustract coefficients are in contrast to censustract 101. Therefore…
Are these differences significant? Check the model summary:
summary(m4)
##
## Call:
## lm(formula = log(totalvalue) ~ finsqft + lotsize + censustract,
## data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1120 -0.1402 0.0097 0.1596 3.1575
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.183e+01 8.420e-03 1405.208 < 2e-16 ***
## finsqft 4.523e-04 1.987e-06 227.605 < 2e-16 ***
## lotsize 4.516e-03 8.368e-05 53.962 < 2e-16 ***
## censustract102.01 -8.985e-02 1.007e-02 -8.921 < 2e-16 ***
## censustract102.02 1.028e-01 1.149e-02 8.948 < 2e-16 ***
## censustract103 3.093e-03 9.359e-03 0.330 0.74103
## censustract104.01 -1.358e-01 1.067e-02 -12.735 < 2e-16 ***
## censustract104.02 4.974e-03 1.101e-02 0.452 0.65146
## censustract105 5.151e-02 1.093e-02 4.711 2.48e-06 ***
## censustract106.01 5.104e-02 1.104e-02 4.622 3.82e-06 ***
## censustract106.02 9.557e-03 1.085e-02 0.881 0.37853
## censustract107 -1.939e-01 1.151e-02 -16.855 < 2e-16 ***
## censustract108 2.153e-02 1.189e-02 1.811 0.07022 .
## censustract109.02 2.441e-01 4.095e-02 5.962 2.52e-09 ***
## censustract110 2.802e-01 9.688e-03 28.927 < 2e-16 ***
## censustract111 -4.839e-03 9.008e-03 -0.537 0.59115
## censustract112.01 -9.269e-02 1.092e-02 -8.485 < 2e-16 ***
## censustract112.02 -2.687e-03 1.085e-02 -0.248 0.80442
## censustract113.01 -1.260e-01 1.032e-02 -12.209 < 2e-16 ***
## censustract113.02 -5.746e-02 1.927e-02 -2.983 0.00286 **
## censustract113.03 2.548e-02 1.193e-02 2.137 0.03264 *
## censustract114 -4.910e-01 1.067e-02 -46.022 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3041 on 30236 degrees of freedom
## Multiple R-squared: 0.7373, Adjusted R-squared: 0.7371
## F-statistic: 4040 on 21 and 30236 DF, p-value: < 2.2e-16
Use the relevel() function to set a new reference level Let’s set the reference level to “109.02”
homes$censustract <- relevel(homes$censustract, ref = "109.02")
m4 <- lm(log(totalvalue) ~ finsqft + lotsize + censustract, data = homes)
summary(m4)
##
## Call:
## lm(formula = log(totalvalue) ~ finsqft + lotsize + censustract,
## data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1120 -0.1402 0.0097 0.1596 3.1575
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.208e+01 4.051e-02 298.095 < 2e-16 ***
## finsqft 4.523e-04 1.987e-06 227.605 < 2e-16 ***
## lotsize 4.516e-03 8.368e-05 53.962 < 2e-16 ***
## censustract101 -2.441e-01 4.095e-02 -5.962 2.52e-09 ***
## censustract102.01 -3.340e-01 4.088e-02 -8.169 3.22e-16 ***
## censustract102.02 -1.414e-01 4.125e-02 -3.427 0.000612 ***
## censustract103 -2.410e-01 4.070e-02 -5.922 3.22e-09 ***
## censustract104.01 -3.800e-01 4.105e-02 -9.256 < 2e-16 ***
## censustract104.02 -2.391e-01 4.111e-02 -5.817 6.05e-09 ***
## censustract105 -1.926e-01 4.109e-02 -4.687 2.78e-06 ***
## censustract106.01 -1.931e-01 4.112e-02 -4.695 2.67e-06 ***
## censustract106.02 -2.346e-01 4.107e-02 -5.711 1.13e-08 ***
## censustract107 -4.380e-01 4.125e-02 -10.619 < 2e-16 ***
## censustract108 -2.226e-01 4.136e-02 -5.382 7.42e-08 ***
## censustract110 3.612e-02 4.078e-02 0.886 0.375800
## censustract111 -2.490e-01 4.062e-02 -6.129 8.97e-10 ***
## censustract112.01 -3.368e-01 4.112e-02 -8.191 2.68e-16 ***
## censustract112.02 -2.468e-01 4.108e-02 -6.008 1.90e-09 ***
## censustract113.01 -3.702e-01 4.096e-02 -9.037 < 2e-16 ***
## censustract113.02 -3.016e-01 4.405e-02 -6.846 7.72e-12 ***
## censustract113.03 -2.186e-01 4.137e-02 -5.285 1.27e-07 ***
## censustract114 -7.351e-01 4.106e-02 -17.902 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3041 on 30236 degrees of freedom
## Multiple R-squared: 0.7373, Adjusted R-squared: 0.7371
## F-statistic: 4040 on 21 and 30236 DF, p-value: < 2.2e-16
anova(m4)
## Analysis of Variance Table
##
## Response: log(totalvalue)
## Df Sum Sq Mean Sq F value Pr(>F)
## finsqft 1 7039.4 7039.4 76105.35 < 2.2e-16 ***
## lotsize 1 174.5 174.5 1886.41 < 2.2e-16 ***
## censustract 19 633.8 33.4 360.65 < 2.2e-16 ***
## Residuals 30236 2796.7 0.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# reset levels back to previous order with "101" first
homes$censustract <- factor(as.character(homes$censustract))
m4
and save it as m5
.# Add the cooling variable to model `m4` and save it as `m5`.
m5 <- lm()
# Does it help the model?
# What is the interpretation of the cooling coefficient?
Often it isn’t reasonable to expect the effect of one variable to be the same regardless of other variables. For example, the effect of finsqft may depend on censustract. Or the effect of lotsize may depend on finsqft.
Use the colon to specify interactions on an individual basis.
Example: finsqft:censustract Example: finsqft:lotsize
Does the effect of finsqft depend on censustract? Does the effect of finsqft depend on lotsize?
m6 <- lm(log(totalvalue) ~ finsqft + lotsize + censustract + cooling +
finsqft:censustract + finsqft:lotsize,
data = homes)
summary(m6)
##
## Call:
## lm(formula = log(totalvalue) ~ finsqft + lotsize + censustract +
## cooling + finsqft:censustract + finsqft:lotsize, data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2210 -0.1409 0.0027 0.1500 3.3844
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.163e+01 1.540e-02 755.166 < 2e-16 ***
## finsqft 4.436e-04 6.346e-06 69.900 < 2e-16 ***
## lotsize 8.641e-03 1.355e-04 63.775 < 2e-16 ***
## censustract102.01 -1.180e-01 2.222e-02 -5.309 1.11e-07 ***
## censustract102.02 2.029e-01 2.623e-02 7.735 1.07e-14 ***
## censustract103 4.532e-02 2.428e-02 1.866 0.062045 .
## censustract104.01 -2.104e-01 2.167e-02 -9.705 < 2e-16 ***
## censustract104.02 6.680e-02 2.475e-02 2.699 0.006953 **
## censustract105 7.831e-02 2.724e-02 2.874 0.004052 **
## censustract106.01 2.082e-01 2.896e-02 7.188 6.74e-13 ***
## censustract106.02 -1.657e-01 2.701e-02 -6.136 8.59e-10 ***
## censustract107 -3.593e-02 3.536e-02 -1.016 0.309587
## censustract108 -2.072e-02 2.637e-02 -0.786 0.431923
## censustract109.02 5.477e-01 1.434e-01 3.821 0.000133 ***
## censustract110 3.590e-01 2.124e-02 16.901 < 2e-16 ***
## censustract111 -9.553e-02 2.061e-02 -4.635 3.58e-06 ***
## censustract112.01 -1.959e-01 2.258e-02 -8.677 < 2e-16 ***
## censustract112.02 -3.515e-02 2.337e-02 -1.504 0.132650
## censustract113.01 -1.526e-01 2.231e-02 -6.842 7.95e-12 ***
## censustract113.02 2.682e-02 5.436e-02 0.493 0.621713
## censustract113.03 -1.187e-02 4.043e-02 -0.294 0.769024
## censustract114 -5.333e-01 2.203e-02 -24.207 < 2e-16 ***
## coolingCentral Air 2.742e-01 5.641e-03 48.604 < 2e-16 ***
## finsqft:censustract102.01 9.641e-07 9.997e-06 0.096 0.923172
## finsqft:censustract102.02 -4.952e-05 9.794e-06 -5.057 4.29e-07 ***
## finsqft:censustract103 -4.440e-05 1.130e-05 -3.930 8.50e-05 ***
## finsqft:censustract104.01 4.148e-05 8.953e-06 4.633 3.61e-06 ***
## finsqft:censustract104.02 -3.481e-05 9.180e-06 -3.792 0.000150 ***
## finsqft:censustract105 -2.810e-05 1.137e-05 -2.471 0.013496 *
## finsqft:censustract106.01 -1.029e-04 1.344e-05 -7.657 1.96e-14 ***
## finsqft:censustract106.02 6.919e-05 1.289e-05 5.370 7.94e-08 ***
## finsqft:censustract107 -1.449e-04 2.247e-05 -6.447 1.16e-10 ***
## finsqft:censustract108 1.522e-06 1.168e-05 0.130 0.896275
## finsqft:censustract109.02 -1.584e-04 6.375e-05 -2.485 0.012969 *
## finsqft:censustract110 -3.983e-05 7.989e-06 -4.986 6.20e-07 ***
## finsqft:censustract111 3.277e-05 8.805e-06 3.721 0.000198 ***
## finsqft:censustract112.01 6.592e-05 9.826e-06 6.709 2.00e-11 ***
## finsqft:censustract112.02 7.912e-06 9.101e-06 0.869 0.384621
## finsqft:censustract113.01 1.180e-05 1.080e-05 1.093 0.274458
## finsqft:censustract113.02 -6.905e-05 3.491e-05 -1.978 0.047937 *
## finsqft:censustract113.03 -7.817e-06 2.273e-05 -0.344 0.730884
## finsqft:censustract114 5.104e-05 1.085e-05 4.703 2.57e-06 ***
## finsqft:lotsize -1.244e-06 3.531e-08 -35.235 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2851 on 30215 degrees of freedom
## Multiple R-squared: 0.7693, Adjusted R-squared: 0.769
## F-statistic: 2399 on 42 and 30215 DF, p-value: < 2.2e-16
Is the interaction “significant”? In other words, does it make the model better? Does the interaction make the model better at explaining the variability in totalvalue. Again the partial F test can help us assess this.
The car package’s Anova
function may be preferred in this case. It assesses an interactions contribution with all other predictors in the model. The base R anova
function is sequential. It assesses the contribution of a predictor given that other predictors before it are already in the model.
anova(m6) # Type I Sum of Squares
## Analysis of Variance Table
##
## Response: log(totalvalue)
## Df Sum Sq Mean Sq F value Pr(>F)
## finsqft 1 7039.4 7039.4 86617.890 < 2.2e-16 ***
## lotsize 1 174.5 174.5 2146.979 < 2.2e-16 ***
## censustract 19 633.8 33.4 410.467 < 2.2e-16 ***
## cooling 1 209.1 209.1 2572.385 < 2.2e-16 ***
## finsqft:censustract 19 31.2 1.6 20.191 < 2.2e-16 ***
## finsqft:lotsize 1 100.9 100.9 1241.527 < 2.2e-16 ***
## Residuals 30215 2455.6 0.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
car::Anova(m6) # Type II Sum of Squares
## Anova Table (Type II tests)
##
## Response: log(totalvalue)
## Sum Sq Df F value Pr(>F)
## finsqft 3877.2 1 47708.102 < 2.2e-16 ***
## lotsize 289.8 1 3566.453 < 2.2e-16 ***
## censustract 523.9 19 339.272 < 2.2e-16 ***
## cooling 192.0 1 2362.371 < 2.2e-16 ***
## finsqft:censustract 40.5 19 26.209 < 2.2e-16 ***
## finsqft:lotsize 100.9 1 1241.527 < 2.2e-16 ***
## Residuals 2455.6 30215
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Either way, the interactions seem to be warranted.
How do we interpret? This is where effect plots help. Here we use ggpredict()
.
plot(ggpredict(m6, terms = c("finsqft", "lotsize")))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.
We can set our own values for lotsize
plot(ggpredict(m6, terms = c("finsqft", "lotsize[0.5, 10, 100]")))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.
Format the labels to show dollar amounts
plot(ggpredict(m6, terms = c("finsqft", "lotsize[0.5, 10, 100]")),
labels = scales::dollar)
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.
The effect of finsqft decreases as lotsize gets larger (around 100 acres). Also notice most of the plot is devoted to the handful of homes over 2.5 million dollars. If we like, we can zoom in on the plot and look at totalvalues from 0 to $1,000,000. We don’t see the interaction when zoomed in. The lines look roughly parallel.
plot(ggpredict(m6, terms = c("finsqft", "lotsize[0.5, 10, 100]")),
labels = scales::dollar) +
coord_cartesian(ylim = c(0, 1e6))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.
Switch the order of the terms to put lotsize on the x-axis
plot(ggpredict(m6, terms = c("lotsize[0:200]", "finsqft[1500, 3000, 4500]")))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.
The effect of lotsize seems more pronounced for smaller homes.
Effect plots for censustract is more difficult because there are so many levels (20). Therefore we need to look at a few at a time.
plot(ggpredict(m6, terms = c("finsqft[0:5000]",
"censustract[113.01, 113.02, 113.03]")))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.
plot(ggpredict(m6, terms = c("finsqft[0:5000]",
"censustract[101, 107, 108, 110, 111]")))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.
The ggpredict() result can be saved and used as a data frame with ggplot
eff_out <- ggpredict(m6, terms = c("finsqft[0:5000]",
"censustract[101, 107, 108, 110, 111]"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.
names(eff_out)
## [1] "x" "predicted" "std.error" "conf.low" "conf.high" "group"
eff_out <- eff_out %>%
rename(`Census Tract` = group)
ggplot(eff_out, aes(x = x, y = predicted, color = `Census Tract`)) +
geom_line() +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high, fill = `Census Tract`),
alpha = 1/5) +
scale_y_continuous(labels = scales::dollar) +
labs(x = "Finished Square Feet")
When should you include interactions? What kind? How many? That requires some thought and expertise in the subject.
m6
and save as m7
.# Add an interaction for cooling and finsqft to `m6` and save as `m7`.
m7 <- lm()
# Is the interaction significant?
# Create an effect plot to visualize the interaction of cooling and finsqft (or
# lack thereof).
Often the simple assumption of a linear effect of a predictor is unrealistic. Fortunately there are ways to fit non-linear effects in a linear model.
Let’s demostrate with some simulated data
# polynomial of degree 2
x <- seq(from = -10, to = 10, length.out = 100)
set.seed(3)
y <- 1.2 + 2*x + 0.9*x^2 + rnorm(100, sd = 10)
nl_dat <- data.frame(y, x)
plot(x, y)
Fit a polynomial model with the poly()
function
mod4 <- lm(y ~ poly(x, 2), data = nl_dat)
summary(mod4)
##
## Call:
## lm(formula = y ~ poly(x, 2), data = nl_dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.3937 -6.1342 0.3919 7.0327 16.3910
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.9164 0.8554 37.31 <2e-16 ***
## poly(x, 2)1 126.9080 8.5539 14.84 <2e-16 ***
## poly(x, 2)2 266.4715 8.5539 31.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.554 on 97 degrees of freedom
## Multiple R-squared: 0.9247, Adjusted R-squared: 0.9231
## F-statistic: 595.3 on 2 and 97 DF, p-value: < 2.2e-16
Notice the coefficients are quite large. That’s because poly() generates orthogonal polynomials by default. That makes for numerical stability. Interpretation of coefficients isn’t really worth the effort.
We could fit the raw polynomials as follows and sort of “recover” the true values.
mod5 <- lm(y ~ poly(x, 2, raw = TRUE), data = nl_dat)
summary(mod5)
##
## Call:
## lm(formula = y ~ poly(x, 2, raw = TRUE), data = nl_dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.3937 -6.1342 0.3919 7.0327 16.3910
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.11953 1.28319 1.652 0.102
## poly(x, 2, raw = TRUE)1 2.17624 0.14668 14.836 <2e-16 ***
## poly(x, 2, raw = TRUE)2 0.87621 0.02813 31.152 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.554 on 97 degrees of freedom
## Multiple R-squared: 0.9247, Adjusted R-squared: 0.9231
## F-statistic: 595.3 on 2 and 97 DF, p-value: < 2.2e-16
Use plot() and ggpredict() to plot fitted model with data
plot(ggpredict(mod4), add.data = TRUE)
## $x
plot(ggpredict(mod5), add.data = TRUE)
## $x
The modern approach to fitting non-linear effects is to use splines instead of polynomials.
Instead of poly()
we can use the ns()
function from the splines package. ns stands for natural splines. The second argument is the degrees of freedom. It may help to think of that as the number of times the smooth line changes directions.
mod6 <- lm(y ~ ns(x, df = 2), data = nl_dat)
summary(mod6)
##
## Call:
## lm(formula = y ~ ns(x, df = 2), data = nl_dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.0764 -6.7423 0.1437 7.4876 18.2382
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.746 2.420 25.92 <2e-16 ***
## ns(x, df = 2)1 -77.728 5.433 -14.31 <2e-16 ***
## ns(x, df = 2)2 91.189 2.959 30.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.642 on 97 degrees of freedom
## Multiple R-squared: 0.9231, Adjusted R-squared: 0.9215
## F-statistic: 582.2 on 2 and 97 DF, p-value: < 2.2e-16
plot(ggpredict(mod6), add.data = TRUE)
## $x
Natural splines essentially allow us to fit a series of cubic polynomials connected at knots located in the range of our data.
The ns() function transforms our data similar to a polynomial transformation.
# simple example data
z <- 1:10
# classic polynomial transformation
p2_out <- poly(z, 2, raw = TRUE)
p2_out
## 1 2
## [1,] 1 1
## [2,] 2 4
## [3,] 3 9
## [4,] 4 16
## [5,] 5 25
## [6,] 6 36
## [7,] 7 49
## [8,] 8 64
## [9,] 9 81
## [10,] 10 100
## attr(,"degree")
## [1] 1 2
## attr(,"class")
## [1] "poly" "matrix"
matplot(poly(z, 2, raw = TRUE))
cor(p2_out[,1], p2_out[,2])
## [1] 0.9745586
# orthogonal, or uncorrelated, transformation
p2_out <- poly(z, 2)
p2_out
## 1 2
## [1,] -0.49543369 0.52223297
## [2,] -0.38533732 0.17407766
## [3,] -0.27524094 -0.08703883
## [4,] -0.16514456 -0.26111648
## [5,] -0.05504819 -0.34815531
## [6,] 0.05504819 -0.34815531
## [7,] 0.16514456 -0.26111648
## [8,] 0.27524094 -0.08703883
## [9,] 0.38533732 0.17407766
## [10,] 0.49543369 0.52223297
## attr(,"coefs")
## attr(,"coefs")$alpha
## [1] 5.5 5.5
##
## attr(,"coefs")$norm2
## [1] 1.0 10.0 82.5 528.0
##
## attr(,"degree")
## [1] 1 2
## attr(,"class")
## [1] "poly" "matrix"
matplot(poly(z, 2))
cor(p2_out[,1], p2_out[,2])
## [1] -3.317659e-17
# natural spline transformation
ns_out <- ns(z, 2)
ns_out
## 1 2
## [1,] 0.0000000 0.00000000
## [2,] 0.1674747 -0.10982084
## [3,] 0.3219712 -0.20001558
## [4,] 0.4505112 -0.25095810
## [5,] 0.5401165 -0.24302231
## [6,] 0.5783495 -0.15739986
## [7,] 0.5652102 0.00590925
## [8,] 0.5131362 0.22809668
## [9,] 0.4351057 0.48953631
## [10,] 0.3440969 0.77060206
## attr(,"degree")
## [1] 3
## attr(,"knots")
## 50%
## 5.5
## attr(,"Boundary.knots")
## [1] 1 10
## attr(,"intercept")
## [1] FALSE
## attr(,"class")
## [1] "ns" "basis" "matrix"
matplot(ns(z, 2))
cor(ns_out[,1], ns_out[,2])
## [1] -0.04882638
Frank Harrell states in his book Regression Model Strategies that 3 to 5 DF is almost always sufficient. His basic advice is to allocate more DF to variables you think are more important.
How can we assess whether we need a non-linear effect? Partial residual plots can help. Also called term plots and component-residual plots. We’ll use the crPlots()
function from the car package.
Partial-residual plots show the relationship between a predictor and the response variable given that the other predictors are also in the model.
Let’s fit a model with no interactions. crPlot()
does not work for models with interactions.
m8 <- lm(log(totalvalue) ~ finsqft + lotsize + censustract + cooling,
data = homes)
Notice we specify the numeric variables just as we would in a model.
The blue dashed line is the fitted slope.
The purple line is a smooth trend line.
A curving purple line indicates a non-linear effect may be justified.
crPlots(m8, terms = ~ finsqft)
crPlots(m8, terms = ~ lotsize)
# use a different smoother for the smooth trend line
crPlots(m8, terms = ~ lotsize, smooth=list(smoother=gamLine))
It appears both finsqft and lotsize may be affecting totalvalue in a non-linear fashion. Let’s fit a non-linear effect for finsqft using a natural spline with 3 DF.
m9 <- lm(log(totalvalue) ~ ns(finsqft, df = 3) + lotsize + censustract + cooling,
data = homes)
summary(m9)
##
## Call:
## lm(formula = log(totalvalue) ~ ns(finsqft, df = 3) + lotsize +
## censustract + cooling, data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9955 -0.1430 -0.0029 0.1473 3.2812
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.141e+01 1.697e-02 672.141 < 2e-16 ***
## ns(finsqft, df = 3)1 1.989e+00 1.248e-02 159.405 < 2e-16 ***
## ns(finsqft, df = 3)2 3.660e+00 4.318e-02 84.764 < 2e-16 ***
## ns(finsqft, df = 3)3 2.908e+00 4.584e-02 63.447 < 2e-16 ***
## lotsize 5.082e-03 7.916e-05 64.193 < 2e-16 ***
## censustract102.01 -1.308e-01 9.499e-03 -13.768 < 2e-16 ***
## censustract102.02 6.066e-02 1.080e-02 5.617 1.96e-08 ***
## censustract103 -7.335e-02 8.904e-03 -8.237 < 2e-16 ***
## censustract104.01 -1.280e-01 1.001e-02 -12.788 < 2e-16 ***
## censustract104.02 -2.734e-02 1.035e-02 -2.640 0.00828 **
## censustract105 -1.710e-02 1.032e-02 -1.657 0.09762 .
## censustract106.01 -2.368e-02 1.046e-02 -2.265 0.02350 *
## censustract106.02 -5.663e-02 1.027e-02 -5.512 3.57e-08 ***
## censustract107 -2.408e-01 1.095e-02 -21.991 < 2e-16 ***
## censustract108 -3.465e-02 1.123e-02 -3.084 0.00204 **
## censustract109.02 1.588e-01 3.844e-02 4.130 3.63e-05 ***
## censustract110 2.463e-01 9.113e-03 27.028 < 2e-16 ***
## censustract111 -5.451e-02 8.491e-03 -6.420 1.39e-10 ***
## censustract112.01 -7.046e-02 1.025e-02 -6.871 6.51e-12 ***
## censustract112.02 -2.694e-02 1.019e-02 -2.644 0.00821 **
## censustract113.01 -1.341e-01 9.702e-03 -13.825 < 2e-16 ***
## censustract113.02 -7.052e-02 1.811e-02 -3.894 9.89e-05 ***
## censustract113.03 -5.021e-02 1.134e-02 -4.429 9.51e-06 ***
## censustract114 -4.503e-01 1.003e-02 -44.904 < 2e-16 ***
## coolingCentral Air 2.238e-01 5.837e-03 38.338 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2853 on 30233 degrees of freedom
## Multiple R-squared: 0.7688, Adjusted R-squared: 0.7686
## F-statistic: 4189 on 24 and 30233 DF, p-value: < 2.2e-16
The coefficients are impossible to interpret. Effect plots are our only hope.
plot(ggpredict(m9, terms = "finsqft"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.
The partial-residual plot looks better
crPlots(m9, ~ns(finsqft, df = 3))
m9
to include a non-linear effect for lotsize using a natural spline with 5 DF and save the new model as m10
.m10
.# Update model `m9` to include a non-linear effect for lotsize using a natural
# spline with 5 DF and save the new model as `m10`.
m10 <- lm()
# Generate an effect plot for the non-linear lotsize effect.
# Check the partial-residual plot for lotsize in model `m10`.
How do we determine if one model is better than another?
One way is through hypothesis testing with the partial F test.
NULL: two models are equally “good”
ALTERNATIVE: the more complex model is better
Note: the smaller model must be a subset of the larger model!
We can run this test with the base R anova()
function.
Example:
# The model from the previous exercise:
m10 <- lm(log(totalvalue) ~ ns(finsqft, df = 3) + ns(lotsize, df = 5) +
censustract + cooling,
data = homes)
# a more complex model with an interactionvbetween ns(finsqft, df = 3) and
# ns(lotsize, df = 5)
m11 <- lm(log(totalvalue) ~ ns(finsqft, df = 3) + ns(lotsize, df = 5) +
censustract + cooling +
ns(finsqft, df = 3):ns(lotsize, df = 5),
data = homes)
Test that both models are equally good at modeling totalvalue. Notice that m10
is a subset of m11
anova(m10, m11)
## Analysis of Variance Table
##
## Model 1: log(totalvalue) ~ ns(finsqft, df = 3) + ns(lotsize, df = 5) +
## censustract + cooling
## Model 2: log(totalvalue) ~ ns(finsqft, df = 3) + ns(lotsize, df = 5) +
## censustract + cooling + ns(finsqft, df = 3):ns(lotsize, df = 5)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30229 2286.6
## 2 30214 2215.5 15 71.039 64.587 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It appears m11
is superior to m10
The low p-value says we reject the null; the bigger model appears to be better than the smaller model.
We can also use the AIC or BIC information criteria. These are not hypothesis tests. We simply see which has a lower value. These values estimate the out-of-sample accuracy if we were to use these models to make predictions on new data. Simply use the AIC
and/or BIC
functions with the model objects. The main difference between the two is that BIC imposes a heavier penalty for models with more predictors.
AIC(m10, m11)
## df AIC
## m10 30 7780.731
## m11 45 6855.754
BIC(m10, m11)
## df BIC
## m10 30 8030.257
## m11 45 7230.042
The nice thing about AIC/BIC is that we don’t need to worry about whether the models contain a subset of predictors. They just need to have the same response variable.
While lower is better, sometimes it’s hard to know how much lower AIC/BIC needs to be. If a really complex model is only slightly lower than an easier to understand smaller model, the smaller model may be preferable.
anova
, AIC
and BIC
. Which is better?# Fit the following three models, each with increasing levels of complexity in
# the splines.
mod_a <- lm(log(totalvalue) ~ ns(finsqft, df = 3) * ns(lotsize, df = 3) +
censustract + cooling,
data = homes)
mod_b <- lm(log(totalvalue) ~ ns(finsqft, df = 4) * ns(lotsize, df = 4) +
censustract + cooling,
data = homes)
mod_c <- lm(log(totalvalue) ~ ns(finsqft, df = 5) * ns(lotsize, df = 5) +
censustract + cooling,
data = homes)
# Compare the models using `anova`, `AIC` and `BIC`. Which is better?
Once we have a linear model we may want to use it to make predictions for new data. We can do that with the predict()
function. Note the new data needs to be in a data frame.
Let’s say we want to use this model:
m12 <- lm(log(totalvalue) ~ ns(finsqft, df = 5) * ns(lotsize, df = 5) +
censustract * ns(finsqft, df = 5) + cooling,
data = homes)
summary(m12)
##
## Call:
## lm(formula = log(totalvalue) ~ ns(finsqft, df = 5) * ns(lotsize,
## df = 5) + censustract * ns(finsqft, df = 5) + cooling, data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9684 -0.1316 0.0012 0.1377 3.3834
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 11.399136 0.152451 74.772
## ns(finsqft, df = 5)1 0.870256 0.141428 6.153
## ns(finsqft, df = 5)2 1.231419 0.162838 7.562
## ns(finsqft, df = 5)3 1.892883 0.178687 10.593
## ns(finsqft, df = 5)4 3.959110 0.731198 5.415
## ns(finsqft, df = 5)5 4.704075 1.331463 3.533
## ns(lotsize, df = 5)1 -1.588311 0.171270 -9.274
## ns(lotsize, df = 5)2 -0.123545 0.131748 -0.938
## ns(lotsize, df = 5)3 3.622165 0.801614 4.519
## ns(lotsize, df = 5)4 7.487665 2.991355 2.503
## ns(lotsize, df = 5)5 14.388095 6.533757 2.202
## censustract102.01 1.233490 0.206541 5.972
## censustract102.02 1.058578 0.232722 4.549
## censustract103 0.090532 0.203144 0.446
## censustract104.01 0.112442 0.126228 0.891
## censustract104.02 0.027893 0.226213 0.123
## censustract105 0.530627 0.226084 2.347
## censustract106.01 -0.302680 0.335961 -0.901
## censustract106.02 1.938215 0.225934 8.579
## censustract107 1.388717 0.277131 5.011
## censustract108 1.512674 0.282092 5.362
## censustract109.02 -65.652267 130.495842 -0.503
## censustract110 0.879675 0.173304 5.076
## censustract111 0.636620 0.146789 4.337
## censustract112.01 0.032162 0.120537 0.267
## censustract112.02 0.012006 0.159288 0.075
## censustract113.01 -0.004298 0.134484 -0.032
## censustract113.02 2.543222 0.325047 7.824
## censustract113.03 1.557342 0.317731 4.901
## censustract114 -0.422166 0.126075 -3.349
## coolingCentral Air 0.237083 0.005582 42.473
## ns(finsqft, df = 5)1:ns(lotsize, df = 5)1 1.568478 0.160003 9.803
## ns(finsqft, df = 5)2:ns(lotsize, df = 5)1 1.492416 0.180635 8.262
## ns(finsqft, df = 5)3:ns(lotsize, df = 5)1 1.295066 0.182065 7.113
## ns(finsqft, df = 5)4:ns(lotsize, df = 5)1 2.231213 0.754200 2.958
## ns(finsqft, df = 5)5:ns(lotsize, df = 5)1 -1.504572 1.341948 -1.121
## ns(finsqft, df = 5)1:ns(lotsize, df = 5)2 0.289468 0.122672 2.360
## ns(finsqft, df = 5)2:ns(lotsize, df = 5)2 0.138072 0.141047 0.979
## ns(finsqft, df = 5)3:ns(lotsize, df = 5)2 0.323484 0.167919 1.926
## ns(finsqft, df = 5)4:ns(lotsize, df = 5)2 -0.068635 0.707302 -0.097
## ns(finsqft, df = 5)5:ns(lotsize, df = 5)2 -1.483206 1.321174 -1.123
## ns(finsqft, df = 5)1:ns(lotsize, df = 5)3 -0.258392 0.752757 -0.343
## ns(finsqft, df = 5)2:ns(lotsize, df = 5)3 0.517601 0.852895 0.607
## ns(finsqft, df = 5)3:ns(lotsize, df = 5)3 -4.274065 0.482510 -8.858
## ns(finsqft, df = 5)4:ns(lotsize, df = 5)3 -0.251152 1.789260 -0.140
## ns(finsqft, df = 5)5:ns(lotsize, df = 5)3 -3.480660 1.011815 -3.440
## ns(finsqft, df = 5)1:ns(lotsize, df = 5)4 -1.673822 2.722666 -0.615
## ns(finsqft, df = 5)2:ns(lotsize, df = 5)4 -9.178902 3.157359 -2.907
## ns(finsqft, df = 5)3:ns(lotsize, df = 5)4 1.319223 1.392499 0.947
## ns(finsqft, df = 5)4:ns(lotsize, df = 5)4 -18.732839 6.496717 -2.883
## ns(finsqft, df = 5)5:ns(lotsize, df = 5)4 -5.967410 2.887742 -2.066
## ns(finsqft, df = 5)1:ns(lotsize, df = 5)5 -6.608910 5.973005 -1.106
## ns(finsqft, df = 5)2:ns(lotsize, df = 5)5 -22.013598 6.898977 -3.191
## ns(finsqft, df = 5)3:ns(lotsize, df = 5)5 5.199070 2.997022 1.735
## ns(finsqft, df = 5)4:ns(lotsize, df = 5)5 -39.413922 13.916752 -2.832
## ns(finsqft, df = 5)5:ns(lotsize, df = 5)5 -2.239959 3.088149 -0.725
## ns(finsqft, df = 5)1:censustract102.01 -1.246213 0.193784 -6.431
## ns(finsqft, df = 5)2:censustract102.01 -1.353778 0.214052 -6.325
## ns(finsqft, df = 5)3:censustract102.01 -0.721572 0.131069 -5.505
## ns(finsqft, df = 5)4:censustract102.01 -2.629437 0.469900 -5.596
## ns(finsqft, df = 5)5:censustract102.01 -0.718808 0.387259 -1.856
## ns(finsqft, df = 5)1:censustract102.02 -0.981329 0.219482 -4.471
## ns(finsqft, df = 5)2:censustract102.02 -0.947081 0.238111 -3.977
## ns(finsqft, df = 5)3:censustract102.02 -0.627166 0.132490 -4.734
## ns(finsqft, df = 5)4:censustract102.02 -1.916222 0.501875 -3.818
## ns(finsqft, df = 5)5:censustract102.02 -0.564730 0.248293 -2.274
## ns(finsqft, df = 5)1:censustract103 -0.172455 0.192935 -0.894
## ns(finsqft, df = 5)2:censustract103 -0.079854 0.211199 -0.378
## ns(finsqft, df = 5)3:censustract103 -0.441121 0.181328 -2.433
## ns(finsqft, df = 5)4:censustract103 0.461762 0.948147 0.487
## ns(finsqft, df = 5)5:censustract103 0.514972 1.675707 0.307
## ns(finsqft, df = 5)1:censustract104.01 -0.196547 0.117192 -1.677
## ns(finsqft, df = 5)2:censustract104.01 -0.219508 0.133328 -1.646
## ns(finsqft, df = 5)3:censustract104.01 0.139010 0.098757 1.408
## ns(finsqft, df = 5)4:censustract104.01 -0.589098 0.294983 -1.997
## ns(finsqft, df = 5)5:censustract104.01 -0.250427 0.246021 -1.018
## ns(finsqft, df = 5)1:censustract104.02 -0.023345 0.212101 -0.110
## ns(finsqft, df = 5)2:censustract104.02 0.039158 0.232663 0.168
## ns(finsqft, df = 5)3:censustract104.02 0.065605 0.135283 0.485
## ns(finsqft, df = 5)4:censustract104.02 -0.092770 0.495217 -0.187
## ns(finsqft, df = 5)5:censustract104.02 -0.235059 0.305940 -0.768
## ns(finsqft, df = 5)1:censustract105 -0.428971 0.215560 -1.990
## ns(finsqft, df = 5)2:censustract105 -0.499048 0.232013 -2.151
## ns(finsqft, df = 5)3:censustract105 -0.212976 0.145587 -1.463
## ns(finsqft, df = 5)4:censustract105 -1.170131 0.510769 -2.291
## ns(finsqft, df = 5)5:censustract105 -0.928658 0.408919 -2.271
## ns(finsqft, df = 5)1:censustract106.01 0.439865 0.321445 1.368
## ns(finsqft, df = 5)2:censustract106.01 0.247408 0.343702 0.720
## ns(finsqft, df = 5)3:censustract106.01 0.027627 0.196092 0.141
## ns(finsqft, df = 5)4:censustract106.01 0.455052 0.727569 0.625
## ns(finsqft, df = 5)5:censustract106.01 -0.742363 0.502621 -1.477
## ns(finsqft, df = 5)1:censustract106.02 -1.829026 0.209948 -8.712
## ns(finsqft, df = 5)2:censustract106.02 -1.818738 0.237946 -7.643
## ns(finsqft, df = 5)3:censustract106.02 -1.351310 0.229008 -5.901
## ns(finsqft, df = 5)4:censustract106.02 -3.082205 1.323837 -2.328
## ns(finsqft, df = 5)5:censustract106.02 0.800719 2.429470 0.330
## ns(finsqft, df = 5)1:censustract107 -1.495867 0.252301 -5.929
## ns(finsqft, df = 5)2:censustract107 -1.591649 0.309368 -5.145
## ns(finsqft, df = 5)3:censustract107 -2.155146 0.563734 -3.823
## ns(finsqft, df = 5)4:censustract107 -2.509137 5.517352 -0.455
## ns(finsqft, df = 5)5:censustract107 1.018772 10.828179 0.094
## ns(finsqft, df = 5)1:censustract108 -1.307509 0.266543 -4.905
## ns(finsqft, df = 5)2:censustract108 -1.534222 0.289640 -5.297
## ns(finsqft, df = 5)3:censustract108 -0.577344 0.169913 -3.398
## ns(finsqft, df = 5)4:censustract108 -3.204861 0.601715 -5.326
## ns(finsqft, df = 5)5:censustract108 -1.472915 0.354155 -4.159
## ns(finsqft, df = 5)1:censustract109.02 65.529306 129.936438 0.504
## ns(finsqft, df = 5)2:censustract109.02 66.173941 130.717812 0.506
## ns(finsqft, df = 5)3:censustract109.02 35.888616 70.885012 0.506
## ns(finsqft, df = 5)4:censustract109.02 117.183098 255.623862 0.458
## ns(finsqft, df = 5)5:censustract109.02 10.993229 70.909936 0.155
## ns(finsqft, df = 5)1:censustract110 -0.537799 0.163605 -3.287
## ns(finsqft, df = 5)2:censustract110 -0.604673 0.178106 -3.395
## ns(finsqft, df = 5)3:censustract110 -0.424973 0.104434 -4.069
## ns(finsqft, df = 5)4:censustract110 -0.937982 0.380979 -2.462
## ns(finsqft, df = 5)5:censustract110 0.047934 0.217736 0.220
## ns(finsqft, df = 5)1:censustract111 -0.625394 0.136974 -4.566
## ns(finsqft, df = 5)2:censustract111 -0.631034 0.153347 -4.115
## ns(finsqft, df = 5)3:censustract111 -0.224152 0.111013 -2.019
## ns(finsqft, df = 5)4:censustract111 -1.076201 0.365718 -2.943
## ns(finsqft, df = 5)5:censustract111 -0.316953 0.392283 -0.808
## ns(finsqft, df = 5)1:censustract112.01 -0.086747 0.111223 -0.780
## ns(finsqft, df = 5)2:censustract112.01 -0.068064 0.128108 -0.531
## ns(finsqft, df = 5)3:censustract112.01 0.111269 0.102558 1.085
## ns(finsqft, df = 5)4:censustract112.01 -0.182073 0.298089 -0.611
## ns(finsqft, df = 5)5:censustract112.01 -0.065671 0.305253 -0.215
## ns(finsqft, df = 5)1:censustract112.02 0.051993 0.147560 0.352
## ns(finsqft, df = 5)2:censustract112.02 -0.050399 0.166245 -0.303
## ns(finsqft, df = 5)3:censustract112.02 0.382863 0.104095 3.678
## ns(finsqft, df = 5)4:censustract112.02 -0.335893 0.363788 -0.923
## ns(finsqft, df = 5)5:censustract112.02 -0.672419 0.246554 -2.727
## ns(finsqft, df = 5)1:censustract113.01 -0.060827 0.123707 -0.492
## ns(finsqft, df = 5)2:censustract113.01 -0.112603 0.143500 -0.785
## ns(finsqft, df = 5)3:censustract113.01 -0.387742 0.128849 -3.009
## ns(finsqft, df = 5)4:censustract113.01 0.020046 0.353479 0.057
## ns(finsqft, df = 5)5:censustract113.01 0.691911 0.436751 1.584
## ns(finsqft, df = 5)1:censustract113.02 -2.544365 0.290019 -8.773
## ns(finsqft, df = 5)2:censustract113.02 -2.501437 0.363376 -6.884
## ns(finsqft, df = 5)3:censustract113.02 -1.929204 0.547613 -3.523
## ns(finsqft, df = 5)4:censustract113.02 -4.030108 2.190199 -1.840
## ns(finsqft, df = 5)5:censustract113.02 0.864610 4.240977 0.204
## ns(finsqft, df = 5)1:censustract113.03 -1.498797 0.304521 -4.922
## ns(finsqft, df = 5)2:censustract113.03 -1.557835 0.331102 -4.705
## ns(finsqft, df = 5)3:censustract113.03 -1.304938 0.390905 -3.338
## ns(finsqft, df = 5)4:censustract113.03 -5.785313 2.271602 -2.547
## ns(finsqft, df = 5)5:censustract113.03 -5.855242 4.326510 -1.353
## ns(finsqft, df = 5)1:censustract114 -0.057624 0.115229 -0.500
## ns(finsqft, df = 5)2:censustract114 0.093185 0.135397 0.688
## ns(finsqft, df = 5)3:censustract114 -0.396467 0.122223 -3.244
## ns(finsqft, df = 5)4:censustract114 0.037872 0.305826 0.124
## ns(finsqft, df = 5)5:censustract114 0.270663 0.301350 0.898
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## ns(finsqft, df = 5)1 7.68e-10 ***
## ns(finsqft, df = 5)2 4.07e-14 ***
## ns(finsqft, df = 5)3 < 2e-16 ***
## ns(finsqft, df = 5)4 6.19e-08 ***
## ns(finsqft, df = 5)5 0.000411 ***
## ns(lotsize, df = 5)1 < 2e-16 ***
## ns(lotsize, df = 5)2 0.348388
## ns(lotsize, df = 5)3 6.25e-06 ***
## ns(lotsize, df = 5)4 0.012316 *
## ns(lotsize, df = 5)5 0.027665 *
## censustract102.01 2.37e-09 ***
## censustract102.02 5.42e-06 ***
## censustract103 0.655851
## censustract104.01 0.373052
## censustract104.02 0.901868
## censustract105 0.018930 *
## censustract106.01 0.367628
## censustract106.02 < 2e-16 ***
## censustract107 5.44e-07 ***
## censustract108 8.28e-08 ***
## censustract109.02 0.614899
## censustract110 3.88e-07 ***
## censustract111 1.45e-05 ***
## censustract112.01 0.789606
## censustract112.02 0.939916
## censustract113.01 0.974505
## censustract113.02 5.28e-15 ***
## censustract113.03 9.56e-07 ***
## censustract114 0.000813 ***
## coolingCentral Air < 2e-16 ***
## ns(finsqft, df = 5)1:ns(lotsize, df = 5)1 < 2e-16 ***
## ns(finsqft, df = 5)2:ns(lotsize, df = 5)1 < 2e-16 ***
## ns(finsqft, df = 5)3:ns(lotsize, df = 5)1 1.16e-12 ***
## ns(finsqft, df = 5)4:ns(lotsize, df = 5)1 0.003095 **
## ns(finsqft, df = 5)5:ns(lotsize, df = 5)1 0.262218
## ns(finsqft, df = 5)1:ns(lotsize, df = 5)2 0.018297 *
## ns(finsqft, df = 5)2:ns(lotsize, df = 5)2 0.327633
## ns(finsqft, df = 5)3:ns(lotsize, df = 5)2 0.054060 .
## ns(finsqft, df = 5)4:ns(lotsize, df = 5)2 0.922697
## ns(finsqft, df = 5)5:ns(lotsize, df = 5)2 0.261598
## ns(finsqft, df = 5)1:ns(lotsize, df = 5)3 0.731405
## ns(finsqft, df = 5)2:ns(lotsize, df = 5)3 0.543938
## ns(finsqft, df = 5)3:ns(lotsize, df = 5)3 < 2e-16 ***
## ns(finsqft, df = 5)4:ns(lotsize, df = 5)3 0.888371
## ns(finsqft, df = 5)5:ns(lotsize, df = 5)3 0.000582 ***
## ns(finsqft, df = 5)1:ns(lotsize, df = 5)4 0.538709
## ns(finsqft, df = 5)2:ns(lotsize, df = 5)4 0.003650 **
## ns(finsqft, df = 5)3:ns(lotsize, df = 5)4 0.343453
## ns(finsqft, df = 5)4:ns(lotsize, df = 5)4 0.003936 **
## ns(finsqft, df = 5)5:ns(lotsize, df = 5)4 0.038793 *
## ns(finsqft, df = 5)1:ns(lotsize, df = 5)5 0.268535
## ns(finsqft, df = 5)2:ns(lotsize, df = 5)5 0.001420 **
## ns(finsqft, df = 5)3:ns(lotsize, df = 5)5 0.082796 .
## ns(finsqft, df = 5)4:ns(lotsize, df = 5)5 0.004627 **
## ns(finsqft, df = 5)5:ns(lotsize, df = 5)5 0.468249
## ns(finsqft, df = 5)1:censustract102.01 1.29e-10 ***
## ns(finsqft, df = 5)2:censustract102.01 2.58e-10 ***
## ns(finsqft, df = 5)3:censustract102.01 3.72e-08 ***
## ns(finsqft, df = 5)4:censustract102.01 2.22e-08 ***
## ns(finsqft, df = 5)5:censustract102.01 0.063443 .
## ns(finsqft, df = 5)1:censustract102.02 7.81e-06 ***
## ns(finsqft, df = 5)2:censustract102.02 6.98e-05 ***
## ns(finsqft, df = 5)3:censustract102.02 2.21e-06 ***
## ns(finsqft, df = 5)4:censustract102.02 0.000135 ***
## ns(finsqft, df = 5)5:censustract102.02 0.022946 *
## ns(finsqft, df = 5)1:censustract103 0.371408
## ns(finsqft, df = 5)2:censustract103 0.705359
## ns(finsqft, df = 5)3:censustract103 0.014991 *
## ns(finsqft, df = 5)4:censustract103 0.626251
## ns(finsqft, df = 5)5:censustract103 0.758605
## ns(finsqft, df = 5)1:censustract104.01 0.093525 .
## ns(finsqft, df = 5)2:censustract104.01 0.099696 .
## ns(finsqft, df = 5)3:censustract104.01 0.159260
## ns(finsqft, df = 5)4:censustract104.01 0.045828 *
## ns(finsqft, df = 5)5:censustract104.01 0.308729
## ns(finsqft, df = 5)1:censustract104.02 0.912359
## ns(finsqft, df = 5)2:censustract104.02 0.866344
## ns(finsqft, df = 5)3:censustract104.02 0.627718
## ns(finsqft, df = 5)4:censustract104.02 0.851401
## ns(finsqft, df = 5)5:censustract104.02 0.442303
## ns(finsqft, df = 5)1:censustract105 0.046597 *
## ns(finsqft, df = 5)2:censustract105 0.031488 *
## ns(finsqft, df = 5)3:censustract105 0.143510
## ns(finsqft, df = 5)4:censustract105 0.021975 *
## ns(finsqft, df = 5)5:censustract105 0.023153 *
## ns(finsqft, df = 5)1:censustract106.01 0.171197
## ns(finsqft, df = 5)2:censustract106.01 0.471633
## ns(finsqft, df = 5)3:censustract106.01 0.887959
## ns(finsqft, df = 5)4:censustract106.01 0.531686
## ns(finsqft, df = 5)5:censustract106.01 0.139691
## ns(finsqft, df = 5)1:censustract106.02 < 2e-16 ***
## ns(finsqft, df = 5)2:censustract106.02 2.18e-14 ***
## ns(finsqft, df = 5)3:censustract106.02 3.66e-09 ***
## ns(finsqft, df = 5)4:censustract106.02 0.019906 *
## ns(finsqft, df = 5)5:censustract106.02 0.741715
## ns(finsqft, df = 5)1:censustract107 3.08e-09 ***
## ns(finsqft, df = 5)2:censustract107 2.69e-07 ***
## ns(finsqft, df = 5)3:censustract107 0.000132 ***
## ns(finsqft, df = 5)4:censustract107 0.649277
## ns(finsqft, df = 5)5:censustract107 0.925042
## ns(finsqft, df = 5)1:censustract108 9.37e-07 ***
## ns(finsqft, df = 5)2:censustract108 1.19e-07 ***
## ns(finsqft, df = 5)3:censustract108 0.000680 ***
## ns(finsqft, df = 5)4:censustract108 1.01e-07 ***
## ns(finsqft, df = 5)5:censustract108 3.21e-05 ***
## ns(finsqft, df = 5)1:censustract109.02 0.614042
## ns(finsqft, df = 5)2:censustract109.02 0.612695
## ns(finsqft, df = 5)3:censustract109.02 0.612654
## ns(finsqft, df = 5)4:censustract109.02 0.646654
## ns(finsqft, df = 5)5:censustract109.02 0.876798
## ns(finsqft, df = 5)1:censustract110 0.001013 **
## ns(finsqft, df = 5)2:censustract110 0.000687 ***
## ns(finsqft, df = 5)3:censustract110 4.73e-05 ***
## ns(finsqft, df = 5)4:censustract110 0.013821 *
## ns(finsqft, df = 5)5:censustract110 0.825759
## ns(finsqft, df = 5)1:censustract111 5.00e-06 ***
## ns(finsqft, df = 5)2:censustract111 3.88e-05 ***
## ns(finsqft, df = 5)3:censustract111 0.043480 *
## ns(finsqft, df = 5)4:censustract111 0.003256 **
## ns(finsqft, df = 5)5:censustract111 0.419114
## ns(finsqft, df = 5)1:censustract112.01 0.435436
## ns(finsqft, df = 5)2:censustract112.01 0.595214
## ns(finsqft, df = 5)3:censustract112.01 0.277961
## ns(finsqft, df = 5)4:censustract112.01 0.541337
## ns(finsqft, df = 5)5:censustract112.01 0.829663
## ns(finsqft, df = 5)1:censustract112.02 0.724578
## ns(finsqft, df = 5)2:censustract112.02 0.761767
## ns(finsqft, df = 5)3:censustract112.02 0.000235 ***
## ns(finsqft, df = 5)4:censustract112.02 0.355848
## ns(finsqft, df = 5)5:censustract112.02 0.006390 **
## ns(finsqft, df = 5)1:censustract113.01 0.622930
## ns(finsqft, df = 5)2:censustract113.01 0.432638
## ns(finsqft, df = 5)3:censustract113.01 0.002621 **
## ns(finsqft, df = 5)4:censustract113.01 0.954776
## ns(finsqft, df = 5)5:censustract113.01 0.113154
## ns(finsqft, df = 5)1:censustract113.02 < 2e-16 ***
## ns(finsqft, df = 5)2:censustract113.02 5.94e-12 ***
## ns(finsqft, df = 5)3:censustract113.02 0.000427 ***
## ns(finsqft, df = 5)4:censustract113.02 0.065769 .
## ns(finsqft, df = 5)5:censustract113.02 0.838456
## ns(finsqft, df = 5)1:censustract113.03 8.62e-07 ***
## ns(finsqft, df = 5)2:censustract113.03 2.55e-06 ***
## ns(finsqft, df = 5)3:censustract113.03 0.000844 ***
## ns(finsqft, df = 5)4:censustract113.03 0.010877 *
## ns(finsqft, df = 5)5:censustract113.03 0.175957
## ns(finsqft, df = 5)1:censustract114 0.617019
## ns(finsqft, df = 5)2:censustract114 0.491312
## ns(finsqft, df = 5)3:censustract114 0.001181 **
## ns(finsqft, df = 5)4:censustract114 0.901446
## ns(finsqft, df = 5)5:censustract114 0.369102
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2647 on 30107 degrees of freedom
## Multiple R-squared: 0.8018, Adjusted R-squared: 0.8008
## F-statistic: 811.9 on 150 and 30107 DF, p-value: < 2.2e-16
sim_tv <- simulate(m12, nsim = 50)
plot(density(log(homes$totalvalue)))
for(i in 1:50)lines(density(sim_tv[[i]]), col = "red")
What is the expected mean totalvalue of a home with the following characteristics:
To answer this, first we need to enter this information into a data frame:
newdata <- data.frame(finsqft = 2500,
lotsize = 1,
cooling = "Central Air",
censustract = "101")
Next we use the predict
function with our newdata
data frame:
predict(m12, newdata = newdata) %>% exp()
## 1
## 432957.1
with 95% confidence interval
predict(m12, newdata = newdata, interval = "confidence") %>% exp()
## fit lwr upr
## 1 432957.1 419529.2 446814.8
To predict a SINGLE home price, we need to set interval = "prediction"
. Notice the predicted value is the same but the interval is much wider since we’re more uncertain about the expected value of a particular house:
predict(m12, newdata = newdata, interval = "prediction") %>% exp()
## fit lwr upr
## 1 432957.1 257446.2 728120.3
Now let’s predict mean totalvalue for same measures but with lotsize ranging from 0 to 10 acreas.
newdata <- data.frame(finsqft = 2500,
lotsize = 1:10,
cooling = "Central Air",
censustract = "101")
head(newdata)
## finsqft lotsize cooling censustract
## 1 2500 1 Central Air 101
## 2 2500 2 Central Air 101
## 3 2500 3 Central Air 101
## 4 2500 4 Central Air 101
## 5 2500 5 Central Air 101
## 6 2500 6 Central Air 101
predict(m12, newdata = newdata, interval = "confidence") %>% exp()
## fit lwr upr
## 1 432957.1 419529.2 446814.8
## 2 452393.5 441256.0 463812.1
## 3 462996.6 451627.3 474652.1
## 4 468040.7 456591.0 479777.5
## 5 472796.1 461287.4 484591.9
## 6 477579.0 466003.6 489441.9
## 7 482389.3 470739.3 494327.7
## 8 487226.9 475494.1 499249.1
## 9 492091.5 480267.7 504206.4
## 10 496983.2 485060.0 509199.5
This is basically how effect plots are created.
pred_out <- predict(m12, newdata = newdata, interval = "confidence") %>%
exp() %>%
as.data.frame() %>%
mutate(lotsize = 1:10)
Now create the effect plot “by hand” with ggplot:
ggplot(pred_out, aes(x = lotsize, y = fit)) +
geom_line() +
geom_ribbon(mapping = aes(ymin = lwr, ymax = upr), alpha = 1/5)
Here’s the same plot with ggpredict; use condition
argument to set other predictors to specific values.
eff_out <- ggpredict(m12, terms = "lotsize[1:10]",
condition = c(finsqft = 2500,
cooling = "Central Air",
censustract = "101"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.
plot(eff_out)
Modify the code below to produce an effect plot for finsqft (with values set to 1500, 2000, 2500, 3000) and lotsize set to 0.5 acre. Leave cooling and censustract as is.
eff_out <- ggpredict(m12, terms = "lotsize[1:10]",
condition = c(finsqft = 2500,
cooling = "Central Air",
censustract = "101"))
plot(eff_out)
The stargazer package can produce regression tables for articles and presentations. It can output tables in html, text or LaTeX. Below we demonstrate with text.
The basic usage is as follows
stargazer(m2, m3, type = "text", title = "Modeling Results")
##
## Modeling Results
## ===============================================================================
## Dependent variable:
## -----------------------------------------------------------
## log(totalvalue)
## (1) (2)
## -------------------------------------------------------------------------------
## finsqft 0.001*** 0.0005***
## (0.00000) (0.00000)
##
## lotsize 0.004***
## (0.0001)
##
## Constant 11.721*** 11.724***
## (0.005) (0.005)
##
## -------------------------------------------------------------------------------
## Observations 30,258 30,258
## R2 0.661 0.678
## Adjusted R2 0.661 0.678
## Residual Std. Error 0.345 (df = 30256) 0.337 (df = 30255)
## F Statistic 59,080.390*** (df = 1; 30256) 31,811.080*** (df = 2; 30255)
## ===============================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Since our dependent variable has been log transformed, we may want to exponentiate the coefficients. We can do that with the apply.coef = exp
argument.
stargazer(m2, m3, type = "text", title = "Modeling Results",
apply.coef = exp)
##
## Modeling Results
## ===============================================================================
## Dependent variable:
## -----------------------------------------------------------
## log(totalvalue)
## (1) (2)
## -------------------------------------------------------------------------------
## finsqft 1.001*** 1.000***
## (0.00000) (0.00000)
##
## lotsize 1.004***
## (0.0001)
##
## Constant 123,168.200*** 123,494.200***
## (0.005) (0.005)
##
## -------------------------------------------------------------------------------
## Observations 30,258 30,258
## R2 0.661 0.678
## Adjusted R2 0.661 0.678
## Residual Std. Error 0.345 (df = 30256) 0.337 (df = 30255)
## F Statistic 59,080.390*** (df = 1; 30256) 31,811.080*** (df = 2; 30255)
## ===============================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The style argument allows you to create tables according to styles preferred by various journals. See ?stargazer
“ajps” = American Journal of Political Science
stargazer(m2, m3,
type = "text",
title = "Modeling Results",
apply.coef = exp,
style = "ajps")
##
## Modeling Results
## -----------------------------------------------------------------------------
## log(totalvalue)
## Model 1 Model 2
## -----------------------------------------------------------------------------
## finsqft 1.001*** 1.000***
## (0.00000) (0.00000)
## lotsize 1.004***
## (0.0001)
## Constant 123168.200*** 123494.200***
## (0.005) (0.005)
## N 30258 30258
## R-squared 0.661 0.678
## Adj. R-squared 0.661 0.678
## Residual Std. Error 0.345 (df = 30256) 0.337 (df = 30255)
## F Statistic 59080.390*** (df = 1; 30256) 31811.080*** (df = 2; 30255)
## -----------------------------------------------------------------------------
## ***p < .01; **p < .05; *p < .1
stargazer(m2, m3,
type = "text",
title = "Modeling Results",
apply.coef = exp,
style = "asr")
##
## Modeling Results
## -------------------------------------------------------------------------------
## log(totalvalue)
## (1) (2)
## -------------------------------------------------------------------------------
## finsqft 1.001*** 1.000***
## lotsize 1.004***
## Constant 123,168.200*** 123,494.200***
## N 30,258 30,258
## R2 0.661 0.678
## Adjusted R2 0.661 0.678
## Residual Std. Error 0.345 (df = 30256) 0.337 (df = 30255)
## F Statistic 59,080.390*** (df = 1; 30256) 31,811.080*** (df = 2; 30255)
## -------------------------------------------------------------------------------
## *p < .05; **p < .01; ***p < .001
We can set the variable names with the covariate.labels
argument.
stargazer(m2, m3,
type = "text",
title = "Modeling Results",
style = "ajps",
apply.coef = exp,
covariate.labels = c("Finished Sq Ft",
"Lot Size",
"Intercept"))
##
## Modeling Results
## -----------------------------------------------------------------------------
## log(totalvalue)
## Model 1 Model 2
## -----------------------------------------------------------------------------
## Finished Sq Ft 1.001*** 1.000***
## (0.00000) (0.00000)
## Lot Size 1.004***
## (0.0001)
## Intercept 123168.200*** 123494.200***
## (0.005) (0.005)
## N 30258 30258
## R-squared 0.661 0.678
## Adj. R-squared 0.661 0.678
## Residual Std. Error 0.345 (df = 30256) 0.337 (df = 30255)
## F Statistic 59080.390*** (df = 1; 30256) 31811.080*** (df = 2; 30255)
## -----------------------------------------------------------------------------
## ***p < .01; **p < .05; *p < .1
We can omit certain model statistics. Below we omit the F statistic. Notice we also we set the name of the dependent variable to display as “Total Value”
stargazer(m2, m3,
type = "text",
title = "Modeling Results",
style = "ajps",
covariate.labels = c("Finished Sq Ft",
"Lot Size",
"Intercept"),
omit.stat = "f",
apply.coef = exp,
dep.var.labels = "Total Value")
##
## Modeling Results
## ---------------------------------------------------------
## Total Value
## Model 1 Model 2
## ---------------------------------------------------------
## Finished Sq Ft 1.001*** 1.000***
## (0.00000) (0.00000)
## Lot Size 1.004***
## (0.0001)
## Intercept 123168.200*** 123494.200***
## (0.005) (0.005)
## N 30258 30258
## R-squared 0.661 0.678
## Adj. R-squared 0.661 0.678
## Residual Std. Error 0.345 (df = 30256) 0.337 (df = 30255)
## ---------------------------------------------------------
## ***p < .01; **p < .05; *p < .1
Again, see ?stargazer
for many more options.
Faraway, J. (2005). Linear Models in R. London: Chapman & Hall.
Fox, J. (2002). A R and S-Plus Companion to Applied Regression. London: Sage.
Harrell, F. (2015). Regression Modeling Strategies (2nd ed.). New York: Springer.
Kutner, M., et al. (2005). Applied Linear Statistical Models (5th ed.). New York: McGraw-Hill.
Maindonald J., Braun, J.W. (2010). Data Analysis and Graphics Using R (3rd ed.). Cambridge: Cambridge Univ Press.