This file is available for download at https://uvastatlab.github.io/phdplus2020/ under the final session Linear Modeling.

RStudio tips

QUESTIONS? email jcf2d@virginia.edu

# Load packages we'll use today
library(tidyverse)
library(ggeffects)
library(car)
library(splines)
library(stargazer)

linear modeling with simulated data [Video, Part 1]

Example 1: one numeric predictor

Let’s begin by simulating some fake data

x <- 1:25
y <- 10 + 5*x
plot(x, y)

10 is the intercept, 5 is the slope. y is completely determined by x

Let’s add some “noise” to our data by adding random draws from a Normal distribution with mean = 0 and a standard deviation = 10.

set.seed(1) ensures we all get the same “random” data

set.seed(1)
noise <- rnorm(n = 25, mean = 0, sd = 10)

Add the noise to 10 + 5*x and re-draw plot

y <- 10 + 5*x + noise
plot(x, y)

This data is the combination of two parts:

  1. 10 + 5*x
  2. rnorm(n = 25, mean = 0, sd = 10)

What if we were given this data and told to determine the process that generated it? In other words, fill in the blanks:

  1. __ + __*x
  2. rnorm(n = 25, mean = 0, sd = ____)

That’s basically what linear modeling is.

Traditional linear modeling assumes the following (among others):

  1. the formula is a weighted sum of predictors (eg, 10 + 5*x)
  2. the noise is a random draw from a Normal distribution with mean = 0
  3. the standard deviation of the Normal distribution is constant

Linear modeling tries to recover the weights in the first assumption (10 and 5) and the standard deviation in the 3rd assumption (10).

Let’s attempt to recover the data generating process. For this we use the lm() function. We have to specify the formula for the first assumption. The 2nd and 3rd assumptions are built into lm().

“y ~ x” means we think Part 1 of the model is "y = intercept + slope*x". This tells lm() to take our data and find the best intercept and slope. Notice this is the correct model!

mod <- lm(y ~ x)
summary(mod)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.876  -4.613   2.254   5.909  14.648 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   11.135      3.999   2.784   0.0105 *  
## x              5.042      0.269  18.743 1.98e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.7 on 23 degrees of freedom
## Multiple R-squared:  0.9386, Adjusted R-squared:  0.9359 
## F-statistic: 351.3 on 1 and 23 DF,  p-value: 1.981e-15

The model returns the following estimates:

intercept = 11.135 slope = 5.042 sd = 9.7 (Residual standard error)

Recall the “true” values:

intercept = 10 slope = 5 sd = 10

We could use those estimates to generate data and see if they look similar to our original data.

y2 <- 11.135 + 5.042*x + rnorm(25, 0, 9.7)
# plot original data
plot(x, y)
# plot simulated data using estimates
points(x, y2, col = "red")

Looks similar.

We could also add the original and fitted lines

plot(x, y)
points(x, y2, col = "red")
# a = intercept, b = slope
abline(a = 10, b = 5)
abline(mod, col = "red")

We could also compare distributions using histograms

hist(y, main = "observed data")

hist(y2, main = "data simulated from model")

Or we could compare using density curves (smooth version of histograms)

plot(density(y))
lines(density(y2), col = "red")

In real life, we DO NOT KNOW the formula of weighted sums, or even if a formula of weighted sums is appropriate. We also don’t know if the Normality assumption or constant variance assumption of the noise is plausible.

linear modeling with simulated data [Video, Part 2]

Example 2: one categorical and one numeric predictor

Again, let’s simulate some data. The function set.seed(15) ensures we all generate the same data.

set.seed(15)

# random sample of "m" and "f"
gender <- sample(c("m", "f"), size = 200, replace = TRUE)

# random numbers from the range of 1 - 20
score <- runif(n = 200, min = 1, max = 20)

# the noise: random draws from a N(0,5) distribution
noise <- rnorm(n = 200, mean = 0, sd = 5)

Now we simulate y. The third predictor is an “interaction”. That is, the “effect” of score depends on gender.

y <- -1 + -3*(gender=="m") + 2*score + 3*(gender=="m")*score + noise
dat <- data.frame(y, gender, score)

Scatter plot of y versus x, colored by gender. Now ggplot2 comes in handy.

ggplot(dat, aes(x = score, y = y, color = gender)) +
  geom_point()

Notice the distribution of y is not Normal. That’s OK. The normality assumption is for the noise/error.

ggplot(dat, aes(x = y, y = stat(density))) +
  geom_histogram(bins = 20) +
  geom_density()

Let’s try to recover the “true” values in the formula and the SD of the noise.

To fit an interaction, use a colon.

mod2 <- lm(y ~ gender + score + gender:score, data = dat)
# or more concisely
mod2 <- lm(y ~ gender * score, data = dat)

We come pretty close to recovering the true values…

# y <- -1 + -3*(gender=="m") + 2*score + 3*(gender=="m")*score + noise
summary(mod2)
## 
## Call:
## lm(formula = y ~ gender * score, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.4835  -3.4282   0.1576   3.4301  12.7603 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.1841     1.3389   0.884  0.37758    
## genderm        -5.3155     1.6563  -3.209  0.00156 ** 
## score           1.8227     0.1090  16.726  < 2e-16 ***
## genderm:score   3.2059     0.1372  23.363  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.12 on 196 degrees of freedom
## Multiple R-squared:  0.9631, Adjusted R-squared:  0.9625 
## F-statistic:  1704 on 3 and 196 DF,  p-value: < 2.2e-16

Let’s simulate data from our model and see how it compares to the original data. For this we can use the simulate() function

sim_y <- simulate(mod2)
head(sim_y)
##      sim_1
## 1 42.19224
## 2 67.73883
## 3 10.33191
## 4 31.29291
## 5 22.11217
## 6 24.98789

plot both y and sim.y and compare. This time we use ggplot2.

ggplot(dat, aes(x = y)) +
  geom_density() +
  geom_density(aes(sim_1), sim_y, color = "red")

Or the base R way

plot(density(dat$y))
lines(density(sim_y$sim_1), col = "red")

By specifying geom_smooth(method = “lm”) in the ggplot2 call we can add the fitted lines.

ggplot(dat, aes(x = score, y = y, color = gender)) +
  geom_point() +
  geom_abline(intercept = -1, slope = 2) +  # true line for "f"
  geom_abline(intercept = -4, slope = 5) +  # true line for "m"
  geom_smooth(method = "lm")                # model predicted lines
## `geom_smooth()` using formula 'y ~ x'

Probably better to use the model to create the plot. These are usually referred to as “effect plots”. Here we use the ggpredict() function from the ggeffects package.

eff <- ggpredict(mod2, terms = c("score", "gender"))
eff
## 
## # Predicted values of y
## # x = score
## 
## # gender = f
## 
##  x | Predicted |   SE |         95% CI
## --------------------------------------
##  0 |      1.18 | 1.34 | [-1.44,  3.81]
##  5 |     10.30 | 0.87 | [ 8.59, 12.00]
## 10 |     19.41 | 0.56 | [18.32, 20.51]
## 15 |     28.52 | 0.68 | [27.19, 29.86]
## 20 |     37.64 | 1.10 | [35.48, 39.79]
## 
## # gender = m
## 
##  x | Predicted |   SE |         95% CI
## --------------------------------------
##  0 |     -4.13 | 0.98 | [-6.04, -2.22]
##  5 |     21.01 | 0.65 | [19.74, 22.28]
## 10 |     46.15 | 0.49 | [45.20, 47.11]
## 15 |     71.30 | 0.63 | [70.06, 72.54]
## 20 |     96.44 | 0.96 | [94.57, 98.31]
plot(eff, add.data = TRUE)

Let’s fit a “wrong” model with no interaction.

“y ~ gender + score” means “y = intercept + b1 * gender + b2 * score”

The effect of score is the same for both males and female

mod3 <- lm(y ~ gender + score, data = dat)
summary(mod3)
## 
## Call:
## lm(formula = y ~ gender + score, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.137  -7.288  -0.479   6.739  26.620 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -21.5267     1.7868  -12.05   <2e-16 ***
## genderm      29.3970     1.4206   20.69   <2e-16 ***
## score         3.8447     0.1285   29.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.935 on 197 degrees of freedom
## Multiple R-squared:  0.8602, Adjusted R-squared:  0.8588 
## F-statistic: 606.3 on 2 and 197 DF,  p-value: < 2.2e-16

Compare simulated data to oberserved data; not good!

sim_y <- simulate(mod3)
plot(density(dat$y))
lines(density(sim_y$sim_1), col = "red")

Plot the fitted model using ggpredict

eff <- ggpredict(mod3, terms = c("score", "gender"))
plot(eff, add.data = TRUE)

Hence this is linear modeling:

  1. propose and fit model(s)
  2. determine if the model is good
  3. use the model to explain relationships or make predictions

YOUR TURN #1

# submit the following code to simulate some data
set.seed(2)
x1 <- sample(1:5, size = 1000, replace = TRUE, 
             prob = c(0.1,0.2,0.3,0.3,0.1))
x2 <- rnorm(n = 1000, mean = 12, sd = 2)
noise <- rnorm(n = 1000, mean = 0, sd = 4)
y <- 5 + 10*x1 + -4*x2 + noise
df <- data.frame(y, x1, x2)
head(df)

# Use lm() as we did above to attempt to recover the "true" values that were
# used to generate y.

linear modeling with Albemarle County Real Estate data [Video, Part 3]

Recall we (David Martin) downloaded and cleaned over 30,000 records from Albemarle County’s office of Geographic Data Services. The final clean version is available as an RDS file on GitHub. The following code will download the file and read into R. Notice we have to assign the result of readRDS to a name. This creates an object in our workspace called “homes”.

URL <- "https://github.com/uvastatlab/phdplus2020/raw/master/data/homes_final.rds"
homes <- readRDS(file = url(URL))
glimpse(homes)
## Observations: 30,258
## Variables: 23
## $ yearbuilt         <dbl> 1939, 1981, 1999, 1754, 1880, 1769, 1818, 1935, 1...
## $ yearremodeled     <dbl> NA, NA, NA, NA, NA, 1988, 1991, 1994, 1978, NA, N...
## $ usecode           <chr> "Single Family", "Single Family", "Single Family"...
## $ finsqft           <dbl> 1596, 1616, 480, 828, 1224, 4875, 5573, 1120, 145...
## $ cooling           <fct> No Central Air, No Central Air, No Central Air, N...
## $ bedroom           <dbl> 3, 6, 0, 1, 2, 4, 6, 3, 3, 3, 3, 3, 3, 3, 3, 4, 2...
## $ fullbath          <dbl> 1, 4, 0, 0, 1, 3, 4, 1, 1, 3, 2, 2, 3, 3, 2, 1, 1...
## $ halfbath          <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0...
## $ totalrooms        <dbl> 5, 9, 1, 3, 4, 8, 11, 5, 6, 6, 9, 0, 10, 8, 8, 6,...
## $ lotsize           <dbl> 55.955, 5.167, 80.750, 80.718, 65.000, 5.102, 453...
## $ landvalue         <dbl> 347900, 140300, 312500, 327100, 497000, 106400, 1...
## $ improvementsvalue <dbl> 80700, 212300, 21600, 13500, 113100, 827000, 8022...
## $ totalvalue        <dbl> 428600, 352600, 334100, 340600, 610100, 933400, 2...
## $ lastsaleprice     <dbl> 424000, 300000, 0, 0, 0, 0, 0, 2820000, 0, 0, 520...
## $ esdistrict        <fct> Broadus Wood, Broadus Wood, Broadus Wood, Broadus...
## $ msdistrict        <fct> Jouett, Jouett, Jouett, Jouett, Jouett, Henley, H...
## $ hsdistrict        <fct> Albemarle, Albemarle, Albemarle, Albemarle, Albem...
## $ censustract       <fct> 101, 101, 101, 101, 101, 101, 101, 101, 101, 101,...
## $ lastsaledate      <date> 2003-11-18, 2012-03-23, 1994-06-30, 1994-08-09, ...
## $ age               <dbl> 80, 38, 20, 265, 139, 250, 201, 84, 139, 14, 15, ...
## $ condition         <fct> Fair, Average, Average, Substandard, Fair, Averag...
## $ remodel           <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ fp                <dbl> 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0...

Notice we have a lot of variability in totalvalue of the home.

summary(homes$totalvalue)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4300  243500  340500  424524  491300 7859000
hist(homes$totalvalue)

# homes less than $1,000,000
hist(homes$totalvalue[homes$totalvalue < 1e6])

Perhaps we could build a linear model to help us understand that variability and estimate the expected totalvalue of a home given various predictors such as finsqft, censustract, bedrooms, lotsize, cooling, etc. Let’s try that with finsqft.

Below we propose that totalvalue = Intercept + slope * finsqft + noise, where noise is assumed to be drawn from a Normal distribution with mean 0 and some unknown standard deviation. Therefore we are estimating three quantities:

  1. Intercept
  2. slope coefficient for finsqft
  3. standard deviation of the Normal distribution with mean = 0
plot(totalvalue ~ finsqft, data = homes)
m1 <- lm(totalvalue ~ finsqft, data = homes)
summary(m1)
## 
## Call:
## lm(formula = totalvalue ~ finsqft, data = homes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1192704   -85548   -14220    50496  6376635 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.472e+05  3.082e+03  -47.76   <2e-16 ***
## finsqft      2.761e+02  1.351e+00  204.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 224600 on 30256 degrees of freedom
## Multiple R-squared:  0.5797, Adjusted R-squared:  0.5797 
## F-statistic: 4.173e+04 on 1 and 30256 DF,  p-value: < 2.2e-16
coef(m1)
##  (Intercept)      finsqft 
## -147204.5943     276.0597
sigma(m1)
## [1] 224574.9
# can add the fitted line to our plot
abline(m1, col = "blue")

A naive interpretation: - Every additional square foot adds about $276 to the total value of a home.

An informal check of the “goodness” of this model is to use it to generate data and then compare that data to the original data. Below we use simulate to generate 50 sets of totalvalues from our model. Then we plot the density of the observed totalvalue and overlay it with the densities of the simulations.

The syntax sim1[[i]] extracts the ith column of the sim1 data frame.

sim1 <- simulate(m1, nsim = 50)
plot(density(homes$totalvalue))
for(i in 1:50)lines(density(sim1[[i]]), col = "red")

# tidyverse method - slower, more complicated
ggplot(homes, aes(x = totalvalue)) +
  geom_line(stat = "density") +
  geom_line(aes(x = totalvalue, group = sim), 
               pivot_longer(sim1, everything(), 
                            names_to = "sim", 
                            values_to = "totalvalue"),
            stat = "density",
            color = "red")

This doesn’t look good.

Recall our assumptions:

  1. the formula is a weighted sum of predictors (eg, 10 + 5*x)
  2. the noise is a random draw from a Normal distribution with mean = 0
  3. the standard deviation of the Normal distribution is constant

Calling plot on our model object produces a series of plots to assess those last two assumptions.

Let’s see what these plots look like when the model assumptions are satisfied. We simulated the data for mod2 using these assumptions and then fit the correct model.

summary(mod2)
## 
## Call:
## lm(formula = y ~ gender * score, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.4835  -3.4282   0.1576   3.4301  12.7603 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.1841     1.3389   0.884  0.37758    
## genderm        -5.3155     1.6563  -3.209  0.00156 ** 
## score           1.8227     0.1090  16.726  < 2e-16 ***
## genderm:score   3.2059     0.1372  23.363  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.12 on 196 degrees of freedom
## Multiple R-squared:  0.9631, Adjusted R-squared:  0.9625 
## F-statistic:  1704 on 3 and 196 DF,  p-value: < 2.2e-16
plot(mod2)

How to interpret plots

  1. Residuals vs Fitted: should have a horizontal line with uniform and symmertic scatter of points; if not, evidence that SD is not constant

  2. Normal Q-Q: points should lie close to diagonal line; if not, evidence that noise is not drawn from N(0, SD)

  3. Scale-Location: should have a horizontal line with uniform scatter of point; (similar to #1 but easier to detect trend in dispersion)

  4. Residuals vs Leverage: points outside the contour lines are influential observations

Let’s look at the diagnostic plots for our homes model.

plot(m1)

Plots 1, 2, and 3 are very concerning

How to address concerns?

Non-constant SD can be evidence of a mispecified model or a very skewed response where some values are orders of magnitude larger than the rest of the data. Notice that our response is quite skewed:

hist(homes$totalvalue)

We could try transforming totalvalue to a different scale. A common transformation is a log transformation. This certainly looks more symmetric but definitely has some outlying observations.

hist(log(homes$totalvalue))

Let’s try modeling log-transformed totalvalue and checking our diagnostic plots.

m2 <- lm(log(totalvalue) ~ finsqft, data = homes)
plot(m2)

The plots look better but there is still evidence of non-constant variance. We may need to include other predictors in our model.

Let’s also simulate some data from the model and compare to the observed totalvalue:

sim_tv <- simulate(m2, nsim = 50)
plot(density(log(homes$totalvalue)))
for(i in 1:50)lines(density(sim_tv[[i]]), col = "red")

Not bad.

But first how to interpret what we have at the moment?

coef(m2) %>% round(4)
## (Intercept)     finsqft 
##     11.7213      0.0005

The response is log transformed, so interpretation of coefficients changes. We first need to exponentiate and then interpret as multiplicative instead of additive effect.

exp(coef(m2)) %>% round(4)
## (Intercept)     finsqft 
## 123168.2092      1.0005

Naive interpretation - each additional finished square foot increases price by 0.05%

We can (and should) look at the confidence interval of the estimates

exp(confint(m2)) %>% round(5)
##                   2.5 %       97.5 %
## (Intercept) 122029.7943 124317.24430
## finsqft          1.0005      1.00051

Let’s examine the summary output in detail:

summary(m2)
## 
## Call:
## lm(formula = log(totalvalue) ~ finsqft, data = homes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5973 -0.1611  0.0130  0.1719  3.0423 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.172e+01  4.738e-03  2474.1   <2e-16 ***
## finsqft     5.049e-04  2.077e-06   243.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3452 on 30256 degrees of freedom
## Multiple R-squared:  0.6613, Adjusted R-squared:  0.6613 
## F-statistic: 5.908e+04 on 1 and 30256 DF,  p-value: < 2.2e-16
Call:
lm(formula = log(totalvalue) ~ finsqft, data = homes)

Repeat of the function call. Useful if result is saved and then printed later.

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5973 -0.1611  0.0130  0.1719  3.0423  

Quick check of the distributional (Normal) assumptions of residuals. Median should not be far from 0. Max and Min, and 1Q and 3Q, should be roughly equal in absolute value.

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.172e+01  4.738e-03  2474.1   <2e-16 ***
finsqft     5.049e-04  2.077e-06   243.1   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3452 on 30256 degrees of freedom
Multiple R-squared:  0.6613,    Adjusted R-squared:  0.6613 
F-statistic: 5.908e+04 on 1 and 30256 DF,  p-value: < 2.2e-16

YOUR TURN #2

# update model m2 to include lotsize; save as m3
m3 <- lm()

# Interptation of lotsize coefficient: 


# check diagnostics


# What do you notice about the last plot?


# OPTIONAL: simulate data and compare to observed data

Categorical predictors [Video, Part 4]

Let’s include censustract in our model. This is a categorical variable.

Categorical predictors are added to models in the form of a contrast, which is a matrix of zeroes and ones. A contrast can take many forms, but the default contrast in R is the treatment contrast. One level of a categorical variable is designated as the baseline or reference level and all other levels are estimated in contrast to the reference level. Therefore if a categorical variable has k levels, the model will have k - 1 coefficients.

summary(homes$censustract)
##    101 102.01 102.02    103 104.01 104.02    105 106.01 106.02    107    108 
##   1768   1925   1188   2751   1506   1391   1406   1359   1453   1205   1055 
## 109.02    110    111 112.01 112.02 113.01 113.02 113.03    114 
##     57   2380   3375   1385   1437   1737    293   1055   1532
plot(homes$censustract)

m4 <- lm(log(totalvalue) ~ finsqft + lotsize + censustract, data = homes)
summary(m4)
## 
## Call:
## lm(formula = log(totalvalue) ~ finsqft + lotsize + censustract, 
##     data = homes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1120 -0.1402  0.0097  0.1596  3.1575 
## 
## Coefficients:
##                     Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)        1.183e+01  8.420e-03 1405.208  < 2e-16 ***
## finsqft            4.523e-04  1.987e-06  227.605  < 2e-16 ***
## lotsize            4.516e-03  8.368e-05   53.962  < 2e-16 ***
## censustract102.01 -8.985e-02  1.007e-02   -8.921  < 2e-16 ***
## censustract102.02  1.028e-01  1.149e-02    8.948  < 2e-16 ***
## censustract103     3.093e-03  9.359e-03    0.330  0.74103    
## censustract104.01 -1.358e-01  1.067e-02  -12.735  < 2e-16 ***
## censustract104.02  4.974e-03  1.101e-02    0.452  0.65146    
## censustract105     5.151e-02  1.093e-02    4.711 2.48e-06 ***
## censustract106.01  5.104e-02  1.104e-02    4.622 3.82e-06 ***
## censustract106.02  9.557e-03  1.085e-02    0.881  0.37853    
## censustract107    -1.939e-01  1.151e-02  -16.855  < 2e-16 ***
## censustract108     2.153e-02  1.189e-02    1.811  0.07022 .  
## censustract109.02  2.441e-01  4.095e-02    5.962 2.52e-09 ***
## censustract110     2.802e-01  9.688e-03   28.927  < 2e-16 ***
## censustract111    -4.839e-03  9.008e-03   -0.537  0.59115    
## censustract112.01 -9.269e-02  1.092e-02   -8.485  < 2e-16 ***
## censustract112.02 -2.687e-03  1.085e-02   -0.248  0.80442    
## censustract113.01 -1.260e-01  1.032e-02  -12.209  < 2e-16 ***
## censustract113.02 -5.746e-02  1.927e-02   -2.983  0.00286 ** 
## censustract113.03  2.548e-02  1.193e-02    2.137  0.03264 *  
## censustract114    -4.910e-01  1.067e-02  -46.022  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3041 on 30236 degrees of freedom
## Multiple R-squared:  0.7373, Adjusted R-squared:  0.7371 
## F-statistic:  4040 on 21 and 30236 DF,  p-value: < 2.2e-16

Every censustract coefficient is in comparison to censustract 101. It got designated as the reference level since it came first numerically.

As a whole does censustract contribute to the model? We can’t (or shouldn’t) tell just by looking at all the p-values. The formal method for evaluating if a variable contributes to a model is the partial F test. We can easily run that in R with the anova() function.

anova(m4)
## Analysis of Variance Table
## 
## Response: log(totalvalue)
##                Df Sum Sq Mean Sq  F value    Pr(>F)    
## finsqft         1 7039.4  7039.4 76105.35 < 2.2e-16 ***
## lotsize         1  174.5   174.5  1886.41 < 2.2e-16 ***
## censustract    19  633.8    33.4   360.65 < 2.2e-16 ***
## Residuals   30236 2796.7     0.1                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This says that with finsqft and lotsize already in the model, censustract appears to help explain or account for the remaining variability.

Another equivalent way is to use anova to compare the new model with censustract to the previous model (m3) without censustract

m3 <- lm(log(totalvalue) ~ finsqft + lotsize, data = homes)
anova(m3, m4)
## Analysis of Variance Table
## 
## Model 1: log(totalvalue) ~ finsqft + lotsize
## Model 2: log(totalvalue) ~ finsqft + lotsize + censustract
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1  30255 3430.5                                  
## 2  30236 2796.7 19    633.81 360.65 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice the F statistic is the same.

How do the diagnostic plots and simulated data look?

plot(m4)

sim_tv <- simulate(m4, nsim = 50)
plot(density(log(homes$totalvalue)))
for(i in 1:50)lines(density(sim_tv[[i]]), col = "red")

How to interpret?

exp(coef(m4)) %>% round(4)
##       (Intercept)           finsqft           lotsize censustract102.01 
##       137632.3236            1.0005            1.0045            0.9141 
## censustract102.02    censustract103 censustract104.01 censustract104.02 
##            1.1082            1.0031            0.8730            1.0050 
##    censustract105 censustract106.01 censustract106.02    censustract107 
##            1.0529            1.0524            1.0096            0.8237 
##    censustract108 censustract109.02    censustract110    censustract111 
##            1.0218            1.2765            1.3234            0.9952 
## censustract112.01 censustract112.02 censustract113.01 censustract113.02 
##            0.9115            0.9973            0.8816            0.9442 
## censustract113.03    censustract114 
##            1.0258            0.6120

The censustract coefficients are in contrast to censustract 101. Therefore…

Are these differences significant? Check the model summary:

summary(m4)
## 
## Call:
## lm(formula = log(totalvalue) ~ finsqft + lotsize + censustract, 
##     data = homes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1120 -0.1402  0.0097  0.1596  3.1575 
## 
## Coefficients:
##                     Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)        1.183e+01  8.420e-03 1405.208  < 2e-16 ***
## finsqft            4.523e-04  1.987e-06  227.605  < 2e-16 ***
## lotsize            4.516e-03  8.368e-05   53.962  < 2e-16 ***
## censustract102.01 -8.985e-02  1.007e-02   -8.921  < 2e-16 ***
## censustract102.02  1.028e-01  1.149e-02    8.948  < 2e-16 ***
## censustract103     3.093e-03  9.359e-03    0.330  0.74103    
## censustract104.01 -1.358e-01  1.067e-02  -12.735  < 2e-16 ***
## censustract104.02  4.974e-03  1.101e-02    0.452  0.65146    
## censustract105     5.151e-02  1.093e-02    4.711 2.48e-06 ***
## censustract106.01  5.104e-02  1.104e-02    4.622 3.82e-06 ***
## censustract106.02  9.557e-03  1.085e-02    0.881  0.37853    
## censustract107    -1.939e-01  1.151e-02  -16.855  < 2e-16 ***
## censustract108     2.153e-02  1.189e-02    1.811  0.07022 .  
## censustract109.02  2.441e-01  4.095e-02    5.962 2.52e-09 ***
## censustract110     2.802e-01  9.688e-03   28.927  < 2e-16 ***
## censustract111    -4.839e-03  9.008e-03   -0.537  0.59115    
## censustract112.01 -9.269e-02  1.092e-02   -8.485  < 2e-16 ***
## censustract112.02 -2.687e-03  1.085e-02   -0.248  0.80442    
## censustract113.01 -1.260e-01  1.032e-02  -12.209  < 2e-16 ***
## censustract113.02 -5.746e-02  1.927e-02   -2.983  0.00286 ** 
## censustract113.03  2.548e-02  1.193e-02    2.137  0.03264 *  
## censustract114    -4.910e-01  1.067e-02  -46.022  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3041 on 30236 degrees of freedom
## Multiple R-squared:  0.7373, Adjusted R-squared:  0.7371 
## F-statistic:  4040 on 21 and 30236 DF,  p-value: < 2.2e-16

Use the relevel() function to set a new reference level Let’s set the reference level to “109.02”

homes$censustract <- relevel(homes$censustract, ref = "109.02")
m4 <- lm(log(totalvalue) ~ finsqft + lotsize + censustract, data = homes)
summary(m4)
## 
## Call:
## lm(formula = log(totalvalue) ~ finsqft + lotsize + censustract, 
##     data = homes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1120 -0.1402  0.0097  0.1596  3.1575 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.208e+01  4.051e-02 298.095  < 2e-16 ***
## finsqft            4.523e-04  1.987e-06 227.605  < 2e-16 ***
## lotsize            4.516e-03  8.368e-05  53.962  < 2e-16 ***
## censustract101    -2.441e-01  4.095e-02  -5.962 2.52e-09 ***
## censustract102.01 -3.340e-01  4.088e-02  -8.169 3.22e-16 ***
## censustract102.02 -1.414e-01  4.125e-02  -3.427 0.000612 ***
## censustract103    -2.410e-01  4.070e-02  -5.922 3.22e-09 ***
## censustract104.01 -3.800e-01  4.105e-02  -9.256  < 2e-16 ***
## censustract104.02 -2.391e-01  4.111e-02  -5.817 6.05e-09 ***
## censustract105    -1.926e-01  4.109e-02  -4.687 2.78e-06 ***
## censustract106.01 -1.931e-01  4.112e-02  -4.695 2.67e-06 ***
## censustract106.02 -2.346e-01  4.107e-02  -5.711 1.13e-08 ***
## censustract107    -4.380e-01  4.125e-02 -10.619  < 2e-16 ***
## censustract108    -2.226e-01  4.136e-02  -5.382 7.42e-08 ***
## censustract110     3.612e-02  4.078e-02   0.886 0.375800    
## censustract111    -2.490e-01  4.062e-02  -6.129 8.97e-10 ***
## censustract112.01 -3.368e-01  4.112e-02  -8.191 2.68e-16 ***
## censustract112.02 -2.468e-01  4.108e-02  -6.008 1.90e-09 ***
## censustract113.01 -3.702e-01  4.096e-02  -9.037  < 2e-16 ***
## censustract113.02 -3.016e-01  4.405e-02  -6.846 7.72e-12 ***
## censustract113.03 -2.186e-01  4.137e-02  -5.285 1.27e-07 ***
## censustract114    -7.351e-01  4.106e-02 -17.902  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3041 on 30236 degrees of freedom
## Multiple R-squared:  0.7373, Adjusted R-squared:  0.7371 
## F-statistic:  4040 on 21 and 30236 DF,  p-value: < 2.2e-16
anova(m4)
## Analysis of Variance Table
## 
## Response: log(totalvalue)
##                Df Sum Sq Mean Sq  F value    Pr(>F)    
## finsqft         1 7039.4  7039.4 76105.35 < 2.2e-16 ***
## lotsize         1  174.5   174.5  1886.41 < 2.2e-16 ***
## censustract    19  633.8    33.4   360.65 < 2.2e-16 ***
## Residuals   30236 2796.7     0.1                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# reset levels back to previous order with "101" first
homes$censustract <- factor(as.character(homes$censustract))

YOUR TURN #3

# Add the cooling variable to model `m4` and save it as `m5`.
m5 <- lm()

# Does it help the model?

# What is the interpretation of the cooling coefficient? 

Interactions and Effect Plots [Video, Part 5]

Often it isn’t reasonable to expect the effect of one variable to be the same regardless of other variables. For example, the effect of finsqft may depend on censustract. Or the effect of lotsize may depend on finsqft.

Use the colon to specify interactions on an individual basis.

Example: finsqft:censustract Example: finsqft:lotsize

Does the effect of finsqft depend on censustract? Does the effect of finsqft depend on lotsize?

m6 <- lm(log(totalvalue) ~ finsqft + lotsize + censustract + cooling + 
           finsqft:censustract + finsqft:lotsize,
         data = homes)
summary(m6)
## 
## Call:
## lm(formula = log(totalvalue) ~ finsqft + lotsize + censustract + 
##     cooling + finsqft:censustract + finsqft:lotsize, data = homes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2210 -0.1409  0.0027  0.1500  3.3844 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.163e+01  1.540e-02 755.166  < 2e-16 ***
## finsqft                    4.436e-04  6.346e-06  69.900  < 2e-16 ***
## lotsize                    8.641e-03  1.355e-04  63.775  < 2e-16 ***
## censustract102.01         -1.180e-01  2.222e-02  -5.309 1.11e-07 ***
## censustract102.02          2.029e-01  2.623e-02   7.735 1.07e-14 ***
## censustract103             4.532e-02  2.428e-02   1.866 0.062045 .  
## censustract104.01         -2.104e-01  2.167e-02  -9.705  < 2e-16 ***
## censustract104.02          6.680e-02  2.475e-02   2.699 0.006953 ** 
## censustract105             7.831e-02  2.724e-02   2.874 0.004052 ** 
## censustract106.01          2.082e-01  2.896e-02   7.188 6.74e-13 ***
## censustract106.02         -1.657e-01  2.701e-02  -6.136 8.59e-10 ***
## censustract107            -3.593e-02  3.536e-02  -1.016 0.309587    
## censustract108            -2.072e-02  2.637e-02  -0.786 0.431923    
## censustract109.02          5.477e-01  1.434e-01   3.821 0.000133 ***
## censustract110             3.590e-01  2.124e-02  16.901  < 2e-16 ***
## censustract111            -9.553e-02  2.061e-02  -4.635 3.58e-06 ***
## censustract112.01         -1.959e-01  2.258e-02  -8.677  < 2e-16 ***
## censustract112.02         -3.515e-02  2.337e-02  -1.504 0.132650    
## censustract113.01         -1.526e-01  2.231e-02  -6.842 7.95e-12 ***
## censustract113.02          2.682e-02  5.436e-02   0.493 0.621713    
## censustract113.03         -1.187e-02  4.043e-02  -0.294 0.769024    
## censustract114            -5.333e-01  2.203e-02 -24.207  < 2e-16 ***
## coolingCentral Air         2.742e-01  5.641e-03  48.604  < 2e-16 ***
## finsqft:censustract102.01  9.641e-07  9.997e-06   0.096 0.923172    
## finsqft:censustract102.02 -4.952e-05  9.794e-06  -5.057 4.29e-07 ***
## finsqft:censustract103    -4.440e-05  1.130e-05  -3.930 8.50e-05 ***
## finsqft:censustract104.01  4.148e-05  8.953e-06   4.633 3.61e-06 ***
## finsqft:censustract104.02 -3.481e-05  9.180e-06  -3.792 0.000150 ***
## finsqft:censustract105    -2.810e-05  1.137e-05  -2.471 0.013496 *  
## finsqft:censustract106.01 -1.029e-04  1.344e-05  -7.657 1.96e-14 ***
## finsqft:censustract106.02  6.919e-05  1.289e-05   5.370 7.94e-08 ***
## finsqft:censustract107    -1.449e-04  2.247e-05  -6.447 1.16e-10 ***
## finsqft:censustract108     1.522e-06  1.168e-05   0.130 0.896275    
## finsqft:censustract109.02 -1.584e-04  6.375e-05  -2.485 0.012969 *  
## finsqft:censustract110    -3.983e-05  7.989e-06  -4.986 6.20e-07 ***
## finsqft:censustract111     3.277e-05  8.805e-06   3.721 0.000198 ***
## finsqft:censustract112.01  6.592e-05  9.826e-06   6.709 2.00e-11 ***
## finsqft:censustract112.02  7.912e-06  9.101e-06   0.869 0.384621    
## finsqft:censustract113.01  1.180e-05  1.080e-05   1.093 0.274458    
## finsqft:censustract113.02 -6.905e-05  3.491e-05  -1.978 0.047937 *  
## finsqft:censustract113.03 -7.817e-06  2.273e-05  -0.344 0.730884    
## finsqft:censustract114     5.104e-05  1.085e-05   4.703 2.57e-06 ***
## finsqft:lotsize           -1.244e-06  3.531e-08 -35.235  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2851 on 30215 degrees of freedom
## Multiple R-squared:  0.7693, Adjusted R-squared:  0.769 
## F-statistic:  2399 on 42 and 30215 DF,  p-value: < 2.2e-16

Is the interaction “significant”? In other words, does it make the model better? Does the interaction make the model better at explaining the variability in totalvalue. Again the partial F test can help us assess this.

The car package’s Anova function may be preferred in this case. It assesses an interactions contribution with all other predictors in the model. The base R anova function is sequential. It assesses the contribution of a predictor given that other predictors before it are already in the model.

anova(m6)       # Type I Sum of Squares
## Analysis of Variance Table
## 
## Response: log(totalvalue)
##                        Df Sum Sq Mean Sq   F value    Pr(>F)    
## finsqft                 1 7039.4  7039.4 86617.890 < 2.2e-16 ***
## lotsize                 1  174.5   174.5  2146.979 < 2.2e-16 ***
## censustract            19  633.8    33.4   410.467 < 2.2e-16 ***
## cooling                 1  209.1   209.1  2572.385 < 2.2e-16 ***
## finsqft:censustract    19   31.2     1.6    20.191 < 2.2e-16 ***
## finsqft:lotsize         1  100.9   100.9  1241.527 < 2.2e-16 ***
## Residuals           30215 2455.6     0.1                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
car::Anova(m6)  # Type II Sum of Squares
## Anova Table (Type II tests)
## 
## Response: log(totalvalue)
##                     Sum Sq    Df   F value    Pr(>F)    
## finsqft             3877.2     1 47708.102 < 2.2e-16 ***
## lotsize              289.8     1  3566.453 < 2.2e-16 ***
## censustract          523.9    19   339.272 < 2.2e-16 ***
## cooling              192.0     1  2362.371 < 2.2e-16 ***
## finsqft:censustract   40.5    19    26.209 < 2.2e-16 ***
## finsqft:lotsize      100.9     1  1241.527 < 2.2e-16 ***
## Residuals           2455.6 30215                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Either way, the interactions seem to be warranted.

How do we interpret? This is where effect plots help. Here we use ggpredict().

plot(ggpredict(m6, terms = c("finsqft", "lotsize")))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

We can set our own values for lotsize

plot(ggpredict(m6, terms = c("finsqft", "lotsize[0.5, 10, 100]")))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

Format the labels to show dollar amounts

plot(ggpredict(m6, terms = c("finsqft", "lotsize[0.5, 10, 100]")), 
     labels = scales::dollar)
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

The effect of finsqft decreases as lotsize gets larger (around 100 acres). Also notice most of the plot is devoted to the handful of homes over 2.5 million dollars. If we like, we can zoom in on the plot and look at totalvalues from 0 to $1,000,000. We don’t see the interaction when zoomed in. The lines look roughly parallel.

plot(ggpredict(m6, terms = c("finsqft", "lotsize[0.5, 10, 100]")), 
     labels = scales::dollar) +
  coord_cartesian(ylim = c(0, 1e6))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

Switch the order of the terms to put lotsize on the x-axis

plot(ggpredict(m6, terms = c("lotsize[0:200]", "finsqft[1500, 3000, 4500]")))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

The effect of lotsize seems more pronounced for smaller homes.

Effect plots for censustract is more difficult because there are so many levels (20). Therefore we need to look at a few at a time.

plot(ggpredict(m6, terms = c("finsqft[0:5000]", 
                             "censustract[113.01, 113.02, 113.03]")))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

plot(ggpredict(m6, terms = c("finsqft[0:5000]", 
                             "censustract[101, 107, 108, 110, 111]")))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

The ggpredict() result can be saved and used as a data frame with ggplot

eff_out <- ggpredict(m6, terms = c("finsqft[0:5000]", 
                        "censustract[101, 107, 108, 110, 111]"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.
names(eff_out)
## [1] "x"         "predicted" "std.error" "conf.low"  "conf.high" "group"
eff_out <- eff_out %>% 
  rename(`Census Tract` = group)
ggplot(eff_out, aes(x = x, y = predicted, color = `Census Tract`)) +
  geom_line() +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high, fill = `Census Tract`), 
              alpha = 1/5) +
  scale_y_continuous(labels = scales::dollar) +
  labs(x = "Finished Square Feet")

When should you include interactions? What kind? How many? That requires some thought and expertise in the subject.

YOUR TURN #4

# Add an interaction for cooling and finsqft to `m6` and save as `m7`.
m7 <- lm()

# Is the interaction significant? 

# Create an effect plot to visualize the interaction of cooling and finsqft (or
# lack thereof).

Non-linear effects [Video, Part 6]

Often the simple assumption of a linear effect of a predictor is unrealistic. Fortunately there are ways to fit non-linear effects in a linear model.

Let’s demostrate with some simulated data

# polynomial of degree 2
x <- seq(from = -10, to = 10, length.out = 100)
set.seed(3)
y <- 1.2 + 2*x + 0.9*x^2 + rnorm(100, sd = 10)
nl_dat <- data.frame(y, x)
plot(x, y)

Fit a polynomial model with the poly() function

mod4 <- lm(y ~ poly(x, 2), data = nl_dat)
summary(mod4)
## 
## Call:
## lm(formula = y ~ poly(x, 2), data = nl_dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.3937  -6.1342   0.3919   7.0327  16.3910 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  31.9164     0.8554   37.31   <2e-16 ***
## poly(x, 2)1 126.9080     8.5539   14.84   <2e-16 ***
## poly(x, 2)2 266.4715     8.5539   31.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.554 on 97 degrees of freedom
## Multiple R-squared:  0.9247, Adjusted R-squared:  0.9231 
## F-statistic: 595.3 on 2 and 97 DF,  p-value: < 2.2e-16

Notice the coefficients are quite large. That’s because poly() generates orthogonal polynomials by default. That makes for numerical stability. Interpretation of coefficients isn’t really worth the effort.

We could fit the raw polynomials as follows and sort of “recover” the true values.

mod5 <- lm(y ~ poly(x, 2, raw = TRUE), data = nl_dat)
summary(mod5)
## 
## Call:
## lm(formula = y ~ poly(x, 2, raw = TRUE), data = nl_dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.3937  -6.1342   0.3919   7.0327  16.3910 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              2.11953    1.28319   1.652    0.102    
## poly(x, 2, raw = TRUE)1  2.17624    0.14668  14.836   <2e-16 ***
## poly(x, 2, raw = TRUE)2  0.87621    0.02813  31.152   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.554 on 97 degrees of freedom
## Multiple R-squared:  0.9247, Adjusted R-squared:  0.9231 
## F-statistic: 595.3 on 2 and 97 DF,  p-value: < 2.2e-16

Use plot() and ggpredict() to plot fitted model with data

plot(ggpredict(mod4), add.data = TRUE)
## $x

plot(ggpredict(mod5), add.data = TRUE)
## $x

The modern approach to fitting non-linear effects is to use splines instead of polynomials.

Instead of poly() we can use the ns() function from the splines package. ns stands for natural splines. The second argument is the degrees of freedom. It may help to think of that as the number of times the smooth line changes directions.

mod6 <- lm(y ~ ns(x, df = 2), data = nl_dat)
summary(mod6)
## 
## Call:
## lm(formula = y ~ ns(x, df = 2), data = nl_dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.0764  -6.7423   0.1437   7.4876  18.2382 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      62.746      2.420   25.92   <2e-16 ***
## ns(x, df = 2)1  -77.728      5.433  -14.31   <2e-16 ***
## ns(x, df = 2)2   91.189      2.959   30.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.642 on 97 degrees of freedom
## Multiple R-squared:  0.9231, Adjusted R-squared:  0.9215 
## F-statistic: 582.2 on 2 and 97 DF,  p-value: < 2.2e-16
plot(ggpredict(mod6), add.data = TRUE)
## $x

Natural splines essentially allow us to fit a series of cubic polynomials connected at knots located in the range of our data.

The ns() function transforms our data similar to a polynomial transformation.

# simple example data
z <- 1:10      

# classic polynomial transformation
p2_out <- poly(z, 2, raw = TRUE)  
p2_out
##        1   2
##  [1,]  1   1
##  [2,]  2   4
##  [3,]  3   9
##  [4,]  4  16
##  [5,]  5  25
##  [6,]  6  36
##  [7,]  7  49
##  [8,]  8  64
##  [9,]  9  81
## [10,] 10 100
## attr(,"degree")
## [1] 1 2
## attr(,"class")
## [1] "poly"   "matrix"
matplot(poly(z, 2, raw = TRUE))

cor(p2_out[,1], p2_out[,2])
## [1] 0.9745586
# orthogonal, or uncorrelated, transformation
p2_out <- poly(z, 2)  
p2_out
##                 1           2
##  [1,] -0.49543369  0.52223297
##  [2,] -0.38533732  0.17407766
##  [3,] -0.27524094 -0.08703883
##  [4,] -0.16514456 -0.26111648
##  [5,] -0.05504819 -0.34815531
##  [6,]  0.05504819 -0.34815531
##  [7,]  0.16514456 -0.26111648
##  [8,]  0.27524094 -0.08703883
##  [9,]  0.38533732  0.17407766
## [10,]  0.49543369  0.52223297
## attr(,"coefs")
## attr(,"coefs")$alpha
## [1] 5.5 5.5
## 
## attr(,"coefs")$norm2
## [1]   1.0  10.0  82.5 528.0
## 
## attr(,"degree")
## [1] 1 2
## attr(,"class")
## [1] "poly"   "matrix"
matplot(poly(z, 2))

cor(p2_out[,1], p2_out[,2])
## [1] -3.317659e-17
# natural spline transformation
ns_out <- ns(z, 2)                
ns_out
##               1           2
##  [1,] 0.0000000  0.00000000
##  [2,] 0.1674747 -0.10982084
##  [3,] 0.3219712 -0.20001558
##  [4,] 0.4505112 -0.25095810
##  [5,] 0.5401165 -0.24302231
##  [6,] 0.5783495 -0.15739986
##  [7,] 0.5652102  0.00590925
##  [8,] 0.5131362  0.22809668
##  [9,] 0.4351057  0.48953631
## [10,] 0.3440969  0.77060206
## attr(,"degree")
## [1] 3
## attr(,"knots")
## 50% 
## 5.5 
## attr(,"Boundary.knots")
## [1]  1 10
## attr(,"intercept")
## [1] FALSE
## attr(,"class")
## [1] "ns"     "basis"  "matrix"
matplot(ns(z, 2))

cor(ns_out[,1], ns_out[,2])
## [1] -0.04882638

Frank Harrell states in his book Regression Model Strategies that 3 to 5 DF is almost always sufficient. His basic advice is to allocate more DF to variables you think are more important.

How can we assess whether we need a non-linear effect? Partial residual plots can help. Also called term plots and component-residual plots. We’ll use the crPlots() function from the car package.

Partial-residual plots show the relationship between a predictor and the response variable given that the other predictors are also in the model.

Let’s fit a model with no interactions. crPlot() does not work for models with interactions.

m8 <- lm(log(totalvalue) ~ finsqft + lotsize + censustract + cooling,
         data = homes)

Notice we specify the numeric variables just as we would in a model.

The blue dashed line is the fitted slope.
The purple line is a smooth trend line.

A curving purple line indicates a non-linear effect may be justified.

crPlots(m8, terms = ~ finsqft)

crPlots(m8, terms = ~ lotsize)

# use a different smoother for the smooth trend line
crPlots(m8, terms = ~ lotsize, smooth=list(smoother=gamLine))

It appears both finsqft and lotsize may be affecting totalvalue in a non-linear fashion. Let’s fit a non-linear effect for finsqft using a natural spline with 3 DF.

m9 <- lm(log(totalvalue) ~ ns(finsqft, df = 3) + lotsize + censustract + cooling,
         data = homes)
summary(m9)
## 
## Call:
## lm(formula = log(totalvalue) ~ ns(finsqft, df = 3) + lotsize + 
##     censustract + cooling, data = homes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9955 -0.1430 -0.0029  0.1473  3.2812 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.141e+01  1.697e-02 672.141  < 2e-16 ***
## ns(finsqft, df = 3)1  1.989e+00  1.248e-02 159.405  < 2e-16 ***
## ns(finsqft, df = 3)2  3.660e+00  4.318e-02  84.764  < 2e-16 ***
## ns(finsqft, df = 3)3  2.908e+00  4.584e-02  63.447  < 2e-16 ***
## lotsize               5.082e-03  7.916e-05  64.193  < 2e-16 ***
## censustract102.01    -1.308e-01  9.499e-03 -13.768  < 2e-16 ***
## censustract102.02     6.066e-02  1.080e-02   5.617 1.96e-08 ***
## censustract103       -7.335e-02  8.904e-03  -8.237  < 2e-16 ***
## censustract104.01    -1.280e-01  1.001e-02 -12.788  < 2e-16 ***
## censustract104.02    -2.734e-02  1.035e-02  -2.640  0.00828 ** 
## censustract105       -1.710e-02  1.032e-02  -1.657  0.09762 .  
## censustract106.01    -2.368e-02  1.046e-02  -2.265  0.02350 *  
## censustract106.02    -5.663e-02  1.027e-02  -5.512 3.57e-08 ***
## censustract107       -2.408e-01  1.095e-02 -21.991  < 2e-16 ***
## censustract108       -3.465e-02  1.123e-02  -3.084  0.00204 ** 
## censustract109.02     1.588e-01  3.844e-02   4.130 3.63e-05 ***
## censustract110        2.463e-01  9.113e-03  27.028  < 2e-16 ***
## censustract111       -5.451e-02  8.491e-03  -6.420 1.39e-10 ***
## censustract112.01    -7.046e-02  1.025e-02  -6.871 6.51e-12 ***
## censustract112.02    -2.694e-02  1.019e-02  -2.644  0.00821 ** 
## censustract113.01    -1.341e-01  9.702e-03 -13.825  < 2e-16 ***
## censustract113.02    -7.052e-02  1.811e-02  -3.894 9.89e-05 ***
## censustract113.03    -5.021e-02  1.134e-02  -4.429 9.51e-06 ***
## censustract114       -4.503e-01  1.003e-02 -44.904  < 2e-16 ***
## coolingCentral Air    2.238e-01  5.837e-03  38.338  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2853 on 30233 degrees of freedom
## Multiple R-squared:  0.7688, Adjusted R-squared:  0.7686 
## F-statistic:  4189 on 24 and 30233 DF,  p-value: < 2.2e-16

The coefficients are impossible to interpret. Effect plots are our only hope.

plot(ggpredict(m9, terms = "finsqft"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.

The partial-residual plot looks better

crPlots(m9, ~ns(finsqft, df = 3))

YOUR TURN #5

# Update model `m9` to include a non-linear effect for lotsize using a natural
# spline with 5 DF and save the new model as `m10`.
m10 <- lm()

# Generate an effect plot for the non-linear lotsize effect.


# Check the partial-residual plot for lotsize in model `m10`.

Comparing models [Video, Part 7]

Partial F test

How do we determine if one model is better than another?

One way is through hypothesis testing with the partial F test.

NULL: two models are equally “good”
ALTERNATIVE: the more complex model is better

Note: the smaller model must be a subset of the larger model!

We can run this test with the base R anova() function.

Example:

# The model from the previous exercise:
m10 <- lm(log(totalvalue) ~ ns(finsqft, df = 3) + ns(lotsize, df = 5) + 
            censustract + cooling,
          data = homes) 

# a more complex model with an interactionvbetween ns(finsqft, df = 3) and
# ns(lotsize, df = 5)
m11 <- lm(log(totalvalue) ~ ns(finsqft, df = 3) + ns(lotsize, df = 5) + 
            censustract + cooling + 
            ns(finsqft, df = 3):ns(lotsize, df = 5),
          data = homes) 

Test that both models are equally good at modeling totalvalue. Notice that m10 is a subset of m11

anova(m10, m11)
## Analysis of Variance Table
## 
## Model 1: log(totalvalue) ~ ns(finsqft, df = 3) + ns(lotsize, df = 5) + 
##     censustract + cooling
## Model 2: log(totalvalue) ~ ns(finsqft, df = 3) + ns(lotsize, df = 5) + 
##     censustract + cooling + ns(finsqft, df = 3):ns(lotsize, df = 5)
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1  30229 2286.6                                  
## 2  30214 2215.5 15    71.039 64.587 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It appears m11 is superior to m10 The low p-value says we reject the null; the bigger model appears to be better than the smaller model.

AIC and BIC

We can also use the AIC or BIC information criteria. These are not hypothesis tests. We simply see which has a lower value. These values estimate the out-of-sample accuracy if we were to use these models to make predictions on new data. Simply use the AIC and/or BIC functions with the model objects. The main difference between the two is that BIC imposes a heavier penalty for models with more predictors.

AIC(m10, m11)
##     df      AIC
## m10 30 7780.731
## m11 45 6855.754
BIC(m10, m11)
##     df      BIC
## m10 30 8030.257
## m11 45 7230.042

The nice thing about AIC/BIC is that we don’t need to worry about whether the models contain a subset of predictors. They just need to have the same response variable.

While lower is better, sometimes it’s hard to know how much lower AIC/BIC needs to be. If a really complex model is only slightly lower than an easier to understand smaller model, the smaller model may be preferable.

YOUR TURN #6

# Fit the following three models, each with increasing levels of complexity in
# the splines.
mod_a <- lm(log(totalvalue) ~ ns(finsqft, df = 3) * ns(lotsize, df = 3) + 
            censustract + cooling,
          data = homes) 
mod_b <- lm(log(totalvalue) ~ ns(finsqft, df = 4) * ns(lotsize, df = 4) + 
            censustract + cooling,
          data = homes) 
mod_c <- lm(log(totalvalue) ~ ns(finsqft, df = 5) * ns(lotsize, df = 5) + 
            censustract + cooling,
          data = homes) 

# Compare the models using `anova`, `AIC` and `BIC`. Which is better?

Using a linear model [Video, Part 8]

Once we have a linear model we may want to use it to make predictions for new data. We can do that with the predict() function. Note the new data needs to be in a data frame.

Let’s say we want to use this model:

m12 <- lm(log(totalvalue) ~ ns(finsqft, df = 5) * ns(lotsize, df = 5) + 
            censustract * ns(finsqft, df = 5) + cooling,
          data = homes)
summary(m12)
## 
## Call:
## lm(formula = log(totalvalue) ~ ns(finsqft, df = 5) * ns(lotsize, 
##     df = 5) + censustract * ns(finsqft, df = 5) + cooling, data = homes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9684 -0.1316  0.0012  0.1377  3.3834 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                                11.399136   0.152451  74.772
## ns(finsqft, df = 5)1                        0.870256   0.141428   6.153
## ns(finsqft, df = 5)2                        1.231419   0.162838   7.562
## ns(finsqft, df = 5)3                        1.892883   0.178687  10.593
## ns(finsqft, df = 5)4                        3.959110   0.731198   5.415
## ns(finsqft, df = 5)5                        4.704075   1.331463   3.533
## ns(lotsize, df = 5)1                       -1.588311   0.171270  -9.274
## ns(lotsize, df = 5)2                       -0.123545   0.131748  -0.938
## ns(lotsize, df = 5)3                        3.622165   0.801614   4.519
## ns(lotsize, df = 5)4                        7.487665   2.991355   2.503
## ns(lotsize, df = 5)5                       14.388095   6.533757   2.202
## censustract102.01                           1.233490   0.206541   5.972
## censustract102.02                           1.058578   0.232722   4.549
## censustract103                              0.090532   0.203144   0.446
## censustract104.01                           0.112442   0.126228   0.891
## censustract104.02                           0.027893   0.226213   0.123
## censustract105                              0.530627   0.226084   2.347
## censustract106.01                          -0.302680   0.335961  -0.901
## censustract106.02                           1.938215   0.225934   8.579
## censustract107                              1.388717   0.277131   5.011
## censustract108                              1.512674   0.282092   5.362
## censustract109.02                         -65.652267 130.495842  -0.503
## censustract110                              0.879675   0.173304   5.076
## censustract111                              0.636620   0.146789   4.337
## censustract112.01                           0.032162   0.120537   0.267
## censustract112.02                           0.012006   0.159288   0.075
## censustract113.01                          -0.004298   0.134484  -0.032
## censustract113.02                           2.543222   0.325047   7.824
## censustract113.03                           1.557342   0.317731   4.901
## censustract114                             -0.422166   0.126075  -3.349
## coolingCentral Air                          0.237083   0.005582  42.473
## ns(finsqft, df = 5)1:ns(lotsize, df = 5)1   1.568478   0.160003   9.803
## ns(finsqft, df = 5)2:ns(lotsize, df = 5)1   1.492416   0.180635   8.262
## ns(finsqft, df = 5)3:ns(lotsize, df = 5)1   1.295066   0.182065   7.113
## ns(finsqft, df = 5)4:ns(lotsize, df = 5)1   2.231213   0.754200   2.958
## ns(finsqft, df = 5)5:ns(lotsize, df = 5)1  -1.504572   1.341948  -1.121
## ns(finsqft, df = 5)1:ns(lotsize, df = 5)2   0.289468   0.122672   2.360
## ns(finsqft, df = 5)2:ns(lotsize, df = 5)2   0.138072   0.141047   0.979
## ns(finsqft, df = 5)3:ns(lotsize, df = 5)2   0.323484   0.167919   1.926
## ns(finsqft, df = 5)4:ns(lotsize, df = 5)2  -0.068635   0.707302  -0.097
## ns(finsqft, df = 5)5:ns(lotsize, df = 5)2  -1.483206   1.321174  -1.123
## ns(finsqft, df = 5)1:ns(lotsize, df = 5)3  -0.258392   0.752757  -0.343
## ns(finsqft, df = 5)2:ns(lotsize, df = 5)3   0.517601   0.852895   0.607
## ns(finsqft, df = 5)3:ns(lotsize, df = 5)3  -4.274065   0.482510  -8.858
## ns(finsqft, df = 5)4:ns(lotsize, df = 5)3  -0.251152   1.789260  -0.140
## ns(finsqft, df = 5)5:ns(lotsize, df = 5)3  -3.480660   1.011815  -3.440
## ns(finsqft, df = 5)1:ns(lotsize, df = 5)4  -1.673822   2.722666  -0.615
## ns(finsqft, df = 5)2:ns(lotsize, df = 5)4  -9.178902   3.157359  -2.907
## ns(finsqft, df = 5)3:ns(lotsize, df = 5)4   1.319223   1.392499   0.947
## ns(finsqft, df = 5)4:ns(lotsize, df = 5)4 -18.732839   6.496717  -2.883
## ns(finsqft, df = 5)5:ns(lotsize, df = 5)4  -5.967410   2.887742  -2.066
## ns(finsqft, df = 5)1:ns(lotsize, df = 5)5  -6.608910   5.973005  -1.106
## ns(finsqft, df = 5)2:ns(lotsize, df = 5)5 -22.013598   6.898977  -3.191
## ns(finsqft, df = 5)3:ns(lotsize, df = 5)5   5.199070   2.997022   1.735
## ns(finsqft, df = 5)4:ns(lotsize, df = 5)5 -39.413922  13.916752  -2.832
## ns(finsqft, df = 5)5:ns(lotsize, df = 5)5  -2.239959   3.088149  -0.725
## ns(finsqft, df = 5)1:censustract102.01     -1.246213   0.193784  -6.431
## ns(finsqft, df = 5)2:censustract102.01     -1.353778   0.214052  -6.325
## ns(finsqft, df = 5)3:censustract102.01     -0.721572   0.131069  -5.505
## ns(finsqft, df = 5)4:censustract102.01     -2.629437   0.469900  -5.596
## ns(finsqft, df = 5)5:censustract102.01     -0.718808   0.387259  -1.856
## ns(finsqft, df = 5)1:censustract102.02     -0.981329   0.219482  -4.471
## ns(finsqft, df = 5)2:censustract102.02     -0.947081   0.238111  -3.977
## ns(finsqft, df = 5)3:censustract102.02     -0.627166   0.132490  -4.734
## ns(finsqft, df = 5)4:censustract102.02     -1.916222   0.501875  -3.818
## ns(finsqft, df = 5)5:censustract102.02     -0.564730   0.248293  -2.274
## ns(finsqft, df = 5)1:censustract103        -0.172455   0.192935  -0.894
## ns(finsqft, df = 5)2:censustract103        -0.079854   0.211199  -0.378
## ns(finsqft, df = 5)3:censustract103        -0.441121   0.181328  -2.433
## ns(finsqft, df = 5)4:censustract103         0.461762   0.948147   0.487
## ns(finsqft, df = 5)5:censustract103         0.514972   1.675707   0.307
## ns(finsqft, df = 5)1:censustract104.01     -0.196547   0.117192  -1.677
## ns(finsqft, df = 5)2:censustract104.01     -0.219508   0.133328  -1.646
## ns(finsqft, df = 5)3:censustract104.01      0.139010   0.098757   1.408
## ns(finsqft, df = 5)4:censustract104.01     -0.589098   0.294983  -1.997
## ns(finsqft, df = 5)5:censustract104.01     -0.250427   0.246021  -1.018
## ns(finsqft, df = 5)1:censustract104.02     -0.023345   0.212101  -0.110
## ns(finsqft, df = 5)2:censustract104.02      0.039158   0.232663   0.168
## ns(finsqft, df = 5)3:censustract104.02      0.065605   0.135283   0.485
## ns(finsqft, df = 5)4:censustract104.02     -0.092770   0.495217  -0.187
## ns(finsqft, df = 5)5:censustract104.02     -0.235059   0.305940  -0.768
## ns(finsqft, df = 5)1:censustract105        -0.428971   0.215560  -1.990
## ns(finsqft, df = 5)2:censustract105        -0.499048   0.232013  -2.151
## ns(finsqft, df = 5)3:censustract105        -0.212976   0.145587  -1.463
## ns(finsqft, df = 5)4:censustract105        -1.170131   0.510769  -2.291
## ns(finsqft, df = 5)5:censustract105        -0.928658   0.408919  -2.271
## ns(finsqft, df = 5)1:censustract106.01      0.439865   0.321445   1.368
## ns(finsqft, df = 5)2:censustract106.01      0.247408   0.343702   0.720
## ns(finsqft, df = 5)3:censustract106.01      0.027627   0.196092   0.141
## ns(finsqft, df = 5)4:censustract106.01      0.455052   0.727569   0.625
## ns(finsqft, df = 5)5:censustract106.01     -0.742363   0.502621  -1.477
## ns(finsqft, df = 5)1:censustract106.02     -1.829026   0.209948  -8.712
## ns(finsqft, df = 5)2:censustract106.02     -1.818738   0.237946  -7.643
## ns(finsqft, df = 5)3:censustract106.02     -1.351310   0.229008  -5.901
## ns(finsqft, df = 5)4:censustract106.02     -3.082205   1.323837  -2.328
## ns(finsqft, df = 5)5:censustract106.02      0.800719   2.429470   0.330
## ns(finsqft, df = 5)1:censustract107        -1.495867   0.252301  -5.929
## ns(finsqft, df = 5)2:censustract107        -1.591649   0.309368  -5.145
## ns(finsqft, df = 5)3:censustract107        -2.155146   0.563734  -3.823
## ns(finsqft, df = 5)4:censustract107        -2.509137   5.517352  -0.455
## ns(finsqft, df = 5)5:censustract107         1.018772  10.828179   0.094
## ns(finsqft, df = 5)1:censustract108        -1.307509   0.266543  -4.905
## ns(finsqft, df = 5)2:censustract108        -1.534222   0.289640  -5.297
## ns(finsqft, df = 5)3:censustract108        -0.577344   0.169913  -3.398
## ns(finsqft, df = 5)4:censustract108        -3.204861   0.601715  -5.326
## ns(finsqft, df = 5)5:censustract108        -1.472915   0.354155  -4.159
## ns(finsqft, df = 5)1:censustract109.02     65.529306 129.936438   0.504
## ns(finsqft, df = 5)2:censustract109.02     66.173941 130.717812   0.506
## ns(finsqft, df = 5)3:censustract109.02     35.888616  70.885012   0.506
## ns(finsqft, df = 5)4:censustract109.02    117.183098 255.623862   0.458
## ns(finsqft, df = 5)5:censustract109.02     10.993229  70.909936   0.155
## ns(finsqft, df = 5)1:censustract110        -0.537799   0.163605  -3.287
## ns(finsqft, df = 5)2:censustract110        -0.604673   0.178106  -3.395
## ns(finsqft, df = 5)3:censustract110        -0.424973   0.104434  -4.069
## ns(finsqft, df = 5)4:censustract110        -0.937982   0.380979  -2.462
## ns(finsqft, df = 5)5:censustract110         0.047934   0.217736   0.220
## ns(finsqft, df = 5)1:censustract111        -0.625394   0.136974  -4.566
## ns(finsqft, df = 5)2:censustract111        -0.631034   0.153347  -4.115
## ns(finsqft, df = 5)3:censustract111        -0.224152   0.111013  -2.019
## ns(finsqft, df = 5)4:censustract111        -1.076201   0.365718  -2.943
## ns(finsqft, df = 5)5:censustract111        -0.316953   0.392283  -0.808
## ns(finsqft, df = 5)1:censustract112.01     -0.086747   0.111223  -0.780
## ns(finsqft, df = 5)2:censustract112.01     -0.068064   0.128108  -0.531
## ns(finsqft, df = 5)3:censustract112.01      0.111269   0.102558   1.085
## ns(finsqft, df = 5)4:censustract112.01     -0.182073   0.298089  -0.611
## ns(finsqft, df = 5)5:censustract112.01     -0.065671   0.305253  -0.215
## ns(finsqft, df = 5)1:censustract112.02      0.051993   0.147560   0.352
## ns(finsqft, df = 5)2:censustract112.02     -0.050399   0.166245  -0.303
## ns(finsqft, df = 5)3:censustract112.02      0.382863   0.104095   3.678
## ns(finsqft, df = 5)4:censustract112.02     -0.335893   0.363788  -0.923
## ns(finsqft, df = 5)5:censustract112.02     -0.672419   0.246554  -2.727
## ns(finsqft, df = 5)1:censustract113.01     -0.060827   0.123707  -0.492
## ns(finsqft, df = 5)2:censustract113.01     -0.112603   0.143500  -0.785
## ns(finsqft, df = 5)3:censustract113.01     -0.387742   0.128849  -3.009
## ns(finsqft, df = 5)4:censustract113.01      0.020046   0.353479   0.057
## ns(finsqft, df = 5)5:censustract113.01      0.691911   0.436751   1.584
## ns(finsqft, df = 5)1:censustract113.02     -2.544365   0.290019  -8.773
## ns(finsqft, df = 5)2:censustract113.02     -2.501437   0.363376  -6.884
## ns(finsqft, df = 5)3:censustract113.02     -1.929204   0.547613  -3.523
## ns(finsqft, df = 5)4:censustract113.02     -4.030108   2.190199  -1.840
## ns(finsqft, df = 5)5:censustract113.02      0.864610   4.240977   0.204
## ns(finsqft, df = 5)1:censustract113.03     -1.498797   0.304521  -4.922
## ns(finsqft, df = 5)2:censustract113.03     -1.557835   0.331102  -4.705
## ns(finsqft, df = 5)3:censustract113.03     -1.304938   0.390905  -3.338
## ns(finsqft, df = 5)4:censustract113.03     -5.785313   2.271602  -2.547
## ns(finsqft, df = 5)5:censustract113.03     -5.855242   4.326510  -1.353
## ns(finsqft, df = 5)1:censustract114        -0.057624   0.115229  -0.500
## ns(finsqft, df = 5)2:censustract114         0.093185   0.135397   0.688
## ns(finsqft, df = 5)3:censustract114        -0.396467   0.122223  -3.244
## ns(finsqft, df = 5)4:censustract114         0.037872   0.305826   0.124
## ns(finsqft, df = 5)5:censustract114         0.270663   0.301350   0.898
##                                           Pr(>|t|)    
## (Intercept)                                < 2e-16 ***
## ns(finsqft, df = 5)1                      7.68e-10 ***
## ns(finsqft, df = 5)2                      4.07e-14 ***
## ns(finsqft, df = 5)3                       < 2e-16 ***
## ns(finsqft, df = 5)4                      6.19e-08 ***
## ns(finsqft, df = 5)5                      0.000411 ***
## ns(lotsize, df = 5)1                       < 2e-16 ***
## ns(lotsize, df = 5)2                      0.348388    
## ns(lotsize, df = 5)3                      6.25e-06 ***
## ns(lotsize, df = 5)4                      0.012316 *  
## ns(lotsize, df = 5)5                      0.027665 *  
## censustract102.01                         2.37e-09 ***
## censustract102.02                         5.42e-06 ***
## censustract103                            0.655851    
## censustract104.01                         0.373052    
## censustract104.02                         0.901868    
## censustract105                            0.018930 *  
## censustract106.01                         0.367628    
## censustract106.02                          < 2e-16 ***
## censustract107                            5.44e-07 ***
## censustract108                            8.28e-08 ***
## censustract109.02                         0.614899    
## censustract110                            3.88e-07 ***
## censustract111                            1.45e-05 ***
## censustract112.01                         0.789606    
## censustract112.02                         0.939916    
## censustract113.01                         0.974505    
## censustract113.02                         5.28e-15 ***
## censustract113.03                         9.56e-07 ***
## censustract114                            0.000813 ***
## coolingCentral Air                         < 2e-16 ***
## ns(finsqft, df = 5)1:ns(lotsize, df = 5)1  < 2e-16 ***
## ns(finsqft, df = 5)2:ns(lotsize, df = 5)1  < 2e-16 ***
## ns(finsqft, df = 5)3:ns(lotsize, df = 5)1 1.16e-12 ***
## ns(finsqft, df = 5)4:ns(lotsize, df = 5)1 0.003095 ** 
## ns(finsqft, df = 5)5:ns(lotsize, df = 5)1 0.262218    
## ns(finsqft, df = 5)1:ns(lotsize, df = 5)2 0.018297 *  
## ns(finsqft, df = 5)2:ns(lotsize, df = 5)2 0.327633    
## ns(finsqft, df = 5)3:ns(lotsize, df = 5)2 0.054060 .  
## ns(finsqft, df = 5)4:ns(lotsize, df = 5)2 0.922697    
## ns(finsqft, df = 5)5:ns(lotsize, df = 5)2 0.261598    
## ns(finsqft, df = 5)1:ns(lotsize, df = 5)3 0.731405    
## ns(finsqft, df = 5)2:ns(lotsize, df = 5)3 0.543938    
## ns(finsqft, df = 5)3:ns(lotsize, df = 5)3  < 2e-16 ***
## ns(finsqft, df = 5)4:ns(lotsize, df = 5)3 0.888371    
## ns(finsqft, df = 5)5:ns(lotsize, df = 5)3 0.000582 ***
## ns(finsqft, df = 5)1:ns(lotsize, df = 5)4 0.538709    
## ns(finsqft, df = 5)2:ns(lotsize, df = 5)4 0.003650 ** 
## ns(finsqft, df = 5)3:ns(lotsize, df = 5)4 0.343453    
## ns(finsqft, df = 5)4:ns(lotsize, df = 5)4 0.003936 ** 
## ns(finsqft, df = 5)5:ns(lotsize, df = 5)4 0.038793 *  
## ns(finsqft, df = 5)1:ns(lotsize, df = 5)5 0.268535    
## ns(finsqft, df = 5)2:ns(lotsize, df = 5)5 0.001420 ** 
## ns(finsqft, df = 5)3:ns(lotsize, df = 5)5 0.082796 .  
## ns(finsqft, df = 5)4:ns(lotsize, df = 5)5 0.004627 ** 
## ns(finsqft, df = 5)5:ns(lotsize, df = 5)5 0.468249    
## ns(finsqft, df = 5)1:censustract102.01    1.29e-10 ***
## ns(finsqft, df = 5)2:censustract102.01    2.58e-10 ***
## ns(finsqft, df = 5)3:censustract102.01    3.72e-08 ***
## ns(finsqft, df = 5)4:censustract102.01    2.22e-08 ***
## ns(finsqft, df = 5)5:censustract102.01    0.063443 .  
## ns(finsqft, df = 5)1:censustract102.02    7.81e-06 ***
## ns(finsqft, df = 5)2:censustract102.02    6.98e-05 ***
## ns(finsqft, df = 5)3:censustract102.02    2.21e-06 ***
## ns(finsqft, df = 5)4:censustract102.02    0.000135 ***
## ns(finsqft, df = 5)5:censustract102.02    0.022946 *  
## ns(finsqft, df = 5)1:censustract103       0.371408    
## ns(finsqft, df = 5)2:censustract103       0.705359    
## ns(finsqft, df = 5)3:censustract103       0.014991 *  
## ns(finsqft, df = 5)4:censustract103       0.626251    
## ns(finsqft, df = 5)5:censustract103       0.758605    
## ns(finsqft, df = 5)1:censustract104.01    0.093525 .  
## ns(finsqft, df = 5)2:censustract104.01    0.099696 .  
## ns(finsqft, df = 5)3:censustract104.01    0.159260    
## ns(finsqft, df = 5)4:censustract104.01    0.045828 *  
## ns(finsqft, df = 5)5:censustract104.01    0.308729    
## ns(finsqft, df = 5)1:censustract104.02    0.912359    
## ns(finsqft, df = 5)2:censustract104.02    0.866344    
## ns(finsqft, df = 5)3:censustract104.02    0.627718    
## ns(finsqft, df = 5)4:censustract104.02    0.851401    
## ns(finsqft, df = 5)5:censustract104.02    0.442303    
## ns(finsqft, df = 5)1:censustract105       0.046597 *  
## ns(finsqft, df = 5)2:censustract105       0.031488 *  
## ns(finsqft, df = 5)3:censustract105       0.143510    
## ns(finsqft, df = 5)4:censustract105       0.021975 *  
## ns(finsqft, df = 5)5:censustract105       0.023153 *  
## ns(finsqft, df = 5)1:censustract106.01    0.171197    
## ns(finsqft, df = 5)2:censustract106.01    0.471633    
## ns(finsqft, df = 5)3:censustract106.01    0.887959    
## ns(finsqft, df = 5)4:censustract106.01    0.531686    
## ns(finsqft, df = 5)5:censustract106.01    0.139691    
## ns(finsqft, df = 5)1:censustract106.02     < 2e-16 ***
## ns(finsqft, df = 5)2:censustract106.02    2.18e-14 ***
## ns(finsqft, df = 5)3:censustract106.02    3.66e-09 ***
## ns(finsqft, df = 5)4:censustract106.02    0.019906 *  
## ns(finsqft, df = 5)5:censustract106.02    0.741715    
## ns(finsqft, df = 5)1:censustract107       3.08e-09 ***
## ns(finsqft, df = 5)2:censustract107       2.69e-07 ***
## ns(finsqft, df = 5)3:censustract107       0.000132 ***
## ns(finsqft, df = 5)4:censustract107       0.649277    
## ns(finsqft, df = 5)5:censustract107       0.925042    
## ns(finsqft, df = 5)1:censustract108       9.37e-07 ***
## ns(finsqft, df = 5)2:censustract108       1.19e-07 ***
## ns(finsqft, df = 5)3:censustract108       0.000680 ***
## ns(finsqft, df = 5)4:censustract108       1.01e-07 ***
## ns(finsqft, df = 5)5:censustract108       3.21e-05 ***
## ns(finsqft, df = 5)1:censustract109.02    0.614042    
## ns(finsqft, df = 5)2:censustract109.02    0.612695    
## ns(finsqft, df = 5)3:censustract109.02    0.612654    
## ns(finsqft, df = 5)4:censustract109.02    0.646654    
## ns(finsqft, df = 5)5:censustract109.02    0.876798    
## ns(finsqft, df = 5)1:censustract110       0.001013 ** 
## ns(finsqft, df = 5)2:censustract110       0.000687 ***
## ns(finsqft, df = 5)3:censustract110       4.73e-05 ***
## ns(finsqft, df = 5)4:censustract110       0.013821 *  
## ns(finsqft, df = 5)5:censustract110       0.825759    
## ns(finsqft, df = 5)1:censustract111       5.00e-06 ***
## ns(finsqft, df = 5)2:censustract111       3.88e-05 ***
## ns(finsqft, df = 5)3:censustract111       0.043480 *  
## ns(finsqft, df = 5)4:censustract111       0.003256 ** 
## ns(finsqft, df = 5)5:censustract111       0.419114    
## ns(finsqft, df = 5)1:censustract112.01    0.435436    
## ns(finsqft, df = 5)2:censustract112.01    0.595214    
## ns(finsqft, df = 5)3:censustract112.01    0.277961    
## ns(finsqft, df = 5)4:censustract112.01    0.541337    
## ns(finsqft, df = 5)5:censustract112.01    0.829663    
## ns(finsqft, df = 5)1:censustract112.02    0.724578    
## ns(finsqft, df = 5)2:censustract112.02    0.761767    
## ns(finsqft, df = 5)3:censustract112.02    0.000235 ***
## ns(finsqft, df = 5)4:censustract112.02    0.355848    
## ns(finsqft, df = 5)5:censustract112.02    0.006390 ** 
## ns(finsqft, df = 5)1:censustract113.01    0.622930    
## ns(finsqft, df = 5)2:censustract113.01    0.432638    
## ns(finsqft, df = 5)3:censustract113.01    0.002621 ** 
## ns(finsqft, df = 5)4:censustract113.01    0.954776    
## ns(finsqft, df = 5)5:censustract113.01    0.113154    
## ns(finsqft, df = 5)1:censustract113.02     < 2e-16 ***
## ns(finsqft, df = 5)2:censustract113.02    5.94e-12 ***
## ns(finsqft, df = 5)3:censustract113.02    0.000427 ***
## ns(finsqft, df = 5)4:censustract113.02    0.065769 .  
## ns(finsqft, df = 5)5:censustract113.02    0.838456    
## ns(finsqft, df = 5)1:censustract113.03    8.62e-07 ***
## ns(finsqft, df = 5)2:censustract113.03    2.55e-06 ***
## ns(finsqft, df = 5)3:censustract113.03    0.000844 ***
## ns(finsqft, df = 5)4:censustract113.03    0.010877 *  
## ns(finsqft, df = 5)5:censustract113.03    0.175957    
## ns(finsqft, df = 5)1:censustract114       0.617019    
## ns(finsqft, df = 5)2:censustract114       0.491312    
## ns(finsqft, df = 5)3:censustract114       0.001181 ** 
## ns(finsqft, df = 5)4:censustract114       0.901446    
## ns(finsqft, df = 5)5:censustract114       0.369102    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2647 on 30107 degrees of freedom
## Multiple R-squared:  0.8018, Adjusted R-squared:  0.8008 
## F-statistic: 811.9 on 150 and 30107 DF,  p-value: < 2.2e-16
sim_tv <- simulate(m12, nsim = 50)
plot(density(log(homes$totalvalue)))
for(i in 1:50)lines(density(sim_tv[[i]]), col = "red")

What is the expected mean totalvalue of a home with the following characteristics:

To answer this, first we need to enter this information into a data frame:

newdata <- data.frame(finsqft = 2500, 
                      lotsize = 1, 
                      cooling = "Central Air",
                      censustract = "101")

Next we use the predict function with our newdata data frame:

predict(m12, newdata = newdata) %>% exp()
##        1 
## 432957.1

with 95% confidence interval

predict(m12, newdata = newdata, interval = "confidence") %>% exp()
##        fit      lwr      upr
## 1 432957.1 419529.2 446814.8

To predict a SINGLE home price, we need to set interval = "prediction". Notice the predicted value is the same but the interval is much wider since we’re more uncertain about the expected value of a particular house:

predict(m12, newdata = newdata, interval = "prediction") %>% exp()
##        fit      lwr      upr
## 1 432957.1 257446.2 728120.3

Now let’s predict mean totalvalue for same measures but with lotsize ranging from 0 to 10 acreas.

newdata <- data.frame(finsqft = 2500, 
                      lotsize = 1:10, 
                      cooling = "Central Air",
                      censustract = "101")
head(newdata)
##   finsqft lotsize     cooling censustract
## 1    2500       1 Central Air         101
## 2    2500       2 Central Air         101
## 3    2500       3 Central Air         101
## 4    2500       4 Central Air         101
## 5    2500       5 Central Air         101
## 6    2500       6 Central Air         101
predict(m12, newdata = newdata, interval = "confidence") %>% exp()
##         fit      lwr      upr
## 1  432957.1 419529.2 446814.8
## 2  452393.5 441256.0 463812.1
## 3  462996.6 451627.3 474652.1
## 4  468040.7 456591.0 479777.5
## 5  472796.1 461287.4 484591.9
## 6  477579.0 466003.6 489441.9
## 7  482389.3 470739.3 494327.7
## 8  487226.9 475494.1 499249.1
## 9  492091.5 480267.7 504206.4
## 10 496983.2 485060.0 509199.5

This is basically how effect plots are created.

pred_out <- predict(m12, newdata = newdata, interval = "confidence") %>%
  exp() %>% 
  as.data.frame() %>% 
  mutate(lotsize = 1:10)

Now create the effect plot “by hand” with ggplot:

ggplot(pred_out, aes(x = lotsize, y = fit)) +
  geom_line() +
  geom_ribbon(mapping = aes(ymin = lwr, ymax = upr), alpha = 1/5) 

Here’s the same plot with ggpredict; use condition argument to set other predictors to specific values.

eff_out <- ggpredict(m12, terms = "lotsize[1:10]", 
                    condition = c(finsqft = 2500, 
                                  cooling = "Central Air",
                                  censustract = "101"))
## Model has log-transformed response. Back-transforming predictions to original response scale. Standard errors are still on the log-scale.
plot(eff_out)

YOUR TURN #7

Modify the code below to produce an effect plot for finsqft (with values set to 1500, 2000, 2500, 3000) and lotsize set to 0.5 acre. Leave cooling and censustract as is.

eff_out <- ggpredict(m12, terms = "lotsize[1:10]", 
                    condition = c(finsqft = 2500, 
                                  cooling = "Central Air",
                                  censustract = "101"))
plot(eff_out)

Communicating modeling results

The stargazer package can produce regression tables for articles and presentations. It can output tables in html, text or LaTeX. Below we demonstrate with text.

The basic usage is as follows

stargazer(m2, m3, type = "text", title = "Modeling Results")
## 
## Modeling Results
## ===============================================================================
##                                         Dependent variable:                    
##                     -----------------------------------------------------------
##                                           log(totalvalue)                      
##                                  (1)                           (2)             
## -------------------------------------------------------------------------------
## finsqft                       0.001***                      0.0005***          
##                               (0.00000)                     (0.00000)          
##                                                                                
## lotsize                                                     0.004***           
##                                                             (0.0001)           
##                                                                                
## Constant                      11.721***                     11.724***          
##                                (0.005)                       (0.005)           
##                                                                                
## -------------------------------------------------------------------------------
## Observations                   30,258                        30,258            
## R2                              0.661                         0.678            
## Adjusted R2                     0.661                         0.678            
## Residual Std. Error      0.345 (df = 30256)            0.337 (df = 30255)      
## F Statistic         59,080.390*** (df = 1; 30256) 31,811.080*** (df = 2; 30255)
## ===============================================================================
## Note:                                               *p<0.1; **p<0.05; ***p<0.01

Since our dependent variable has been log transformed, we may want to exponentiate the coefficients. We can do that with the apply.coef = exp argument.

stargazer(m2, m3, type = "text", title = "Modeling Results",
          apply.coef = exp)
## 
## Modeling Results
## ===============================================================================
##                                         Dependent variable:                    
##                     -----------------------------------------------------------
##                                           log(totalvalue)                      
##                                  (1)                           (2)             
## -------------------------------------------------------------------------------
## finsqft                       1.001***                      1.000***           
##                               (0.00000)                     (0.00000)          
##                                                                                
## lotsize                                                     1.004***           
##                                                             (0.0001)           
##                                                                                
## Constant                   123,168.200***                123,494.200***        
##                                (0.005)                       (0.005)           
##                                                                                
## -------------------------------------------------------------------------------
## Observations                   30,258                        30,258            
## R2                              0.661                         0.678            
## Adjusted R2                     0.661                         0.678            
## Residual Std. Error      0.345 (df = 30256)            0.337 (df = 30255)      
## F Statistic         59,080.390*** (df = 1; 30256) 31,811.080*** (df = 2; 30255)
## ===============================================================================
## Note:                                               *p<0.1; **p<0.05; ***p<0.01

The style argument allows you to create tables according to styles preferred by various journals. See ?stargazer

“ajps” = American Journal of Political Science

stargazer(m2, m3, 
          type = "text", 
          title = "Modeling Results",
          apply.coef = exp,
          style = "ajps")
## 
## Modeling Results
## -----------------------------------------------------------------------------
##                                          log(totalvalue)                     
##                               Model 1                      Model 2           
## -----------------------------------------------------------------------------
## finsqft                       1.001***                     1.000***          
##                              (0.00000)                    (0.00000)          
## lotsize                                                    1.004***          
##                                                            (0.0001)          
## Constant                   123168.200***                123494.200***        
##                               (0.005)                      (0.005)           
## N                              30258                        30258            
## R-squared                      0.661                        0.678            
## Adj. R-squared                 0.661                        0.678            
## Residual Std. Error      0.345 (df = 30256)           0.337 (df = 30255)     
## F Statistic         59080.390*** (df = 1; 30256) 31811.080*** (df = 2; 30255)
## -----------------------------------------------------------------------------
## ***p < .01; **p < .05; *p < .1

“asr” American Sociological Review

stargazer(m2, m3, 
          type = "text", 
          title = "Modeling Results",
          apply.coef = exp,
          style = "asr")
## 
## Modeling Results
## -------------------------------------------------------------------------------
##                                           log(totalvalue)                      
##                                  (1)                           (2)             
## -------------------------------------------------------------------------------
## finsqft                       1.001***                      1.000***           
## lotsize                                                     1.004***           
## Constant                   123,168.200***                123,494.200***        
## N                              30,258                        30,258            
## R2                              0.661                         0.678            
## Adjusted R2                     0.661                         0.678            
## Residual Std. Error      0.345 (df = 30256)            0.337 (df = 30255)      
## F Statistic         59,080.390*** (df = 1; 30256) 31,811.080*** (df = 2; 30255)
## -------------------------------------------------------------------------------
## *p < .05; **p < .01; ***p < .001

We can set the variable names with the covariate.labels argument.

stargazer(m2, m3, 
          type = "text", 
          title = "Modeling Results",
          style = "ajps", 
          apply.coef = exp,
          covariate.labels = c("Finished Sq Ft",
                               "Lot Size", 
                               "Intercept"))
## 
## Modeling Results
## -----------------------------------------------------------------------------
##                                          log(totalvalue)                     
##                               Model 1                      Model 2           
## -----------------------------------------------------------------------------
## Finished Sq Ft                1.001***                     1.000***          
##                              (0.00000)                    (0.00000)          
## Lot Size                                                   1.004***          
##                                                            (0.0001)          
## Intercept                  123168.200***                123494.200***        
##                               (0.005)                      (0.005)           
## N                              30258                        30258            
## R-squared                      0.661                        0.678            
## Adj. R-squared                 0.661                        0.678            
## Residual Std. Error      0.345 (df = 30256)           0.337 (df = 30255)     
## F Statistic         59080.390*** (df = 1; 30256) 31811.080*** (df = 2; 30255)
## -----------------------------------------------------------------------------
## ***p < .01; **p < .05; *p < .1

We can omit certain model statistics. Below we omit the F statistic. Notice we also we set the name of the dependent variable to display as “Total Value”

stargazer(m2, m3, 
          type = "text", 
          title = "Modeling Results",
          style = "ajps", 
          covariate.labels = c("Finished Sq Ft",
                               "Lot Size",
                               "Intercept"),
          omit.stat = "f",
          apply.coef = exp,
          dep.var.labels = "Total Value")
## 
## Modeling Results
## ---------------------------------------------------------
##                                  Total Value             
##                          Model 1            Model 2      
## ---------------------------------------------------------
## Finished Sq Ft           1.001***           1.000***     
##                         (0.00000)          (0.00000)     
## Lot Size                                    1.004***     
##                                             (0.0001)     
## Intercept             123168.200***      123494.200***   
##                          (0.005)            (0.005)      
## N                         30258              30258       
## R-squared                 0.661              0.678       
## Adj. R-squared            0.661              0.678       
## Residual Std. Error 0.345 (df = 30256) 0.337 (df = 30255)
## ---------------------------------------------------------
## ***p < .01; **p < .05; *p < .1

Again, see ?stargazer for many more options.

References

  • Faraway, J. (2005). Linear Models in R. London: Chapman & Hall.

  • Fox, J. (2002). A R and S-Plus Companion to Applied Regression. London: Sage.

  • Harrell, F. (2015). Regression Modeling Strategies (2nd ed.). New York: Springer.

  • Kutner, M., et al. (2005). Applied Linear Statistical Models (5th ed.). New York: McGraw-Hill.

  • Maindonald J., Braun, J.W. (2010). Data Analysis and Graphics Using R (3rd ed.). Cambridge: Cambridge Univ Press.