Goal

The goal of this session is to teach you how to conduct basic statistical analyses in R. We will cover descriptive statistics, exploratory data analysis, t-tests, ANOVA, and methods for discrete variable analysis along with hypothesis testing and assessing the assumptions of the tests.

Setting up

Hopefully you got here by double-clicking the .Rproj file in the Instructional Materials. If not, exit out of RStudio and enter in through the .Rproj file so that you are in the proper folder.

Remember that an RStudio project (specifically the .Rproj file) makes it straightforward to place your work into it’s own working directory. Creating a project takes away some of the stress of navigating through file directories and file paths. A project creates an encapsulation for source files, images, and anything else created during your R Session.

The next step of setting up will be to load packages we will need today.

# install.packages("inspectdf")
# install.packages("broom")

library(tidyverse)
library(inspectdf)
library(broom)

Remember that if your console says “Error package name not found”, you will need to install the package using install.packages("packagename"). I like to put the install lines in my script, run them to install the package, and then comment the line out so that installations will not get in the way of rendering the document (in case you want to do that).

As our final set-up step, let’s load in the homes dataset that we have been using for the past few weeks. This is the version after all of the cleaning steps we did with David in the Data Prep lesson.

We will use the read_csv() function from the tidyverse to get our data into R. Once we’ve loaded it, we can type the name of the object itself (homes) to see it printed to the screen.

homes <- read_csv("albemarle_homes.csv")
homes
## # A tibble: 31,228 x 27
##    tmp   yearbuilt yearremodeled usecode condition finsqft cooling bedroom
##    <chr>     <dbl>         <dbl> <chr>   <chr>       <dbl> <chr>     <dbl>
##  1 0030…      1939            NA Single… Fair         1596 No Cen…       3
##  2 0030…      1981            NA Single… Average      1616 No Cen…       6
##  3 0030…      1999            NA Single… Average       480 No Cen…       0
##  4 0030…      1754            NA Single… Substand…     828 No Cen…       1
##  5 0030…      1880            NA Single… Fair         1224 No Cen…       2
##  6 0050…      1769          1988 Single… Average      4875 Centra…       4
##  7 0050…      1818          1991 Single… Average      5573 Centra…       6
##  8 0060…      1935          1994 Single… Average      1120 No Cen…       3
##  9 0060…      1880          1978 Single… Average      1458 Centra…       3
## 10 0060…      2005            NA Single… Average      1888 Centra…       3
## # … with 31,218 more rows, and 19 more variables: fullbath <dbl>,
## #   halfbath <dbl>, totalrooms <dbl>, lotsize <dbl>, landvalue <dbl>,
## #   improvementsvalue <dbl>, totalvalue <dbl>, lastsaleprice <dbl>,
## #   lastsaledate1 <chr>, esdistrict <chr>, msdistrict <chr>, hsdistrict <chr>,
## #   censustract <dbl>, lastsaledate <date>, age <dbl>, condition2 <chr>,
## #   remodel <dbl>, fp <dbl>, landuse <dbl>

Take a look at that output. The nice thing about loading the tidyverse and reading data with the read_csv() from readr is that data are displayed in a much more friendly way. This dataset has 31K+ rows and 27 columns. When you import/convert data this way and try to display the object in the console, instead of trying to display all of the rows, you’ll only see 10 by default. The printing behavior of read.csv() causes 1000s of lines of output that are seldom helpful. Also, if you have so many columns that the data would wrap off the edge of your screen, those columns will not be displayed, but you’ll see at the bottom of the output which, if any, columns were hidden from view.

Remember that if you want to see the whole dataset, there are two ways to do this. First, you can click on the name of the data.frame in the Environment panel in RStudio. Or you could use the View() function (with a capital V).

# View(homes)

Note: I have commented out the View() function because its behavior when rendering an Rmd is not desirable. You could also change the Rmd chunk options to prevent this chunk from being evaluated.

Recall several built-in functions that are useful for working with data frames.

Let’s try a few

glimpse(homes)
## Observations: 31,228
## Variables: 27
## $ tmp               <chr> "00300000000500", "003000000006B0", "00300000001000…
## $ yearbuilt         <dbl> 1939, 1981, 1999, 1754, 1880, 1769, 1818, 1935, 188…
## $ yearremodeled     <dbl> NA, NA, NA, NA, NA, 1988, 1991, 1994, 1978, NA, NA,…
## $ usecode           <chr> "Single Family", "Single Family", "Single Family", …
## $ condition         <chr> "Fair", "Average", "Average", "Substandard", "Fair"…
## $ finsqft           <dbl> 1596, 1616, 480, 828, 1224, 4875, 5573, 1120, 1458,…
## $ cooling           <chr> "No Central Air", "No Central Air", "No Central Air…
## $ bedroom           <dbl> 3, 6, 0, 1, 2, 4, 6, 3, 3, 3, 3, 3, 3, 3, 3, 4, 2, …
## $ fullbath          <dbl> 1, 4, 0, 0, 1, 3, 4, 1, 1, 3, 2, 2, 3, 3, 2, 1, 1, …
## $ halfbath          <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, …
## $ totalrooms        <dbl> 5, 9, 1, 3, 4, 8, 11, 5, 6, 6, 9, 0, 10, 8, 8, 6, 4…
## $ lotsize           <dbl> 55.955, 5.167, 80.750, 80.718, 65.000, 5.102, 453.8…
## $ landvalue         <dbl> 347900, 140300, 312500, 327100, 497000, 106400, 161…
## $ improvementsvalue <dbl> 80700, 212300, 21600, 13500, 113100, 827000, 802200…
## $ totalvalue        <dbl> 428600, 352600, 334100, 340600, 610100, 933400, 241…
## $ lastsaleprice     <dbl> 424000, 300000, 0, 0, 0, 0, 0, 2820000, 0, 0, 52000…
## $ lastsaledate1     <chr> "11/18/2003", "03/23/2012", "06/30/1994", "08/09/19…
## $ esdistrict        <chr> "Broadus Wood", "Broadus Wood", "Broadus Wood", "Br…
## $ msdistrict        <chr> "Jouett", "Jouett", "Jouett", "Jouett", "Jouett", "…
## $ hsdistrict        <chr> "Albemarle", "Albemarle", "Albemarle", "Albemarle",…
## $ censustract       <dbl> 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 1…
## $ lastsaledate      <date> 2003-11-18, 2012-03-23, 1994-06-30, 1994-08-09, 20…
## $ age               <dbl> 80, 38, 20, 265, 139, 250, 201, 84, 139, 14, 15, 12…
## $ condition2        <chr> "Fair", "Average", "Average", "Substandard", "Fair"…
## $ remodel           <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ fp                <dbl> 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, …
## $ landuse           <dbl> 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, …
names(homes)
##  [1] "tmp"               "yearbuilt"         "yearremodeled"    
##  [4] "usecode"           "condition"         "finsqft"          
##  [7] "cooling"           "bedroom"           "fullbath"         
## [10] "halfbath"          "totalrooms"        "lotsize"          
## [13] "landvalue"         "improvementsvalue" "totalvalue"       
## [16] "lastsaleprice"     "lastsaledate1"     "esdistrict"       
## [19] "msdistrict"        "hsdistrict"        "censustract"      
## [22] "lastsaledate"      "age"               "condition2"       
## [25] "remodel"           "fp"                "landuse"
summary(homes)
##      tmp              yearbuilt    yearremodeled     usecode         
##  Length:31228       Min.   :1668   Min.   :   2    Length:31228      
##  Class :character   1st Qu.:1973   1st Qu.:1992    Class :character  
##  Mode  :character   Median :1989   Median :2002    Mode  :character  
##                     Mean   :1981   Mean   :1998                      
##                     3rd Qu.:2002   3rd Qu.:2008                      
##                     Max.   :2019   Max.   :2019                      
##                     NA's   :952    NA's   :28929                     
##   condition            finsqft       cooling             bedroom      
##  Length:31228       Min.   : 144   Length:31228       Min.   : 0.000  
##  Class :character   1st Qu.:1386   Class :character   1st Qu.: 3.000  
##  Mode  :character   Median :1860   Mode  :character   Median : 3.000  
##                     Mean   :2087                      Mean   : 3.386  
##                     3rd Qu.:2573                      3rd Qu.: 4.000  
##                     Max.   :9649                      Max.   :12.000  
##                                                       NA's   :1       
##     fullbath        halfbath        totalrooms        lotsize         
##  Min.   :0.000   Min.   :0.0000   Min.   : 0.000   Min.   :   0.0000  
##  1st Qu.:2.000   1st Qu.:0.0000   1st Qu.: 6.000   1st Qu.:   0.1620  
##  Median :2.000   Median :1.0000   Median : 7.000   Median :   0.6855  
##  Mean   :2.324   Mean   :0.6323   Mean   : 7.073   Mean   :   4.7997  
##  3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.: 8.000   3rd Qu.:   2.7480  
##  Max.   :9.000   Max.   :5.0000   Max.   :84.000   Max.   :1067.1500  
##  NA's   :1       NA's   :110      NA's   :9                           
##    landvalue       improvementsvalue   totalvalue      lastsaleprice     
##  Min.   :      0   Min.   :      0   Min.   :   4300   Min.   :       0  
##  1st Qu.:  80000   1st Qu.: 154600   1st Qu.: 245800   1st Qu.:       0  
##  Median : 117000   Median : 225100   Median : 344050   Median :  184000  
##  Mean   : 142752   Mean   : 286777   Mean   : 429529   Mean   :  246097  
##  3rd Qu.: 151900   3rd Qu.: 340025   3rd Qu.: 500200   3rd Qu.:  351000  
##  Max.   :6651000   Max.   :4953700   Max.   :7859000   Max.   :11925000  
##                                                        NA's   :1         
##  lastsaledate1       esdistrict         msdistrict         hsdistrict       
##  Length:31228       Length:31228       Length:31228       Length:31228      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   censustract     lastsaledate             age          condition2       
##  Min.   :101.0   Min.   :1754-01-01   Min.   :  0.00   Length:31228      
##  1st Qu.:103.0   1st Qu.:2001-12-21   1st Qu.: 17.00   Class :character  
##  Median :107.0   Median :2011-02-02   Median : 30.00   Mode  :character  
##  Mean   :107.5   Mean   :2004-08-02   Mean   : 37.71                     
##  3rd Qu.:111.0   3rd Qu.:2016-08-03   3rd Qu.: 46.00                     
##  Max.   :114.0   Max.   :2019-10-24   Max.   :351.00                     
##  NA's   :5                                                               
##     remodel              fp            landuse       
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.00000   Median :1.0000   Median :0.00000  
##  Mean   :0.07362   Mean   :0.7296   Mean   :0.05229  
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.00000  
## 

Descriptive statistics

We can access individual variables within a data frame using the $ operator, e.g., mydataframe$specificVariable.

Descriptive statistics for categorical variables

Printing out all the esdistrict values in the data will not be useful. It will print 1000 rows and then stop, trying to warn us that we didn’t really want to do that.

# Display all esdistrict values
homes$esdistrict

Now, let’s see what are the unique values of esdistrict using base R and then using a dplyr pipeline

# Get the unique values of Race
unique(homes$esdistrict)
##  [1] "Broadus Wood"     "Crozet"           "Meriwether Lewis" "Baker-Butler"    
##  [5] "Stony Point"      "Stone-Robinson"   "Unassigned"       "Brownsville"     
##  [9] "Greer"            "Agnor-Hurt"       "Woodbrook"        "Hollymead"       
## [13] "Murray"           "Red Hill"         "Cale"             "Scottsville"
# Do the same thing the dplyr way
homes %>% distinct(esdistrict)
## # A tibble: 16 x 1
##    esdistrict      
##    <chr>           
##  1 Broadus Wood    
##  2 Crozet          
##  3 Meriwether Lewis
##  4 Baker-Butler    
##  5 Stony Point     
##  6 Stone-Robinson  
##  7 Unassigned      
##  8 Brownsville     
##  9 Greer           
## 10 Agnor-Hurt      
## 11 Woodbrook       
## 12 Hollymead       
## 13 Murray          
## 14 Red Hill        
## 15 Cale            
## 16 Scottsville

Think about the difference between these two types of output produced by the base r function or by the dplyr equivalent. Sometimes one format is vastly preferred over another so it is good to note the difference.

#create a table of frequencies
table(homes$esdistrict) #base r
## 
##       Agnor-Hurt     Baker-Butler     Broadus Wood      Brownsville 
##             2144             2944             2280             3147 
##             Cale           Crozet            Greer        Hollymead 
##             2676             1859             1437             1188 
## Meriwether Lewis           Murray         Red Hill      Scottsville 
##             2161             1697             1574             1452 
##   Stone-Robinson      Stony Point       Unassigned        Woodbrook 
##             3593             1311                5             1760
homes %>% count(esdistrict) #this returns a tibble
## # A tibble: 16 x 2
##    esdistrict           n
##    <chr>            <int>
##  1 Agnor-Hurt        2144
##  2 Baker-Butler      2944
##  3 Broadus Wood      2280
##  4 Brownsville       3147
##  5 Cale              2676
##  6 Crozet            1859
##  7 Greer             1437
##  8 Hollymead         1188
##  9 Meriwether Lewis  2161
## 10 Murray            1697
## 11 Red Hill          1574
## 12 Scottsville       1452
## 13 Stone-Robinson    3593
## 14 Stony Point       1311
## 15 Unassigned           5
## 16 Woodbrook         1760
#2-factor table or cross tabulation
table(homes$hsdistrict, homes$esdistrict)
##                    
##                     Agnor-Hurt Baker-Butler Broadus Wood Brownsville Cale
##   Albemarle               2144         2944         2280           0    0
##   Monticello                 0            0            0           0 2676
##   Unassigned                 0            0            0           0    0
##   Western Albemarle          0            0            0        3147    0
##                    
##                     Crozet Greer Hollymead Meriwether Lewis Murray Red Hill
##   Albemarle              0  1437      1188                0      0        0
##   Monticello             0     0         0                0      0     1574
##   Unassigned             0     0         0                0      0        0
##   Western Albemarle   1859     0         0             2161   1697        0
##                    
##                     Scottsville Stone-Robinson Stony Point Unassigned Woodbrook
##   Albemarle                   0              0         501          0      1760
##   Monticello               1452           3593         810          0         0
##   Unassigned                  0              0           0          5         0
##   Western Albemarle           0              0           0          0         0
homes %>% count(hsdistrict, esdistrict)
## # A tibble: 17 x 3
##    hsdistrict        esdistrict           n
##    <chr>             <chr>            <int>
##  1 Albemarle         Agnor-Hurt        2144
##  2 Albemarle         Baker-Butler      2944
##  3 Albemarle         Broadus Wood      2280
##  4 Albemarle         Greer             1437
##  5 Albemarle         Hollymead         1188
##  6 Albemarle         Stony Point        501
##  7 Albemarle         Woodbrook         1760
##  8 Monticello        Cale              2676
##  9 Monticello        Red Hill          1574
## 10 Monticello        Scottsville       1452
## 11 Monticello        Stone-Robinson    3593
## 12 Monticello        Stony Point        810
## 13 Unassigned        Unassigned           5
## 14 Western Albemarle Brownsville       3147
## 15 Western Albemarle Crozet            1859
## 16 Western Albemarle Meriwether Lewis  2161
## 17 Western Albemarle Murray            1697

Stony point elementary is the only ES that feeds into 2 different HS

Descriptive statistics for categorical variables

Now let’s calculate some descriptive stats for the totalvalue variable.

# measures of the middle: mean, median
mean(homes$totalvalue)
## [1] 429528.6
median(homes$totalvalue)
## [1] 344050
# measures of overall spread
range(homes$totalvalue)
## [1]    4300 7859000
sd(homes$totalvalue)
## [1] 347592
# the Interquartile Range (difference between 75th and 25th quartiles)
IQR(homes$totalvalue)
## [1] 254400
# quartiles (points that divide data into quarters)
quantile(homes$totalvalue)
##      0%     25%     50%     75%    100% 
##    4300  245800  344050  500200 7859000
# 90th percentile of total value
quantile(homes$totalvalue, probs = .90)
##    90% 
## 739930
# a distributional summary
summary(homes$totalvalue)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4300  245800  344050  429529  500200 7859000

What are your conclusions about this variable?

Recall the following about the mean and the median: - The median is the middle of the sorted data - The mean is the “balance point” of the data - Symmetric data have similar means and medians

You can calculate these same descriptive statistics using dplyr, but remember, this returns a single-row, single-column tibble, not a single scalar value like the above. This is only really useful in the context of grouping and summarizing.

# Compute the median totalvalue
homes %>% 
  summarize(median(totalvalue))
## # A tibble: 1 x 1
##   `median(totalvalue)`
##                  <dbl>
## 1               344050
# Now grouped by other variables 
homes %>% 
  group_by(esdistrict) %>% 
  summarize(med = median(totalvalue))
## # A tibble: 16 x 2
##    esdistrict          med
##    <chr>             <dbl>
##  1 Agnor-Hurt       277400
##  2 Baker-Butler     295400
##  3 Broadus Wood     383000
##  4 Brownsville      409700
##  5 Cale             337000
##  6 Crozet           275000
##  7 Greer            306100
##  8 Hollymead        349550
##  9 Meriwether Lewis 544000
## 10 Murray           729700
## 11 Red Hill         249000
## 12 Scottsville      192350
## 13 Stone-Robinson   364900
## 14 Stony Point      393200
## 15 Unassigned       685500
## 16 Woodbrook        330350
# let's order that output
homes %>% 
  group_by(esdistrict) %>% 
  summarize(med = median(totalvalue)) %>%
  arrange(desc(med))
## # A tibble: 16 x 2
##    esdistrict          med
##    <chr>             <dbl>
##  1 Murray           729700
##  2 Unassigned       685500
##  3 Meriwether Lewis 544000
##  4 Brownsville      409700
##  5 Stony Point      393200
##  6 Broadus Wood     383000
##  7 Stone-Robinson   364900
##  8 Hollymead        349550
##  9 Cale             337000
## 10 Woodbrook        330350
## 11 Greer            306100
## 12 Baker-Butler     295400
## 13 Agnor-Hurt       277400
## 14 Crozet           275000
## 15 Red Hill         249000
## 16 Scottsville      192350

Missing data

Let’s try taking the mean of a different variable, either the dplyr way or the simpler $ way.

# the dplyr way: returns a single-row single-column tibble/dataframe
homes %>% summarize(mean(totalrooms))
## # A tibble: 1 x 1
##   `mean(totalrooms)`
##                <dbl>
## 1                 NA
# returns a single value
mean(homes$totalrooms)
## [1] NA

What happened there? NA indicates missing data. Take a look at a summary of the totalrooms variable.

summary(homes$totalrooms)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   6.000   7.000   7.073   8.000  84.000       9

Notice that there are 9 missing values for totalrooms Trying to get the mean a bunch of observations with some missing data returns a missing value by default. This is almost universally the case with all summary statistics – a single NA will cause the summary to return NA. Now look at the help for ?mean. Notice the na.rm argument. This is a logical (i.e., TRUE or FALSE) value indicating whether or not missing values should be removed prior to computing the mean. By default, it’s set to FALSE. Now try it again.

mean(homes$totalrooms, na.rm = TRUE)
## [1] 7.073032

The is.na() function tells you if a value is missing. Get the sum() of that vector, which adds up all the TRUEs to tell you how many of the values are missing.

is.na(homes$totalrooms)
sum(is.na(homes$totalrooms))
## [1] 9
# proportion missing
sum(is.na(homes$totalrooms)) / length(homes$totalrooms)
## [1] 0.0002882029

What if we want to know the amount of missing data in each column? There’s a package for that! Let’s use the inspect_na() function in the inspectdf package to help. Remember that if you get an error message that the function cannot be found, go back to the top of the script and try the library(inspectdf) line. If that fails, install.packages("inspectdf")

inspect_na(homes) %>% print(n = Inf)
## # A tibble: 27 x 3
##    col_name            cnt     pcnt
##    <chr>             <int>    <dbl>
##  1 yearremodeled     28929 92.6    
##  2 yearbuilt           952  3.05   
##  3 halfbath            110  0.352  
##  4 totalrooms            9  0.0288 
##  5 censustract           5  0.0160 
##  6 bedroom               1  0.00320
##  7 fullbath              1  0.00320
##  8 lastsaleprice         1  0.00320
##  9 tmp                   0  0      
## 10 usecode               0  0      
## 11 condition             0  0      
## 12 finsqft               0  0      
## 13 cooling               0  0      
## 14 lotsize               0  0      
## 15 landvalue             0  0      
## 16 improvementsvalue     0  0      
## 17 totalvalue            0  0      
## 18 lastsaledate1         0  0      
## 19 esdistrict            0  0      
## 20 msdistrict            0  0      
## 21 hsdistrict            0  0      
## 22 lastsaledate          0  0      
## 23 age                   0  0      
## 24 condition2            0  0      
## 25 remodel               0  0      
## 26 fp                    0  0      
## 27 landuse               0  0

This function provides the count of missings and the proportion missing for each variable, sorted by amount missing.

What to do about missing data?

This is a tough question, but an important one. If you have missing data and are wondering what to do about it, please request a consult with one of the PhD+ instructors, since our advice will depend a lot on the situation (how much is missing, is it missing at random or not, what are you trying to do with the data, etc.)

EDA

Now, let’s talk about exploratory data analysis (EDA).

It’s always worth examining your data visually before you start any statistical analysis or hypothesis testing. We already spent an entire day on data visualizations, so I’ll just remind us of a few of the big ones here: histograms and scatterplots. We will also re-familiarize ourselves with density plots when we go through some stats in a few minutes.

Histograms

We can learn a lot from the data just looking at the distributions of particular variables. Let’s make some basic histograms with ggplot2.

ggplot(homes, aes(finsqft)) + geom_histogram()

How would you describe the shape of this distribution?

What about age?

ggplot(homes, aes(age)) + geom_histogram() 

What do you notice?

What is going on with age around 260

filter(homes, age < 270 & age > 250)
## # A tibble: 419 x 27
##    tmp   yearbuilt yearremodeled usecode condition finsqft cooling bedroom
##    <chr>     <dbl>         <dbl> <chr>   <chr>       <dbl> <chr>     <dbl>
##  1 0030…      1754            NA Single… Substand…     828 No Cen…       1
##  2 0070…      1754            NA Single… Substand…     863 No Cen…       0
##  3 0090…      1754            NA Single… Fair         1000 No Cen…       3
##  4 0090…      1754          1993 Single… Average      1859 No Cen…       3
##  5 0090…      1754            NA Single… Fair          520 No Cen…       1
##  6 0090…      1754            NA Single… Average      1580 No Cen…       4
##  7 0100…      1754            NA Single… Fair          768 No Cen…       2
##  8 0100…      1754            NA Single… Poor          629 No Cen…       2
##  9 0110…      1754          1987 Single… Average      3236 Centra…       4
## 10 0130…      1754            NA Single… Substand…     912 No Cen…       0
## # … with 409 more rows, and 19 more variables: fullbath <dbl>, halfbath <dbl>,
## #   totalrooms <dbl>, lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## #   totalvalue <dbl>, lastsaleprice <dbl>, lastsaledate1 <chr>,
## #   esdistrict <chr>, msdistrict <chr>, hsdistrict <chr>, censustract <dbl>,
## #   lastsaledate <date>, age <dbl>, condition2 <chr>, remodel <dbl>, fp <dbl>,
## #   landuse <dbl>

They do seem to be real cases. I would love to know what happened in Charlottesville in 1754 to prompt so many houses to be built (that are still around today)!

Scatterplots

Let’s look at how a few different variables relate to each other. E.g., lotsize and finsqft:

ggplot(homes, aes(lotsize, finsqft)) + geom_point()

Some very large homes on small lots. But (I think) lot size is in acres, so really we are looking at few properties that are > 300 acres.

Stats with Continuous variables

A very well-known group of basic statistical tests is called the t-test. A t-test is used when we want to know the difference between 2 group means.

T-tests assume a few things about the data:

  1. Data were randomly sampled from their populations
  2. two groups are independent from one another
  3. groups were sampled from populations with normal distributions (symmetrical, bell-shaped)
  4. variance is equal for the two populations

As we go, we will learn what to do if assumptions do not hold, but for now, here is a brief flowchart: - Random Sampling – if not met, no statistics will help you - Independent Samples – if not met, use a paired t-test - Normality – if not met, use a Wilcoxon-Mann-Whitney U test - Equal variance – if not met, use a Welch’s t-test - If independent samples and normality are both not met – use Wilcoxon signed rank test

T-tests

Let’s do a two-sample t-tests to assess the difference in means between two groups. The function for a t-test is t.test(). See the help using ?t.test.

We’ll investigate whether the condition of the home impacts its total value for homes that are either in Fair or Substandard condition.

Right now we have a census of all of the homes in Albemarle. Let’s create a sample of homes that are either condition == “Fair” or condition == “Substandard” to use.

set.seed(317)
lessgood <- homes %>% 
  sample_frac(.2) %>%
  filter(condition == "Fair" | condition == "Substandard")

Great!

To answer our questions about the assumptions, let’s create some exploratory plots. I like density plots colored by group. I think these help you see the distribution of the variable to assess equal variance. This plot also informs you somewhat on normality, and helps you see if there is a noticeable difference in groups.

ggplot(lessgood, aes(totalvalue, fill = condition)) + geom_density()

# hmm, definitely not normally distributed, but lots of $ variables are log transformed, so let's try that
ggplot(lessgood, aes(log(totalvalue), fill = condition, color = condition)) +
  geom_density(alpha = .5)

First we make sure that a t-test is the correct type of analysis. A t-test tests the difference in 2 means - yes that is what we want to do. Next we need to decide what type of t-test we need to perform by thinking through the assumptions. Domain specific knowledge and exploratory data analyses will help here.

Random sampling – YES, we sampled a random (pseudo-random, but good enough) subset of rows from our homes dataset

Independent samples – YES (fair and substandard are different homes - unrelated). It could be a paired t-test if we were assessing fair and substandard pairs where the two homes were matched by street or something

Equal variance. Also called homoscedasticity of variances. ?? we could think about the populations of fair and substandard homes and what we know about how those designations impact the variability of value…if we were an appraiser. I don’t feel like I have the knowledge of these variables, so I will allow my EDA plots to help. Looking on the width of the bases of the density plots, they don’t look too different. Maybe the substandard one is a bit wider, but I would likely call them the same in this case.

Normality – Our density plots don’t look too bad, but there are better ways to assess this.

Normality can be assessed graphically or via hypothesis tests. There are pros and cons to either approach.

Graphically, we could look at a histogram, boxplot, or a more specialized plot to assess normality called a QQ plot (quantile-quantile plot or quantile comparison plot or normal probability plot). A QQ plot graphs the expected data value given a normal distribution on the X axis against the observed data value on the y axis. If the data is normally distributed, we should see a 1:1 ratio between the expected values and the observed values. Let’s have a look for log(totalvalue):

ggplot(lessgood, aes(sample = log(totalvalue))) + 
  geom_qq() +
  geom_qq_line() +
  facet_wrap(~condition) 

# both look good, but look at that class imbalance

# count by condition
lessgood %>%
  group_by(condition) %>%
  count()
## # A tibble: 2 x 2
## # Groups:   condition [2]
##   condition       n
##   <chr>       <int>
## 1 Fair          242
## 2 Substandard    32

Learning what is a normal QQ plot looks like is a process.

Certain fields love hypothesis tests of normality and sometimes reviewers will specifically request one. There is a theoretical problem with trying to prove a null hypothesis and they are known to reject the null when sample sizes are large. My best advice is to use your brain, subject matter expertise, and graphical assessments as much as possible, but in case you are forced to do a hypothesis test for normality check out shapiro.test(), since it seems to be the least awful choice (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3693611/).

Ok. We’ve checked our assumptions and are ready to perform a two-sample, pooled variance t-test

t.test(log(totalvalue) ~ condition, data=lessgood, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  log(totalvalue) by condition
## t = 5.1591, df = 272, p-value = 4.79e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.3986370 0.8906271
## sample estimates:
##        mean in group Fair mean in group Substandard 
##                  12.25437                  11.60974
# what if we had assumed unequal variance to be safe
t.test(log(totalvalue) ~ condition, data=lessgood)
## 
##  Welch Two Sample t-test
## 
## data:  log(totalvalue) by condition
## t = 3.9803, df = 35.252, p-value = 0.0003276
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.3159263 0.9733378
## sample estimates:
##        mean in group Fair mean in group Substandard 
##                  12.25437                  11.60974

In either case, we conclude that the condition groups are different in terms of their log(totalvalue). We also can see a 95% confidence interval for the difference, meaning that we are 95% confident that the difference in population log(totalvalue) is between 0.39 and 0.89 (pooled) or between 0.31 and 0.97 (Welch’s).

What if we felt uncomfortable with the assumption of normality? t-tests are robust to departures in normality, so in this case I think a t-test is the best option, but for demonstration purposes, the non-parameteric alternative is the Wilcoxon-Mann-Whitney U test

wilcox.test(log(totalvalue) ~ condition, data=lessgood)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  log(totalvalue) by condition
## W = 5663, p-value = 2.135e-05
## alternative hypothesis: true location shift is not equal to 0

Notice that the output does not provide confidence intervals or summary statistics, but we can add a confidence interval for the difference and calculate the summary statistics ourselves.

# with confidence interval for the pseudo-median difference
wilcox.test(log(totalvalue) ~ condition, conf.int = TRUE, data=lessgood)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  log(totalvalue) by condition
## W = 5663, p-value = 2.135e-05
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  0.3633802 0.8756668
## sample estimates:
## difference in location 
##              0.6242179
# calculate means and medians by group
lessgood %>%
  group_by(condition) %>%
  summarize(med = median(log(totalvalue)),
            mean = mean(log(totalvalue)))
## # A tibble: 2 x 3
##   condition     med  mean
##   <chr>       <dbl> <dbl>
## 1 Fair         12.2  12.3
## 2 Substandard  11.7  11.6

A note on one-tailed versus two-tailed tests: A two-tailed test is usually more appropriate. The hypothesis you’re testing here is spelled out in the results (“alternative hypothesis: true difference in means is not equal to 0”). If the p-value is very low, you can reject the null hypothesis that there’s no difference in means. Because you may not know a priori whether the difference in means will be positive or negative, you want to do the two-tailed test. However, if we only wanted to test a very specific directionality of effect, we could use a one-tailed test and specify which direction we expect. This is more powerful if we “get it right”, but much less powerful for the opposite effect. The p-value of a one-tailed test will be half of that of a two-tailed hypothesis test. BUT again, the two-tailed test is almost always conducted to be conservative.

A note on paired versus unpaired t-tests: The t-tests we performed here were unpaired tests. Fair and substandard homes are different and not related. In these cases, an unpaired test is appropriate. An alternative design might be when data is derived from samples who have been measured at two different time points or locations, e.g., before versus after treatment, left versus right hand, etc. In this case, a paired t-test would be more appropriate. A paired test takes into consideration the intra and inter-subject variability, and is more powerful than the unpaired test. There is a paired = TRUE option for both the t-test and the Wilcoxon test.

Relationship between t-test ANOVA and Linear models

Analysis of variance (ANOVA) and linear modeling are complex topics that deserve an entire semester dedicated to theory, design, and interpretation. Luckily, next week we will be focused on linear modeling. What follows is a necessary over-simplification with more focus on implementation, and less on theory and design.

Where t-tests and their nonparametric substitutes are used for assessing the differences in means between two groups, ANOVA is used to assess the significance of differences in means between multiple groups. In fact, a t-test is just a specific case of ANOVA when you only have two groups. And both t-tests and ANOVA are just specific cases of linear regression, where you’re trying to fit a model describing how a continuous outcome (e.g., log(totalvalue)) changes with some predictor variable (e.g., condition, hsdistrict, etc.). The distinction is that ANOVA has a categorical predictor while linear modeling has a continuous predictor.

Let’s examine the same question we just used as a t-test, but run it as an ANOVA to prove that t-test and ANOVA are really the same test. Because ANOVA has an assumption of equal variance, let’s run a t-test specifying that the variances between groups are equal.

t.test(log(totalvalue) ~ condition, data=lessgood, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  log(totalvalue) by condition
## t = 5.1591, df = 272, p-value = 4.79e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.3986370 0.8906271
## sample estimates:
##        mean in group Fair mean in group Substandard 
##                  12.25437                  11.60974

It looks like fair homes have a higher log(totalvalue) than those in the substandard condition. Now, let’s do the same test in a linear modeling framework. First, let’s create the fitted model and store it in an object called fit.

fit <- lm(log(totalvalue) ~ condition, data=lessgood)

You can display the linear model object itself, but that isn’t too interesting. You can get the more familiar ANOVA table by calling the anova() function on the fit object. More generally, the summary() function on a linear model object will tell you much more. (Note this is different from dplyr’s summarize function).

fit
## 
## Call:
## lm(formula = log(totalvalue) ~ condition, data = lessgood)
## 
## Coefficients:
##          (Intercept)  conditionSubstandard  
##              12.2544               -0.6446
anova(fit)
## Analysis of Variance Table
## 
## Response: log(totalvalue)
##            Df  Sum Sq Mean Sq F value   Pr(>F)    
## condition   1  11.745 11.7446  26.616 4.79e-07 ***
## Residuals 272 120.024  0.4413                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit)
## 
## Call:
## lm(formula = log(totalvalue) ~ condition, data = lessgood)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.44022 -0.45834 -0.01923  0.38438  2.43881 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           12.2544     0.0427 286.979  < 2e-16 ***
## conditionSubstandard  -0.6446     0.1250  -5.159 4.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6643 on 272 degrees of freedom
## Multiple R-squared:  0.08913,    Adjusted R-squared:  0.08578 
## F-statistic: 26.62 on 1 and 272 DF,  p-value: 4.79e-07

Notice from these pieces of output that:

  1. The p-values from all three tests (t-test, ANOVA, and linear regression) are all identical (p=4.79e-07). This is because the tests are all identical: a t-test is a specific case of ANOVA, which is a specific case of linear regression.

  2. The test statistics are all related. The t statistic from the t-test is 5.1591, which is the same as the t-statistic from the linear regression. If you square that, you get 26.62, the F statistic from the ANOVA.
  3. The t.test() output shows you the means for the two groups, Fair and Substandard. Just displaying the fit object itself or running summary(fit) shows you the coefficients as a linear model. Here, the model assumes the “baseline” condition level is Fair (first alphabetically), and that the intercept in a regression model (e.g., \(\beta_{0}\) in the model \(Y = \beta_{0} + \beta_{1}X\)) is the mean of the Fair group (12.25 is the mean log(totalvalue) for Fair homes). Being Substandard results in a decrease in log(totalvalue) of 0.64. This is the \(\beta_{1}\) coefficient in the model.

A note on dummy coding: If you have a \(k\)-level factor, R creates \(k-1\) dummy variables, or indicator variables, by default, using the alphabetically first level as the reference group. For example, the levels of condition in our lessgood dataset are “Fair” and “Substandard”. R creates a dummy variable called “conditionSubstandard” that’s 0 if the home is Fair, and 1 if the home is Substandard. The linear model is saying for every unit increase in conditionSubstandard, i.e., going from Fair to Substandard, results in a 0.64-unit decrease in log(totalvalue). You can change the ordering of the factors to change the interpretation of the model (e.g., treating Substandard as baseline and going from Substandard up to Fair). We’ll do this in the next section.

ANOVA

Recap: t-tests are for assessing the differences in means between two groups. A t-test is a specific case of ANOVA, which is a specific case of a linear model. Let’s run ANOVA, but this time looking for differences in means between more than two groups.

Let’s look at the relationship between hsdistrict, and log(totalvalue)

We’ll use the same technique as before to create a sample

set.seed(317)
hs <- homes %>%
  sample_frac(.005)

hs %>% count(hsdistrict)
## # A tibble: 3 x 2
##   hsdistrict            n
##   <chr>             <int>
## 1 Albemarle            65
## 2 Monticello           42
## 3 Western Albemarle    49

We have pretty balanced sample sizes. This is good for ANOVA. If you can choose to design your study with balanced sample sizes, ANOVA will perform better (better power, higher probability of equal variances assumption being met, etc).

First let’s run a grouped summary to see what we have in terms of differences in means

hs %>%
  group_by(hsdistrict) %>%
  summarize(logmean = mean(log(totalvalue)),
            mean = exp(logmean))
## # A tibble: 3 x 3
##   hsdistrict        logmean    mean
##   <chr>               <dbl>   <dbl>
## 1 Albemarle            12.6 307083.
## 2 Monticello           12.6 285528.
## 3 Western Albemarle    13.1 507660.

Ok, so Western Albemarle seems higher than the other two hsdistricts.

Now that we understand these data a bit, let’s proceed to fit the model

fit <- lm(log(totalvalue) ~ hsdistrict, data=hs)
anova(fit)
## Analysis of Variance Table
## 
## Response: log(totalvalue)
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## hsdistrict   2  9.621  4.8104  14.306 2.015e-06 ***
## Residuals  153 51.447  0.3363                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-test on the ANOVA table tells us that there is a significant difference in mean log(totalvalue) between the high school districts (p=\(2.015 \times 10^{-6}\)).

For more details, let’s take a look at the linear model output

summary(fit)
## 
## Call:
## lm(formula = log(totalvalue) ~ hsdistrict, data = hs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3530 -0.3448 -0.0337  0.3312  1.8446 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 12.63487    0.07192 175.669  < 2e-16 ***
## hsdistrictMonticello        -0.07278    0.11480  -0.634    0.527    
## hsdistrictWestern Albemarle  0.50269    0.10971   4.582 9.49e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5799 on 153 degrees of freedom
## Multiple R-squared:  0.1575, Adjusted R-squared:  0.1465 
## F-statistic: 14.31 on 2 and 153 DF,  p-value: 2.015e-06
exp(12.63487) # $307K
## [1] 307082
exp(12.63487-0.07278) # $285K
## [1] 285526.5
exp(12.63487+0.50269) # $508K
## [1] 507656.3

Interpretation: - the mean log(totalvalue) for the Albemarle hsdistrict = 12.63 or 3.070819710^{5} in actual dollars - the Monticello hsdistrict has a mean log(totalvalue) -0.07 less than that of Albemarle. 2.855264610^{5} in real dollars. That difference is not statistically significant - the Western Albemarle district has a mean log(totalvalue) 0.50 more than that of Albemarle. 5.076563410^{5} in real dollars. That difference is statistically significant

The bottom of the output has the ANOVA information and model summary statistics.

Because the default handling of categorical variables is to treat the alphabetical first level as the baseline, “Albemarle” high school district is treated the baseline group (the intercept row) and the coefficients for “Monticello” and “Western Albemarle” describe how those groups’ mean log(totalvalue) differs from that of the Albemarle hsdistrict.

What if we wanted “Western Albemarle” to be the reference category, followed by Albemarle, then Monticello? Let’s use mutate() and factor() to change the factor levels.

hs <- hs %>%
  mutate(hsdistrict = factor(hsdistrict, levels = c("Western Albemarle", "Albemarle", "Monticello")))

Re-run the model to see how re-leveling the variable changes the output

fit <- lm(log(totalvalue) ~ hsdistrict, data=hs)
anova(fit)
## Analysis of Variance Table
## 
## Response: log(totalvalue)
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## hsdistrict   2  9.621  4.8104  14.306 2.015e-06 ***
## Residuals  153 51.447  0.3363                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Same F statistic and p-value overall. Let’s have a look at the summary

summary(fit)
## 
## Call:
## lm(formula = log(totalvalue) ~ hsdistrict, data = hs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3530 -0.3448 -0.0337  0.3312  1.8446 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          13.13757    0.08284 158.592  < 2e-16 ***
## hsdistrictAlbemarle  -0.50269    0.10971  -4.582 9.49e-06 ***
## hsdistrictMonticello -0.57547    0.12194  -4.719 5.30e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5799 on 153 degrees of freedom
## Multiple R-squared:  0.1575, Adjusted R-squared:  0.1465 
## F-statistic: 14.31 on 2 and 153 DF,  p-value: 2.015e-06

Now Western Albemarle is the reference group and the comparisons are to that hsdistrict.

So far, we have been looking at model output as a large paragraph. If you need to do something downstream with the output from these models, the tidy() and glance() functions in the broom package may help. For example, these functions can help tremendously if you’d like to output the results of a model into a table. These functions work for t-test, wilcox.test, ANOVA, and linear models (and maybe other types of tests too).

# coefficients section
tidy(fit)
## # A tibble: 3 x 5
##   term                 estimate std.error statistic   p.value
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)            13.1      0.0828    159.   1.24e-171
## 2 hsdistrictAlbemarle    -0.503    0.110      -4.58 9.49e-  6
## 3 hsdistrictMonticello   -0.575    0.122      -4.72 5.30e-  6
# model summary section
glance(fit)
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
## 1     0.158         0.147 0.580      14.3 2.02e-6     3  -135.  278.  290.
## # … with 2 more variables: deviance <dbl>, df.residual <int>

If you wanted to remain in the ANOVA framework, you can run the typical post-hoc ANOVA procedures on the fit object. For example, the TukeyHSD() function will run Tukey’s test. Tukey’s test computes all pairwise mean difference calculation, comparing each group to each other group, identifying any difference between two groups that’s greater than the standard error, while controlling the type I error for all multiple comparisons. First run aov() (not anova()) on the fitted linear model object, then run TukeyHSD() on the resulting analysis of variance fit.

TukeyHSD(aov(fit))
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = fit)
## 
## $hsdistrict
##                                     diff        lwr        upr     p adj
## Albemarle-Western Albemarle  -0.50269252 -0.7623387 -0.2430464 0.0000282
## Monticello-Western Albemarle -0.57547268 -0.8640633 -0.2868821 0.0000157
## Monticello-Albemarle         -0.07278016 -0.3444833  0.1989230 0.8016517
tidy(TukeyHSD(aov(fit)))
## # A tibble: 3 x 6
##   term       comparison                  estimate conf.low conf.high adj.p.value
##   <chr>      <chr>                          <dbl>    <dbl>     <dbl>       <dbl>
## 1 hsdistrict Albemarle-Western Albemarle  -0.503    -0.762    -0.243   0.0000282
## 2 hsdistrict Monticello-Western Albemar…  -0.575    -0.864    -0.287   0.0000157
## 3 hsdistrict Monticello-Albemarle         -0.0728   -0.344     0.199   0.802
plot(TukeyHSD(aov(fit)))

This shows that there isn’t much of a difference between Albemarle and Monticello, but that both of these differ significantly from Western Albemarle that has a higher log(totalvalue).

Finally, let’s visualize the differences between these groups.

# plot results
hs %>%
  ggplot(aes(hsdistrict, log(totalvalue))) + geom_boxplot()

# a fancier plot using a grouped summary
hssum <- hs %>%
  group_by(hsdistrict) %>%
  summarise(mean = mean(log(totalvalue)),
            sd = sd(log(totalvalue)),
            lower = mean - sd,
            upper = mean + sd)

ggplot() + 
  geom_jitter(data = hs, aes(x= hsdistrict, 
                             y = log(totalvalue), 
                             col = hsdistrict), 
              width = .1, alpha = .25) + 
  geom_pointrange(data = hssum, aes(x = hsdistrict, 
                                    y = mean, 
                                    ymin = lower, 
                                    ymax = upper, 
                                    col = hsdistrict), 
                  size = 1, fatten = 2) + 
  theme_classic() +
  labs(y = "Log-transformed Total Home Value",
       x = "",
       col = "High School District")

Discrete variable Statistics

Until now we’ve only discussed analyzing continuous outcomes (dependent variables). We’ve tested for differences in means between two groups with t-tests, differences among means between n groups with ANOVA, and more general relationships using linear regression. In all of these cases, the dependent variable, i.e., the outcome, or \(Y\) variable, was continuous. What if our outcome variable is discrete, e.g., “Yes/No”, “Excellent/Fair/Substandard”, etc.? Here we use a different set of procedures for assessing significant associations.

So far we have covered: 1. T-tests – analyzing differences in one continuous variable between 2 groups 2. ANOVA – analyzing differences in one continuous variable between 3+ groups 3. LM – Next week Clay will go through analyzing the impact of a continuous variable on another continuous variable, but we did an introduction to the functions we will need.

Test of proportion

One of the most basic questions we could have is what proportion of homes are above half a million dollars?

We’ll generate a sample to see the proportion

homes %>%
  sample_frac(0.01) %>%
  summarize(prop = mean(totalvalue > 500000))
## # A tibble: 1 x 1
##    prop
##   <dbl>
## 1 0.240

Run the above code again and we will see that a different sample of homes will lead to a different sample proportion. One of our questions should be what is the variability in that sample proportion? (Remember we usually do not have access to the whole population like we do here)

# generate a sample that is the same for all of us
set.seed(317)
samp <- homes %>%
  sample_frac(0.01) 

# see the sample proportion and the raw number
samp %>%
  summarize(prop = mean(totalvalue > 500000),
            n = sum(totalvalue > 500000))
## # A tibble: 1 x 2
##    prop     n
##   <dbl> <int>
## 1 0.237    74

Let’s create a confidence interval around this proportion

prop.test(x = sum(samp$totalvalue > 500000), n = length(samp$totalvalue))
## 
##  1-sample proportions test with continuity correction
## 
## data:  sum(samp$totalvalue > 5e+05) out of length(samp$totalvalue), null probability 0.5
## X-squared = 85.157, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.1918825 0.2890849
## sample estimates:
##         p 
## 0.2371795

We are 95% confident that the population proportion of homes > .5M in Albemarle county is between 0.19 and .29

The hypothesis test this function performs simply tests if the proportion = .5 (not helpful for us)

Let’s increase the sample size and see what happens to our sample proportion and to our confidence interval

# generate a sample that is the same for all of us
set.seed(317)
bigsamp <- homes %>%
  sample_frac(0.1)

# calculate the confidence interval
prop.test(x = sum(bigsamp$totalvalue > 500000), n = length(bigsamp$totalvalue))
## 
##  1-sample proportions test with continuity correction
## 
## data:  sum(bigsamp$totalvalue > 5e+05) out of length(bigsamp$totalvalue), null probability 0.5
## X-squared = 857.03, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.2231496 0.2533258
## sample estimates:
##         p 
## 0.2379123

Notice that the point estimate of our sample stayed similar but that the confidence interval shrank considerably. This is the wonderful effect of increasing your sample size!

Since we have access to our population, let’s see what the true proportion is

mean(homes$totalvalue > 500000)
## [1] 0.2502882

Testing 2 proportions

Often we are curious to compare 2 proportions. For example, is there is difference in the proportion of homes rated as “Substandard” or “Poor” condition with and without central air?

Let’s use the bigsamp object we created above to test our question

bigsamp %>%
  group_by(cooling) %>%
  summarise(prop = mean(condition == "Poor" | condition == "Substandard"),
            n = sum(condition == "Poor" | condition == "Substandard"),
            totN = n())
## # A tibble: 2 x 4
##   cooling           prop     n  totN
##   <chr>            <dbl> <int> <int>
## 1 Central Air    0.00359    10  2785
## 2 No Central Air 0.0888     30   338

So 0.3% of homes with cooling are classified as poor or substandard while 8.9% of homes without cooling are poor/substandard. Let’s do the test of proportions to see the confidence interval for the difference in proportions

The prop.test() function’s first argument is called x and it is the number of “successes” for each group. The second argument, n, is the total number of homes in each group

prop.test(x = c(10,30), n = c(2785, 338))
## Warning in prop.test(x = c(10, 30), n = c(2785, 338)): Chi-squared approximation
## may be incorrect
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(10, 30) out of c(2785, 338)
## X-squared = 166.24, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.11722542 -0.05310804
## sample estimates:
##      prop 1      prop 2 
## 0.003590664 0.088757396

The difference between our sample proportions is likely between -.117 and -0.053, a pretty large difference.

Let’s see what the population difference in proportions is

homes %>%
  group_by(cooling) %>%
  summarise(prop = mean(condition == "Poor" | condition == "Substandard"),
            n = sum(condition == "Poor" | condition == "Substandard"),
            totN = n())
## # A tibble: 2 x 4
##   cooling           prop     n  totN
##   <chr>            <dbl> <int> <int>
## 1 Central Air    0.00352    97 27554
## 2 No Central Air 0.103     379  3674
0.00352-0.103
## [1] -0.09948

Chi Square Contingency tables

Many times we are interested in how two discrete variables interact with each other. When ever you have count data with more than one variable, you should be thinking about contingency tables. This type of analysis is also called chi square test of independence.

The xtabs() function is useful for creating contingency tables from categorical variables. Let’s create a cross tabulation showing hsdistrict and condition2 of the home, and assign it to an object called xt. After making the assignment, type the name of the object to view it.

xt <- xtabs(~hsdistrict + condition2, data = homes)
xt
##                    condition2
## hsdistrict          Average Excellent Fair Good None Poor Substandard
##   Albemarle            9119       108  447 2138  334   78          30
##   Monticello           7572        61  481 1467  330  129          65
##   Unassigned              2         0    0    2    0    0           1
##   Western Albemarle    6397       121  403 1469  301  116          57

Ok, so our data is a 4 by 7 table. Looking at this table, it is clear that we don’t have a lot of homes in the Unassigned hsdistrict, so we should likely drop that prior to analysis. Let’s also drop condition2 == None since we don’t know what that means for the home.

Let’s also use a function from the forcats package to combine levels of condition together. The syntax for fct_collapse() is new = c(“old”, “Old”)

homes_chi <- homes %>% 
  filter(hsdistrict != "Unassigned" & condition2 != "None") %>%
  mutate(condition3 = fct_collapse(condition2,
                                   Good = c("Excellent", "Good"),
                                   Average = c("Average", "Fair"),
                                   Poor = c("Poor", "Substandard")
                                   ))


homes_chi %>%
  count(condition3)
## # A tibble: 3 x 2
##   condition3     n
##   <fct>      <int>
## 1 Average    24419
## 2 Good        5364
## 3 Poor         475

Great. We dropped ~1000 cases and are ready to re-make our xt object with our homes_chi dataset and the condition3 variable

xt <- xtabs(~hsdistrict + condition3, data = homes_chi)
xt
##                    condition3
## hsdistrict          Average Good Poor
##   Albemarle            9566 2246  108
##   Monticello           8053 1528  194
##   Western Albemarle    6800 1590  173

Nice. Now we have a 3x3 table.

There are two useful functions, addmargins() and prop.table() that add more information or manipulate how the data is displayed. addmargins() adds totals for rows and columns. By default, prop.table() will divide the number of observations in each cell by the total.

# Add marginal totals
addmargins(xt)
##                    condition3
## hsdistrict          Average  Good  Poor   Sum
##   Albemarle            9566  2246   108 11920
##   Monticello           8053  1528   194  9775
##   Western Albemarle    6800  1590   173  8563
##   Sum                 24419  5364   475 30258
# Get the proportional table
prop.table(xt)
##                    condition3
## hsdistrict              Average        Good        Poor
##   Albemarle         0.316147796 0.074228303 0.003569304
##   Monticello        0.266144491 0.050499042 0.006411528
##   Western Albemarle 0.224733955 0.052548086 0.005717496
#each cell divided by grand total
# That isn't really what we want

We want to to see what proportion of homes in each hsdistrict fall into each condition, so we want proportions by rows. Rows is the first dimension in R, so we specify margin = 1.

# Calculate proportions over the first margin (rows)
prop.table(xt, margin=1)
##                    condition3
## hsdistrict              Average        Good        Poor
##   Albemarle         0.802516779 0.188422819 0.009060403
##   Monticello        0.823836317 0.156317136 0.019846547
##   Western Albemarle 0.794114212 0.185682588 0.020203200

Looks like Monticello has fewer homes in “Good” condition and Albemarle has fewer homes in “Poor” condition. But are these differences significant?

The chi-square test is used to assess the independence of these two factors (hsdistrict and condition3). That is, if the null hypothesis that hsdistrict and condition3 are independent is true, then we would expect a proportionally equal number of homes within each condition across each hsdistrict.

Let’s do the test and see.

chisq.test(xt)
## 
##  Pearson's Chi-squared test
## 
## data:  xt
## X-squared = 96.724, df = 4, p-value < 2.2e-16

The p-value is small suggesting that we found changes from an equal distribution.

The last thing we will do is to compare where the differences are by looking at our observed table, xt, and the values expected if there was an even distribution of conditions across hsdistricts.

xt
##                    condition3
## hsdistrict          Average Good Poor
##   Albemarle            9566 2246  108
##   Monticello           8053 1528  194
##   Western Albemarle    6800 1590  173
chisq.test(xt)$expected
##                    condition3
## hsdistrict           Average     Good     Poor
##   Albemarle         9619.753 2113.123 187.1241
##   Monticello        7888.682 1732.867 153.4512
##   Western Albemarle 6910.566 1518.010 134.4248
  • We expected 9619 Average homes in Albemarle, and found fewer
  • We expected 7888 Average homes in Monticello, and found more
  • We expected 6910 Average homes in WA, and found fewer

  • We expected 2113 Good homes in Albemarle, and found more
  • We expected 1732 Good homes in Monticello, and found fewer
  • We expected 1518 Good homes in WA, and found more

  • We expected 187 Poor homes in Albemarle, and found fewer
  • We expected 153 Poor homes in Monticello, and found more
  • We expected 134 Poor homes in WA, and found more

Feedback

Thank you all so much for sticking through to the end with a new format and mode of learning. Because this is our first time teaching like this, if you have any feedback about how it went or improvements we could make I would be very grateful. David and I are teaching a full curricular coding course starting in the next few weeks, so knowing what works and what presents challenges with this virtual format would help us become better educators.

Here is a link to a Google Form where you can add your anonymous thoughts: https://forms.gle/MLKnyw36CzMKLtZQ6

Thank you in advance!