The goal of this session is to teach you how to conduct basic statistical analyses in R. We will cover descriptive statistics, exploratory data analysis, t-tests, ANOVA, and methods for discrete variable analysis along with hypothesis testing and assessing the assumptions of the tests.
Hopefully you got here by double-clicking the .Rproj file in the Instructional Materials. If not, exit out of RStudio and enter in through the .Rproj file so that you are in the proper folder.
Remember that an RStudio project (specifically the .Rproj file) makes it straightforward to place your work into it’s own working directory. Creating a project takes away some of the stress of navigating through file directories and file paths. A project creates an encapsulation for source files, images, and anything else created during your R Session.
The next step of setting up will be to load packages we will need today.
# install.packages("inspectdf")
# install.packages("broom")
library(tidyverse)
library(inspectdf)
library(broom)
Remember that if your console says “Error package name not found”, you will need to install the package using install.packages("packagename")
. I like to put the install lines in my script, run them to install the package, and then comment the line out so that installations will not get in the way of rendering the document (in case you want to do that).
As our final set-up step, let’s load in the homes dataset that we have been using for the past few weeks. This is the version after all of the cleaning steps we did with David in the Data Prep lesson.
We will use the read_csv()
function from the tidyverse to get our data into R. Once we’ve loaded it, we can type the name of the object itself (homes
) to see it printed to the screen.
homes <- read_csv("albemarle_homes.csv")
homes
## # A tibble: 31,228 x 27
## tmp yearbuilt yearremodeled usecode condition finsqft cooling bedroom
## <chr> <dbl> <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 0030… 1939 NA Single… Fair 1596 No Cen… 3
## 2 0030… 1981 NA Single… Average 1616 No Cen… 6
## 3 0030… 1999 NA Single… Average 480 No Cen… 0
## 4 0030… 1754 NA Single… Substand… 828 No Cen… 1
## 5 0030… 1880 NA Single… Fair 1224 No Cen… 2
## 6 0050… 1769 1988 Single… Average 4875 Centra… 4
## 7 0050… 1818 1991 Single… Average 5573 Centra… 6
## 8 0060… 1935 1994 Single… Average 1120 No Cen… 3
## 9 0060… 1880 1978 Single… Average 1458 Centra… 3
## 10 0060… 2005 NA Single… Average 1888 Centra… 3
## # … with 31,218 more rows, and 19 more variables: fullbath <dbl>,
## # halfbath <dbl>, totalrooms <dbl>, lotsize <dbl>, landvalue <dbl>,
## # improvementsvalue <dbl>, totalvalue <dbl>, lastsaleprice <dbl>,
## # lastsaledate1 <chr>, esdistrict <chr>, msdistrict <chr>, hsdistrict <chr>,
## # censustract <dbl>, lastsaledate <date>, age <dbl>, condition2 <chr>,
## # remodel <dbl>, fp <dbl>, landuse <dbl>
Take a look at that output. The nice thing about loading the tidyverse and reading data with the read_csv()
from readr is that data are displayed in a much more friendly way. This dataset has 31K+ rows and 27 columns. When you import/convert data this way and try to display the object in the console, instead of trying to display all of the rows, you’ll only see 10 by default. The printing behavior of read.csv()
causes 1000s of lines of output that are seldom helpful. Also, if you have so many columns that the data would wrap off the edge of your screen, those columns will not be displayed, but you’ll see at the bottom of the output which, if any, columns were hidden from view.
Remember that if you want to see the whole dataset, there are two ways to do this. First, you can click on the name of the data.frame in the Environment panel in RStudio. Or you could use the View()
function (with a capital V).
# View(homes)
Note: I have commented out the View() function because its behavior when rendering an Rmd is not desirable. You could also change the Rmd chunk options to prevent this chunk from being evaluated.
Recall several built-in functions that are useful for working with data frames.
head()
: shows the first few rowstail()
: shows the last few rowsdim()
: returns a 2-element vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)nrow()
: returns the number of rowsncol()
: returns the number of columnsnames()
: returns the column namesglimpse()
(from dplyr): Returns a glimpse of your data, telling you the structure of the dataset and information about the class, length and content of each columnsummary()
: Returns a frequency count for factor variables and a 6 number summary for numeric variablesLet’s try a few
glimpse(homes)
## Observations: 31,228
## Variables: 27
## $ tmp <chr> "00300000000500", "003000000006B0", "00300000001000…
## $ yearbuilt <dbl> 1939, 1981, 1999, 1754, 1880, 1769, 1818, 1935, 188…
## $ yearremodeled <dbl> NA, NA, NA, NA, NA, 1988, 1991, 1994, 1978, NA, NA,…
## $ usecode <chr> "Single Family", "Single Family", "Single Family", …
## $ condition <chr> "Fair", "Average", "Average", "Substandard", "Fair"…
## $ finsqft <dbl> 1596, 1616, 480, 828, 1224, 4875, 5573, 1120, 1458,…
## $ cooling <chr> "No Central Air", "No Central Air", "No Central Air…
## $ bedroom <dbl> 3, 6, 0, 1, 2, 4, 6, 3, 3, 3, 3, 3, 3, 3, 3, 4, 2, …
## $ fullbath <dbl> 1, 4, 0, 0, 1, 3, 4, 1, 1, 3, 2, 2, 3, 3, 2, 1, 1, …
## $ halfbath <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, …
## $ totalrooms <dbl> 5, 9, 1, 3, 4, 8, 11, 5, 6, 6, 9, 0, 10, 8, 8, 6, 4…
## $ lotsize <dbl> 55.955, 5.167, 80.750, 80.718, 65.000, 5.102, 453.8…
## $ landvalue <dbl> 347900, 140300, 312500, 327100, 497000, 106400, 161…
## $ improvementsvalue <dbl> 80700, 212300, 21600, 13500, 113100, 827000, 802200…
## $ totalvalue <dbl> 428600, 352600, 334100, 340600, 610100, 933400, 241…
## $ lastsaleprice <dbl> 424000, 300000, 0, 0, 0, 0, 0, 2820000, 0, 0, 52000…
## $ lastsaledate1 <chr> "11/18/2003", "03/23/2012", "06/30/1994", "08/09/19…
## $ esdistrict <chr> "Broadus Wood", "Broadus Wood", "Broadus Wood", "Br…
## $ msdistrict <chr> "Jouett", "Jouett", "Jouett", "Jouett", "Jouett", "…
## $ hsdistrict <chr> "Albemarle", "Albemarle", "Albemarle", "Albemarle",…
## $ censustract <dbl> 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 1…
## $ lastsaledate <date> 2003-11-18, 2012-03-23, 1994-06-30, 1994-08-09, 20…
## $ age <dbl> 80, 38, 20, 265, 139, 250, 201, 84, 139, 14, 15, 12…
## $ condition2 <chr> "Fair", "Average", "Average", "Substandard", "Fair"…
## $ remodel <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ fp <dbl> 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, …
## $ landuse <dbl> 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, …
names(homes)
## [1] "tmp" "yearbuilt" "yearremodeled"
## [4] "usecode" "condition" "finsqft"
## [7] "cooling" "bedroom" "fullbath"
## [10] "halfbath" "totalrooms" "lotsize"
## [13] "landvalue" "improvementsvalue" "totalvalue"
## [16] "lastsaleprice" "lastsaledate1" "esdistrict"
## [19] "msdistrict" "hsdistrict" "censustract"
## [22] "lastsaledate" "age" "condition2"
## [25] "remodel" "fp" "landuse"
summary(homes)
## tmp yearbuilt yearremodeled usecode
## Length:31228 Min. :1668 Min. : 2 Length:31228
## Class :character 1st Qu.:1973 1st Qu.:1992 Class :character
## Mode :character Median :1989 Median :2002 Mode :character
## Mean :1981 Mean :1998
## 3rd Qu.:2002 3rd Qu.:2008
## Max. :2019 Max. :2019
## NA's :952 NA's :28929
## condition finsqft cooling bedroom
## Length:31228 Min. : 144 Length:31228 Min. : 0.000
## Class :character 1st Qu.:1386 Class :character 1st Qu.: 3.000
## Mode :character Median :1860 Mode :character Median : 3.000
## Mean :2087 Mean : 3.386
## 3rd Qu.:2573 3rd Qu.: 4.000
## Max. :9649 Max. :12.000
## NA's :1
## fullbath halfbath totalrooms lotsize
## Min. :0.000 Min. :0.0000 Min. : 0.000 Min. : 0.0000
## 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.: 6.000 1st Qu.: 0.1620
## Median :2.000 Median :1.0000 Median : 7.000 Median : 0.6855
## Mean :2.324 Mean :0.6323 Mean : 7.073 Mean : 4.7997
## 3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.: 8.000 3rd Qu.: 2.7480
## Max. :9.000 Max. :5.0000 Max. :84.000 Max. :1067.1500
## NA's :1 NA's :110 NA's :9
## landvalue improvementsvalue totalvalue lastsaleprice
## Min. : 0 Min. : 0 Min. : 4300 Min. : 0
## 1st Qu.: 80000 1st Qu.: 154600 1st Qu.: 245800 1st Qu.: 0
## Median : 117000 Median : 225100 Median : 344050 Median : 184000
## Mean : 142752 Mean : 286777 Mean : 429529 Mean : 246097
## 3rd Qu.: 151900 3rd Qu.: 340025 3rd Qu.: 500200 3rd Qu.: 351000
## Max. :6651000 Max. :4953700 Max. :7859000 Max. :11925000
## NA's :1
## lastsaledate1 esdistrict msdistrict hsdistrict
## Length:31228 Length:31228 Length:31228 Length:31228
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## censustract lastsaledate age condition2
## Min. :101.0 Min. :1754-01-01 Min. : 0.00 Length:31228
## 1st Qu.:103.0 1st Qu.:2001-12-21 1st Qu.: 17.00 Class :character
## Median :107.0 Median :2011-02-02 Median : 30.00 Mode :character
## Mean :107.5 Mean :2004-08-02 Mean : 37.71
## 3rd Qu.:111.0 3rd Qu.:2016-08-03 3rd Qu.: 46.00
## Max. :114.0 Max. :2019-10-24 Max. :351.00
## NA's :5
## remodel fp landuse
## Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.00000 Median :1.0000 Median :0.00000
## Mean :0.07362 Mean :0.7296 Mean :0.05229
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.0000 Max. :1.00000
##
We can access individual variables within a data frame using the $
operator, e.g., mydataframe$specificVariable
.
Printing out all the esdistrict
values in the data will not be useful. It will print 1000 rows and then stop, trying to warn us that we didn’t really want to do that.
# Display all esdistrict values
homes$esdistrict
Now, let’s see what are the unique
values of esdistrict using base R and then using a dplyr pipeline
# Get the unique values of Race
unique(homes$esdistrict)
## [1] "Broadus Wood" "Crozet" "Meriwether Lewis" "Baker-Butler"
## [5] "Stony Point" "Stone-Robinson" "Unassigned" "Brownsville"
## [9] "Greer" "Agnor-Hurt" "Woodbrook" "Hollymead"
## [13] "Murray" "Red Hill" "Cale" "Scottsville"
# Do the same thing the dplyr way
homes %>% distinct(esdistrict)
## # A tibble: 16 x 1
## esdistrict
## <chr>
## 1 Broadus Wood
## 2 Crozet
## 3 Meriwether Lewis
## 4 Baker-Butler
## 5 Stony Point
## 6 Stone-Robinson
## 7 Unassigned
## 8 Brownsville
## 9 Greer
## 10 Agnor-Hurt
## 11 Woodbrook
## 12 Hollymead
## 13 Murray
## 14 Red Hill
## 15 Cale
## 16 Scottsville
Think about the difference between these two types of output produced by the base r function or by the dplyr equivalent. Sometimes one format is vastly preferred over another so it is good to note the difference.
#create a table of frequencies
table(homes$esdistrict) #base r
##
## Agnor-Hurt Baker-Butler Broadus Wood Brownsville
## 2144 2944 2280 3147
## Cale Crozet Greer Hollymead
## 2676 1859 1437 1188
## Meriwether Lewis Murray Red Hill Scottsville
## 2161 1697 1574 1452
## Stone-Robinson Stony Point Unassigned Woodbrook
## 3593 1311 5 1760
homes %>% count(esdistrict) #this returns a tibble
## # A tibble: 16 x 2
## esdistrict n
## <chr> <int>
## 1 Agnor-Hurt 2144
## 2 Baker-Butler 2944
## 3 Broadus Wood 2280
## 4 Brownsville 3147
## 5 Cale 2676
## 6 Crozet 1859
## 7 Greer 1437
## 8 Hollymead 1188
## 9 Meriwether Lewis 2161
## 10 Murray 1697
## 11 Red Hill 1574
## 12 Scottsville 1452
## 13 Stone-Robinson 3593
## 14 Stony Point 1311
## 15 Unassigned 5
## 16 Woodbrook 1760
#2-factor table or cross tabulation
table(homes$hsdistrict, homes$esdistrict)
##
## Agnor-Hurt Baker-Butler Broadus Wood Brownsville Cale
## Albemarle 2144 2944 2280 0 0
## Monticello 0 0 0 0 2676
## Unassigned 0 0 0 0 0
## Western Albemarle 0 0 0 3147 0
##
## Crozet Greer Hollymead Meriwether Lewis Murray Red Hill
## Albemarle 0 1437 1188 0 0 0
## Monticello 0 0 0 0 0 1574
## Unassigned 0 0 0 0 0 0
## Western Albemarle 1859 0 0 2161 1697 0
##
## Scottsville Stone-Robinson Stony Point Unassigned Woodbrook
## Albemarle 0 0 501 0 1760
## Monticello 1452 3593 810 0 0
## Unassigned 0 0 0 5 0
## Western Albemarle 0 0 0 0 0
homes %>% count(hsdistrict, esdistrict)
## # A tibble: 17 x 3
## hsdistrict esdistrict n
## <chr> <chr> <int>
## 1 Albemarle Agnor-Hurt 2144
## 2 Albemarle Baker-Butler 2944
## 3 Albemarle Broadus Wood 2280
## 4 Albemarle Greer 1437
## 5 Albemarle Hollymead 1188
## 6 Albemarle Stony Point 501
## 7 Albemarle Woodbrook 1760
## 8 Monticello Cale 2676
## 9 Monticello Red Hill 1574
## 10 Monticello Scottsville 1452
## 11 Monticello Stone-Robinson 3593
## 12 Monticello Stony Point 810
## 13 Unassigned Unassigned 5
## 14 Western Albemarle Brownsville 3147
## 15 Western Albemarle Crozet 1859
## 16 Western Albemarle Meriwether Lewis 2161
## 17 Western Albemarle Murray 1697
Stony point elementary is the only ES that feeds into 2 different HS
Now let’s calculate some descriptive stats for the totalvalue
variable.
# measures of the middle: mean, median
mean(homes$totalvalue)
## [1] 429528.6
median(homes$totalvalue)
## [1] 344050
# measures of overall spread
range(homes$totalvalue)
## [1] 4300 7859000
sd(homes$totalvalue)
## [1] 347592
# the Interquartile Range (difference between 75th and 25th quartiles)
IQR(homes$totalvalue)
## [1] 254400
# quartiles (points that divide data into quarters)
quantile(homes$totalvalue)
## 0% 25% 50% 75% 100%
## 4300 245800 344050 500200 7859000
# 90th percentile of total value
quantile(homes$totalvalue, probs = .90)
## 90%
## 739930
# a distributional summary
summary(homes$totalvalue)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4300 245800 344050 429529 500200 7859000
What are your conclusions about this variable?
Recall the following about the mean and the median: - The median is the middle of the sorted data - The mean is the “balance point” of the data - Symmetric data have similar means and medians
You can calculate these same descriptive statistics using dplyr, but remember, this returns a single-row, single-column tibble, not a single scalar value like the above. This is only really useful in the context of grouping and summarizing.
# Compute the median totalvalue
homes %>%
summarize(median(totalvalue))
## # A tibble: 1 x 1
## `median(totalvalue)`
## <dbl>
## 1 344050
# Now grouped by other variables
homes %>%
group_by(esdistrict) %>%
summarize(med = median(totalvalue))
## # A tibble: 16 x 2
## esdistrict med
## <chr> <dbl>
## 1 Agnor-Hurt 277400
## 2 Baker-Butler 295400
## 3 Broadus Wood 383000
## 4 Brownsville 409700
## 5 Cale 337000
## 6 Crozet 275000
## 7 Greer 306100
## 8 Hollymead 349550
## 9 Meriwether Lewis 544000
## 10 Murray 729700
## 11 Red Hill 249000
## 12 Scottsville 192350
## 13 Stone-Robinson 364900
## 14 Stony Point 393200
## 15 Unassigned 685500
## 16 Woodbrook 330350
# let's order that output
homes %>%
group_by(esdistrict) %>%
summarize(med = median(totalvalue)) %>%
arrange(desc(med))
## # A tibble: 16 x 2
## esdistrict med
## <chr> <dbl>
## 1 Murray 729700
## 2 Unassigned 685500
## 3 Meriwether Lewis 544000
## 4 Brownsville 409700
## 5 Stony Point 393200
## 6 Broadus Wood 383000
## 7 Stone-Robinson 364900
## 8 Hollymead 349550
## 9 Cale 337000
## 10 Woodbrook 330350
## 11 Greer 306100
## 12 Baker-Butler 295400
## 13 Agnor-Hurt 277400
## 14 Crozet 275000
## 15 Red Hill 249000
## 16 Scottsville 192350
Let’s try taking the mean of a different variable, either the dplyr way or the simpler $
way.
# the dplyr way: returns a single-row single-column tibble/dataframe
homes %>% summarize(mean(totalrooms))
## # A tibble: 1 x 1
## `mean(totalrooms)`
## <dbl>
## 1 NA
# returns a single value
mean(homes$totalrooms)
## [1] NA
What happened there? NA
indicates missing data. Take a look at a summary of the totalrooms variable.
summary(homes$totalrooms)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 6.000 7.000 7.073 8.000 84.000 9
Notice that there are 9 missing values for totalrooms Trying to get the mean a bunch of observations with some missing data returns a missing value by default. This is almost universally the case with all summary statistics – a single NA
will cause the summary to return NA
. Now look at the help for ?mean
. Notice the na.rm
argument. This is a logical (i.e., TRUE
or FALSE
) value indicating whether or not missing values should be removed prior to computing the mean. By default, it’s set to FALSE
. Now try it again.
mean(homes$totalrooms, na.rm = TRUE)
## [1] 7.073032
The is.na()
function tells you if a value is missing. Get the sum()
of that vector, which adds up all the TRUE
s to tell you how many of the values are missing.
is.na(homes$totalrooms)
sum(is.na(homes$totalrooms))
## [1] 9
# proportion missing
sum(is.na(homes$totalrooms)) / length(homes$totalrooms)
## [1] 0.0002882029
What if we want to know the amount of missing data in each column? There’s a package for that! Let’s use the inspect_na()
function in the inspectdf package to help. Remember that if you get an error message that the function cannot be found, go back to the top of the script and try the library(inspectdf)
line. If that fails, install.packages("inspectdf")
inspect_na(homes) %>% print(n = Inf)
## # A tibble: 27 x 3
## col_name cnt pcnt
## <chr> <int> <dbl>
## 1 yearremodeled 28929 92.6
## 2 yearbuilt 952 3.05
## 3 halfbath 110 0.352
## 4 totalrooms 9 0.0288
## 5 censustract 5 0.0160
## 6 bedroom 1 0.00320
## 7 fullbath 1 0.00320
## 8 lastsaleprice 1 0.00320
## 9 tmp 0 0
## 10 usecode 0 0
## 11 condition 0 0
## 12 finsqft 0 0
## 13 cooling 0 0
## 14 lotsize 0 0
## 15 landvalue 0 0
## 16 improvementsvalue 0 0
## 17 totalvalue 0 0
## 18 lastsaledate1 0 0
## 19 esdistrict 0 0
## 20 msdistrict 0 0
## 21 hsdistrict 0 0
## 22 lastsaledate 0 0
## 23 age 0 0
## 24 condition2 0 0
## 25 remodel 0 0
## 26 fp 0 0
## 27 landuse 0 0
This function provides the count of missings and the proportion missing for each variable, sorted by amount missing.
This is a tough question, but an important one. If you have missing data and are wondering what to do about it, please request a consult with one of the PhD+ instructors, since our advice will depend a lot on the situation (how much is missing, is it missing at random or not, what are you trying to do with the data, etc.)
Now, let’s talk about exploratory data analysis (EDA).
It’s always worth examining your data visually before you start any statistical analysis or hypothesis testing. We already spent an entire day on data visualizations, so I’ll just remind us of a few of the big ones here: histograms and scatterplots. We will also re-familiarize ourselves with density plots when we go through some stats in a few minutes.
We can learn a lot from the data just looking at the distributions of particular variables. Let’s make some basic histograms with ggplot2.
ggplot(homes, aes(finsqft)) + geom_histogram()
How would you describe the shape of this distribution?
What about age?
ggplot(homes, aes(age)) + geom_histogram()
What do you notice?
What is going on with age around 260
filter(homes, age < 270 & age > 250)
## # A tibble: 419 x 27
## tmp yearbuilt yearremodeled usecode condition finsqft cooling bedroom
## <chr> <dbl> <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 0030… 1754 NA Single… Substand… 828 No Cen… 1
## 2 0070… 1754 NA Single… Substand… 863 No Cen… 0
## 3 0090… 1754 NA Single… Fair 1000 No Cen… 3
## 4 0090… 1754 1993 Single… Average 1859 No Cen… 3
## 5 0090… 1754 NA Single… Fair 520 No Cen… 1
## 6 0090… 1754 NA Single… Average 1580 No Cen… 4
## 7 0100… 1754 NA Single… Fair 768 No Cen… 2
## 8 0100… 1754 NA Single… Poor 629 No Cen… 2
## 9 0110… 1754 1987 Single… Average 3236 Centra… 4
## 10 0130… 1754 NA Single… Substand… 912 No Cen… 0
## # … with 409 more rows, and 19 more variables: fullbath <dbl>, halfbath <dbl>,
## # totalrooms <dbl>, lotsize <dbl>, landvalue <dbl>, improvementsvalue <dbl>,
## # totalvalue <dbl>, lastsaleprice <dbl>, lastsaledate1 <chr>,
## # esdistrict <chr>, msdistrict <chr>, hsdistrict <chr>, censustract <dbl>,
## # lastsaledate <date>, age <dbl>, condition2 <chr>, remodel <dbl>, fp <dbl>,
## # landuse <dbl>
They do seem to be real cases. I would love to know what happened in Charlottesville in 1754 to prompt so many houses to be built (that are still around today)!
Let’s look at how a few different variables relate to each other. E.g., lotsize and finsqft:
ggplot(homes, aes(lotsize, finsqft)) + geom_point()
Some very large homes on small lots. But (I think) lot size is in acres, so really we are looking at few properties that are > 300 acres.
A very well-known group of basic statistical tests is called the t-test. A t-test is used when we want to know the difference between 2 group means.
T-tests assume a few things about the data:
As we go, we will learn what to do if assumptions do not hold, but for now, here is a brief flowchart: - Random Sampling – if not met, no statistics will help you - Independent Samples – if not met, use a paired t-test - Normality – if not met, use a Wilcoxon-Mann-Whitney U test - Equal variance – if not met, use a Welch’s t-test - If independent samples and normality are both not met – use Wilcoxon signed rank test
Let’s do a two-sample t-tests to assess the difference in means between two groups. The function for a t-test is t.test()
. See the help using ?t.test
.
We’ll investigate whether the condition of the home impacts its total value for homes that are either in Fair or Substandard condition.
Right now we have a census of all of the homes in Albemarle. Let’s create a sample of homes that are either condition == “Fair” or condition == “Substandard” to use.
set.seed(317)
lessgood <- homes %>%
sample_frac(.2) %>%
filter(condition == "Fair" | condition == "Substandard")
Great!
To answer our questions about the assumptions, let’s create some exploratory plots. I like density plots colored by group. I think these help you see the distribution of the variable to assess equal variance. This plot also informs you somewhat on normality, and helps you see if there is a noticeable difference in groups.
ggplot(lessgood, aes(totalvalue, fill = condition)) + geom_density()
# hmm, definitely not normally distributed, but lots of $ variables are log transformed, so let's try that
ggplot(lessgood, aes(log(totalvalue), fill = condition, color = condition)) +
geom_density(alpha = .5)
First we make sure that a t-test is the correct type of analysis. A t-test tests the difference in 2 means - yes that is what we want to do. Next we need to decide what type of t-test we need to perform by thinking through the assumptions. Domain specific knowledge and exploratory data analyses will help here.
Random sampling – YES, we sampled a random (pseudo-random, but good enough) subset of rows from our homes dataset
Independent samples – YES (fair and substandard are different homes - unrelated). It could be a paired t-test if we were assessing fair and substandard pairs where the two homes were matched by street or something
Equal variance. Also called homoscedasticity of variances. ?? we could think about the populations of fair and substandard homes and what we know about how those designations impact the variability of value…if we were an appraiser. I don’t feel like I have the knowledge of these variables, so I will allow my EDA plots to help. Looking on the width of the bases of the density plots, they don’t look too different. Maybe the substandard one is a bit wider, but I would likely call them the same in this case.
Normality – Our density plots don’t look too bad, but there are better ways to assess this.
Normality can be assessed graphically or via hypothesis tests. There are pros and cons to either approach.
Graphically, we could look at a histogram, boxplot, or a more specialized plot to assess normality called a QQ plot (quantile-quantile plot or quantile comparison plot or normal probability plot). A QQ plot graphs the expected data value given a normal distribution on the X axis against the observed data value on the y axis. If the data is normally distributed, we should see a 1:1 ratio between the expected values and the observed values. Let’s have a look for log(totalvalue):
ggplot(lessgood, aes(sample = log(totalvalue))) +
geom_qq() +
geom_qq_line() +
facet_wrap(~condition)
# both look good, but look at that class imbalance
# count by condition
lessgood %>%
group_by(condition) %>%
count()
## # A tibble: 2 x 2
## # Groups: condition [2]
## condition n
## <chr> <int>
## 1 Fair 242
## 2 Substandard 32
Learning what is a normal QQ plot looks like is a process.
Certain fields love hypothesis tests of normality and sometimes reviewers will specifically request one. There is a theoretical problem with trying to prove a null hypothesis and they are known to reject the null when sample sizes are large. My best advice is to use your brain, subject matter expertise, and graphical assessments as much as possible, but in case you are forced to do a hypothesis test for normality check out shapiro.test()
, since it seems to be the least awful choice (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3693611/).
Ok. We’ve checked our assumptions and are ready to perform a two-sample, pooled variance t-test
t.test(log(totalvalue) ~ condition, data=lessgood, var.equal = TRUE)
##
## Two Sample t-test
##
## data: log(totalvalue) by condition
## t = 5.1591, df = 272, p-value = 4.79e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3986370 0.8906271
## sample estimates:
## mean in group Fair mean in group Substandard
## 12.25437 11.60974
# what if we had assumed unequal variance to be safe
t.test(log(totalvalue) ~ condition, data=lessgood)
##
## Welch Two Sample t-test
##
## data: log(totalvalue) by condition
## t = 3.9803, df = 35.252, p-value = 0.0003276
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3159263 0.9733378
## sample estimates:
## mean in group Fair mean in group Substandard
## 12.25437 11.60974
In either case, we conclude that the condition groups are different in terms of their log(totalvalue). We also can see a 95% confidence interval for the difference, meaning that we are 95% confident that the difference in population log(totalvalue) is between 0.39 and 0.89 (pooled) or between 0.31 and 0.97 (Welch’s).
What if we felt uncomfortable with the assumption of normality? t-tests are robust to departures in normality, so in this case I think a t-test is the best option, but for demonstration purposes, the non-parameteric alternative is the Wilcoxon-Mann-Whitney U test
wilcox.test(log(totalvalue) ~ condition, data=lessgood)
##
## Wilcoxon rank sum test with continuity correction
##
## data: log(totalvalue) by condition
## W = 5663, p-value = 2.135e-05
## alternative hypothesis: true location shift is not equal to 0
Notice that the output does not provide confidence intervals or summary statistics, but we can add a confidence interval for the difference and calculate the summary statistics ourselves.
# with confidence interval for the pseudo-median difference
wilcox.test(log(totalvalue) ~ condition, conf.int = TRUE, data=lessgood)
##
## Wilcoxon rank sum test with continuity correction
##
## data: log(totalvalue) by condition
## W = 5663, p-value = 2.135e-05
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## 0.3633802 0.8756668
## sample estimates:
## difference in location
## 0.6242179
# calculate means and medians by group
lessgood %>%
group_by(condition) %>%
summarize(med = median(log(totalvalue)),
mean = mean(log(totalvalue)))
## # A tibble: 2 x 3
## condition med mean
## <chr> <dbl> <dbl>
## 1 Fair 12.2 12.3
## 2 Substandard 11.7 11.6
A note on one-tailed versus two-tailed tests: A two-tailed test is usually more appropriate. The hypothesis you’re testing here is spelled out in the results (“alternative hypothesis: true difference in means is not equal to 0”). If the p-value is very low, you can reject the null hypothesis that there’s no difference in means. Because you may not know a priori whether the difference in means will be positive or negative, you want to do the two-tailed test. However, if we only wanted to test a very specific directionality of effect, we could use a one-tailed test and specify which direction we expect. This is more powerful if we “get it right”, but much less powerful for the opposite effect. The p-value of a one-tailed test will be half of that of a two-tailed hypothesis test. BUT again, the two-tailed test is almost always conducted to be conservative.
A note on paired versus unpaired t-tests: The t-tests we performed here were unpaired tests. Fair and substandard homes are different and not related. In these cases, an unpaired test is appropriate. An alternative design might be when data is derived from samples who have been measured at two different time points or locations, e.g., before versus after treatment, left versus right hand, etc. In this case, a paired t-test would be more appropriate. A paired test takes into consideration the intra and inter-subject variability, and is more powerful than the unpaired test. There is a paired = TRUE option for both the t-test and the Wilcoxon test.
Analysis of variance (ANOVA) and linear modeling are complex topics that deserve an entire semester dedicated to theory, design, and interpretation. Luckily, next week we will be focused on linear modeling. What follows is a necessary over-simplification with more focus on implementation, and less on theory and design.
Where t-tests and their nonparametric substitutes are used for assessing the differences in means between two groups, ANOVA is used to assess the significance of differences in means between multiple groups. In fact, a t-test is just a specific case of ANOVA when you only have two groups. And both t-tests and ANOVA are just specific cases of linear regression, where you’re trying to fit a model describing how a continuous outcome (e.g., log(totalvalue)) changes with some predictor variable (e.g., condition, hsdistrict, etc.). The distinction is that ANOVA has a categorical predictor while linear modeling has a continuous predictor.
Let’s examine the same question we just used as a t-test, but run it as an ANOVA to prove that t-test and ANOVA are really the same test. Because ANOVA has an assumption of equal variance, let’s run a t-test specifying that the variances between groups are equal.
t.test(log(totalvalue) ~ condition, data=lessgood, var.equal=TRUE)
##
## Two Sample t-test
##
## data: log(totalvalue) by condition
## t = 5.1591, df = 272, p-value = 4.79e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3986370 0.8906271
## sample estimates:
## mean in group Fair mean in group Substandard
## 12.25437 11.60974
It looks like fair homes have a higher log(totalvalue) than those in the substandard condition. Now, let’s do the same test in a linear modeling framework. First, let’s create the fitted model and store it in an object called fit
.
fit <- lm(log(totalvalue) ~ condition, data=lessgood)
You can display the linear model object itself, but that isn’t too interesting. You can get the more familiar ANOVA table by calling the anova()
function on the fit
object. More generally, the summary()
function on a linear model object will tell you much more. (Note this is different from dplyr’s summarize function).
fit
##
## Call:
## lm(formula = log(totalvalue) ~ condition, data = lessgood)
##
## Coefficients:
## (Intercept) conditionSubstandard
## 12.2544 -0.6446
anova(fit)
## Analysis of Variance Table
##
## Response: log(totalvalue)
## Df Sum Sq Mean Sq F value Pr(>F)
## condition 1 11.745 11.7446 26.616 4.79e-07 ***
## Residuals 272 120.024 0.4413
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit)
##
## Call:
## lm(formula = log(totalvalue) ~ condition, data = lessgood)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.44022 -0.45834 -0.01923 0.38438 2.43881
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.2544 0.0427 286.979 < 2e-16 ***
## conditionSubstandard -0.6446 0.1250 -5.159 4.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6643 on 272 degrees of freedom
## Multiple R-squared: 0.08913, Adjusted R-squared: 0.08578
## F-statistic: 26.62 on 1 and 272 DF, p-value: 4.79e-07
Notice from these pieces of output that:
The p-values from all three tests (t-test, ANOVA, and linear regression) are all identical (p=4.79e-07). This is because the tests are all identical: a t-test is a specific case of ANOVA, which is a specific case of linear regression.
The t.test()
output shows you the means for the two groups, Fair and Substandard. Just displaying the fit
object itself or running summary(fit)
shows you the coefficients as a linear model. Here, the model assumes the “baseline” condition level is Fair (first alphabetically), and that the intercept in a regression model (e.g., \(\beta_{0}\) in the model \(Y = \beta_{0} + \beta_{1}X\)) is the mean of the Fair group (12.25 is the mean log(totalvalue) for Fair homes). Being Substandard results in a decrease in log(totalvalue) of 0.64. This is the \(\beta_{1}\) coefficient in the model.
A note on dummy coding: If you have a \(k\)-level factor, R creates \(k-1\) dummy variables, or indicator variables, by default, using the alphabetically first level as the reference group. For example, the levels of condition in our lessgood dataset are “Fair” and “Substandard”. R creates a dummy variable called “conditionSubstandard” that’s 0 if the home is Fair, and 1 if the home is Substandard. The linear model is saying for every unit increase in conditionSubstandard, i.e., going from Fair to Substandard, results in a 0.64-unit decrease in log(totalvalue). You can change the ordering of the factors to change the interpretation of the model (e.g., treating Substandard as baseline and going from Substandard up to Fair). We’ll do this in the next section.
Recap: t-tests are for assessing the differences in means between two groups. A t-test is a specific case of ANOVA, which is a specific case of a linear model. Let’s run ANOVA, but this time looking for differences in means between more than two groups.
Let’s look at the relationship between hsdistrict, and log(totalvalue)
We’ll use the same technique as before to create a sample
set.seed(317)
hs <- homes %>%
sample_frac(.005)
hs %>% count(hsdistrict)
## # A tibble: 3 x 2
## hsdistrict n
## <chr> <int>
## 1 Albemarle 65
## 2 Monticello 42
## 3 Western Albemarle 49
We have pretty balanced sample sizes. This is good for ANOVA. If you can choose to design your study with balanced sample sizes, ANOVA will perform better (better power, higher probability of equal variances assumption being met, etc).
First let’s run a grouped summary to see what we have in terms of differences in means
hs %>%
group_by(hsdistrict) %>%
summarize(logmean = mean(log(totalvalue)),
mean = exp(logmean))
## # A tibble: 3 x 3
## hsdistrict logmean mean
## <chr> <dbl> <dbl>
## 1 Albemarle 12.6 307083.
## 2 Monticello 12.6 285528.
## 3 Western Albemarle 13.1 507660.
Ok, so Western Albemarle seems higher than the other two hsdistricts.
Now that we understand these data a bit, let’s proceed to fit the model
fit <- lm(log(totalvalue) ~ hsdistrict, data=hs)
anova(fit)
## Analysis of Variance Table
##
## Response: log(totalvalue)
## Df Sum Sq Mean Sq F value Pr(>F)
## hsdistrict 2 9.621 4.8104 14.306 2.015e-06 ***
## Residuals 153 51.447 0.3363
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F-test on the ANOVA table tells us that there is a significant difference in mean log(totalvalue) between the high school districts (p=\(2.015 \times 10^{-6}\)).
For more details, let’s take a look at the linear model output
summary(fit)
##
## Call:
## lm(formula = log(totalvalue) ~ hsdistrict, data = hs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3530 -0.3448 -0.0337 0.3312 1.8446
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.63487 0.07192 175.669 < 2e-16 ***
## hsdistrictMonticello -0.07278 0.11480 -0.634 0.527
## hsdistrictWestern Albemarle 0.50269 0.10971 4.582 9.49e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5799 on 153 degrees of freedom
## Multiple R-squared: 0.1575, Adjusted R-squared: 0.1465
## F-statistic: 14.31 on 2 and 153 DF, p-value: 2.015e-06
exp(12.63487) # $307K
## [1] 307082
exp(12.63487-0.07278) # $285K
## [1] 285526.5
exp(12.63487+0.50269) # $508K
## [1] 507656.3
Interpretation: - the mean log(totalvalue) for the Albemarle hsdistrict = 12.63 or 3.070819710^{5} in actual dollars - the Monticello hsdistrict has a mean log(totalvalue) -0.07 less than that of Albemarle. 2.855264610^{5} in real dollars. That difference is not statistically significant - the Western Albemarle district has a mean log(totalvalue) 0.50 more than that of Albemarle. 5.076563410^{5} in real dollars. That difference is statistically significant
The bottom of the output has the ANOVA information and model summary statistics.
Because the default handling of categorical variables is to treat the alphabetical first level as the baseline, “Albemarle” high school district is treated the baseline group (the intercept row) and the coefficients for “Monticello” and “Western Albemarle” describe how those groups’ mean log(totalvalue) differs from that of the Albemarle hsdistrict.
What if we wanted “Western Albemarle” to be the reference category, followed by Albemarle, then Monticello? Let’s use mutate()
and factor()
to change the factor levels.
hs <- hs %>%
mutate(hsdistrict = factor(hsdistrict, levels = c("Western Albemarle", "Albemarle", "Monticello")))
Re-run the model to see how re-leveling the variable changes the output
fit <- lm(log(totalvalue) ~ hsdistrict, data=hs)
anova(fit)
## Analysis of Variance Table
##
## Response: log(totalvalue)
## Df Sum Sq Mean Sq F value Pr(>F)
## hsdistrict 2 9.621 4.8104 14.306 2.015e-06 ***
## Residuals 153 51.447 0.3363
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Same F statistic and p-value overall. Let’s have a look at the summary
summary(fit)
##
## Call:
## lm(formula = log(totalvalue) ~ hsdistrict, data = hs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3530 -0.3448 -0.0337 0.3312 1.8446
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.13757 0.08284 158.592 < 2e-16 ***
## hsdistrictAlbemarle -0.50269 0.10971 -4.582 9.49e-06 ***
## hsdistrictMonticello -0.57547 0.12194 -4.719 5.30e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5799 on 153 degrees of freedom
## Multiple R-squared: 0.1575, Adjusted R-squared: 0.1465
## F-statistic: 14.31 on 2 and 153 DF, p-value: 2.015e-06
Now Western Albemarle is the reference group and the comparisons are to that hsdistrict.
So far, we have been looking at model output as a large paragraph. If you need to do something downstream with the output from these models, the tidy()
and glance()
functions in the broom package may help. For example, these functions can help tremendously if you’d like to output the results of a model into a table. These functions work for t-test, wilcox.test, ANOVA, and linear models (and maybe other types of tests too).
# coefficients section
tidy(fit)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 13.1 0.0828 159. 1.24e-171
## 2 hsdistrictAlbemarle -0.503 0.110 -4.58 9.49e- 6
## 3 hsdistrictMonticello -0.575 0.122 -4.72 5.30e- 6
# model summary section
glance(fit)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.158 0.147 0.580 14.3 2.02e-6 3 -135. 278. 290.
## # … with 2 more variables: deviance <dbl>, df.residual <int>
If you wanted to remain in the ANOVA framework, you can run the typical post-hoc ANOVA procedures on the fit object. For example, the TukeyHSD()
function will run Tukey’s test. Tukey’s test computes all pairwise mean difference calculation, comparing each group to each other group, identifying any difference between two groups that’s greater than the standard error, while controlling the type I error for all multiple comparisons. First run aov()
(not anova()
) on the fitted linear model object, then run TukeyHSD()
on the resulting analysis of variance fit.
TukeyHSD(aov(fit))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = fit)
##
## $hsdistrict
## diff lwr upr p adj
## Albemarle-Western Albemarle -0.50269252 -0.7623387 -0.2430464 0.0000282
## Monticello-Western Albemarle -0.57547268 -0.8640633 -0.2868821 0.0000157
## Monticello-Albemarle -0.07278016 -0.3444833 0.1989230 0.8016517
tidy(TukeyHSD(aov(fit)))
## # A tibble: 3 x 6
## term comparison estimate conf.low conf.high adj.p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 hsdistrict Albemarle-Western Albemarle -0.503 -0.762 -0.243 0.0000282
## 2 hsdistrict Monticello-Western Albemar… -0.575 -0.864 -0.287 0.0000157
## 3 hsdistrict Monticello-Albemarle -0.0728 -0.344 0.199 0.802
plot(TukeyHSD(aov(fit)))
This shows that there isn’t much of a difference between Albemarle and Monticello, but that both of these differ significantly from Western Albemarle that has a higher log(totalvalue).
Finally, let’s visualize the differences between these groups.
# plot results
hs %>%
ggplot(aes(hsdistrict, log(totalvalue))) + geom_boxplot()
# a fancier plot using a grouped summary
hssum <- hs %>%
group_by(hsdistrict) %>%
summarise(mean = mean(log(totalvalue)),
sd = sd(log(totalvalue)),
lower = mean - sd,
upper = mean + sd)
ggplot() +
geom_jitter(data = hs, aes(x= hsdistrict,
y = log(totalvalue),
col = hsdistrict),
width = .1, alpha = .25) +
geom_pointrange(data = hssum, aes(x = hsdistrict,
y = mean,
ymin = lower,
ymax = upper,
col = hsdistrict),
size = 1, fatten = 2) +
theme_classic() +
labs(y = "Log-transformed Total Home Value",
x = "",
col = "High School District")
Until now we’ve only discussed analyzing continuous outcomes (dependent variables). We’ve tested for differences in means between two groups with t-tests, differences among means between n groups with ANOVA, and more general relationships using linear regression. In all of these cases, the dependent variable, i.e., the outcome, or \(Y\) variable, was continuous. What if our outcome variable is discrete, e.g., “Yes/No”, “Excellent/Fair/Substandard”, etc.? Here we use a different set of procedures for assessing significant associations.
So far we have covered: 1. T-tests – analyzing differences in one continuous variable between 2 groups 2. ANOVA – analyzing differences in one continuous variable between 3+ groups 3. LM – Next week Clay will go through analyzing the impact of a continuous variable on another continuous variable, but we did an introduction to the functions we will need.
One of the most basic questions we could have is what proportion of homes are above half a million dollars?
We’ll generate a sample to see the proportion
homes %>%
sample_frac(0.01) %>%
summarize(prop = mean(totalvalue > 500000))
## # A tibble: 1 x 1
## prop
## <dbl>
## 1 0.240
Run the above code again and we will see that a different sample of homes will lead to a different sample proportion. One of our questions should be what is the variability in that sample proportion? (Remember we usually do not have access to the whole population like we do here)
# generate a sample that is the same for all of us
set.seed(317)
samp <- homes %>%
sample_frac(0.01)
# see the sample proportion and the raw number
samp %>%
summarize(prop = mean(totalvalue > 500000),
n = sum(totalvalue > 500000))
## # A tibble: 1 x 2
## prop n
## <dbl> <int>
## 1 0.237 74
Let’s create a confidence interval around this proportion
prop.test(x = sum(samp$totalvalue > 500000), n = length(samp$totalvalue))
##
## 1-sample proportions test with continuity correction
##
## data: sum(samp$totalvalue > 5e+05) out of length(samp$totalvalue), null probability 0.5
## X-squared = 85.157, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.1918825 0.2890849
## sample estimates:
## p
## 0.2371795
We are 95% confident that the population proportion of homes > .5M in Albemarle county is between 0.19 and .29
The hypothesis test this function performs simply tests if the proportion = .5 (not helpful for us)
Let’s increase the sample size and see what happens to our sample proportion and to our confidence interval
# generate a sample that is the same for all of us
set.seed(317)
bigsamp <- homes %>%
sample_frac(0.1)
# calculate the confidence interval
prop.test(x = sum(bigsamp$totalvalue > 500000), n = length(bigsamp$totalvalue))
##
## 1-sample proportions test with continuity correction
##
## data: sum(bigsamp$totalvalue > 5e+05) out of length(bigsamp$totalvalue), null probability 0.5
## X-squared = 857.03, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.2231496 0.2533258
## sample estimates:
## p
## 0.2379123
Notice that the point estimate of our sample stayed similar but that the confidence interval shrank considerably. This is the wonderful effect of increasing your sample size!
Since we have access to our population, let’s see what the true proportion is
mean(homes$totalvalue > 500000)
## [1] 0.2502882
Often we are curious to compare 2 proportions. For example, is there is difference in the proportion of homes rated as “Substandard” or “Poor” condition with and without central air?
Let’s use the bigsamp
object we created above to test our question
bigsamp %>%
group_by(cooling) %>%
summarise(prop = mean(condition == "Poor" | condition == "Substandard"),
n = sum(condition == "Poor" | condition == "Substandard"),
totN = n())
## # A tibble: 2 x 4
## cooling prop n totN
## <chr> <dbl> <int> <int>
## 1 Central Air 0.00359 10 2785
## 2 No Central Air 0.0888 30 338
So 0.3% of homes with cooling are classified as poor or substandard while 8.9% of homes without cooling are poor/substandard. Let’s do the test of proportions to see the confidence interval for the difference in proportions
The prop.test()
function’s first argument is called x and it is the number of “successes” for each group. The second argument, n, is the total number of homes in each group
prop.test(x = c(10,30), n = c(2785, 338))
## Warning in prop.test(x = c(10, 30), n = c(2785, 338)): Chi-squared approximation
## may be incorrect
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(10, 30) out of c(2785, 338)
## X-squared = 166.24, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.11722542 -0.05310804
## sample estimates:
## prop 1 prop 2
## 0.003590664 0.088757396
The difference between our sample proportions is likely between -.117 and -0.053, a pretty large difference.
Let’s see what the population difference in proportions is
homes %>%
group_by(cooling) %>%
summarise(prop = mean(condition == "Poor" | condition == "Substandard"),
n = sum(condition == "Poor" | condition == "Substandard"),
totN = n())
## # A tibble: 2 x 4
## cooling prop n totN
## <chr> <dbl> <int> <int>
## 1 Central Air 0.00352 97 27554
## 2 No Central Air 0.103 379 3674
0.00352-0.103
## [1] -0.09948
Many times we are interested in how two discrete variables interact with each other. When ever you have count data with more than one variable, you should be thinking about contingency tables. This type of analysis is also called chi square test of independence.
The xtabs()
function is useful for creating contingency tables from categorical variables. Let’s create a cross tabulation showing hsdistrict and condition2 of the home, and assign it to an object called xt
. After making the assignment, type the name of the object to view it.
xt <- xtabs(~hsdistrict + condition2, data = homes)
xt
## condition2
## hsdistrict Average Excellent Fair Good None Poor Substandard
## Albemarle 9119 108 447 2138 334 78 30
## Monticello 7572 61 481 1467 330 129 65
## Unassigned 2 0 0 2 0 0 1
## Western Albemarle 6397 121 403 1469 301 116 57
Ok, so our data is a 4 by 7 table. Looking at this table, it is clear that we don’t have a lot of homes in the Unassigned hsdistrict, so we should likely drop that prior to analysis. Let’s also drop condition2 == None since we don’t know what that means for the home.
Let’s also use a function from the forcats
package to combine levels of condition together. The syntax for fct_collapse()
is new = c(“old”, “Old”)
homes_chi <- homes %>%
filter(hsdistrict != "Unassigned" & condition2 != "None") %>%
mutate(condition3 = fct_collapse(condition2,
Good = c("Excellent", "Good"),
Average = c("Average", "Fair"),
Poor = c("Poor", "Substandard")
))
homes_chi %>%
count(condition3)
## # A tibble: 3 x 2
## condition3 n
## <fct> <int>
## 1 Average 24419
## 2 Good 5364
## 3 Poor 475
Great. We dropped ~1000 cases and are ready to re-make our xt object with our homes_chi dataset and the condition3 variable
xt <- xtabs(~hsdistrict + condition3, data = homes_chi)
xt
## condition3
## hsdistrict Average Good Poor
## Albemarle 9566 2246 108
## Monticello 8053 1528 194
## Western Albemarle 6800 1590 173
Nice. Now we have a 3x3 table.
There are two useful functions, addmargins()
and prop.table()
that add more information or manipulate how the data is displayed. addmargins()
adds totals for rows and columns. By default, prop.table()
will divide the number of observations in each cell by the total.
# Add marginal totals
addmargins(xt)
## condition3
## hsdistrict Average Good Poor Sum
## Albemarle 9566 2246 108 11920
## Monticello 8053 1528 194 9775
## Western Albemarle 6800 1590 173 8563
## Sum 24419 5364 475 30258
# Get the proportional table
prop.table(xt)
## condition3
## hsdistrict Average Good Poor
## Albemarle 0.316147796 0.074228303 0.003569304
## Monticello 0.266144491 0.050499042 0.006411528
## Western Albemarle 0.224733955 0.052548086 0.005717496
#each cell divided by grand total
# That isn't really what we want
We want to to see what proportion of homes in each hsdistrict fall into each condition, so we want proportions by rows. Rows is the first dimension in R, so we specify margin = 1
.
# Calculate proportions over the first margin (rows)
prop.table(xt, margin=1)
## condition3
## hsdistrict Average Good Poor
## Albemarle 0.802516779 0.188422819 0.009060403
## Monticello 0.823836317 0.156317136 0.019846547
## Western Albemarle 0.794114212 0.185682588 0.020203200
Looks like Monticello has fewer homes in “Good” condition and Albemarle has fewer homes in “Poor” condition. But are these differences significant?
The chi-square test is used to assess the independence of these two factors (hsdistrict and condition3). That is, if the null hypothesis that hsdistrict and condition3 are independent is true, then we would expect a proportionally equal number of homes within each condition across each hsdistrict.
Let’s do the test and see.
chisq.test(xt)
##
## Pearson's Chi-squared test
##
## data: xt
## X-squared = 96.724, df = 4, p-value < 2.2e-16
The p-value is small suggesting that we found changes from an equal distribution.
The last thing we will do is to compare where the differences are by looking at our observed table, xt, and the values expected if there was an even distribution of conditions across hsdistricts.
xt
## condition3
## hsdistrict Average Good Poor
## Albemarle 9566 2246 108
## Monticello 8053 1528 194
## Western Albemarle 6800 1590 173
chisq.test(xt)$expected
## condition3
## hsdistrict Average Good Poor
## Albemarle 9619.753 2113.123 187.1241
## Monticello 7888.682 1732.867 153.4512
## Western Albemarle 6910.566 1518.010 134.4248
We expected 6910 Average homes in WA, and found fewer
We expected 1518 Good homes in WA, and found more
We expected 134 Poor homes in WA, and found more
Thank you all so much for sticking through to the end with a new format and mode of learning. Because this is our first time teaching like this, if you have any feedback about how it went or improvements we could make I would be very grateful. David and I are teaching a full curricular coding course starting in the next few weeks, so knowing what works and what presents challenges with this virtual format would help us become better educators.
Here is a link to a Google Form where you can add your anonymous thoughts: https://forms.gle/MLKnyw36CzMKLtZQ6
Thank you in advance!