Today’s goals include

Learning to use dplyr functions for data exploration
Thinking about data exploration and preparation as a chance to get to know the data you’re working with
Using tidyverse functions to go from raw data to a dataset ready for further analysis

dplyr and data exploration

Before we dive in with the data, let’s get to know the dplyr package. Part of the the tidyverse, dplyr is a package for data manipulation. The package implements a grammar for transforming data, based on verbs/functions that define a set of common tasks.

Functions/Verbs

dplyr functions are for data frames.

first argument of dplyr functions is always a data frame
followed by function specific arguments that detail what to do

Let’s look at some quick examples using the Albemarle homes data .

library(tidyverse)
library(lubridate)
options(tibble.print_min = 5)

homes <- read_csv("data/albemarle_homes_2020.csv")

count()

\(\color{blue}{\text{count()}}\) - tally rows by distinct values of \(\color{blue}{\text{variable}}\)

count(homes, condition)

filter()

\(\color{green}{\text{filter()}}\) - extract \(\color{green}{\text{rows}}\) that meet logical conditions

filter(homes, condition == "Excellent")

Logical tests

x < y: less than
x > y: greater than
x == y: equal to
x <= y: less than or equal to
y >= y: greater than or equal to
x != y: not equal to
x %in% y: is a member of
is.na(x): is NA
!is.na(x): is not NA

filter(homes, yearbuilt > 2011)

Boolean operators for multiple conditions

a & b: and
a | b: or
xor(a,b): exactly or
!a: not

filter(homes, esdistrict == "Murray" & bedroom <= 2)

arrange()

\(\color{green}{\text{arrange()}}\) - order \(\color{green}{\text{rows}}\) based on value of designated column(s)

arrange(homes, finsqft)

Reverse the order (largest to smallest) with desc()

arrange(homes, desc(finsqft))

select()

\(\color{blue}{\text{select()}}\) - extract \(\color{blue}{\text{variables}}\) by name

select(homes, totalvalue)

select() helpers include

select(.data, var1:var10): select range of columns
select(.data, -c(var1, var2)): select every column but
select(.data, starts_with(“string”)): select columns that start with… (or ends_with(“string”))
select(.data, contains(“string”)): select columns whose names contain…

select(homes, lotsize:totalvalue)

pipes

The pipe is an operator that allows you to chain together functions. It passes (or pipes) the result on the left into the first argument of the function on the right. It can be read as “and then…”.

For instance, if we want the totalvalue and lotsize for homes in the Murray school district arranged in descending order of lotsize, without the pipe it would look like

arrange(
  select(
    filter(homes, esdistrict == "Murray"), 
    totalvalue, lotsize), 
  desc(lotsize))

Or we could save the intervening steps

tmp <- filter(homes, esdistrict == "Murray")
tmp <- select(tmp, totalvalue, lotsize)
arrange(tmp, desc(lotsize))

Or we could use pipes – which is easier!

homes %>% 
  filter(esdistrict == "Murray") %>% 
  select(totalvalue, lotsize) %>% 
  arrange(desc(lotsize))

Keyboard Shortcut!

Mac: cmd + shift + m
Windows: ctrl + shift + m

summarize()

\(\color{blue}{\text{summarize()}}\) - summarize \(\color{blue}{\text{variables}}\) according to a summary function

Summary functions include

first(): first value
last(): last value
nth(.x, n): nth value
n(): number of values
n_distinct(): number of distinct values
min(): minimum value
max(): maximum value
mean(): mean value
median(): median value
var(): variance
sd(): standard deviation
IQR(): interquartile range

homes %>% 
  filter(yearbuilt > 0) %>% 
  summarize(oldest = min(yearbuilt), 
            newest = max(yearbuilt), 
            total = n())

Things to note: * multipe summary functions can be called within the same summarize(); * we can give the summary values new names (though we dont’ have to); * the n() function returns the number of observations (surprisingly useful).

That’s not always interesting on it’s own, but when combined with group_by, it is powerful!

group_by()

\(\color{green}{\text{group_by()}}\) - group \(\color{green}{\text{rows}}\) by value(s) of column(s)

homes %>% 
  filter(yearbuilt > 0) %>% 
  group_by(esdistrict) %>% 
  summarize(oldest = min(yearbuilt), 
            avgsqft = mean(finsqft), 
            number = n())

Data wrangling and preparation

Data is usually messy and requires preparation (aka cleaning, wrangling, munging). Now we’ll use these functions (and more) to walk through how we got from the original data sources to this more manageable data set, and finish preparing it for analysis using the Albemarle County Real Estate data as an example.

This data was actually derived from three sources within the Albemarle County Office of Geographic Data Services.

Real Estate Information - Parcel Level Data – containing information about the parcel itself such as owner information, deed acreage value, and assessed value.
Real Estate Information - Card Level Data – containing property information organized by dwellings or units on a property.
Other Parcel Characteristics – containing other parcel information such as Zoning and School Districts.

Our goal is to recreate the homes data we’ve been using from these sources. In creating this data set, the intent was to be able to understand more about the wealth derived from home ownership, the quality and values of residential homes, in our area.

Details are in the script.

Merging data

We want to take three datasets and create one using a series of merges using join functions. There are several types of join functions in R:

inner_join(x,y): return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.
left_join(x,y): return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
right_join(x,y): return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
full_join(x,y): return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

Wrangling the data

Reference to some steps we’ll take.

Make the variable names lowercase: str_to_lower() from stringr
Fix misclassified character variables: mutate(across())
Fill in some missing values: mutate() and if_else()
Working with factors: fct_relevel(), fct_collapse()

Data Exploration and Wrangling

Michele Claibourn

2/17/2021