Today’s goals include

  • Learning to use dplyr functions for data exploration
  • Thinking about data exploration and preparation as a chance to get to know the data you’re working with
  • Using tidyverse functions to go from raw data to a dataset ready for further analysis

dplyr and data exploration

Before we dive in with the data, let’s get to know the dplyr package. Part of the the tidyverse, dplyr is a package for data manipulation. The package implements a grammar for transforming data, based on verbs/functions that define a set of common tasks.

Functions/Verbs

dplyr functions are for data frames.

  • first argument of dplyr functions is always a data frame
  • followed by function specific arguments that detail what to do

Let’s look at some quick examples using the Albemarle homes data .

library(tidyverse)
library(lubridate)
options(tibble.print_min = 5)

homes <- read_csv("data/albemarle_homes_2020.csv")

count()

\(\color{blue}{\text{count()}}\) - tally rows by distinct values of \(\color{blue}{\text{variable}}\)

count(homes, condition)

filter()

\(\color{green}{\text{filter()}}\) - extract \(\color{green}{\text{rows}}\) that meet logical conditions

filter(homes, condition == "Excellent")

Logical tests

  • x < y: less than
  • x > y: greater than
  • x == y: equal to
  • x <= y: less than or equal to
  • y >= y: greater than or equal to
  • x != y: not equal to
  • x %in% y: is a member of
  • is.na(x): is NA
  • !is.na(x): is not NA
filter(homes, yearbuilt > 2011)

Boolean operators for multiple conditions

  • a & b: and
  • a | b: or
  • xor(a,b): exactly or
  • !a: not
filter(homes, esdistrict == "Murray" & bedroom <= 2)

arrange()

\(\color{green}{\text{arrange()}}\) - order \(\color{green}{\text{rows}}\) based on value of designated column(s)

arrange(homes, finsqft)

Reverse the order (largest to smallest) with desc()

arrange(homes, desc(finsqft))

select()

\(\color{blue}{\text{select()}}\) - extract \(\color{blue}{\text{variables}}\) by name

select(homes, totalvalue)

select() helpers include

  • select(.data, var1:var10): select range of columns
  • select(.data, -c(var1, var2)): select every column but
  • select(.data, starts_with(“string”)): select columns that start with… (or ends_with(“string”))
  • select(.data, contains(“string”)): select columns whose names contain…
select(homes, lotsize:totalvalue)

pipes

The pipe is an operator that allows you to chain together functions. It passes (or pipes) the result on the left into the first argument of the function on the right. It can be read as “and then…”.

For instance, if we want the totalvalue and lotsize for homes in the Murray school district arranged in descending order of lotsize, without the pipe it would look like

arrange(
  select(
    filter(homes, esdistrict == "Murray"), 
    totalvalue, lotsize), 
  desc(lotsize))

Or we could save the intervening steps

tmp <- filter(homes, esdistrict == "Murray")
tmp <- select(tmp, totalvalue, lotsize)
arrange(tmp, desc(lotsize))

Or we could use pipes – which is easier!

homes %>% 
  filter(esdistrict == "Murray") %>% 
  select(totalvalue, lotsize) %>% 
  arrange(desc(lotsize))

Keyboard Shortcut!

  • Mac: cmd + shift + m
  • Windows: ctrl + shift + m

summarize()

\(\color{blue}{\text{summarize()}}\) - summarize \(\color{blue}{\text{variables}}\) according to a summary function

Summary functions include

  • first(): first value
  • last(): last value
  • nth(.x, n): nth value
  • n(): number of values
  • n_distinct(): number of distinct values
  • min(): minimum value
  • max(): maximum value
  • mean(): mean value
  • median(): median value
  • var(): variance
  • sd(): standard deviation
  • IQR(): interquartile range
homes %>% 
  filter(yearbuilt > 0) %>% 
  summarize(oldest = min(yearbuilt), 
            newest = max(yearbuilt), 
            total = n())

Things to note: * multipe summary functions can be called within the same summarize(); * we can give the summary values new names (though we dont’ have to); * the n() function returns the number of observations (surprisingly useful).

That’s not always interesting on it’s own, but when combined with group_by, it is powerful!

group_by()

\(\color{green}{\text{group_by()}}\) - group \(\color{green}{\text{rows}}\) by value(s) of column(s)

homes %>% 
  filter(yearbuilt > 0) %>% 
  group_by(esdistrict) %>% 
  summarize(oldest = min(yearbuilt), 
            avgsqft = mean(finsqft), 
            number = n())

Data wrangling and preparation

Data is usually messy and requires preparation (aka cleaning, wrangling, munging). Now we’ll use these functions (and more) to walk through how we got from the original data sources to this more manageable data set, and finish preparing it for analysis using the Albemarle County Real Estate data as an example.

This data was actually derived from three sources within the Albemarle County Office of Geographic Data Services.

Our goal is to recreate the homes data we’ve been using from these sources. In creating this data set, the intent was to be able to understand more about the wealth derived from home ownership, the quality and values of residential homes, in our area.

Details are in the script.

Merging data

We want to take three datasets and create one using a series of merges using join functions. There are several types of join functions in R:

  • inner_join(x,y): return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

  • left_join(x,y): return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

  • right_join(x,y): return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

  • full_join(x,y): return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

Wrangling the data

Reference to some steps we’ll take.

  1. Make the variable names lowercase: str_to_lower() from stringr
  2. Fix misclassified character variables: mutate(across())
  3. Fill in some missing values: mutate() and if_else()
  4. Working with factors: fct_relevel(), fct_collapse()