Today’s goals include
Before we dive in with the data, let’s get to know the dplyr
package. Part of the the tidyverse
, dplyr
is a package for data manipulation. The package implements a grammar for transforming data, based on verbs/functions that define a set of common tasks.
dplyr
functions are for d
ata frames.
dplyr
functions is always a data frameLet’s look at some quick examples using the Albemarle homes
data .
library(tidyverse)
library(lubridate)
options(tibble.print_min = 5)
homes <- read_csv("data/albemarle_homes_2020.csv")
\(\color{blue}{\text{count()}}\) - tally rows by distinct values of \(\color{blue}{\text{variable}}\)
count(homes, condition)
\(\color{green}{\text{filter()}}\) - extract \(\color{green}{\text{rows}}\) that meet logical conditions
filter(homes, condition == "Excellent")
Logical tests
filter(homes, yearbuilt > 2011)
Boolean operators for multiple conditions
filter(homes, esdistrict == "Murray" & bedroom <= 2)
\(\color{green}{\text{arrange()}}\) - order \(\color{green}{\text{rows}}\) based on value of designated column(s)
arrange(homes, finsqft)
Reverse the order (largest to smallest) with desc()
arrange(homes, desc(finsqft))
\(\color{blue}{\text{select()}}\) - extract \(\color{blue}{\text{variables}}\) by name
select(homes, totalvalue)
select() helpers include
select(homes, lotsize:totalvalue)
The pipe is an operator that allows you to chain together functions. It passes (or pipes) the result on the left into the first argument of the function on the right. It can be read as “and then…”.
For instance, if we want the totalvalue
and lotsize
for homes in the Murray school district arranged in descending order of lotsize, without the pipe it would look like
arrange(
select(
filter(homes, esdistrict == "Murray"),
totalvalue, lotsize),
desc(lotsize))
Or we could save the intervening steps
tmp <- filter(homes, esdistrict == "Murray")
tmp <- select(tmp, totalvalue, lotsize)
arrange(tmp, desc(lotsize))
Or we could use pipes – which is easier!
homes %>%
filter(esdistrict == "Murray") %>%
select(totalvalue, lotsize) %>%
arrange(desc(lotsize))
\(\color{blue}{\text{summarize()}}\) - summarize \(\color{blue}{\text{variables}}\) according to a summary function
Summary functions include
homes %>%
filter(yearbuilt > 0) %>%
summarize(oldest = min(yearbuilt),
newest = max(yearbuilt),
total = n())
Things to note: * multipe summary functions can be called within the same summarize(); * we can give the summary values new names (though we dont’ have to); * the n()
function returns the number of observations (surprisingly useful).
That’s not always interesting on it’s own, but when combined with group_by
, it is powerful!
\(\color{green}{\text{group_by()}}\) - group \(\color{green}{\text{rows}}\) by value(s) of column(s)
homes %>%
filter(yearbuilt > 0) %>%
group_by(esdistrict) %>%
summarize(oldest = min(yearbuilt),
avgsqft = mean(finsqft),
number = n())
Data is usually messy and requires preparation (aka cleaning, wrangling, munging). Now we’ll use these functions (and more) to walk through how we got from the original data sources to this more manageable data set, and finish preparing it for analysis using the Albemarle County Real Estate data as an example.
This data was actually derived from three sources within the Albemarle County Office of Geographic Data Services.
Our goal is to recreate the homes
data we’ve been using from these sources. In creating this data set, the intent was to be able to understand more about the wealth derived from home ownership, the quality and values of residential homes, in our area.
Details are in the script.
We want to take three datasets and create one using a series of merges using join
functions. There are several types of join
functions in R:
inner_join(x,y)
: return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.
left_join(x,y)
: return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
right_join(x,y)
: return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
full_join(x,y)
: return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.
Reference to some steps we’ll take.
str_to_lower()
from stringr
mutate(across())
mutate()
and if_else()
fct_relevel()
, fct_collapse()