Before we get started

Set R Global Options

  1. Go to Tools -> Global Options -> Code -> Make sure Soft-Wrap R Source Files is checked.

  2. Go to Tools -> Global Options -> RMarkdown -> Make sure Show Output Inline for all RMarkdown Documents is unchecked.

Today’s class

Today we are going to walk through how you go from disparate datasets to one unified dataset. In previous weeks we’ve worked with a dataset that was complete and clean, however, it did not start that way. This dataset was actually pulled together from three separate sources.

Albemarle County Office of Geographic Data Services

http://www.albemarle.org/department.asp?department=gds&relpage=3914Parcels

Real Estate Information - Parcel Level Data

https://gisweb.albemarle.org/gisdata/CAMA/GIS_View_Redacted_ParcelInfo_TXT.zip

This file contains information about the parcel itself such as owner information, deed acreage value, and assessed value.

Real Estate Information - Card Level Data

Card Level Data refers to property information organized by particular residential dwellings or commercial units (e.g. building details and outbuilding information) on a given property. These tables can be linked to the Parcel Level Data table (ParcelID field) via the TMP field.

Card Level Data: https://gisweb.albemarle.org/gisdata/CAMA/GIS_CardLevelData_new_TXT.zip This file includes data such as year built, finished square footage, number of rooms, and condition.

Other characteristics: https://gisweb.albemarle.org/gisdata/CAMA/CityView_View_OtherParcelCharacteristics_TXT.zip This file contains other parcel information that is managed in our development tracking system (e.g. Zoning, School Districts, Jurisdictional Areas, etc.).

## -- Attaching packages ----------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts -------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   tmp = col_character(),
##   usecode = col_character(),
##   condition = col_character(),
##   cooling = col_character(),
##   lastsaledate1 = col_character(),
##   esdistrict = col_character(),
##   msdistrict = col_character(),
##   hsdistrict = col_character(),
##   lastsaledate = col_character(),
##   condition2 = col_character()
## )
## See spec(...) for full column specifications.

Reading in our TXT Files

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   CardNum = col_double(),
##   VisionID = col_double(),
##   BID = col_double()
## )
## See spec(...) for full column specifications.
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   LotSize = col_double(),
##   LotSizeJan1 = col_double(),
##   LastSalePrice = col_double(),
##   Cards = col_double(),
##   UniqueField = col_double()
## )
## See spec(...) for full column specifications.
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   CensusBlockGroup = col_double(),
##   CensusTract = col_double(),
##   LandUsePrimaryStructuresNumber = col_double(),
##   LandUsePrimaryDwellingUnitsNumber = col_double(),
##   LanduseSecondaryStructuresNumber = col_double(),
##   LandUseSecondaryDwellingUnitsNumber = col_double(),
##   LanduseMinorStructuresNumber = col_double(),
##   LandUseMinorDwellingUnitsNumber = col_double()
## )
## See spec(...) for full column specifications.

Data Investigation

Before beginning making changes, we need to identify what we want in the end. We want to be able to look at the value of the house (parcel), the actually makeup of the house and land (card), and other pertinent information about the area in which the house resides (other). We are going to start by looking at the card dataset to see which variables we want to keep in regards to the house/land.

There are a few things to note here. First, we want to make sure we keep the TMP variable since that is how we will join it to other datasets. Second, there is a lot of information here, perhaps more than we need. Finally, all the data was read in as character data, even things like number of stores and total rooms. We will want to change that.

The first and second tasks we can complete right now by using hte select function.

Using Mutate to change/add things to your dataset.

So, we’ve already culled our card dataset down from 35 to 14 variables. However, we still have some issues with all the data being characters and a few other issues. To make changes to your data, like adding a new variable/column or changing the variable type, we will be utilizing the mutate function.

However, we are simply going to make a small change at first by using mutate to add a simple column.

Now that we have started on our card dataset, we are going to move over to the parcel and other datasets and go throsugh similar motions.


Exercise

I have selected the variables below for you to keep in our two datasets.

I want you to:

  1. Create two new datasets parcel and other by selecting the chosen variables from the raw datasets

Then:

  1. Add variables to parcel and other naming the source of the data.
  1. There is no question 3…

Turning many into one (the power of the merge)

We now have 3 datasets with the variables that we want. We’ve gone from having over 110 variables to around 30 variables. Now, the goal is to take our 3 datasets and create 1 dataset that we will work on moving forward. To do this, we are going to use a series of merges using join functions. There are several types of join functions in R, all depending on your specific needs.

There are several types of joins (merges) that we can use. The one you use depends upon your specific needs.

inner_join(): return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

left_join(): return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

right_join(): return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

full_join(): return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

However, before attempting to join our data, we need to investigate the variable we are going to use to uniquely identify observations in each of the datasets.

We are going to start by combining our parcel and other datasets together. Since there, in theory, only one parcel in each of the datasets, we are going to use a one-to-one join where a Parcel in one dataset should only match up to one row in the other dataset.

We are now going to take the card dataset and merge it onto the combined parcel dataset. Since the card dataset can have multiple observations that belong to a single parcel, this will be considered a many-to-one join. We are going to use a full-join for this merge, as we want to keep all data, matching or not.

Details, details, details.

Now the tedious work begins. This is where we have to go through and ensure that the data we have is in the shape and form that we want it to be in. One way to start is just to investigate the data as we did before.

One thing we could do, if desired, is change the variable names to all lower case. This could, in the future, make it a bit easier to minimize mistakes when mis-capitalizing variable names.

We are close on the variables, but we still have some cleaning up to do in regards to the value of our variables.

Here are a few things we have (make believe with me) identified as important for our data.

  1. We only want residential homes records (not businesses, apartment complexes, etc.).
  2. We want properties with individual households own and can accrue wealth or properties that can be rented to individual households

Much earlier, I mentioned about the issue with all of our variables being seen as character variables. This obviously is not ideal, so we need to use a special version of mutate to handle our mis-classified variables.

This will look quite similar to the use of %in% above. We create a vector of variables we want to change and then we use the mutate_at function to change them.

Into the Weeds

For today, the main purpose of this dataset is to look at the total value (totalvalue), the square footage (finsqft), and lot size (lotsize). Since these are the key variables for us to look at, we need to ensure that we have very reliable data for these variables.

We can also do something similar to square footage and lot size.

After running those filters, we now have the same number of observations in our datasets.

Fixing up the rest

This is quite the process, huh? Let’s keep moving…

The first thing we are going to do is create an age variable, since we may want to bin things by age.

The next step we are going to take is to relevel a factor variable. When we look at the condition of the house, the factor will automatically set the factor up in alphabetical order. This is fine at times, but sometimes you want the factor to be ordered in your own specific way.

The rest of the cleaning done on these datasets is shown below. These tasks tend to be repeats of things we have done earlier, but they will be good examples for you moving forward.

Run this code chunk:

#   yearremodeled, numstories, cooling, fp_open, bedroom, fullbath,
#   halfbath, totalrooms, landvalue, landusevalue, improvementsvalue,
#   lastsaleprice, esdstrict, msdistrit, hsdistrict, censustract

# yearremodeled -> remodel indicator
summary(homes$yearremodeled) # NA not remodeled; not sure about 2 or 8
homes <- homes %>% 
  mutate(remodel = if_else(!is.na(yearremodeled), 1, 0))


# numstories
table(homes$numstories) # realize I don't know what this means; was expecting 1, 2, 3, etc.. Let's drop it
homes <- homes %>% select(-numstories)


# cooling
table(homes$cooling) # fix factor -- assume 00, M1, and NULL are no air
homes <- homes %>% 
  mutate(cooling = fct_collapse(cooling, 
                                "No Central Air" = c("00", "M1", "NULL")))


# fp_open (these are characters)
table(homes$fp_open) # make a binary indicator, 0 and Null are none
homes <- homes %>% 
  mutate(fp = if_else(fp_open %in% c("0", "NULL"), 0, 1))


# bedroom, fullbath, halfbath, totalrooms 
table(homes$bedroom) # 103 homes with no bedroom is a likely error
table(homes$fullbath) # 168 homes with no full bath is a likely error
table(homes$halfbath) # ok
table(homes$totalrooms) # 479 homes with no rooms is a likely error


# landvalue
summary(homes$landvalue) # no missing, some 0s
homes %>% filter(landvalue == 0) %>% count(lotsize) # 13 total, only one with 0 lotsize

# landusevalue
summary(homes$landusevalue) # no missing
homes <- homes %>% # create binary indicator for land use (land generates revenue)
  mutate(landuse = if_else(landusevalue > 0, 1, 0)) %>% 
  select(-landusevalue) # remove variable
table(homes$landuse)

# improvementsvalue
summary(homes$improvementsvalue) # no missing, some 0s

# create a tmp file with ImprovementsValue == 0, arrange the file by finsqft
tmp <- homes %>% 
  filter(improvementsvalue == 0) %>% 
  arrange(finsqft)
# several new buildings with common values (e.g., condo/th dev?)

# lastsaleprice
summary(homes$lastsaleprice)
tmp <- homes %>% 
  filter(lastsaleprice == 0) %>% 
  arrange(lastsaledate) 
# first ~ 700 records, last sale date is prior to year built 

homes %>% 
  mutate(datecheck = if_else(yearbuilt > as.integer(year(lastsaledate)), 1, 0)) %>% 
  filter(datecheck == 1 | yearbuilt == 0) %>% 
  tally()
# suggests yearbuilt (or lastsaledate) is wrong for at least 3472 records

#..................................
# Clean up and save ----
# remove a few additional variables -- these were for examining the data
homes <- homes %>% 
  select(-c(cardnum, fp_open, owner, propname, cards, med_age))

We are finally there. If you look at your homes object and compare it to the goal dataset, we should have the same number of observations and variables. If you don’t, don’t fret, this is all just practice. Below, I am showing a way that you could compare two datasets to see just how well we did.

Appendix

The following functions were not needed in this workshop, but they are ones that I find useful.

The final function I want to show is slightly out of data and has been replaced by the pivot_wider, pivot_longer functions. However, these functions are currently quite picky about the version of Rstudio you are running and work in quite a similar fashion.

The spread and gather functions allow you to take your data from wide to long or long to wide, reshaping your complete dataset. There are times where you want to have every observation represent a single thing, like one state, and have the columns represent a large amount of information, like average house price from 1992-2016. This would make the data quite wide. You also might want to break your dataset into a state-year format, where you would have many observations and very few columns. This would be a long dataset.