This lesson will show you how to use R to explore your data in a programmatic, systematic, and visual way. The main goals of exploratory data analysis are to generate questions about your data, search for answers within your data, and then refine or create new questions. This is a very iterative process that will take both programmatic tools and visual tools. Even if you already have questions that you want to know about with your data, exploratory data analysis can still be used to ensure that you have clean data and that the data meets expectations.

Reminder: Creating a Project

RStudio projects make it straightforward to place your work into it’s own working directory. Creating a project takes away some of the stress of navigating through file directories and file paths. A project creates an encapsulation for source files, images, and anything else created during your R Session.

To create a Project, go to File -> New Project and then either create a new folder for your project by going to New Directory and browsing to where you want to place your project. Or, you can use a folder you have already created by going to Existing Directory and then navigating to your chosen folder.

Intro to RMarkdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. To insert an R code chunk press Ctrl+Alt+I OR CMD+Option+I.

The following is a code chunk. They provide a way to break your markdown file into sections of code and prose. In this code chunk, I have placed the packages that you will need to install in order to knit together a document (knitr and tinytex)

#install.packages("knitr")
#install.packages("tinytex")
#library(knitr)
#tinytex::install_tinytex()

Getting Started

Let’s start by reading in our albemarle homes dataset using the read_csv function which we get from the tidyverse package.

# Install the tidyverse package if you have not previously done so.
# install.packages("tidyverse")
library(tidyverse)

homes <- read_csv("albemarle_homes_2020.csv")

dplyr

dplyr is a package in R that allows you to work with and manipulate your data. It will allow us to focus in on the variables (columns) of interest and the observations (rows) of interest.

The Pipe

The pipe is an operator in R that allows you to chain together functions in dplyr. In the past, to use multiple functions you would have to nest your functions inside of each other, but using the pipe allows you to chain the functions together in a readable/reproducible format. The pipe character is %>% and essentially means “then” You can create the character using CTRL+SHIFT+M or CMD+SHIFT+M

# Let's begin by exploring the homes dataset both using the pipe and not using the pipe

#No Pipe
View(homes) 

#With the Pipe
homes %>% View()

Count

The count function will provide for you the distinct values of a column. The following example investigates the different values of the condition variable (condition):

homes %>% 
  count(condition)

## # A tibble: 8 x 2
##   condition       n
##   <chr>       <int>
## 1 Average     23090
## 2 Excellent     290
## 3 Fair         1331
## 4 Good         5076
## 5 NULL          959
## 6 Poor          323
## 7 Substandard   153
## 8 Unknown         6

Filter

If you want to filter rows of the data where some condition is true, use the filter() function.

The first argument is the data frame you want to filter, e.g. filter(mydata, ....
The second argument is a condition you must satisfy, e.g. filter(clean, variable == "levelA").

==: Equal to
!=: Not equal to
>, >=: Greater than, greater than or equal to
<, <=: Less than, less than or equal to

If you want to satisfy all of multiple conditions, you can use the “and” operator, &.

The “or” operator | (the pipe character, usually shift-backslash) will return a subset that meet any of the conditions.

Let us say that we wanted to only look at data for those homes built in the last ten years (since 2010)

#Looking at a numeric condition
homes %>% 
  filter(yearbuilt >= 2010)

#Looking at a categorical condition
homes %>% 
  filter(condition == "Excellent")

#Combining two conditions using AND (&)
homes %>% 
  filter(condition == "Excellent" & yearbuilt >= 2010)

#Combining two conditions using OR (&)
homes %>% 
  filter(condition == "Excellent" | condition == "Average")

# You can save the result of a filter to a separate dataframe by using the assignment operator
new_homes <- homes %>% 
  filter(yearbuilt >= 2010)

# Filter is a useful function for filtering out observations that are missing or unwanted. 
# The following code will create a dataframe that only contains houses that have a value for yearremodeled.

homes %>% 
  filter(!is.na(yearremodeled))

EXERCISE: count and filter

Look at the homes dataset and find a categorical variable to investigate.

Figure out the distinct values of the variable and the number of each of these values.
Filter your dataset based of one (or more) of these values. Save this filtered dataset as an object (other than homes).
Rerun this filtered dataset through the appropriate function to check and see if you have the distinct values and numbers you would expect for your chosen variable.

Select

Whereas the filter() function allows you to return only certain rows matching a condition, the select() function returns only certain columns. The first argument is the data, and subsequent arguments are the columns you want.

homes %>% 
  select(yearbuilt, yearremodeled, totalrooms)

You can use the - sign to drop columns you do not want to keep.

homes %>% 
  select(-yearbuilt, -yearremodeled)

The starts_with and ends_with functions are useful ways to drop or keep patterns of columns at once.

homes %>% 
  select(starts_with("year"))

homes %>% 
  select(-starts_with("year"))

homes %>% 
  select(ends_with("district"))

Combining functions with the Pipe

The power of dplyr is in the ability to pipe the verbs/functions together. Essentially, the output of one function will then be piped into the input of the next function. The example below takes the output from the filter statement and feeds into the select statement as an input.

homes %>% 
  filter(yearbuilt >= 2010) %>% 
  select(-yearremodeled)

Summarize

The summarize() function summarizes multiple values to a single value. On its own the summarize() function doesn’t seem to be all that useful. The dplyr package provides a few convenience functions called n() and n_distinct() that tell you the number of observations or the number of distinct values of a particular variable.

Notice that summarize takes a data frame and returns a data frame. In this case it’s a 1x1 data frame with a single row and a single column. The name of the column, by default is whatever the expression was used to summarize the data. This usually isn’t pretty, and if we wanted to work with this resulting data frame later on, we’d want to name that returned value something easier to deal with.

# The n() function allows you to see the number of observations
homes %>% 
  summarize(n())

## # A tibble: 1 x 1
##   `n()`
##   <int>
## 1 31228

# The is.na function returns TRUE for each observation that is missing
homes %>% 
  filter(is.na(yearbuilt)) %>% 
  summarize(n())

## # A tibble: 1 x 1
##   `n()`
##   <int>
## 1   952

homes %>% 
  filter(is.na(censustract)) %>% 
  summarize(n())

## # A tibble: 1 x 1
##   `n()`
##   <int>
## 1     5

# We can use our typical statistical functions, but be careful of missing values
homes %>% 
  summarize(median(lastsaleprice))

## # A tibble: 1 x 1
##   `median(lastsaleprice)`
##                     <dbl>
## 1                      NA

homes %>% 
  summarize(median(lastsaleprice, na.rm = TRUE), 
            mean(lastsaleprice, na.rm = TRUE))

## # A tibble: 1 x 2
##   `median(lastsaleprice, na.rm = TRUE)` `mean(lastsaleprice, na.rm = TRUE)`
##                                   <dbl>                               <dbl>
## 1                                184000                             246097.

# It is helpful to give useful variable names
homes %>% 
  summarize(median = median(lastsaleprice, na.rm = TRUE), 
            mean = mean(lastsaleprice, na.rm = TRUE))

## # A tibble: 1 x 2
##   median    mean
##    <dbl>   <dbl>
## 1 184000 246097.

homes %>% 
  filter(hsdistrict == "Albemarle") %>% 
  summarize(n_distinct(esdistrict))

## # A tibble: 1 x 1
##   `n_distinct(esdistrict)`
##                      <int>
## 1                        7

homes %>% 
  filter(hsdistrict == "Albemarle") %>%
  distinct(esdistrict)

## # A tibble: 7 x 1
##   esdistrict  
##   <chr>       
## 1 Broadus Wood
## 2 Baker-Butler
## 3 Stony Point 
## 4 Greer       
## 5 Agnor-Hurt  
## 6 Woodbrook   
## 7 Hollymead

Group_By

We saw that summarize() isn’t that useful on its own. Neither is group_by() All this does is takes an existing data frame and coverts it into a grouped data frame where operations are performed by group.

homes %>% 
  group_by(hsdistrict)

The real power comes in where group_by() and summarize() are used together. First, write the group_by() statement. Then pipe the result to a call to summarize().

homes %>% 
  group_by(hsdistrict) %>% 
  summarize(median(totalvalue, na.rm = TRUE))

## # A tibble: 4 x 2
##   hsdistrict        `median(totalvalue, na.rm = TRUE)`
##   <chr>                                          <dbl>
## 1 Albemarle                                     319650
## 2 Monticello                                    322100
## 3 Unassigned                                    685500
## 4 Western Albemarle                             439700

And again, you can thread these verbs all together in one pipeline

homes %>% 
  filter(yearbuilt >= 2010) %>% 
  select(-yearremodeled) %>%  
  group_by(hsdistrict) %>% 
  summarize(median(totalvalue, na.rm = TRUE))

## # A tibble: 4 x 2
##   hsdistrict        `median(totalvalue, na.rm = TRUE)`
##   <chr>                                          <dbl>
## 1 Albemarle                                     403850
## 2 Monticello                                    409000
## 3 Unassigned                                    521800
## 4 Western Albemarle                             524400

Arrange

The arrange() function does what it sounds like. It takes a data frame or tbl and arranges (or sorts) by column(s) of interest. The first argument is the data, and subsequent arguments are columns to sort on. Use the desc() function to arrange by descending.

homes %>% 
  filter(yearbuilt >= 2010 & hsdistrict != "Unassigned") %>% 
  select(-yearremodeled) %>%  
  group_by(hsdistrict) %>% 
  summarize(mean = mean(lotsize)) %>% 
  arrange(mean)

## # A tibble: 3 x 2
##   hsdistrict         mean
##   <chr>             <dbl>
## 1 Albemarle          1.87
## 2 Western Albemarle  2.29
## 3 Monticello         4.06

homes %>% 
  filter(yearbuilt >= 2010 & hsdistrict != "Unassigned") %>% 
  select(-yearremodeled) %>%  
  group_by(hsdistrict) %>% 
  summarize(mean = mean(lotsize)) %>% 
  arrange(-mean)

## # A tibble: 3 x 2
##   hsdistrict         mean
##   <chr>             <dbl>
## 1 Monticello         4.06
## 2 Western Albemarle  2.29
## 3 Albemarle          1.87

EXERCISE: On Your Own

Unscramble the following statements to find the median improvement value (improvementsvalue) by elementary schools district (esdistrict) for homes that have been remodeled since 2010. Arrange in descending order of value and return the 3 districts with the highest median improvement value.

# homes %>% 
#   
#   head(3)
# 
#   arrange(desc(median_improv)) %>% 
#     
#   filter(yearremodeled >= 2010) %>% 
#   
#   summarize(median_improv = median(improvementsvalue)) %>% 
#   
#   group_by(esdistrict) %>%

Basic Exploratory Plotting

Exploring the data programatically is quite helpful, however, sometimes a picture can be worth a thousand words. This section will tie together the use of dplyr with the use data visualization. The exploratory data analysis process is iterative and we will go through how we can use our dplyr results to inform our visualizations and use the results from our visualizations to inform our use of dplyr functions.

In this session I am going to introduce a few plot types that will be helpful when it comes to data exploration and EDA. Future sessions will expand on the plotting capabilities of R.

The main reasons to use plots in Exploratory Data Analysis is to check for missing data, check for outliers, check for the typical values, and to get an overall handle of your data.

Histogram/Density

The first plot we are going ot create is a histogram plot that is valuable in visualizing distributions and in looking for typical values/outliers.

To start a plot using ggplot(), we first build a base/canvas using the ggplot function. We provide the function with a dataset and then map variables from that dataset onto the x and y axes. In this section we will be using the ggplot function and then the following two key aspects of using ggplot:

a geom, which specifies how the data are represented on the plot (points, lines, bars, etc.),
aesthetics that map variables in the data to axes on the plot or to plotting size, shape, color, etc.,

Let us first build a histogram looking at the distribution of home ages.

#First build a canvas using the homes dataset and the age variable on our x axis
ggplot(data = homes, aes(x = age))

#From there, we can tell ggplot to use a histogram to plot our values from age
ggplot(data = homes, aes(x = age)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Looking at this visualization, we can see that most of our data falls under 75 years old. We do have a few outliers, so maybe we would want to get the mean age of our homes, both with all of the data and without the outliers.
homes %>% 
  summarize(mean(age), median(age))

## # A tibble: 1 x 2
##   `mean(age)` `median(age)`
##         <dbl>         <dbl>
## 1        37.7            30

homes %>% 
  filter(age < 75) %>% 
  summarize(mean(age), median(age))

## # A tibble: 1 x 2
##   `mean(age)` `median(age)`
##         <dbl>         <dbl>
## 1        29.6            28

#We could also filter out the outliers and only plot the values for homes that are less than 75 years old.
homes %>% 
  filter(age < 75) %>% 
  ggplot(aes(x = age)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Let us take a look at another variable, the total value of a lot.
ggplot(data = homes, aes(totalvalue)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Why do you think that the graph looks like it does? 
homes %>% arrange(desc(totalvalue)) %>% head(5) %>% select(totalvalue)

## # A tibble: 5 x 1
##   totalvalue
##        <dbl>
## 1    7859000
## 2    7766800
## 3    6755600
## 4    6448900
## 5    5980300

homes %>% filter(totalvalue > 3e06) %>% summarize(n())

## # A tibble: 1 x 1
##   `n()`
##   <int>
## 1    75

homes %>% filter(totalvalue > 1e06) %>% summarize(n())

## # A tibble: 1 x 1
##   `n()`
##   <int>
## 1  1442

#Let us only look at homes that are less than $1,000,000
homes %>% 
  filter(totalvalue < 1000000) %>% 
  ggplot(aes(totalvalue)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Instead of a histogram you can also use a density plot which is a smoothed version of the histogram.
homes %>% 
  filter(totalvalue < 1000000) %>% 
  ggplot(aes(totalvalue)) + geom_density()

Box Plots

Sometimes we want to look past just a single variable and look at how two variables interact. When looking at a continous and a categorical variable, a box plot is a highly appropriate choice. A box plot illustrates how varied and spread out your data is across several different levels. Box plots provide a quick way to compare your data across different levels, but it also allows you to check for outliers in your data and it allows you to see the different levels of specific variables.

# To create a boxplot, you simply need to add a y axis to your aes call and then add a geom_boxplot() function.
ggplot(data = homes, aes(x = condition, y = totalrooms)) + geom_boxplot()

## Warning: Removed 9 rows containing non-finite values (stat_boxplot).

# What stands out to you about this graph?

# Let us go ahead and deal with these issues
homes %>% filter(condition != "NULL" & condition != "Unknown") %>% 
ggplot(aes(condition, totalrooms)) + geom_boxplot()

## Warning: Removed 4 rows containing non-finite values (stat_boxplot).

homes %>% filter(condition != "NULL" & condition != "Unknown") %>% filter(totalrooms < 40) %>% 
ggplot(aes(condition, totalrooms)) + geom_boxplot()

# If you find yourself having to turn your head to read the x axes, you can use the coord_flip() function to make it easier to read.
homes %>% filter(condition != "NULL" & condition != "Unknown") %>% filter(totalrooms < 40) %>% 
ggplot(aes(condition, totalrooms)) + geom_boxplot() + coord_flip()

Scatter

The final type of plot that we are going to look is one that compares multiple continuous variables. The scatter plot is a good choice for investigating the relationship between two continuous variables and it allows you to look for missing values that are denoted as such and to find other unusual values. geom_point() is the function to create scatter plots

EXERCISE: Let is scatter

Plot the year a home was remodeled against the last sale price

Which points in the plot will give us trouble down the line?
How do you think we could handle these issues?
Attempt to clean up the data and rerun the scatter plot, how does is look now?

# Initial Plot

ggplot(homes, aes(yearremodeled, lastsaleprice)) + geom_point()

## Warning: Removed 28929 rows containing missing values (geom_point).

# 1.
   #The 0/0 ones

# 2.
   #Filter out these results

# 3. 
homes %>% filter(yearremodeled > 500) %>% 
ggplot(aes(yearremodeled, lastsaleprice)) + geom_point()