This lesson will show you how to use R to explore your data in a programmatic, systematic, and visual way. The main goals of exploratory data analysis are to generate questions about your data, search for answers within your data, and then refine or create new questions. This is a very iterative process that will take both programmatic tools and visual tools. Even if you already have questions that you want to know about with your data, exploratory data analysis can still be used to ensure that you have clean data and that the data meets expectations.
Reminder: Creating a Project
RStudio projects make it straightforward to place your work into it’s own working directory. Creating a project takes away some of the stress of navigating through file directories and file paths. A project creates an encapsulation for source files, images, and anything else created during your R Session.
To create a Project, go to File -> New Project
and then either create a new folder for your project by going to New Directory
and browsing to where you want to place your project. Or, you can use a folder you have already created by going to Existing Directory
and then navigating to your chosen folder.
Intro to RMarkdown
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. To insert an R code chunk press Ctrl+Alt+I
OR CMD+Option+I
.
The following is a code chunk. They provide a way to break your markdown file into sections of code and prose. In this code chunk, I have placed the packages that you will need to install in order to knit together a document (knitr and tinytex)
Getting Started
Let’s start by reading in our albemarle homes dataset using the read_csv
function which we get from the tidyverse package.
dplyr
dplyr is a package in R that allows you to work with and manipulate your data. It will allow us to focus in on the variables (columns) of interest and the observations (rows) of interest.
The Pipe
The pipe is an operator in R that allows you to chain together functions in dplyr. In the past, to use multiple functions you would have to nest your functions inside of each other, but using the pipe allows you to chain the functions together in a readable/reproducible format. The pipe character is %>%
and essentially means “then” You can create the character using CTRL+SHIFT+M
or CMD+SHIFT+M
Count
The count function will provide for you the distinct values of a column. The following example investigates the different values of the condition variable (condition
):
## # A tibble: 8 x 2
## condition n
## <chr> <int>
## 1 Average 23090
## 2 Excellent 290
## 3 Fair 1331
## 4 Good 5076
## 5 NULL 959
## 6 Poor 323
## 7 Substandard 153
## 8 Unknown 6
Filter
If you want to filter rows of the data where some condition is true, use the filter()
function.
- The first argument is the data frame you want to filter, e.g.
filter(mydata, ...
. - The second argument is a condition you must satisfy, e.g.
filter(clean, variable == "levelA")
.
==
: Equal to!=
: Not equal to>
,>=
: Greater than, greater than or equal to<
,<=
: Less than, less than or equal to
If you want to satisfy all of multiple conditions, you can use the “and” operator, &
.
The “or” operator |
(the pipe character, usually shift-backslash) will return a subset that meet any of the conditions.
Let us say that we wanted to only look at data for those homes built in the last ten years (since 2010)
#Looking at a numeric condition
homes %>%
filter(yearbuilt >= 2010)
#Looking at a categorical condition
homes %>%
filter(condition == "Excellent")
#Combining two conditions using AND (&)
homes %>%
filter(condition == "Excellent" & yearbuilt >= 2010)
#Combining two conditions using OR (&)
homes %>%
filter(condition == "Excellent" | condition == "Average")
# You can save the result of a filter to a separate dataframe by using the assignment operator
new_homes <- homes %>%
filter(yearbuilt >= 2010)
# Filter is a useful function for filtering out observations that are missing or unwanted.
# The following code will create a dataframe that only contains houses that have a value for yearremodeled.
homes %>%
filter(!is.na(yearremodeled))
EXERCISE: count and filter
Look at the homes dataset and find a categorical variable to investigate.
Figure out the distinct values of the variable and the number of each of these values.
Filter your dataset based of one (or more) of these values. Save this filtered dataset as an object (other than
homes
).Rerun this filtered dataset through the appropriate function to check and see if you have the distinct values and numbers you would expect for your chosen variable.
Select
Whereas the filter()
function allows you to return only certain rows matching a condition, the select()
function returns only certain columns. The first argument is the data, and subsequent arguments are the columns you want.
You can use the -
sign to drop columns you do not want to keep.
The starts_with
and ends_with
functions are useful ways to drop or keep patterns of columns at once.
Combining functions with the Pipe
The power of dplyr is in the ability to pipe the verbs/functions together. Essentially, the output of one function will then be piped into the input of the next function. The example below takes the output from the filter statement and feeds into the select statement as an input.
Summarize
The summarize()
function summarizes multiple values to a single value. On its own the summarize()
function doesn’t seem to be all that useful. The dplyr package provides a few convenience functions called n()
and n_distinct()
that tell you the number of observations or the number of distinct values of a particular variable.
Notice that summarize takes a data frame and returns a data frame. In this case it’s a 1x1 data frame with a single row and a single column. The name of the column, by default is whatever the expression was used to summarize the data. This usually isn’t pretty, and if we wanted to work with this resulting data frame later on, we’d want to name that returned value something easier to deal with.
## # A tibble: 1 x 1
## `n()`
## <int>
## 1 31228
# The is.na function returns TRUE for each observation that is missing
homes %>%
filter(is.na(yearbuilt)) %>%
summarize(n())
## # A tibble: 1 x 1
## `n()`
## <int>
## 1 952
## # A tibble: 1 x 1
## `n()`
## <int>
## 1 5
# We can use our typical statistical functions, but be careful of missing values
homes %>%
summarize(median(lastsaleprice))
## # A tibble: 1 x 1
## `median(lastsaleprice)`
## <dbl>
## 1 NA
## # A tibble: 1 x 2
## `median(lastsaleprice, na.rm = TRUE)` `mean(lastsaleprice, na.rm = TRUE)`
## <dbl> <dbl>
## 1 184000 246097.
# It is helpful to give useful variable names
homes %>%
summarize(median = median(lastsaleprice, na.rm = TRUE),
mean = mean(lastsaleprice, na.rm = TRUE))
## # A tibble: 1 x 2
## median mean
## <dbl> <dbl>
## 1 184000 246097.
## # A tibble: 1 x 1
## `n_distinct(esdistrict)`
## <int>
## 1 7
## # A tibble: 7 x 1
## esdistrict
## <chr>
## 1 Broadus Wood
## 2 Baker-Butler
## 3 Stony Point
## 4 Greer
## 5 Agnor-Hurt
## 6 Woodbrook
## 7 Hollymead
Group_By
We saw that summarize()
isn’t that useful on its own. Neither is group_by()
All this does is takes an existing data frame and coverts it into a grouped data frame where operations are performed by group.
The real power comes in where group_by()
and summarize()
are used together. First, write the group_by()
statement. Then pipe the result to a call to summarize()
.
## # A tibble: 4 x 2
## hsdistrict `median(totalvalue, na.rm = TRUE)`
## <chr> <dbl>
## 1 Albemarle 319650
## 2 Monticello 322100
## 3 Unassigned 685500
## 4 Western Albemarle 439700
And again, you can thread these verbs all together in one pipeline
homes %>%
filter(yearbuilt >= 2010) %>%
select(-yearremodeled) %>%
group_by(hsdistrict) %>%
summarize(median(totalvalue, na.rm = TRUE))
## # A tibble: 4 x 2
## hsdistrict `median(totalvalue, na.rm = TRUE)`
## <chr> <dbl>
## 1 Albemarle 403850
## 2 Monticello 409000
## 3 Unassigned 521800
## 4 Western Albemarle 524400
Arrange
The arrange()
function does what it sounds like. It takes a data frame or tbl and arranges (or sorts) by column(s) of interest. The first argument is the data, and subsequent arguments are columns to sort on. Use the desc()
function to arrange by descending.
homes %>%
filter(yearbuilt >= 2010 & hsdistrict != "Unassigned") %>%
select(-yearremodeled) %>%
group_by(hsdistrict) %>%
summarize(mean = mean(lotsize)) %>%
arrange(mean)
## # A tibble: 3 x 2
## hsdistrict mean
## <chr> <dbl>
## 1 Albemarle 1.87
## 2 Western Albemarle 2.29
## 3 Monticello 4.06
homes %>%
filter(yearbuilt >= 2010 & hsdistrict != "Unassigned") %>%
select(-yearremodeled) %>%
group_by(hsdistrict) %>%
summarize(mean = mean(lotsize)) %>%
arrange(-mean)
## # A tibble: 3 x 2
## hsdistrict mean
## <chr> <dbl>
## 1 Monticello 4.06
## 2 Western Albemarle 2.29
## 3 Albemarle 1.87
EXERCISE: On Your Own
Unscramble the following statements to find the median improvement value (improvementsvalue) by elementary schools district (esdistrict) for homes that have been remodeled since 2010. Arrange in descending order of value and return the 3 districts with the highest median improvement value.
Basic Exploratory Plotting
Exploring the data programatically is quite helpful, however, sometimes a picture can be worth a thousand words. This section will tie together the use of dplyr with the use data visualization. The exploratory data analysis process is iterative and we will go through how we can use our dplyr results to inform our visualizations and use the results from our visualizations to inform our use of dplyr functions.
In this session I am going to introduce a few plot types that will be helpful when it comes to data exploration and EDA. Future sessions will expand on the plotting capabilities of R.
The main reasons to use plots in Exploratory Data Analysis is to check for missing data, check for outliers, check for the typical values, and to get an overall handle of your data.
Histogram/Density
The first plot we are going ot create is a histogram plot that is valuable in visualizing distributions and in looking for typical values/outliers.
To start a plot using ggplot(), we first build a base/canvas using the ggplot function. We provide the function with a dataset and then map variables from that dataset onto the x and y axes. In this section we will be using the ggplot function and then the following two key aspects of using ggplot:
- a geom, which specifies how the data are represented on the plot (points, lines, bars, etc.),
- aesthetics that map variables in the data to axes on the plot or to plotting size, shape, color, etc.,
Let us first build a histogram looking at the distribution of home ages.
#First build a canvas using the homes dataset and the age variable on our x axis
ggplot(data = homes, aes(x = age))
#From there, we can tell ggplot to use a histogram to plot our values from age
ggplot(data = homes, aes(x = age)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Looking at this visualization, we can see that most of our data falls under 75 years old. We do have a few outliers, so maybe we would want to get the mean age of our homes, both with all of the data and without the outliers.
homes %>%
summarize(mean(age), median(age))
## # A tibble: 1 x 2
## `mean(age)` `median(age)`
## <dbl> <dbl>
## 1 37.7 30
## # A tibble: 1 x 2
## `mean(age)` `median(age)`
## <dbl> <dbl>
## 1 29.6 28
#We could also filter out the outliers and only plot the values for homes that are less than 75 years old.
homes %>%
filter(age < 75) %>%
ggplot(aes(x = age)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Let us take a look at another variable, the total value of a lot.
ggplot(data = homes, aes(totalvalue)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Why do you think that the graph looks like it does?
homes %>% arrange(desc(totalvalue)) %>% head(5) %>% select(totalvalue)
## # A tibble: 5 x 1
## totalvalue
## <dbl>
## 1 7859000
## 2 7766800
## 3 6755600
## 4 6448900
## 5 5980300
## # A tibble: 1 x 1
## `n()`
## <int>
## 1 75
## # A tibble: 1 x 1
## `n()`
## <int>
## 1 1442
#Let us only look at homes that are less than $1,000,000
homes %>%
filter(totalvalue < 1000000) %>%
ggplot(aes(totalvalue)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Box Plots
Sometimes we want to look past just a single variable and look at how two variables interact. When looking at a continous and a categorical variable, a box plot is a highly appropriate choice. A box plot illustrates how varied and spread out your data is across several different levels. Box plots provide a quick way to compare your data across different levels, but it also allows you to check for outliers in your data and it allows you to see the different levels of specific variables.
# To create a boxplot, you simply need to add a y axis to your aes call and then add a geom_boxplot() function.
ggplot(data = homes, aes(x = condition, y = totalrooms)) + geom_boxplot()
## Warning: Removed 9 rows containing non-finite values (stat_boxplot).
# What stands out to you about this graph?
# Let us go ahead and deal with these issues
homes %>% filter(condition != "NULL" & condition != "Unknown") %>%
ggplot(aes(condition, totalrooms)) + geom_boxplot()
## Warning: Removed 4 rows containing non-finite values (stat_boxplot).
Scatter
The final type of plot that we are going to look is one that compares multiple continuous variables. The scatter plot is a good choice for investigating the relationship between two continuous variables and it allows you to look for missing values that are denoted as such and to find other unusual values. geom_point() is the function to create scatter plots
EXERCISE: Let is scatter
Plot the year a home was remodeled against the last sale price
- Which points in the plot will give us trouble down the line?
- How do you think we could handle these issues?
- Attempt to clean up the data and rerun the scatter plot, how does is look now?
## Warning: Removed 28929 rows containing missing values (geom_point).