Phd+ Module: Data Science Essentials in R

Sessions

Introduction to R

R is a free, open-source software environment and programming language designed specifically for statistical analysis; RStudio is a free, open source integrated development environment (IDE) for R that provides a friendly interface for viewing graphs, data tables, R code, and output all at the same time. This first session provides a gentle introduction to R and RStudio. We’ll get started navigating R with RStudio, loading libraries, and importing data. We’ll do some basic data manipulation and exploration and cover some key practices and shortcuts for using R effectively and helpful resources for learning more.

Exploratory Data Analysis

Wednesday, 2/19 Instructor: David Martin

While many researchers want to jump right into developing statistical models and making predictions on their data, it important to first be able to understand the data. The process of initially understanding your data is called exploratory data analysis or EDA. EDA involves investigating your data using various methods, including graphical and numerical. This session will introduce the dplyr and ggplot packages as tools that will allow users to look for initial patterns in their data, check for missing data, and to set themselves up to move forward with statistical analysis.

Data preparation

Wednesday, 2/26 Instructor: David Martin

Data analysis involves a large amount of preparing, cleaning, and “munging” data to facilitate downstream data analysis. This session will cover data cleaning and “tidy data,” and will introduce participants to R packages that enable data manipulation, analysis, and visualization using split-apply-combine strategies. Participants will learn how to use the dplyr package in R to effectively manipulate and conditionally compute summary statistics over subsets of a “big” dataset containing many observations. The session will assume understanding of the material in the preceding sessions and will build on a common research case, using Albemarle Real Estate Property data (though each workshop may also introduce additional examples and data).

Data visualization

Wednesday, 3/4 Instructor: Clay Ford

Exploring our data with graphs allows us to visualize relationships, spot unusual observations, or find unexpected patterns. In this session we introduce how to effectively use the ggplot2 package to explore and visualize data in R. With its consistent syntax and layered approach to making graphics, ggplot2 has revolutionized data visualization. What previously would have required hours of tedious programming can now be accomplished in a few lines of ggplot2 code. This workshop will introduce the logic behind ggplot2, how to use ggplot2 to explore your data, and how to customize and polish ggplot2 graphs. The session will assume understanding of the material in the preceding sessions and will build on a common research case, using Albemarle Real Estate Property data (though each workshop may also introduce additional examples and data).

Everyday Statistics

Wednesday, 3/18 Instructor: Marieke Jones

This session will provide hands-on instruction and exercises covering basic statistical analysis in R. We will cover descriptive statistics, exploratory data analysis, t-tests, ANOVA, and methods for discrete variable analysis along with hypothesis testing and assessing the assumptions of the tests. The session will assume understanding of the material in the preceding sessions and will build on a common research case, using Albemarle Real Estate Property data (though each workshop may also introduce additional examples and data).

Linear Modeling

Wednesday, 3/25
Instructor: Clay Ford

The linear model is one of the most commonly-used statistical models. Also called the regression model or the ordinary linear regression, linear modeling is the foundation for more complex general linear models like logit or count models, mixed-effects models, and structural equation models. So it’s a good model to understand. This session will cover how to use R to fit and analyze linear models. We’ll talk about interpretation of model output and checking model assumptions. We’ll also explore dummy variables, interactions, and variable transformations. The session will assume understanding of the material in the preceding sessions and will build on a common research case, using Albemarle Real Estate Property data (though each workshop may also introduce additional examples and data).