The UVA Library’s Research Data Services and Data Services at the Health Sciences Library teams are partnering with UVA’s PhDPlus program to present Data Science Essentials in R, a six-session series to build data wrangling, visualization, and analysis skills.
This site will host the materials for the series.
Register for the series through the PhDPlus events pages. Upon successful completion of this series (4 out of 6 sessions), PhD students will be eligible for PhD Plus “Data Literacy in R” non-credit credentials on transcripts. These credentials are UVA-endorsed, and are a good opportunity to show demonstrated interest and knowledge in basic data science skills to prospective employers. You may register for individual sessions, though the sessions are cumulative, so understanding of the material in the prior session is assumed. If registration exceeds the registration cap (35), preference will be given to individuals who’ve signed up for four or more sessions.
Wednesday, 2/12
Instructor: Jenn Huck
R is a free, open-source software environment and programming language designed specifically for statistical analysis; RStudio is a free, open source integrated development environment (IDE) for R that provides a friendly interface for viewing graphs, data tables, R code, and output all at the same time. This first session provides a gentle introduction to R and RStudio. We’ll get started navigating R with RStudio, loading libraries, and importing data. We’ll do some basic data manipulation and exploration and cover some key practices and shortcuts for using R effectively and helpful resources for learning more.
Wednesday, 2/19 Instructor: David Martin
While many researchers want to jump right into developing statistical models and making predictions on their data, it important to first be able to understand the data. The process of initially understanding your data is called exploratory data analysis or EDA. EDA involves investigating your data using various methods, including graphical and numerical. This session will introduce the dplyr and ggplot packages as tools that will allow users to look for initial patterns in their data, check for missing data, and to set themselves up to move forward with statistical analysis.
Wednesday, 2/26 Instructor: David Martin
Data analysis involves a large amount of preparing, cleaning, and “munging” data to facilitate downstream data analysis. This session will cover data cleaning and “tidy data,” and will introduce participants to R packages that enable data manipulation, analysis, and visualization using split-apply-combine strategies. Participants will learn how to use the dplyr package in R to effectively manipulate and conditionally compute summary statistics over subsets of a “big” dataset containing many observations. The session will assume understanding of the material in the preceding sessions and will build on a common research case, using Albemarle Real Estate Property data (though each workshop may also introduce additional examples and data).
Wednesday, 3/4 Instructor: Clay Ford
Exploring our data with graphs allows us to visualize relationships, spot unusual observations, or find unexpected patterns. In this session we introduce how to effectively use the ggplot2 package to explore and visualize data in R. With its consistent syntax and layered approach to making graphics, ggplot2 has revolutionized data visualization. What previously would have required hours of tedious programming can now be accomplished in a few lines of ggplot2 code. This workshop will introduce the logic behind ggplot2, how to use ggplot2 to explore your data, and how to customize and polish ggplot2 graphs. The session will assume understanding of the material in the preceding sessions and will build on a common research case, using Albemarle Real Estate Property data (though each workshop may also introduce additional examples and data).
Wednesday, 3/18 Instructor: Marieke Jones
This session will provide hands-on instruction and exercises covering basic statistical analysis in R. We will cover descriptive statistics, exploratory data analysis, t-tests, ANOVA, and methods for discrete variable analysis along with hypothesis testing and assessing the assumptions of the tests. The session will assume understanding of the material in the preceding sessions and will build on a common research case, using Albemarle Real Estate Property data (though each workshop may also introduce additional examples and data).
Wednesday, 3/25
Instructor: Clay Ford
The linear model is one of the most commonly-used statistical models. Also called the regression model or the ordinary linear regression, linear modeling is the foundation for more complex general linear models like logit or count models, mixed-effects models, and structural equation models. So it’s a good model to understand. This session will cover how to use R to fit and analyze linear models. We’ll talk about interpretation of model output and checking model assumptions. We’ll also explore dummy variables, interactions, and variable transformations. The session will assume understanding of the material in the preceding sessions and will build on a common research case, using Albemarle Real Estate Property data (though each workshop may also introduce additional examples and data).