Chapter 7 Basic Plotting and Visualization
This chapter discusses methods for creating plots to explore and understand data. Visualization in Python and R is a gigantic and evolving topic; we don’t pretend to present a comprehensive comparison.
The plots below make use of the palmerpenguins data set, which contains various measurements for 344 penguins across three islands in the Antarctic Palmer Archipelago. The data were collected by Kristen Gorman and colleagues, and they were made available under a CC0 public domain license by Allison Horst, Alison Hill, and Kristen Gorman.
For the R sections below, we show how to make each plot with base R and with ggplot2.
Here’s a glimpse at the data set:
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
7.1 Histograms
Visualizing the distribution of numeric data.
Python
The Python library Matplotlib provides the hist()
function for plotting a histogram. There are many arguments that can be specified in the hist()
function for customizing the histogram plot to fit your needs: Some parameters include the number of bins, the upper and lower bounds on each bin, the coloration, and more.
Below, we show how to generate a histogram of the bill lengths from the penguins
dataset. We specified 30 bins, each of which is light blue with a black outline of linewidth = 1
. The hist()
function defaults to showing no bin outline, which can make it difficult to distinguish bins clearly, so we add in the outlines here.
Note that the bins are left-inclusive and right-exclusive. For example, if a particular bin spans the range of 1 to 3, the bin will include the value 1 but will exclude the value 3 (and will include all values between 1 and 3). In short, bin ranges are [x1,x2), where x1 is the starting point of the bin and x2 is the ending point of the bin.
Notice the semicolon at the end of the plt.hist()
function. This suppresses the printing of the array generated to create the histogram.
import matplotlib.pyplot as plt
plt.clf() # clears the current figure to ensure that multiple plots are not overlaid
plt.hist(penguins.bill_length_mm, bins = 30, range = (30, 60),
color = 'lightblue', edgecolor = 'k', linewidth = 1);
plt.title("Penguin Bill Lengths")
plt.xlabel("Bill Length (mm)")
plt.ylabel("Count")
plt.show()
R
Base R’s hist()
function generates histograms, and features of the histogram—like the bar color, number of bins/breaks, and so on—can be easily customized with arguments as demonstrated below.
hist(penguins$bill_length_mm, breaks = 25, col = 'lightblue',
xlim = c(30, 60),
main = 'Penguin Bill Lengths',
xlab = 'Bill Length (mm)', ylab = 'Count')
The ggplot2 method for generating histograms follows the standard ggplot2 syntax: Initialize a plot with ggplot()
, then add layers thereto. Here, the layer to add is geom_histogram()
.
7.2 Barplots
Visualizing the distribution of categorical data.
Python
For this example, we generate a barplot showing how many penguins of each species—Adelie, Chinstrap, Gentoo—are represented in the data. We show two ways of doing so here.
First, we can use the function bar()
from the Matplotlib plotting library. We determine the per-species penguin counts and then use that data to create the bar plot.
import matplotlib.pyplot as plt
# Determine the number of each species
adelie_counts = len(penguins.loc[penguins["species"] == "Adelie"])
chinstrap_counts = len(penguins.loc[penguins["species"] == "Chinstrap"])
gentoo_counts = len(penguins.loc[penguins["species"] == "Gentoo"])
# Save the counts information as an arrays for input into the bar() function
spec = ["Adelie", "Chinstrap", "Gentoo"]
counts = [adelie_counts, chinstrap_counts, gentoo_counts]
plt.clf()
plt.bar(spec, counts)
<BarContainer object of 3 artists>
The data are stored in a pandas Dataframe, which has its own built-in plotting module, plot
. As an alternative approach to the one above, we can use the pandas bar()
function to recreate the barplot.
Using this second approach, we generated the same barplot with far less effort. Using the built-in pandas plotting routine proved to be the more efficient method here.
R
To form barplots, we’ll first create a summary data frame containing the statistics we’re looking to plot. Here, that’s simply the sample size of each species in the data set.
species Freq
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
We can plot those values using the barplot()
function in base R, specifying arguments along the way to customize the title/axis labeling, bar colors, and range of the y axis. To add values above the bars, we can follow barplot()
with a text()
call as below.
penguin_plot <- barplot(Freq ~ species, data = species_counts,
col = c('lightblue', 'cornflowerblue', 'darkslateblue'),
main = 'Species Sample Size',
xlab = 'Species', ylab = 'Count', ylim = c(0, 200))
text(x = penguin_plot, y = species_counts$Freq + 10,
labels = species_counts$Freq)
To recreate the bar plot above with ggplot2, one can add a geom_bar()
layer to a plot initialized with ggplot()
.
ggplot(species_counts, aes(x = species, y = Freq)) +
geom_bar(aes(fill = species), stat = 'identity') +
# scale_fill_manual() is used here for bar-color customization
scale_fill_manual(values = c('lightblue', 'cornflowerblue', 'darkslateblue')) +
labs(title = 'Species Sample Size', x = 'Species', y = 'Count') +
# For simplicity, we omit the legend
theme(legend.position = 'none') +
ylim(0, 200) +
# geom_text() is used here to add counts above the bars
geom_text(aes(label = Freq, vjust = -0.5))
7.3 Scatterplot
Visualizing the relationship between two numeric variables.
Python
The scatter()
function, part of Matplotlib, can produce scatterplots in Python. The x
and y
arguments specify the points to plot. In the example below, we also use the c
and marker
arguments to customize the point color and point shape, respectively. The xlabel()
, ylabel()
, and title()
functions customize plot labels.
import matplotlib.pyplot as plt
# Remove NA rows, as we only want to plot present data
penguins_no_na = penguins.dropna()
plt.clf()
# 'd' generates diamond markers (learn more about available marker shapes here: https://matplotlib.org/stable/api/markers_api.html#module-matplotlib.markers)
plt.scatter(x = penguins_no_na['body_mass_g'],
y = penguins_no_na['flipper_length_mm'],
c = 'lightblue', marker = 'd')
plt.xlabel('Body Mass (g)')
plt.ylabel('Flipper Length (mm)')
plt.title('Scatterplot of Body Mass and Flipper Length')
plt.show()
R
Scatterplots can be generated in base R with the plot()
function. The pch
argument below modifies the point shape (e.g., 20 = solid circle; 24 = unfilled triangle; etc.)
plot(x = penguins$body_mass_g, y = penguins$flipper_length_mm,
col = 'navy', pch = 20,
main = 'Scatterplot of Body Mass and Flipper Length',
xlab = 'Body Mass (g)', ylab = 'Flipper Length (mm)')
To generate a scatterplot with ggplot2, initialize a plot with ggplot()
, then add a layer of points with geom_point()
.
7.4 Stripcharts
Stripcharts, or strip plots, are one-dimensional scatterplots. Like boxplots, they reveal the distribution of a numeric variable within levels of a categorical variable.
Python
The seaborn package provides the stripplot()
function. The user simply specifies which variables they want on the x and y axes. Below, we specify island
on the y axis to see the distribution of bill_depth_mm
horizontally. Specify the pandas DataFrame using the data
argument. Finally, create the plot using plt.show()
.
import seaborn as sns
import matplotlib.pyplot as plt
sns.stripplot(x = "bill_depth_mm", y = "island", data = penguins)
plt.show()
The stripplot help page provides more examples.
R
Base R offers the stripchart()
function. To indicate the numeric variable and the grouping variable, you can use formula notation: numeric_var ~ grouping_var
. Adding method = 'jitter'
to the arguments spreads the points out slightly within each level of the grouping variable, making it easier to see points that might otherwise be obscured by overlap.
stripchart(bill_depth_mm ~ island, data = penguins,
method = 'jitter',
ylab = 'Island', xlab = 'Bill Depth (mm)',
main = 'Stripchart of Bill Depth by Island')
Stripcharts can also be made with ggplot2’s geom_jitter()
function, as shown below. You can control the amount of jitter with a position
argument in geom_jitter()
.
ggplot(penguins, aes(x = island, y = bill_depth_mm)) +
geom_jitter(aes(color = island), position = position_jitter(0.1)) +
# scale_fill_manual() is used to manually specify group colors
# once aes(color = island) is specified in `geom_jitter()`
scale_color_manual(values = c('lightblue', 'cornflowerblue', 'darkslateblue')) +
labs(title = 'Stripchart of Bill Depth by Island',
x = 'Island', y = 'Bill Depth (mm)') +
theme(legend.position = 'none')
Warning: Removed 2 rows containing missing values (`geom_point()`).
7.5 Boxplots
Visualizing the relationship between a numeric variable and a categorical variable via five-number summaries.
Python
The boxplot()
function in seaborn generates boxplots, and functions from Matplotlib, like xlabel()
and ylabel()
, can be used to control the aesthetics of the plot.
R
The boxplot()
function in base R generates boxplots. A user specifies the grouping variable and the numeric variable to be plotted in formula notation: y ~ grouping_var
.
boxplot(bill_depth_mm ~ island, data = penguins,
col = c('lightblue', 'cornflowerblue', 'darkslateblue'),
main = 'Boxplot of Bill Depth by Island',
xlab = 'Island', ylab = 'Bill Depth (mm)')
To generate a boxplot with ggplot2, add a geom_boxplot()
layer to a plot initialized with ggplot()
.
7.6 Facet plots
Facet plots (also called trellis plots, lattice plots, and conditional plots) are comprised of multiple smaller plots, where each subplot contains a subset of the overall data, with subsets defined by one or more faceting variables.
Python
The seaborn package provides several functions for creating facet plots, including relplot()
, displot()
, catplot()
, and lmplot()
. Below we demonstrate the lmplot()
function, which allows a user to create scatterplots at certain levels of categorical variable. A user specifies the x and y variables as character strings using the x
and y
arguments, respectively. The grouping variable is identified with either the col
or row
arguments. By default, a least-squares lines is added to each plot. Setting ci = None
suppresses the confidence interval ribbon.
import seaborn as sns
plt.clf()
sns.lmplot(x = "bill_length_mm", y = "bill_depth_mm",
col = "species",
data = penguins, ci = None)
To facet by two variables, provide variables to both the col
and row
arguments.
plt.clf()
sns.lmplot(x = "bill_length_mm", y = "bill_depth_mm",
col = "species", row = "sex",
data = penguins, ci = None)
Set lowess = True
for smooth trend lines. Notice also that color can be mapped to the same variable used for faceting.
plt.clf()
sns.lmplot(x = "bill_length_mm", y = "bill_depth_mm",
col = "species", hue = "species",
data = penguins, ci = None, lowess = True)
The 1lmplot()1 help page showcases other examples.
R
The coplot()
function in base R produces conditioning plots using formula notation: y ~ x | grouping_var
. The rows
and columns
arguments control layout. Below we specify one row of plots. The panel
argument controls what action is carried out in each plot. The default is a scatterplot. Below we use the base R panel.smooth
function to create scatter plots with a smooth trend line.
coplot(bill_depth_mm ~ bill_length_mm | species,
data = penguins,
panel = panel.smooth,
rows = 1, col = "navy")
To condition on two variables, use formula notation with syntax: y ~ x | grp_var1 * grp_var2
.
coplot(bill_depth_mm ~ bill_length_mm | species * sex,
data = penguins,
panel = panel.smooth,
col = "navy")
The labels of the conditioning variables unfortunately use a lot of real estate in the margins, and there is no easy way to modify that. However, this design works quite well when we condition on a numeric variable. The coplot()
function automatically creates overlapping group intervals to condition on, and the stacked layout of the labels helps us visualize how the relationship between x and y changes between the groups. To manually set the number of groups, use the number
argument. Below, we specify four groups to be generated for bill_length_mm
.
coplot(flipper_length_mm ~ body_mass_g | bill_length_mm,
data = penguins,
panel = panel.smooth,
number = 4,
rows = 1, col = "navy")
Alternatively, ggplot2 provides a intuitive and easy-to-use method for generating facet plots: A user specifies the aesthetics of the plot using standard ggplot2 syntax (i.e., as a series of added layers) and then adds an additional call, facet_wrap()
or facet_grid()
(the differences are discussed below), specifying the faceting variable(s) to split up the plots by.
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point(color = 'navy') +
# Add least-squares line
geom_smooth(method = 'lm', se = FALSE, color = 'salmon') +
# Use formula notation, a character vector, or vars() to
# specify faceting variables; e.g., ~species, c('species'),
# or vars(species)
facet_wrap(~ species)
`geom_smooth()` using formula = 'y ~ x'
The number of rows and columns can be manually specified with nrow
and ncol
arguments in facet_wrap()
. By default, the x and y axes of all facet plots will be on the same scale. The axis ranges can be set to vary freely by adding scales = 'free'
as an argument (or, alternatively, scales = 'free_x'
or scales = 'free_y'
to free just the x or y axis).
Both facet_wrap()
and facet_grid()
can be used to make facet plots. When faceting based on multiple variables (e.g., species and sex), facet_wrap()
will drop group combinations for which there are no data points, whereas facet_grid()
will generate a plot for all possible group combinations:
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point(color = 'navy') +
geom_smooth(method = 'lm', se = F, color = 'salmon') +
facet_wrap(vars(species, sex))
`geom_smooth()` using formula = 'y ~ x'
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point(color = 'navy') +
geom_smooth(method = 'lm', se = F, color = 'salmon') +
# Note that facet_grid() has separate `rows` and `cols` arguments
# for specifying faceting variables
facet_grid(rows = vars(sex), cols = vars(species))
`geom_smooth()` using formula = 'y ~ x'