Chapter 6 Aggregation and Group Operations
This chapter looks at manipulating and summarizing data by groups.
6.1 Cross tabulation
Cross tabulation is the process of determining frequencies per group (or determining values based on frequencies, like proportions), with groups defined by one or more variables (e.g., nationality and sex).
The Python and R examples of cross tabulation below both make use of the following data set, dat
:
nationality sex
1 Canadian m
2 French f
3 French f
4 Egyptian m
5 Canadian f
Python
The pandas package contains a crosstab()
function for cross tabulation with two or more variables. Alternatively, the groupby()
function, also in pandas, facilitates cross tabulation by one or more variables when used in combination with count()
.
sex f m
nationality
Canadian 1 1
Egyptian 0 1
French 2 0
nationality
Canadian 2
Egyptian 1
French 2
Name: nationality, dtype: int64
nationality sex
Canadian f 1
m 1
Egyptian m 1
French f 2
Name: nationality, dtype: int64
R
The table()
function performs cross tabulation in R. A user can enter a single grouping variable or enter multiple grouping variables separated by a comma(s). The xtabs()
function also computes cross-tabs; a user enters the variables to be used for grouping in formula notation.
Canadian Egyptian French
2 1 2
f m
Canadian 1 1
Egyptian 0 1
French 2 0
sex
nationality f m
Canadian 1 1
Egyptian 0 1
French 2 0
6.2 Group summaries
Computing statistical summaries per group.
Python
The groupby()
function from pandas splits up a data set based on one or more grouping variables. Summarizing functions—like mean()
, sum()
, and so on—can then be applied to those groups. In the first example below, we use groupby()
to group rows of the mtcars
data set by the number of cylinders each car has; from there, we select just the mpg
column and call mean()
, thus calculating the average miles per gallon within each cylinder group. In the second example, we again group observations by cyl
, but instead of then selecting just the mpg
column, we directly call mean()
; this gives the mean for each variable in the data set within each cylinder group. Finally, in the third example, we group by two variables—cyl
and vs
—and then use the describe()
function to generate a set of descriptive statistics for mpg
within each cylinder
*vs
group (e.g., mean, SD, minimum, etc.).
import pandas as pd
mean_mpg_by_cyl = mtcars.groupby(by = 'cyl')['mpg'].mean()
print(mean_mpg_by_cyl)
cyl
4.0 26.663636
6.0 19.742857
8.0 15.100000
Name: mpg, dtype: float64
mpg disp hp ... am gear carb
cyl ...
4.0 26.663636 105.136364 82.636364 ... 0.727273 4.090909 1.545455
6.0 19.742857 183.314286 122.285714 ... 0.428571 3.857143 3.428571
8.0 15.100000 353.100000 209.214286 ... 0.142857 3.285714 3.500000
[3 rows x 10 columns]
count mean std min 25% 50% 75% max
cyl vs
4.0 0.0 1.0 26.000000 NaN 26.0 26.000 26.00 26.00 26.0
1.0 10.0 26.730000 4.748111 21.4 22.800 25.85 30.40 33.9
6.0 0.0 3.0 20.566667 0.750555 19.7 20.350 21.00 21.00 21.0
1.0 4.0 19.125000 1.631717 17.8 18.025 18.65 19.75 21.4
8.0 0.0 14.0 15.100000 2.560048 10.4 14.400 15.20 16.25 19.2
R
The aggregate()
function can be used to generate by-group statistical summaries based on one or more grouping variables. Grouping variables can be declared as a list in the function’s by
argument. Alternatively, the grouping variable(s) and the variable to be summarized can be passed to aggregate()
in formula notation: var_to_be_aggregated ~ grouping_var_1 + ... + grouping_var_N
. The summarizing function (e.g., mean()
; median()
; etc.) is declared in the FUN
argument.
# One grouping variable
# Calculating mean of `mpg` in each `cyl` group
aggregate(x = mtcars$mpg,
by = list(cyl = mtcars$cyl),
FUN = "mean")
cyl x
1 4 26.66364
2 6 19.74286
3 8 15.10000
Adding drop = FALSE
ensures that all combinations of levels are returned even if no data exist at that combination. The final row below is NA
since there are no eight-cylinder cars with a “straight” engine (vs = 1
).
# Two or more grouping variables
# Calculating max of `mpg` in each `cyl`*`vs` group
aggregate(x = mtcars$mpg,
by = list(cyl = mtcars$cyl, vs = mtcars$vs),
FUN = 'max', drop = FALSE)
cyl vs x
1 4 0 26.0
2 6 0 21.0
3 8 0 19.2
4 4 1 33.9
5 6 1 21.4
6 8 1 NA
# Or, specify the variable to summarize and the grouping variables in formula notation
aggregate(mpg ~ cyl + vs, data = mtcars, FUN = max)
The tidyverse also offers a summarizing function, summarize()
(or summarise()
, for the Britons), which is in the dplyr package. After grouping a data frame/tibble (with, e.g., dplyr’s group_by()
function), a user passes it to summarize()
, specifying in the function call how the summary statistic should be calculated.
`summarise()` has grouped output by 'cyl'. You can override using the
`.groups` argument.
# A tibble: 5 × 3
# Groups: cyl [3]
cyl vs avg_mpg
<dbl> <dbl> <dbl>
1 4 0 26
2 4 1 26.7
3 6 0 20.6
4 6 1 19.1
5 8 0 15.1
summarize()
makes it easy to specify relatively complicated summary calculations without needing to write an external function.
mtcars %>%
group_by(cyl, vs) %>%
summarize(avg_mpg = mean(mpg),
complicated_summary_calculation =
min(mpg)^0.5 *
mean(wt)^0.5 +
mean(disp)^(1 / mean(hp)))
`summarise()` has grouped output by 'cyl'. You can override using the
`.groups` argument.
# A tibble: 5 × 4
# Groups: cyl [3]
cyl vs avg_mpg complicated_summary_calculation
<dbl> <dbl> <dbl> <dbl>
1 4 0 26 8.51
2 4 1 26.7 8.07
3 6 0 20.6 8.41
4 6 1 19.1 8.81
5 8 0 15.1 7.48
6.3 Centering and Scaling
Centering refers to subtracting a constant, such as the mean, from every value in a set. This is sometimes performed to aid interpretation of linear model coefficients.
Scaling refers to rescaling a column or vector of values such that their mean is zero and their standard deviation is one. This is sometimes performed to put multiple variables on the same scale and is often recommended for procedures such as principal components analysis (PCA).
Python
The scale()
function from the preprocessing module of the scikit-learn (sklearn) package provides one-step centering and scaling. To center a variable at zero without scaling it, use scale()
with with_mean = True
and with_std = False
(both are True
by default).
from sklearn import preprocessing
centered_mpg = preprocessing.scale(mtcars.mpg, with_mean = True, with_std = False)
centered_mpg.mean()
-3.1086244689504383e-15
To scale a variable after centering it (so that its mean is zero and its standard deviation is one), use scale()
with with_mean = True
and with_std = True
.
from sklearn import preprocessing
scaled_mpg = preprocessing.scale(mtcars.mpg, with_mean = True, with_std = True)
scaled_mpg.mean()
-4.996003610813204e-16
1.0
R
The scale()
function can both center and scale variables.
To center a variable without scaling it, call scale()
with the center
argument set to TRUE
and the scale
argument set to FALSE
. The variable’s mean will be subtracted off of each of the variable values. (Note: If desired, the center
argument can be set to a numeric value instead of TRUE
/FALSE
; in that case, each variable value will have the argument value subtracted off of it.)
[1] 4.440892e-16
To scale a variable (while also centering it), call scale()
with the center
and scale
arguments set to TRUE
(these are the default argument values). The variable’s mean will be subtracted off of each of the variable values, and each value will then be divided by the variable’s standard deviation. (Note: As with the center
argument, the scale
argument can also be set to a numeric value instead of TRUE
/FALSE
; in that case, the divisor will be the argument value instead of the standard deviation.)
[1] 7.112366e-17
[1] 1