Chapter 6 Aggregation and Group Operations

This chapter looks at manipulating and summarizing data by groups.

6.1 Cross tabulation

Cross tabulation is the process of determining frequencies per group (or determining values based on frequencies, like proportions), with groups defined by one or more variables (e.g., nationality and sex).

The Python and R examples of cross tabulation below both make use of the following data set, dat:

dat

  nationality sex
1    Canadian   m
2      French   f
3      French   f
4    Egyptian   m
5    Canadian   f

Python

The pandas package contains a crosstab() function for cross tabulation with two or more variables. Alternatively, the groupby() function, also in pandas, facilitates cross tabulation by one or more variables when used in combination with count().

import pandas as pd
pd.crosstab(dat.nationality, dat.sex)

sex          f  m
nationality      
Canadian     1  1
Egyptian     0  1
French       2  0

dat.groupby(by = 'nationality').nationality.count()

nationality
Canadian    2
Egyptian    1
French      2
Name: nationality, dtype: int64

dat.groupby(by = ['nationality', 'sex']).nationality.count()

nationality  sex
Canadian     f      1
             m      1
Egyptian     m      1
French       f      2
Name: nationality, dtype: int64

# Or: dat.groupby(by = ['nationality', 'sex']).sex.count()

R

The table() function performs cross tabulation in R. A user can enter a single grouping variable or enter multiple grouping variables separated by a comma(s). The xtabs() function also computes cross-tabs; a user enters the variables to be used for grouping in formula notation.

table(dat$nationality)


Canadian Egyptian   French 
       2        1        2

table(dat$nationality, dat$sex)

          
           f m
  Canadian 1 1
  Egyptian 0 1
  French   2 0

xtabs(formula = ~nationality + sex, data = dat)

           sex
nationality f m
   Canadian 1 1
   Egyptian 0 1
   French   2 0

6.2 Group summaries

Computing statistical summaries per group.

Python

The groupby() function from pandas splits up a data set based on one or more grouping variables. Summarizing functions—like mean(), sum(), and so on—can then be applied to those groups. In the first example below, we use groupby() to group rows of the mtcars data set by the number of cylinders each car has; from there, we select just the mpg column and call mean(), thus calculating the average miles per gallon within each cylinder group. In the second example, we again group observations by cyl, but instead of then selecting just the mpg column, we directly call mean(); this gives the mean for each variable in the data set within each cylinder group. Finally, in the third example, we group by two variables—cyl and vs—and then use the describe() function to generate a set of descriptive statistics for mpg within each cylinder*vs group (e.g., mean, SD, minimum, etc.).

import pandas as pd

mean_mpg_by_cyl = mtcars.groupby(by = 'cyl')['mpg'].mean()
print(mean_mpg_by_cyl)

cyl
4.0    26.663636
6.0    19.742857
8.0    15.100000
Name: mpg, dtype: float64

means_all_vars = mtcars.groupby(by = 'cyl').mean()
print(means_all_vars)

           mpg        disp          hp  ...        am      gear      carb
cyl                                     ...                              
4.0  26.663636  105.136364   82.636364  ...  0.727273  4.090909  1.545455
6.0  19.742857  183.314286  122.285714  ...  0.428571  3.857143  3.428571
8.0  15.100000  353.100000  209.214286  ...  0.142857  3.285714  3.500000

[3 rows x 10 columns]

mpg_by_cyl_vs = mtcars.groupby(by = ['cyl', 'vs'])['mpg'].describe()
print(mpg_by_cyl_vs)

         count       mean       std   min     25%    50%    75%   max
cyl vs                                                               
4.0 0.0    1.0  26.000000       NaN  26.0  26.000  26.00  26.00  26.0
    1.0   10.0  26.730000  4.748111  21.4  22.800  25.85  30.40  33.9
6.0 0.0    3.0  20.566667  0.750555  19.7  20.350  21.00  21.00  21.0
    1.0    4.0  19.125000  1.631717  17.8  18.025  18.65  19.75  21.4
8.0 0.0   14.0  15.100000  2.560048  10.4  14.400  15.20  16.25  19.2

R

The aggregate() function can be used to generate by-group statistical summaries based on one or more grouping variables. Grouping variables can be declared as a list in the function’s by argument. Alternatively, the grouping variable(s) and the variable to be summarized can be passed to aggregate() in formula notation: var_to_be_aggregated ~ grouping_var_1 + ... + grouping_var_N. The summarizing function (e.g., mean(); median(); etc.) is declared in the FUN argument.

# One grouping variable
# Calculating mean of `mpg` in each `cyl` group
aggregate(x = mtcars$mpg, 
          by = list(cyl = mtcars$cyl), 
          FUN = "mean")

  cyl        x
1   4 26.66364
2   6 19.74286
3   8 15.10000

Adding drop = FALSE ensures that all combinations of levels are returned even if no data exist at that combination. The final row below is NA since there are no eight-cylinder cars with a “straight” engine (vs = 1).

# Two or more grouping variables
# Calculating max of `mpg` in each `cyl`*`vs` group
aggregate(x = mtcars$mpg, 
          by = list(cyl = mtcars$cyl, vs = mtcars$vs), 
          FUN = 'max', drop = FALSE)

  cyl vs    x
1   4  0 26.0
2   6  0 21.0
3   8  0 19.2
4   4  1 33.9
5   6  1 21.4
6   8  1   NA

# Or, specify the variable to summarize and the grouping variables in formula notation
aggregate(mpg ~ cyl + vs, data = mtcars, FUN = max)

The tidyverse also offers a summarizing function, summarize() (or summarise(), for the Britons), which is in the dplyr package. After grouping a data frame/tibble (with, e.g., dplyr’s group_by() function), a user passes it to summarize(), specifying in the function call how the summary statistic should be calculated.

library(dplyr)
mtcars %>% 
  group_by(cyl, vs) %>% 
  summarize(avg_mpg = mean(mpg))

`summarise()` has grouped output by 'cyl'. You can override using the
`.groups` argument.

# A tibble: 5 × 3
# Groups:   cyl [3]
    cyl    vs avg_mpg
  <dbl> <dbl>   <dbl>
1     4     0    26  
2     4     1    26.7
3     6     0    20.6
4     6     1    19.1
5     8     0    15.1

summarize() makes it easy to specify relatively complicated summary calculations without needing to write an external function.

mtcars %>% 
  group_by(cyl, vs) %>% 
  summarize(avg_mpg = mean(mpg),
            complicated_summary_calculation = 
              min(mpg)^0.5 * 
              mean(wt)^0.5 + 
              mean(disp)^(1 / mean(hp)))

`summarise()` has grouped output by 'cyl'. You can override using the
`.groups` argument.

# A tibble: 5 × 4
# Groups:   cyl [3]
    cyl    vs avg_mpg complicated_summary_calculation
  <dbl> <dbl>   <dbl>                           <dbl>
1     4     0    26                              8.51
2     4     1    26.7                            8.07
3     6     0    20.6                            8.41
4     6     1    19.1                            8.81
5     8     0    15.1                            7.48

6.3 Centering and Scaling

Centering refers to subtracting a constant, such as the mean, from every value in a set. This is sometimes performed to aid interpretation of linear model coefficients.

Scaling refers to rescaling a column or vector of values such that their mean is zero and their standard deviation is one. This is sometimes performed to put multiple variables on the same scale and is often recommended for procedures such as principal components analysis (PCA).

Python

The scale() function from the preprocessing module of the scikit-learn (sklearn) package provides one-step centering and scaling. To center a variable at zero without scaling it, use scale() with with_mean = True and with_std = False (both are True by default).

from sklearn import preprocessing

centered_mpg = preprocessing.scale(mtcars.mpg, with_mean = True, with_std = False)
centered_mpg.mean()

-3.1086244689504383e-15

To scale a variable after centering it (so that its mean is zero and its standard deviation is one), use scale() with with_mean = True and with_std = True.

from sklearn import preprocessing

scaled_mpg = preprocessing.scale(mtcars.mpg, with_mean = True, with_std = True)
scaled_mpg.mean()

-4.996003610813204e-16

scaled_mpg.std()

1.0

R

The scale() function can both center and scale variables.

To center a variable without scaling it, call scale() with the center argument set to TRUE and the scale argument set to FALSE. The variable’s mean will be subtracted off of each of the variable values. (Note: If desired, the center argument can be set to a numeric value instead of TRUE/FALSE; in that case, each variable value will have the argument value subtracted off of it.)

centered_mpg <- scale(mtcars$mpg, center = T, scale = F)
mean(centered_mpg)

[1] 4.440892e-16

To scale a variable (while also centering it), call scale() with the center and scale arguments set to TRUE (these are the default argument values). The variable’s mean will be subtracted off of each of the variable values, and each value will then be divided by the variable’s standard deviation. (Note: As with the center argument, the scale argument can also be set to a numeric value instead of TRUE/FALSE; in that case, the divisor will be the argument value instead of the standard deviation.)

scaled_mpg <- scale(mtcars$mpg, center = T, scale = T)
mean(scaled_mpg)

[1] 7.112366e-17

sd(scaled_mpg)

[1] 1