Complicated reshaping example

Author

Clay Ford

Published

February 15, 2024

The task

The data set in question had six different sets of variables that needed to be reshaped into “long” format. Each set was collected over five waves.

For example, here’s one set:

d[1:2,7:11]

# A tibble: 2 × 5
  w1MEI_affirm w2MEI_affirm w3MEI_affirm w4MEI_affirm w5MEI_affirm
         <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
1         3               2         2.33            3            3
2         4.33            4         5               4            5

There are five more sets like that in the data. For each I needed to reshape such that there was a column for time and a column identifying the type of measure. Something like this…

time    affirm
1       3
2       2
3       2.33
4       3
5       3
1       4.33
2       4
...

The original approach

My first attempt was pretty verbose but it worked. You can tell my mastery of pivot_longer() did not exceed much beyond the basics. Yes, I knew how to use the names_transform argument to extract the numbers from the column labels, but that was about it.

library(tidyr)

# reshape affirm
d1 <- pivot_longer(d[,1:11], -c(1:6), names_to = "time",
                   values_to = "affirm", 
                   names_transform = readr::parse_number)

# reshape explor
d2 <- pivot_longer(d[,c(1, 12:16)], -1, names_to = "time",
                   values_to = "explor", 
                   names_transform = readr::parse_number)


# reshape appSUM
d3 <- pivot_longer(d[,c(1, 17:21)], -1, names_to = "time",
                   values_to = "appSUM", 
                   names_transform = readr::parse_number)

# reshape emoSUM
d4 <- pivot_longer(d[,c(1, 22:26)], -1, names_to = "time",
                   values_to = "emoSUM", 
                   names_transform = readr::parse_number)

# reshape infoSUM
d5 <- pivot_longer(d[,c(1, 27:31)], -1, names_to = "time",
                   values_to = "infoSUM", 
                   names_transform = readr::parse_number)

# reshape racematch
d6 <- pivot_longer(d[,c(1, 32:36)], -1, names_to = "time",
                   values_to = "racematch", 
                   names_transform = readr::parse_number)

# merge reshaped dataframes into one
d_long1 <- merge(d1, d2, by = c("P_ID", "time")) |>
  merge(d3, by = c("P_ID", "time")) |> 
  merge(d4, by = c("P_ID", "time")) |> 
  merge(d5, by = c("P_ID", "time")) |> 
  merge(d6, by = c("P_ID", "time"))

head(d_long1, n = 5)

  P_ID time age gender male white race   affirm   explor appSUM emoSUM infoSUM
1    1    1  18      1    1     0    2 3.000000 2.333333     NA     NA      NA
2    1    2  18      1    1     0    2 2.000000 2.666667      9      7       8
3    1    3  18      1    1     0    2 2.333333 2.333333     11      9       9
4    1    4  18      1    1     0    2 3.000000 3.000000      6     10       8
5    1    5  18      1    1     0    2 3.000000 3.333333     NA     NA      NA
  racematch
1        NA
2         0
3         1
4         1
5        NA

This did the trick and I was able to move on, but it didn’t sit right with me. I thought I remembered Jacob demonstrating in his data wrangling workshop how to reshape datasets similar to this using just one call to pivot_longer(). So today I decided to revisit this code and see what I could do with it.

The elegant approach

After some fiddling around with the pivot_longer() function and reading the {tidyr} “Pivoting” vignette, I was finally able to implement this in a much more elegant fashion. (I should have referred to Jacob’s workshop materials, but I didn’t have them handy and decided to see if I could figure it out on my own. Of course, once I did it, I then had to go download his materials and check if this approach is what he demonstrated, and it is!)

d_long2 <- pivot_longer(d, cols = !c(1:6), 
                       names_to = c("time", ".value"), 
                       names_pattern = "(w[1-5])(.+)",
                       names_transform = list(time = readr::parse_number))
head(d_long2, n = 5)

# A tibble: 5 × 13
   P_ID   age gender     male white race       time MEI_affirm MEI_explor appSUM
  <dbl> <dbl> <dbl+lbl> <dbl> <dbl> <dbl+lbl> <dbl>      <dbl>      <dbl>  <dbl>
1     1    18 1 [Man]       1     0 2 [Asian]     1       3          2.33     NA
2     1    18 1 [Man]       1     0 2 [Asian]     2       2          2.67      9
3     1    18 1 [Man]       1     0 2 [Asian]     3       2.33       2.33     11
4     1    18 1 [Man]       1     0 2 [Asian]     4       3          3         6
5     1    18 1 [Man]       1     0 2 [Asian]     5       3          3.33     NA
# ℹ 3 more variables: emoSUM <dbl>, infoSUM <dbl>, racematch <dbl>

This is for my benefit so I remember how this works:

cols = !c(1:6) says to reshape all columns but the first six
names_to = c("time", ".value") uses the keyword .value which “indicates that the corresponding component of the column name defines the name of the output column containing the cell values” (from the help page).
names_pattern = "(w[1-5])(.+)" defines the pattern of the column names: “the letter w followed by a number 1-5, (w[1-5]), and then everything else, (.+).
names_transform = list(time = readr::parse_number) says to transform the time column to a number using the parse_number function.

In the grand scheme, I’m not sure this makes a difference. Most people don’t care about data wrangling but rather the analysis. Although my first attempt was inefficient, it was easy for me to verify that each set of columns was successfully reshaped. I don’t know if I’ll ever get to a place where it becomes routine for me to reshape data like this casually using one call to pivot_longer(). But I thought it might be of interest to see what pivot_longer() is capable of.