Presentation 2: Advanced Tidyverse / Pivot longer, wider, and nesting

Load packages

library(readxl)
library(tidyverse)

Load data

df_sales <- read_excel('../out/sales_data_2.xlsx')

Goal

The goal is to make a plot like this where we visualize the sales for each year across each employee.

The data is now in wide format. Can we make the plot in the way the data is formatted now?

No. To create the plot, we need a column for the years to use on the x-axis, a column for sales in thousands DKK, and a column for Names. While the Names column is already present, the sales data is spread across four separate columns, and there is no column for the year.

head(df_sales)
# A tibble: 6 × 9
     ID Name      Age Sex    sales_2020 sales_2021 sales_2022 sales_2023 City   
  <dbl> <chr>   <dbl> <chr>       <dbl>      <dbl>      <dbl>      <dbl> <chr>  
1     1 Alice      25 Female        100        110        120        100 Miami  
2     2 Bob        30 Male          200        210        220        230 Miami  
3     3 Charlie    22 Male          150        160        170        200 LA     
4     4 Sophie     35 Female        300        320        340        250 New Yo…
5     5 Eve        28 Female        250        240        250        270 LA     
6     6 Frank      NA Male           NA        260        270        280 New Yo…

Long format

The data set is in wide format. The data can be restructured to long format such that there is one line per year per person. For this we use the pivot_longer function.

df_sales_longer <- df_sales %>% 
  pivot_longer(cols = starts_with("sales_"),
               names_to = "year",
               values_to = "sales"
               )

head(df_sales_longer)
# A tibble: 6 × 7
     ID Name    Age Sex    City  year       sales
  <dbl> <chr> <dbl> <chr>  <chr> <chr>      <dbl>
1     1 Alice    25 Female Miami sales_2020   100
2     1 Alice    25 Female Miami sales_2021   110
3     1 Alice    25 Female Miami sales_2022   120
4     1 Alice    25 Female Miami sales_2023   100
5     2 Bob      30 Male   Miami sales_2020   200
6     2 Bob      30 Male   Miami sales_2021   210

We can transform the year to integer.

df_sales_longer <- df_sales_longer %>% 
  mutate(year = str_remove(year, 'sales_') %>% as.integer()) 

head(df_sales_longer)
# A tibble: 6 × 7
     ID Name    Age Sex    City   year sales
  <dbl> <chr> <dbl> <chr>  <chr> <int> <dbl>
1     1 Alice    25 Female Miami  2020   100
2     1 Alice    25 Female Miami  2021   110
3     1 Alice    25 Female Miami  2022   120
4     1 Alice    25 Female Miami  2023   100
5     2 Bob      30 Male   Miami  2020   200
6     2 Bob      30 Male   Miami  2021   210

Make the plot explained above (scatter and line plot over the sales development over the years for each person).

df_sales_longer %>% 
  ggplot(aes(x = year, 
             y = sales, 
             color = Name)) + 
  geom_point() + 
  geom_line() + 
  theme_bw()

# Saving the plot
# ggsave('../out/sales_2_plot.png', width = 10.37, height = 7.55, units = 'in')

Wide format

The pivot_wider function is used to get data to wide format.

df_sales_wider <- df_sales_longer %>% 
  pivot_wider(names_from = year, 
              values_from = sales,
              names_prefix = 'sales_')

# Same content
head(df_sales)
# A tibble: 6 × 9
     ID Name      Age Sex    sales_2020 sales_2021 sales_2022 sales_2023 City   
  <dbl> <chr>   <dbl> <chr>       <dbl>      <dbl>      <dbl>      <dbl> <chr>  
1     1 Alice      25 Female        100        110        120        100 Miami  
2     2 Bob        30 Male          200        210        220        230 Miami  
3     3 Charlie    22 Male          150        160        170        200 LA     
4     4 Sophie     35 Female        300        320        340        250 New Yo…
5     5 Eve        28 Female        250        240        250        270 LA     
6     6 Frank      NA Male           NA        260        270        280 New Yo…
head(df_sales_wider)
# A tibble: 6 × 9
     ID Name      Age Sex    City    sales_2020 sales_2021 sales_2022 sales_2023
  <dbl> <chr>   <dbl> <chr>  <chr>        <dbl>      <dbl>      <dbl>      <dbl>
1     1 Alice      25 Female Miami          100        110        120        100
2     2 Bob        30 Male   Miami          200        210        220        230
3     3 Charlie    22 Male   LA             150        160        170        200
4     4 Sophie     35 Female New Yo…        300        320        340        250
5     5 Eve        28 Female LA             250        240        250        270
6     6 Frank      NA Male   New Yo…         NA        260        270        280

Nesting

The long format can contain many repetitions e.g. information on the sales employee in df_sales_longer. Group data by Name and nest year and sales into a single column that we will name sales_data.

df_sales_longer_nested <- df_sales_longer %>%  
  group_by(Name) %>% 
  nest(sales_data = c(year, sales)) %>% 
  ungroup()

head(df_sales_longer_nested)
# A tibble: 6 × 6
     ID Name      Age Sex    City     sales_data      
  <dbl> <chr>   <dbl> <chr>  <chr>    <list>          
1     1 Alice      25 Female Miami    <tibble [4 × 2]>
2     2 Bob        30 Male   Miami    <tibble [4 × 2]>
3     3 Charlie    22 Male   LA       <tibble [4 × 2]>
4     4 Sophie     35 Female New York <tibble [4 × 2]>
5     5 Eve        28 Female LA       <tibble [4 × 2]>
6     6 Frank      NA Male   New York <tibble [4 × 2]>

Note: Nested data cannot be exported as an Excel or CSV file. Instead, you need to export it as an RDS file, a format specifically designed to save R objects.

We can extract the sales information from an employee using the pull function.

df_sales_longer_nested %>% 
  filter(Name == 'Bob') %>% 
  pull(sales_data)
[[1]]
# A tibble: 4 × 2
   year sales
  <int> <dbl>
1  2020   200
2  2021   210
3  2022   220
4  2023   230