library(readxl)
library(tidyverse)
Presentation 2: Advanced Tidyverse / Pivot longer, wider, and nesting
Load packages
Load data
<- read_excel('../out/sales_data_2.xlsx') df_sales
Goal
The goal is to make a plot like this where we visualize the sales for each year across each employee.
The data is now in wide format. Can we make the plot in the way the data is formatted now?
No. To create the plot, we need a column for the years to use on the x-axis, a column for sales in thousands DKK, and a column for Names. While the Names column is already present, the sales data is spread across four separate columns, and there is no column for the year.
head(df_sales)
# A tibble: 6 × 9
ID Name Age Sex sales_2020 sales_2021 sales_2022 sales_2023 City
<dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 Alice 25 Female 100 110 120 100 Miami
2 2 Bob 30 Male 200 210 220 230 Miami
3 3 Charlie 22 Male 150 160 170 200 LA
4 4 Sophie 35 Female 300 320 340 250 New Yo…
5 5 Eve 28 Female 250 240 250 270 LA
6 6 Frank NA Male NA 260 270 280 New Yo…
Long format
The data set is in wide format. The data can be restructured to long format such that there is one line per year per person. For this we use the pivot_longer
function.
<- df_sales %>%
df_sales_longer pivot_longer(cols = starts_with("sales_"),
names_to = "year",
values_to = "sales"
)
head(df_sales_longer)
# A tibble: 6 × 7
ID Name Age Sex City year sales
<dbl> <chr> <dbl> <chr> <chr> <chr> <dbl>
1 1 Alice 25 Female Miami sales_2020 100
2 1 Alice 25 Female Miami sales_2021 110
3 1 Alice 25 Female Miami sales_2022 120
4 1 Alice 25 Female Miami sales_2023 100
5 2 Bob 30 Male Miami sales_2020 200
6 2 Bob 30 Male Miami sales_2021 210
We can transform the year to integer.
<- df_sales_longer %>%
df_sales_longer mutate(year = str_remove(year, 'sales_') %>% as.integer())
head(df_sales_longer)
# A tibble: 6 × 7
ID Name Age Sex City year sales
<dbl> <chr> <dbl> <chr> <chr> <int> <dbl>
1 1 Alice 25 Female Miami 2020 100
2 1 Alice 25 Female Miami 2021 110
3 1 Alice 25 Female Miami 2022 120
4 1 Alice 25 Female Miami 2023 100
5 2 Bob 30 Male Miami 2020 200
6 2 Bob 30 Male Miami 2021 210
Make the plot explained above (scatter and line plot over the sales development over the years for each person).
%>%
df_sales_longer ggplot(aes(x = year,
y = sales,
color = Name)) +
geom_point() +
geom_line() +
theme_bw()
# Saving the plot
# ggsave('../out/sales_2_plot.png', width = 10.37, height = 7.55, units = 'in')
Wide format
The pivot_wider
function is used to get data to wide format.
<- df_sales_longer %>%
df_sales_wider pivot_wider(names_from = year,
values_from = sales,
names_prefix = 'sales_')
# Same content
head(df_sales)
# A tibble: 6 × 9
ID Name Age Sex sales_2020 sales_2021 sales_2022 sales_2023 City
<dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 Alice 25 Female 100 110 120 100 Miami
2 2 Bob 30 Male 200 210 220 230 Miami
3 3 Charlie 22 Male 150 160 170 200 LA
4 4 Sophie 35 Female 300 320 340 250 New Yo…
5 5 Eve 28 Female 250 240 250 270 LA
6 6 Frank NA Male NA 260 270 280 New Yo…
head(df_sales_wider)
# A tibble: 6 × 9
ID Name Age Sex City sales_2020 sales_2021 sales_2022 sales_2023
<dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 Alice 25 Female Miami 100 110 120 100
2 2 Bob 30 Male Miami 200 210 220 230
3 3 Charlie 22 Male LA 150 160 170 200
4 4 Sophie 35 Female New Yo… 300 320 340 250
5 5 Eve 28 Female LA 250 240 250 270
6 6 Frank NA Male New Yo… NA 260 270 280
Nesting
The long format can contain many repetitions e.g. information on the sales employee in df_sales_longer
. Group data by Name
and nest year
and sales
into a single column that we will name sales_data
.
<- df_sales_longer %>%
df_sales_longer_nested group_by(Name) %>%
nest(sales_data = c(year, sales)) %>%
ungroup()
head(df_sales_longer_nested)
# A tibble: 6 × 6
ID Name Age Sex City sales_data
<dbl> <chr> <dbl> <chr> <chr> <list>
1 1 Alice 25 Female Miami <tibble [4 × 2]>
2 2 Bob 30 Male Miami <tibble [4 × 2]>
3 3 Charlie 22 Male LA <tibble [4 × 2]>
4 4 Sophie 35 Female New York <tibble [4 × 2]>
5 5 Eve 28 Female LA <tibble [4 × 2]>
6 6 Frank NA Male New York <tibble [4 × 2]>
Note: Nested data cannot be exported as an Excel or CSV file. Instead, you need to export it as an RDS file, a format specifically designed to save R objects.
We can extract the sales information from an employee using the pull
function.
%>%
df_sales_longer_nested filter(Name == 'Bob') %>%
pull(sales_data)
[[1]]
# A tibble: 4 × 2
year sales
<int> <dbl>
1 2020 200
2 2021 210
3 2022 220
4 2023 230