Presentation 3 - Exploratory Data Analysis (EDA)

In exercises 3A and B you will a deeper look at the diabetes data. To prepare for that, we will hear learn some extra ggplot tricks!

Load packages

library(readxl)
library(tidyverse)

Load data

df_sales <- read_excel('../out/sales_data_2.xlsx')
df_sales
# A tibble: 10 × 12
      ID Name      Age Sex    sales_2020 sales_2021 sales_2022 sales_2023 mood 
   <dbl> <chr>   <dbl> <chr>       <dbl>      <dbl>      <dbl>      <dbl> <chr>
 1     1 Alice      25 Female        100        110        120        100 happy
 2     2 Bob        30 Male          200        210        220        230 happy
 3     3 Charlie    22 Male          150        160        170        200 happy
 4     4 Sophie     35 Female        300        320        340        250 happy
 5     5 Eve        28 Female        250        240        250        270 happy
 6     6 Frank      NA Male           NA        260        270        280 happy
 7     7 Grace      40 Female        400        420        430        450 happy
 8     8 Hannah     29 Female        500        510         NA        500 happy
 9     9 Ian        21 Male          450        460        470        480 happy
10    10 Jack       33 Male          300        310        320        290 happy
# ℹ 3 more variables: raise <chr>, group <chr>, City <chr>

ggplot recap

We will not go into much detail here since this section mostly serves as a recap of the ggplot material covered in the previous course, From Excel to R.

The creed of ggplot is summarized is that every information that should be put into the plot must be in a column. There is one column that describes the x-axis and one for the y-axis, and one for each additional aesthetic like color, size, shape, ect.

ggplot(df_sales, aes(x = Name, y = sales_2022, color = Sex)) +
  geom_point()
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

The long format is ggplot’s best friend

It follows that if I need to plot all sales data, I will need to change the dataframe’s format such that all data points referring to sales are in the same column. As shown in pres 2 we do that with pivot_longer:

sales_long <- df_sales %>%
  pivot_longer(cols = starts_with("sales_"),
               names_to = "sales_year",
               values_to = "sales_value")
sales_long
# A tibble: 40 × 10
      ID Name      Age Sex    mood  raise group     City  sales_year sales_value
   <dbl> <chr>   <dbl> <chr>  <chr> <chr> <chr>     <chr> <chr>            <dbl>
 1     1 Alice      25 Female happy no    young_fe… Miami sales_2020         100
 2     1 Alice      25 Female happy no    young_fe… Miami sales_2021         110
 3     1 Alice      25 Female happy no    young_fe… Miami sales_2022         120
 4     1 Alice      25 Female happy no    young_fe… Miami sales_2023         100
 5     2 Bob        30 Male   happy yes   mature_m… Miami sales_2020         200
 6     2 Bob        30 Male   happy yes   mature_m… Miami sales_2021         210
 7     2 Bob        30 Male   happy yes   mature_m… Miami sales_2022         220
 8     2 Bob        30 Male   happy yes   mature_m… Miami sales_2023         230
 9     3 Charlie    22 Male   happy yes   young_ma… LA    sales_2020         150
10     3 Charlie    22 Male   happy yes   young_ma… LA    sales_2021         160
# ℹ 30 more rows
ggplot(sales_long, aes(x = Name, y = sales_value, color = Sex)) +
  geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

You can pipe into ggplot

You know what sucks? Having 10 million dataframes with very similar names in your environment. If you you don’t need to use your long format dataframe for anything else, instead of saving it and then plugging it into ggplot, you can pipe directly into ggplot:

df_sales %>%
  pivot_longer(cols = starts_with("sales_"),
               names_to = "sales_year",
               values_to = "sales_value") %>%
  #we omit the dataframe to plot because that is being piped into ggplot
  #remember that different plot layers are still combined with '+'
  ggplot(aes(x = Name, y = sales_value, color = Sex)) +
  geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Plotting several dataframes

Sometimes we would like to add more information to a plot. Consider the one we just made above. It shows 3 or 4 dots for each amployee, which the 3 or 4 different years we have information for. I can now calculate a mean across the 4 years per employee:

sales_mean <- sales_long %>%
  group_by(Name) %>%
  summarise(mean = mean(sales_value, na.rm = T))

sales_mean
# A tibble: 10 × 2
   Name     mean
   <chr>   <dbl>
 1 Alice    108.
 2 Bob      215 
 3 Charlie  170 
 4 Eve      252.
 5 Frank    270 
 6 Grace    425 
 7 Hannah   503.
 8 Ian      465 
 9 Jack     305 
10 Sophie   302.

And I would like to add it to the plot:

#copy pasta code above
df_sales %>%
  pivot_longer(cols = starts_with("sales_"),
               names_to = "sales_year",
               values_to = "sales_value") %>%
  #we omit the dataframe to plot because that is being piped into ggplot
  #remember that different plot layers are still combined with '+'
  ggplot(aes(x = Name, y = sales_value, color = Sex)) +
  geom_point() +
  #add mean data by switching the dataframe!
  #I need to specify a color aesthetic because there is no Sex column in sales_mean
  geom_point(data = sales_mean, aes(x = Name, y = mean), color = 'black')
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Plots are objects

ggplot plots are objects like any other R object and they can therefore be stored in a variable and displayed by invoking the variable’s name:

awesome_plot <- df_sales %>%
  pivot_longer(cols = starts_with("sales_"),
               names_to = "sales_year",
               values_to = "sales_value") %>%
  #we omit the dataframe to plot because that is being piped into ggplot
  #remember that different plot layers are still combined with '+'
  ggplot(aes(x = Name, y = sales_value, color = Sex)) +
  geom_point() +
  #add mean data by switching the dataframe!
  #I need to specify a color aesthetic because there is no Sex column in sales_mean
  geom_point(data = sales_mean, aes(x = Name, y = mean), color = 'black')

awesome_plot
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

If R is every being pesky about showing you plots (e.g. if you want to display them in a loop) wrapping print() around the plot name usually helps:

print(awesome_plot)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Aliasing column names

Lastly, we’re going to show you how to alias a column name. Have you noticed that we always need to specify the literal name of the column we want to plot? What if we want to give the column name in a variable?

plot_this <- 'Name'

ggplot(df_sales, aes(x = plot_this, y = sales_2022, color = Sex)) +
  geom_point()
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

Certainly not the expected outcome! We can see that ggplot didn’t evaluate plot_this to the name of the actual column, Name. We’ll have to do it this way:

plot_this <- 'Name'

ggplot(df_sales, aes(x = .data[[plot_this]], y = sales_2022, color = Sex)) +
  geom_point()
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

We hear you say ‘But that is cumbersome!’. Unfortunately we’re neither the developers nor maintainers of ggplot so we all suffer together.