library(readxl)
library(tidyverse)
Presentation 3 - Exploratory Data Analysis (EDA)
In exercises 3A and B you will a deeper look at the diabetes data. To prepare for that, we will hear learn some extra ggplot tricks!
Load packages
Load data
<- read_excel('../out/sales_data_2.xlsx')
df_sales df_sales
# A tibble: 10 × 12
ID Name Age Sex sales_2020 sales_2021 sales_2022 sales_2023 mood
<dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 Alice 25 Female 100 110 120 100 happy
2 2 Bob 30 Male 200 210 220 230 happy
3 3 Charlie 22 Male 150 160 170 200 happy
4 4 Sophie 35 Female 300 320 340 250 happy
5 5 Eve 28 Female 250 240 250 270 happy
6 6 Frank NA Male NA 260 270 280 happy
7 7 Grace 40 Female 400 420 430 450 happy
8 8 Hannah 29 Female 500 510 NA 500 happy
9 9 Ian 21 Male 450 460 470 480 happy
10 10 Jack 33 Male 300 310 320 290 happy
# ℹ 3 more variables: raise <chr>, group <chr>, City <chr>
ggplot recap
We will not go into much detail here since this section mostly serves as a recap of the ggplot material covered in the previous course, From Excel to R.
The creed of ggplot
is summarized is that every information that should be put into the plot must be in a column. There is one column that describes the x-axis and one for the y-axis, and one for each additional aesthetic like color, size, shape, ect.
ggplot(df_sales, aes(x = Name, y = sales_2022, color = Sex)) +
geom_point()
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
The long format is ggplot’s best friend
It follows that if I need to plot all sales data, I will need to change the dataframe’s format such that all data points referring to sales are in the same column. As shown in pres 2 we do that with pivot_longer
:
<- df_sales %>%
sales_long pivot_longer(cols = starts_with("sales_"),
names_to = "sales_year",
values_to = "sales_value")
sales_long
# A tibble: 40 × 10
ID Name Age Sex mood raise group City sales_year sales_value
<dbl> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 1 Alice 25 Female happy no young_fe… Miami sales_2020 100
2 1 Alice 25 Female happy no young_fe… Miami sales_2021 110
3 1 Alice 25 Female happy no young_fe… Miami sales_2022 120
4 1 Alice 25 Female happy no young_fe… Miami sales_2023 100
5 2 Bob 30 Male happy yes mature_m… Miami sales_2020 200
6 2 Bob 30 Male happy yes mature_m… Miami sales_2021 210
7 2 Bob 30 Male happy yes mature_m… Miami sales_2022 220
8 2 Bob 30 Male happy yes mature_m… Miami sales_2023 230
9 3 Charlie 22 Male happy yes young_ma… LA sales_2020 150
10 3 Charlie 22 Male happy yes young_ma… LA sales_2021 160
# ℹ 30 more rows
ggplot(sales_long, aes(x = Name, y = sales_value, color = Sex)) +
geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
You can pipe into ggplot
You know what sucks? Having 10 million dataframes with very similar names in your environment. If you you don’t need to use your long format dataframe for anything else, instead of saving it and then plugging it into ggplot, you can pipe directly into ggplot:
%>%
df_sales pivot_longer(cols = starts_with("sales_"),
names_to = "sales_year",
values_to = "sales_value") %>%
#we omit the dataframe to plot because that is being piped into ggplot
#remember that different plot layers are still combined with '+'
ggplot(aes(x = Name, y = sales_value, color = Sex)) +
geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Plotting several dataframes
Sometimes we would like to add more information to a plot. Consider the one we just made above. It shows 3 or 4 dots for each amployee, which the 3 or 4 different years we have information for. I can now calculate a mean across the 4 years per employee:
<- sales_long %>%
sales_mean group_by(Name) %>%
summarise(mean = mean(sales_value, na.rm = T))
sales_mean
# A tibble: 10 × 2
Name mean
<chr> <dbl>
1 Alice 108.
2 Bob 215
3 Charlie 170
4 Eve 252.
5 Frank 270
6 Grace 425
7 Hannah 503.
8 Ian 465
9 Jack 305
10 Sophie 302.
And I would like to add it to the plot:
#copy pasta code above
%>%
df_sales pivot_longer(cols = starts_with("sales_"),
names_to = "sales_year",
values_to = "sales_value") %>%
#we omit the dataframe to plot because that is being piped into ggplot
#remember that different plot layers are still combined with '+'
ggplot(aes(x = Name, y = sales_value, color = Sex)) +
geom_point() +
#add mean data by switching the dataframe!
#I need to specify a color aesthetic because there is no Sex column in sales_mean
geom_point(data = sales_mean, aes(x = Name, y = mean), color = 'black')
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Plots are objects
ggplot plots are objects like any other R object and they can therefore be stored in a variable and displayed by invoking the variable’s name:
<- df_sales %>%
awesome_plot pivot_longer(cols = starts_with("sales_"),
names_to = "sales_year",
values_to = "sales_value") %>%
#we omit the dataframe to plot because that is being piped into ggplot
#remember that different plot layers are still combined with '+'
ggplot(aes(x = Name, y = sales_value, color = Sex)) +
geom_point() +
#add mean data by switching the dataframe!
#I need to specify a color aesthetic because there is no Sex column in sales_mean
geom_point(data = sales_mean, aes(x = Name, y = mean), color = 'black')
awesome_plot
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
If R is every being pesky about showing you plots (e.g. if you want to display them in a loop) wrapping print()
around the plot name usually helps:
print(awesome_plot)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Aliasing column names
Lastly, we’re going to show you how to alias a column name. Have you noticed that we always need to specify the literal name of the column we want to plot? What if we want to give the column name in a variable?
<- 'Name'
plot_this
ggplot(df_sales, aes(x = plot_this, y = sales_2022, color = Sex)) +
geom_point()
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
Certainly not the expected outcome! We can see that ggplot didn’t evaluate plot_this
to the name of the actual column, Name
. We’ll have to do it this way:
<- 'Name'
plot_this
ggplot(df_sales, aes(x = .data[[plot_this]], y = sales_2022, color = Sex)) +
geom_point()
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
We hear you say ‘But that is cumbersome!’. Unfortunately we’re neither the developers nor maintainers of ggplot
so we all suffer together.