library(tidyverse)
library(readxl)
Exercise 3: ggplot2 - Solutions
Getting started
Before you proceed with the exercises in this document, make sure to run the command library(tidyverse)
in order to load the core tidyverse packages (including ggplot2).
The data set used in these exercises, climate.xlsx1, was compiled from data downloaded in 2017 from the website of the UKβs national weather service, the Met Office.
The spreadsheet contains data from five UK weather stations in 2016. The following variables are included in the data set:
Variable name | Explanation |
---|---|
station | Location of weather station |
year | Year |
month | Month |
af | Days of air frost |
rain | Rainfall in mm |
sun | Sunshine duration in hours |
device | Brand of sunshine recorder / sensor |
The data set is the same as the one used for the Tidyverse exercise. If you have already imported the data, there is no need to import it again, unless you have made changes to the data assigned to climate
since the original data set was imported.
<- read_xlsx('../../Data/climate.xlsx')
climate head(climate)
# A tibble: 6 Γ 7
station year month af rain sun device
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 armagh 2016 1 5 132. 44.5 Campbell Stokes
2 armagh 2016 2 10 62.6 71.3 Campbell Stokes
3 armagh 2016 3 4 43.8 117. Campbell Stokes
4 armagh 2016 4 5 54 140. Campbell Stokes
5 armagh 2016 5 0 41.4 210. Campbell Stokes
6 armagh 2016 6 0 75.1 114. Campbell Stokes
Need a little help? Consult the ggplot2 cheatsheet here: https://rstudio.github.io/cheatsheets/data-visualization.pdf
Scatter plot I
- Make a scatter (point) plot of rain against sun.
ggplot(climate,
aes(x = rain,
y = sun)) +
geom_point()
- Color the points in the scatter plot according to weather station. Save the plot in an object.
ggplot(climate,
aes(x = rain,
y = sun,
color = station)) +
geom_point()
- Add the segment
+ facet_wrap(vars(station))
to the saved plot object from above, and update the plot. What happens?
ggplot(climate,
aes(x = rain,
y = sun,
color = station)) +
geom_point() +
facet_wrap(vars(station))
- Is it necessary to have a legend in the faceted plot? How can you remove this legend? Hint: try adding a
theme()
withlegend.position = "none"
inside it.
ggplot(climate,
aes(x = rain,
y = sun,
color = station)) +
geom_point() +
facet_wrap(vars(station)) +
theme(legend.position = "none")
Graphic files
Use
ggsave(file="weather.jpeg")
to remake the last ggplot as a jpeg-file and save it. The file will be saved on your working directory. Locate this file on your computer and open it.Use
ggsave(file="weather.png", width=10, height=8, units="cm")
to remake the last ggplot as a png-file and save it. What do the three other options do? Look at the help page?ggsave
to get an overview of the possible options.
Scatter plot II: error bars
- Calculate the average and standard deviation for sunshine in each month and save it to a table called
summary_stats
. You will needgroup_by
andsummarize
. Recall how to do this from the tidyverse exercise.
<- climate %>%
summary_stats group_by(month) %>%
summarize(sun_avg = mean(sun),
sun_sd = sd(sun))
head(summary_stats)
# A tibble: 6 Γ 3
month sun_avg sun_sd
<dbl> <dbl> <dbl>
1 1 45.3 9.19
2 2 86.2 19.5
3 3 113. 21.8
4 4 160. 16.0
5 5 193. 19.1
6 6 130. 40.3
- Make a scatter plot of the summary_stats with month on the x-axis, and the average number of sunshine hours on the y-axis.
<- ggplot(summary_stats,
p aes(x = month,
y = sun_avg)) +
geom_point()
p
- Add error bars to the plot, which represent the average number of sunshine hours plus/minus the standard deviation of the observations. The relevant geom is called
geom_errorbar
.
Hint:
geom_errorbar(aes(ymin = sun_avg - sun_sd, ymax = sun_avg + sun_sd), width = 0.2)
mapping: ymin = ~sun_avg - sun_sd, ymax = ~sun_avg + sun_sd
geom_errorbar: na.rm = FALSE, orientation = NA, width = 0.2
stat_identity: na.rm = FALSE
position_identity
<- p + geom_errorbar(aes(ymin = sun_avg - sun_sd, ymax = sun_avg + sun_sd), width = 0.2)
p
p
- How could make the plot with horizontal error bars instead? Tip: Think about which of the two variables, month and average sunshine hours, can meaningfully have an error.
+ coord_flip() p
Line plot (also known as a spaghetti plot)
- Make a line plot (find the correct
geom_
for this) of the rainfall observations over time (month), such observations from the same station are connected in one line. Put month on the x-axis. Color the lines according to weather station as well.
ggplot(climate,
aes(x = month,
y = rain,
color = station)) +
geom_line()
- The month variable was read into R as a numerical variable. Convert this variable to a factor and make the scatter plot from 8 again. What has changed?
$month <- as.factor(climate$month)
climatestr(climate)
tibble [60 Γ 7] (S3: tbl_df/tbl/data.frame)
$ station: chr [1:60] "armagh" "armagh" "armagh" "armagh" ...
$ year : num [1:60] 2016 2016 2016 2016 2016 ...
$ month : Factor w/ 12 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ af : num [1:60] 5 10 4 5 0 0 0 0 0 0 ...
$ rain : num [1:60] 131.9 62.6 43.8 54 41.4 ...
$ sun : num [1:60] 44.5 71.3 117.3 139.7 209.6 ...
$ device : chr [1:60] "Campbell Stokes" "Campbell Stokes" "Campbell Stokes" "Campbell Stokes" ...
<- ggplot(climate,
p aes(x = month,
y = rain,
color = station,
group = station)) +
geom_line()
p
group = station
needs to be added when month is a factor.The plot now shows the individual months instead of showing them on a continuous scale.
- Use
theme(legend.position = ???)
to move the color legend to the top of the plot.
<- p + theme(legend.position = 'top')
p
p
Layering
We can add several geoms to the same plot to show several things at once.
- (Re)Make the line plot of monthly rainfall and add
geom_point()
to it.
<- p + geom_point()
p p
- Now, add
geom_hline(yintercept = mean(climate$rain), linetype = "dashed")
at the end of your code for the line plot, and update the plot again. Have a look at the code again and understand what it does and how. What do you think βhβ in hline stands for?
hline = horizontal line.
<- p + geom_hline(yintercept = mean(climate$rain),
p linetype = "dashed")
p
- Finally, try adding the following code and update the plot. What changed? Replace
X
,Y
,COL
, andTITLE
with some more suitable (informative) text.
labs(x = "X", y = "Y", color = "COL", title = "TITLE")
<- p + labs(x = "Month", y = "Rain", color = "Staion", title = "Rainfall over month")
p p
Box plot I
- Make a box plot of sunshine per weather station.
ggplot(climate,
aes(y = sun,
x = station)) +
geom_boxplot()
- Color the boxes according to weather station.
<- ggplot(climate,
p aes(y = sun,
x = station,
fill = station)) +
geom_boxplot()
p
Box plot II - Aesthetics
There are many ways in which you can manipulate the look of your plot. For this we will use the boxplot you made in the exercise above.
- Add a different legend title with
labs(fill = "Custom Title")
.
<- p + labs(fill = "Station")
p p
- Change the theme of the ggplot grid. Suggestions:
theme_minimal()
,theme_bw()
,theme_dark()
,theme_void()
.
<- p + theme_minimal()
p p
- Instead of automatically chosen colors, pick your own colors for
fill = station
by adding thescale_fill_manual()
command. You will need five colors, one for each station. What happens if you choose too few colors?
<- p + scale_fill_manual(values = c('magenta', 'pink1', 'deeppink', 'violet', 'hotpink'))
p p
- Change the boxplot to a violin plot. Add the sunshine observations as scatter points to the plot. Include a boxplot inside the violin plot with
geom_boxplot(width=.1)
.
ggplot(climate,
aes(y = sun,
x = station,
fill = station)) +
geom_violin() +
geom_point() +
geom_boxplot(width=.1)
Histogram
- Make a histogram (find the correct
geom_
for this) of rain from the climate dataset. Interpret the plot, what does it show?
The plot shows the distribution of accumulated rainfall across stations and months. For most months, the rainfall is around 50 mm.
ggplot(climate,
aes(x = rain)) +
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
- R suggests that you choose a different number of bins/bin width for the histogram. Use
binwidth =
inside the histogram geom to experiment with different values of bin width. Look at how the histogram changes.
ggplot(climate,
aes(x = rain)) +
geom_histogram(binwidth = 25)
ggplot(climate,
aes(x = rain)) +
geom_histogram(binwidth = 3)
- Color the entire histogram. Here we are not coloring/filling according to any attribute, just the entire thing so the argument needs to be outside
aes()
.
ggplot(climate,
aes(x = rain)) +
geom_histogram(binwidth = 25, fill = 'hotpink')
Bar chart I
- Make a bar chart (
geom_col()
) which visualizes the sunshine hours per month. If you have not done so in question 13, convert month to a factor now and remake the plot.
ggplot(climate,
aes(x = month,
y = sun)) +
geom_col()
- Color, i.e. divide the bars according to weather station.
ggplot(climate,
aes(x = month,
y = sun,
fill = station)) +
geom_col()
- For better comparison, place the bars for each station next to each other instead of stacking them.
ggplot(climate,
aes(x = month,
y = sun,
fill = station)) +
geom_col(position = 'dodge')
- Make the axis labels, legend title, and title of the plot more informative by customizing them like you did for the line plot above.
ggplot(climate,
aes(x = month,
y = sun,
fill = station)) +
geom_col(position = 'dodge') +
labs(x = 'Month',
y = 'Sunshine',
fill = 'Weather station',
title = 'Sunshine over month')
Bar chart II: Sorting bars
- Make a new bar chart showing the (total) annual rainfall recorded at each weather station. You will need to calculate this first. The format we need is a dataframe with summed up rain data per station.
<- climate %>%
rain_summary group_by(station) %>%
summarize(rain_sum = sum(rain))
rain_summary
# A tibble: 5 Γ 2
station rain_sum
<chr> <dbl>
1 armagh 737.
2 camborne 1147.
3 lerwick 1218.
4 oxford 658.
5 sheffield 788.
ggplot(rain_summary,
aes(x = station,
y = rain_sum)) +
geom_col()
- Sort the stations in accordance to rainfall, either ascending or descending. This was shown in the ggplot lecture. Sort your rain dataframe from the question above by sum, then re-arrange the factor-levels of the βstationβ as shown in the lecture.
# Arrange
<- rain_summary %>%
rain_summary arrange(desc(rain_sum))
# Change station to factor
$station <- factor(rain_summary$station,
rain_summarylevels = rain_summary$station)
# Plot
<- rain_summary %>%
p ggplot(aes(x = station,
y = rain_sum)) +
geom_col()
p
- Add labels to each bar that state the sum of the rainfall. You can do this by adding the
label
keyword to theaes()
and addinggeom_label()
to the plot. Just like geoms likegeom_scatter
look at theaes()
for knowing what to plot on the x and y axis,geom_label
looks at it to know what to use for labels.
+ geom_label(aes(label = sum(rain_sum))) p
- Adjust the label positions so that the labels are positioned above the bars instead of inside them.
+ geom_label(aes(label = rain_sum),
p position = position_nudge(y = 35))
- To alter size of figure in report:
{r, fig.width=10, fig.height=10}
Footnotes
Contains public sector information licensed under the Open Government Licence v3.0.β©οΈ