library(tidyverse)
library(readxl)
Exercise 2: Tidyverse - Solutions
Setting up
Create new Quarto document. For working on the exercise, create a new Quarto document with a descriptive name and save it where you can find it again, i.e. in the folder where you downloaded the teaching material. You can use the commands shown in presentation2.qmd to solve this exercise. There is no shame in outright copying from the presentation2.qmd script, provided you understand what the command is doing.
Load packages. You will need to load the packages
tidyverse
andreadxl
for this exercise.
Importing data and a first look at the dataset
The data set used in these exercises was compiled from data downloaded from the website of the UKβs national weather service, the Met Office. It is saved in the file climate.xlsx1 which can be found in the folder Exercises/Data/
. The spreadsheet contains monthly data from five UK weather stations for the following variables:
Variable name | Explanation |
---|---|
station | Location of weather station |
year | Year |
month | Month |
af | Days of air frost |
rain | Rainfall in mm |
sun | Sunshine duration in hours |
device | Brand of sunshine recorder / sensor |
- Load data. Start by importing the dataset using either the
read_excel()
function or theImport Dataset
button and name itclimate
. If you load withImport Dataset
it is a good idea to copy the command into your script so that the next time you run your script you can just execute that line instead of having to find the file again.
<- read_excel('../../Data/climate.xlsx') climate
- First look at data. Write the name of the dataframe, i.e.
climate
, into the console and press enter to see the first rows of the dataset. You can also click on theclimate
object in the Environment panel.
head(climate)
# A tibble: 6 Γ 7
station year month af rain sun device
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 armagh 2016 1 5 132. 44.5 Campbell Stokes
2 armagh 2016 2 10 62.6 71.3 Campbell Stokes
3 armagh 2016 3 4 43.8 117. Campbell Stokes
4 armagh 2016 4 5 54 140. Campbell Stokes
5 armagh 2016 5 0 41.4 210. Campbell Stokes
6 armagh 2016 6 0 75.1 114. Campbell Stokes
Explore your dataset and understand what data you have.
How many observations, i.e. rows are there?
How many data columns are there and what are their types?
What is the information in each row and column?
How many different stations are there?
How many rows per station?
# a: 60 rows
nrow(climate)
[1] 60
# b: 7 columns
ncol(climate)
[1] 7
str(climate)
tibble [60 Γ 7] (S3: tbl_df/tbl/data.frame)
$ station: chr [1:60] "armagh" "armagh" "armagh" "armagh" ...
$ year : num [1:60] 2016 2016 2016 2016 2016 ...
$ month : num [1:60] 1 2 3 4 5 6 7 8 9 10 ...
$ af : num [1:60] 5 10 4 5 0 0 0 0 0 0 ...
$ rain : num [1:60] 131.9 62.6 43.8 54 41.4 ...
$ sun : num [1:60] 44.5 71.3 117.3 139.7 209.6 ...
$ device : chr [1:60] "Campbell Stokes" "Campbell Stokes" "Campbell Stokes" "Campbell Stokes" ...
# c: af = air forst
colnames(climate)
[1] "station" "year" "month" "af" "rain" "sun" "device"
# d: 5 different/unique stations
%>%
climate select(station) %>%
distinct()
# A tibble: 5 Γ 1
station
<chr>
1 armagh
2 camborne
3 lerwick
4 oxford
5 sheffield
# e: 12 rows per station
%>%
climate group_by(station) %>%
count()
# A tibble: 5 Γ 2
# Groups: station [5]
station n
<chr> <int>
1 armagh 12
2 camborne 12
3 lerwick 12
4 oxford 12
5 sheffield 12
Working with the data
Before you proceed with the exercises in this document, make sure you load the tidyverse
in order to use the functions from this package.
- Count the number of rows that did not have any days with air frost.
%>%
climate filter(af == 0) %>%
count()
# A tibble: 1 Γ 1
n
<int>
1 37
- Count the number of rows per station that did not have any days with air frost.
%>%
climate group_by(station) %>%
filter(af == 0) %>%
count()
# A tibble: 5 Γ 2
# Groups: station [5]
station n
<chr> <int>
1 armagh 6
2 camborne 12
3 lerwick 7
4 oxford 6
5 sheffield 6
Select from the climate dataset (remember to
filter
rows andselect
columns):all rows from the station in Oxford
all rows from the station in Oxford when there were at least 100 hours of sunlight
all rows from the stations in Oxford and Camborne when there were at least 100 hours of sunlight
a subset that only contains the
station
,year
andrain
columns
# a
%>%
climate filter(station == 'oxford')
# A tibble: 12 Γ 7
station year month af rain sun device
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 oxford 2016 1 5 83.9 59.1 Campbell Stokes
2 oxford 2016 2 6 47.6 113. Campbell Stokes
3 oxford 2016 3 4 74.2 124. Campbell Stokes
4 oxford 2016 4 1 53.1 164. Campbell Stokes
5 oxford 2016 5 0 86.1 203. Campbell Stokes
6 oxford 2016 6 0 95.7 100. Campbell Stokes
7 oxford 2016 7 0 3.4 228. Campbell Stokes
8 oxford 2016 8 0 41.2 204. Campbell Stokes
9 oxford 2016 9 0 44.6 113. Campbell Stokes
10 oxford 2016 10 0 26.5 112. Campbell Stokes
11 oxford 2016 11 3 76.1 88.3 Campbell Stokes
12 oxford 2016 12 10 25.8 62.3 Campbell Stokes
# b
%>%
climate filter(station == 'oxford' & sun > 100)
# A tibble: 9 Γ 7
station year month af rain sun device
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 oxford 2016 2 6 47.6 113. Campbell Stokes
2 oxford 2016 3 4 74.2 124. Campbell Stokes
3 oxford 2016 4 1 53.1 164. Campbell Stokes
4 oxford 2016 5 0 86.1 203. Campbell Stokes
5 oxford 2016 6 0 95.7 100. Campbell Stokes
6 oxford 2016 7 0 3.4 228. Campbell Stokes
7 oxford 2016 8 0 41.2 204. Campbell Stokes
8 oxford 2016 9 0 44.6 113. Campbell Stokes
9 oxford 2016 10 0 26.5 112. Campbell Stokes
# c
%>%
climate filter((station == 'oxford' | station == 'camborne') & sun > 100)
# A tibble: 17 Γ 7
station year month af rain sun device
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 camborne 2016 3 0 88.4 140. Kipp Zonen
2 camborne 2016 4 0 81.4 184. Kipp Zonen
3 camborne 2016 5 0 45.6 206. Kipp Zonen
4 camborne 2016 6 0 65.8 132. Kipp Zonen
5 camborne 2016 7 0 23.2 161. Kipp Zonen
6 camborne 2016 8 0 57.4 171. Kipp Zonen
7 camborne 2016 9 0 154. 103. Kipp Zonen
8 camborne 2016 10 0 53.2 125. Kipp Zonen
9 oxford 2016 2 6 47.6 113. Campbell Stokes
10 oxford 2016 3 4 74.2 124. Campbell Stokes
11 oxford 2016 4 1 53.1 164. Campbell Stokes
12 oxford 2016 5 0 86.1 203. Campbell Stokes
13 oxford 2016 6 0 95.7 100. Campbell Stokes
14 oxford 2016 7 0 3.4 228. Campbell Stokes
15 oxford 2016 8 0 41.2 204. Campbell Stokes
16 oxford 2016 9 0 44.6 113. Campbell Stokes
17 oxford 2016 10 0 26.5 112. Campbell Stokes
# d
%>%
climate select(station, year, rain)
# A tibble: 60 Γ 3
station year rain
<chr> <dbl> <dbl>
1 armagh 2016 132.
2 armagh 2016 62.6
3 armagh 2016 43.8
4 armagh 2016 54
5 armagh 2016 41.4
6 armagh 2016 75.1
7 armagh 2016 80.6
8 armagh 2016 52.5
9 armagh 2016 65.4
10 armagh 2016 37.1
# βΉ 50 more rows
The next few questions build on each other, each adding a piece of code:
- Compute the average rainfall over the full dataset by using the
summarize
function. You can look at the examples we did at the end of presentation 2.
%>%
climate summarize(rain_avg = mean(rain))
# A tibble: 1 Γ 1
rain_avg
<dbl>
1 75.8
- Now, compute the average rainfall, standard deviation of the rainfall and the total rainfall (the sum) on the full dataset. I.e. all three measures should be inside the same resulting table. Have a look at the tidyverse lecture if you have trouble with this.
%>%
climate summarize(rain_avg = mean(rain),
rain_sd = sd(rain),
rain_sum = sum(rain))
# A tibble: 1 Γ 3
rain_avg rain_sd rain_sum
<dbl> <dbl> <dbl>
1 75.8 43.2 4548.
- Now, use
group_by
beforesummarize
in order to compute group summary statistics (average, standard deviation, and sum) but split up into each of the five weather stations.
%>%
climate group_by(station) %>%
summarize(rain_avg = mean(rain),
rain_sd = sd(rain),
rain_sum = sum(rain))
# A tibble: 5 Γ 4
station rain_avg rain_sd rain_sum
<chr> <dbl> <dbl> <dbl>
1 armagh 61.4 26.1 737.
2 camborne 95.6 59.4 1147.
3 lerwick 101. 45.6 1218.
4 oxford 54.8 28.5 658.
5 sheffield 65.6 30.5 788.
- Include a column in the summary statistics which shows how many observations, i.e. rows, the data set contains for each station.
%>%
climate group_by(station) %>%
summarize(rain_avg = mean(rain),
rain_sd = sd(rain),
rain_sum = sum(rain),
n = n())
# A tibble: 5 Γ 5
station rain_avg rain_sd rain_sum n
<chr> <dbl> <dbl> <dbl> <int>
1 armagh 61.4 26.1 737. 12
2 camborne 95.6 59.4 1147. 12
3 lerwick 101. 45.6 1218. 12
4 oxford 54.8 28.5 658. 12
5 sheffield 65.6 30.5 788. 12
- Sort the rows in the output in descending order according to average annual rainfall.
%>%
climate group_by(station) %>%
summarize(rain_avg = mean(rain),
rain_sd = sd(rain),
rain_sum = sum(rain),
n = n()) %>%
arrange(desc(rain_avg))
# A tibble: 5 Γ 5
station rain_avg rain_sd rain_sum n
<chr> <dbl> <dbl> <dbl> <int>
1 lerwick 101. 45.6 1218. 12
2 camborne 95.6 59.4 1147. 12
3 sheffield 65.6 30.5 788. 12
4 armagh 61.4 26.1 737. 12
5 oxford 54.8 28.5 658. 12
Manipulating the data
- Create a new column in
climate
and save the new dataset in a different variable so you donβt overwrite your originalclimate
data. The new column should count the number of days in each month without air frost, based on the existingaf
column. For this exercise, assume each month has 30 days. To find the number of days without air frost, subtract the value in theaf
column from 30.
<- climate %>%
climate2 mutate(no_af = 30 - af)
head(climate2)
# A tibble: 6 Γ 8
station year month af rain sun device no_af
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 armagh 2016 1 5 132. 44.5 Campbell Stokes 25
2 armagh 2016 2 10 62.6 71.3 Campbell Stokes 20
3 armagh 2016 3 4 43.8 117. Campbell Stokes 26
4 armagh 2016 4 5 54 140. Campbell Stokes 25
5 armagh 2016 5 0 41.4 210. Campbell Stokes 30
6 armagh 2016 6 0 75.1 114. Campbell Stokes 30
- Add another column to your new dataset that says whether the weather this month was good. We consider a month to be good if it had at least 100 hours of sunshine and less than 100 mm of rain. Otherwise the weather was bad.
<- climate2 %>%
climate2 mutate(good_weather = if_else(sun > 100 & rain < 100, 'yes', 'no'))
head(climate2)
# A tibble: 6 Γ 9
station year month af rain sun device no_af good_weather
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
1 armagh 2016 1 5 132. 44.5 Campbell Stokes 25 no
2 armagh 2016 2 10 62.6 71.3 Campbell Stokes 20 no
3 armagh 2016 3 4 43.8 117. Campbell Stokes 26 yes
4 armagh 2016 4 5 54 140. Campbell Stokes 25 yes
5 armagh 2016 5 0 41.4 210. Campbell Stokes 30 yes
6 armagh 2016 6 0 75.1 114. Campbell Stokes 30 yes
- How many months are there with good weather (use the column you made in 14) for each station? Find the station that has the most months with good weather.
%>%
climate2 filter(good_weather == 'yes') %>%
group_by(station) %>%
summarize(n = n()) %>%
arrange(desc(n)) #%>%
# A tibble: 5 Γ 2
station n
<chr> <int>
1 oxford 9
2 camborne 7
3 armagh 6
4 sheffield 5
5 lerwick 4
# head(n = 1) # to only get the first one
Complex operations
The final questions require that you combine commands and variables of the type above.
- For each weather station apart from the one in Armagh, compute the total rainfall and sunshine duration for months that had no days of air frost. Present the totals in centimetres and days, respectively.
%>%
climate2 filter(station != 'armagh' & af == 0) %>%
group_by(station) %>%
summarize(total_rain = sum(rain),
total_sun = sum(sun)) %>%
mutate(total_rain_cm = total_rain / 10, # 10 mm per cm
total_sun_days = total_sun / 24) # 24 h per day
# A tibble: 4 Γ 5
station total_rain total_sun total_rain_cm total_sun_days
<chr> <dbl> <dbl> <dbl> <dbl>
1 camborne 1147. 1437. 115. 59.9
2 lerwick 644. 800. 64.4 33.3
3 oxford 298. 961. 29.8 40.0
4 sheffield 354 841 35.4 35.0
- Identify the weather station for which the median number of monthly sunshine hours over the months April to September was largest.
%>%
climate2 filter(month %in% c(4:9)) %>%
group_by(station) %>%
summarize(sun_median = median(sun)) %>%
arrange(desc(sun_median)) #%>%
# A tibble: 5 Γ 2
station sun_median
<chr> <dbl>
1 oxford 183.
2 camborne 166
3 sheffield 160.
4 lerwick 132.
5 armagh 124.
# head(n = 1) # to only get the first/largest one
Footnotes
Contains public sector information licensed under the Open Government Licence v3.0.β©οΈ