Exercise 2: Tidyverse - Solutions

Setting up

Create new Quarto document. For working on the exercise, create a new Quarto document with a descriptive name and save it where you can find it again, i.e. in the folder where you downloaded the teaching material. You can use the commands shown in presentation2.qmd to solve this exercise. There is no shame in outright copying from the presentation2.qmd script, provided you understand what the command is doing.
Load packages. You will need to load the packages tidyverse and readxl for this exercise.

library(tidyverse)
library(readxl)

Importing data and a first look at the dataset

The data set used in these exercises was compiled from data downloaded from the website of the UK’s national weather service, the Met Office. It is saved in the file climate.xlsx¹ which can be found in the folder Exercises/Data/. The spreadsheet contains monthly data from five UK weather stations for the following variables:

Variable name	Explanation
station	Location of weather station
year	Year
month	Month
af	Days of air frost
rain	Rainfall in mm
sun	Sunshine duration in hours
device	Brand of sunshine recorder / sensor

Load data. Start by importing the dataset using either the read_excel() function or the Import Dataset button and name it climate. If you load with Import Dataset it is a good idea to copy the command into your script so that the next time you run your script you can just execute that line instead of having to find the file again.

climate <- read_excel('../../Data/climate.xlsx')

First look at data. Write the name of the dataframe, i.e. climate, into the console and press enter to see the first rows of the dataset. You can also click on the climate object in the Environment panel.

head(climate)

# A tibble: 6 × 7
  station  year month    af  rain   sun device         
  <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <chr>          
1 armagh   2016     1     5 132.   44.5 Campbell Stokes
2 armagh   2016     2    10  62.6  71.3 Campbell Stokes
3 armagh   2016     3     4  43.8 117.  Campbell Stokes
4 armagh   2016     4     5  54   140.  Campbell Stokes
5 armagh   2016     5     0  41.4 210.  Campbell Stokes
6 armagh   2016     6     0  75.1 114.  Campbell Stokes

Explore your dataset and understand what data you have.
1. How many observations, i.e. rows are there?
2. How many data columns are there and what are their types?
3. What is the information in each row and column?
4. How many different stations are there?
5. How many rows per station?

# a: 60 rows
nrow(climate)

[1] 60

# b: 7 columns
ncol(climate)

[1] 7

str(climate)

tibble [60 × 7] (S3: tbl_df/tbl/data.frame)
 $ station: chr [1:60] "armagh" "armagh" "armagh" "armagh" ...
 $ year   : num [1:60] 2016 2016 2016 2016 2016 ...
 $ month  : num [1:60] 1 2 3 4 5 6 7 8 9 10 ...
 $ af     : num [1:60] 5 10 4 5 0 0 0 0 0 0 ...
 $ rain   : num [1:60] 131.9 62.6 43.8 54 41.4 ...
 $ sun    : num [1:60] 44.5 71.3 117.3 139.7 209.6 ...
 $ device : chr [1:60] "Campbell Stokes" "Campbell Stokes" "Campbell Stokes" "Campbell Stokes" ...

# c: af = air forst
colnames(climate)

[1] "station" "year"    "month"   "af"      "rain"    "sun"     "device"

# d: 5 different/unique stations
climate %>% 
  select(station) %>% 
  distinct()

# A tibble: 5 × 1
  station  
  <chr>    
1 armagh   
2 camborne 
3 lerwick  
4 oxford   
5 sheffield

# e: 12 rows per station
climate %>% 
  group_by(station) %>% 
  count()

# A tibble: 5 × 2
# Groups:   station [5]
  station       n
  <chr>     <int>
1 armagh       12
2 camborne     12
3 lerwick      12
4 oxford       12
5 sheffield    12

Working with the data

Before you proceed with the exercises in this document, make sure you load the tidyverse in order to use the functions from this package.

Count the number of rows that did not have any days with air frost.

climate %>% 
  filter(af == 0) %>% 
  count()

# A tibble: 1 × 1
      n
  <int>
1    37

Count the number of rows per station that did not have any days with air frost.

climate %>% 
  group_by(station) %>% 
  filter(af == 0) %>% 
  count()

# A tibble: 5 × 2
# Groups:   station [5]
  station       n
  <chr>     <int>
1 armagh        6
2 camborne     12
3 lerwick       7
4 oxford        6
5 sheffield     6

Select from the climate dataset (remember to filter rows and select columns):
1. all rows from the station in Oxford
2. all rows from the station in Oxford when there were at least 100 hours of sunlight
3. all rows from the stations in Oxford and Camborne when there were at least 100 hours of sunlight
4. a subset that only contains the station, year and rain columns

# a
climate %>% 
  filter(station == 'oxford')

# A tibble: 12 × 7
   station  year month    af  rain   sun device         
   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <chr>          
 1 oxford   2016     1     5  83.9  59.1 Campbell Stokes
 2 oxford   2016     2     6  47.6 113.  Campbell Stokes
 3 oxford   2016     3     4  74.2 124.  Campbell Stokes
 4 oxford   2016     4     1  53.1 164.  Campbell Stokes
 5 oxford   2016     5     0  86.1 203.  Campbell Stokes
 6 oxford   2016     6     0  95.7 100.  Campbell Stokes
 7 oxford   2016     7     0   3.4 228.  Campbell Stokes
 8 oxford   2016     8     0  41.2 204.  Campbell Stokes
 9 oxford   2016     9     0  44.6 113.  Campbell Stokes
10 oxford   2016    10     0  26.5 112.  Campbell Stokes
11 oxford   2016    11     3  76.1  88.3 Campbell Stokes
12 oxford   2016    12    10  25.8  62.3 Campbell Stokes

# b
climate %>% 
  filter(station == 'oxford' & sun > 100)

# A tibble: 9 × 7
  station  year month    af  rain   sun device         
  <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <chr>          
1 oxford   2016     2     6  47.6  113. Campbell Stokes
2 oxford   2016     3     4  74.2  124. Campbell Stokes
3 oxford   2016     4     1  53.1  164. Campbell Stokes
4 oxford   2016     5     0  86.1  203. Campbell Stokes
5 oxford   2016     6     0  95.7  100. Campbell Stokes
6 oxford   2016     7     0   3.4  228. Campbell Stokes
7 oxford   2016     8     0  41.2  204. Campbell Stokes
8 oxford   2016     9     0  44.6  113. Campbell Stokes
9 oxford   2016    10     0  26.5  112. Campbell Stokes

# c
climate %>% 
  filter((station == 'oxford' | station == 'camborne') & sun > 100)

# A tibble: 17 × 7
   station   year month    af  rain   sun device         
   <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <chr>          
 1 camborne  2016     3     0  88.4  140. Kipp Zonen     
 2 camborne  2016     4     0  81.4  184. Kipp Zonen     
 3 camborne  2016     5     0  45.6  206. Kipp Zonen     
 4 camborne  2016     6     0  65.8  132. Kipp Zonen     
 5 camborne  2016     7     0  23.2  161. Kipp Zonen     
 6 camborne  2016     8     0  57.4  171. Kipp Zonen     
 7 camborne  2016     9     0 154.   103. Kipp Zonen     
 8 camborne  2016    10     0  53.2  125. Kipp Zonen     
 9 oxford    2016     2     6  47.6  113. Campbell Stokes
10 oxford    2016     3     4  74.2  124. Campbell Stokes
11 oxford    2016     4     1  53.1  164. Campbell Stokes
12 oxford    2016     5     0  86.1  203. Campbell Stokes
13 oxford    2016     6     0  95.7  100. Campbell Stokes
14 oxford    2016     7     0   3.4  228. Campbell Stokes
15 oxford    2016     8     0  41.2  204. Campbell Stokes
16 oxford    2016     9     0  44.6  113. Campbell Stokes
17 oxford    2016    10     0  26.5  112. Campbell Stokes

# d
climate %>% 
  select(station, year, rain)

# A tibble: 60 × 3
   station  year  rain
   <chr>   <dbl> <dbl>
 1 armagh   2016 132. 
 2 armagh   2016  62.6
 3 armagh   2016  43.8
 4 armagh   2016  54  
 5 armagh   2016  41.4
 6 armagh   2016  75.1
 7 armagh   2016  80.6
 8 armagh   2016  52.5
 9 armagh   2016  65.4
10 armagh   2016  37.1
# ℹ 50 more rows

The next few questions build on each other, each adding a piece of code:

Compute the average rainfall over the full dataset by using the summarize function. You can look at the examples we did at the end of presentation 2.

climate %>% 
  summarize(rain_avg = mean(rain))

# A tibble: 1 × 1
  rain_avg
     <dbl>
1     75.8

Now, compute the average rainfall, standard deviation of the rainfall and the total rainfall (the sum) on the full dataset. I.e. all three measures should be inside the same resulting table. Have a look at the tidyverse lecture if you have trouble with this.

climate %>% 
  summarize(rain_avg = mean(rain),
            rain_sd = sd(rain), 
            rain_sum = sum(rain))

# A tibble: 1 × 3
  rain_avg rain_sd rain_sum
     <dbl>   <dbl>    <dbl>
1     75.8    43.2    4548.

Now, use group_by before summarize in order to compute group summary statistics (average, standard deviation, and sum) but split up into each of the five weather stations.

climate %>% 
  group_by(station) %>% 
  summarize(rain_avg = mean(rain),
            rain_sd = sd(rain), 
            rain_sum = sum(rain))

# A tibble: 5 × 4
  station   rain_avg rain_sd rain_sum
  <chr>        <dbl>   <dbl>    <dbl>
1 armagh        61.4    26.1     737.
2 camborne      95.6    59.4    1147.
3 lerwick      101.     45.6    1218.
4 oxford        54.8    28.5     658.
5 sheffield     65.6    30.5     788.

Include a column in the summary statistics which shows how many observations, i.e. rows, the data set contains for each station.

climate %>% 
  group_by(station) %>% 
  summarize(rain_avg = mean(rain),
            rain_sd = sd(rain), 
            rain_sum = sum(rain),
            n = n())

# A tibble: 5 × 5
  station   rain_avg rain_sd rain_sum     n
  <chr>        <dbl>   <dbl>    <dbl> <int>
1 armagh        61.4    26.1     737.    12
2 camborne      95.6    59.4    1147.    12
3 lerwick      101.     45.6    1218.    12
4 oxford        54.8    28.5     658.    12
5 sheffield     65.6    30.5     788.    12

Sort the rows in the output in descending order according to average annual rainfall.

climate %>% 
  group_by(station) %>% 
  summarize(rain_avg = mean(rain),
            rain_sd = sd(rain), 
            rain_sum = sum(rain),
            n = n()) %>% 
  arrange(desc(rain_avg))

# A tibble: 5 × 5
  station   rain_avg rain_sd rain_sum     n
  <chr>        <dbl>   <dbl>    <dbl> <int>
1 lerwick      101.     45.6    1218.    12
2 camborne      95.6    59.4    1147.    12
3 sheffield     65.6    30.5     788.    12
4 armagh        61.4    26.1     737.    12
5 oxford        54.8    28.5     658.    12

Manipulating the data

Create a new column in climate and save the new dataset in a different variable so you don’t overwrite your original climate data. The new column should count the number of days in each month without air frost, based on the existing af column. For this exercise, assume each month has 30 days. To find the number of days without air frost, subtract the value in the af column from 30.

climate2 <- climate %>% 
  mutate(no_af = 30 - af)

head(climate2)

# A tibble: 6 × 8
  station  year month    af  rain   sun device          no_af
  <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <chr>           <dbl>
1 armagh   2016     1     5 132.   44.5 Campbell Stokes    25
2 armagh   2016     2    10  62.6  71.3 Campbell Stokes    20
3 armagh   2016     3     4  43.8 117.  Campbell Stokes    26
4 armagh   2016     4     5  54   140.  Campbell Stokes    25
5 armagh   2016     5     0  41.4 210.  Campbell Stokes    30
6 armagh   2016     6     0  75.1 114.  Campbell Stokes    30

Add another column to your new dataset that says whether the weather this month was good. We consider a month to be good if it had at least 100 hours of sunshine and less than 100 mm of rain. Otherwise the weather was bad.

climate2 <- climate2 %>% 
  mutate(good_weather = if_else(sun > 100 & rain < 100, 'yes', 'no'))

head(climate2)

# A tibble: 6 × 9
  station  year month    af  rain   sun device          no_af good_weather
  <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <chr>           <dbl> <chr>       
1 armagh   2016     1     5 132.   44.5 Campbell Stokes    25 no          
2 armagh   2016     2    10  62.6  71.3 Campbell Stokes    20 no          
3 armagh   2016     3     4  43.8 117.  Campbell Stokes    26 yes         
4 armagh   2016     4     5  54   140.  Campbell Stokes    25 yes         
5 armagh   2016     5     0  41.4 210.  Campbell Stokes    30 yes         
6 armagh   2016     6     0  75.1 114.  Campbell Stokes    30 yes

How many months are there with good weather (use the column you made in 14) for each station? Find the station that has the most months with good weather.

climate2 %>% 
  filter(good_weather == 'yes') %>% 
  group_by(station) %>% 
  summarize(n = n()) %>% 
  arrange(desc(n)) #%>%

# A tibble: 5 × 2
  station       n
  <chr>     <int>
1 oxford        9
2 camborne      7
3 armagh        6
4 sheffield     5
5 lerwick       4

  # head(n = 1) # to only get the first one

Complex operations

The final questions require that you combine commands and variables of the type above.

For each weather station apart from the one in Armagh, compute the total rainfall and sunshine duration for months that had no days of air frost. Present the totals in centimetres and days, respectively.

climate2 %>% 
  filter(station != 'armagh' & af == 0) %>% 
  group_by(station) %>% 
  summarize(total_rain = sum(rain),
            total_sun = sum(sun)) %>% 
  mutate(total_rain_cm = total_rain / 10, # 10 mm per cm
         total_sun_days = total_sun / 24) # 24 h per day

# A tibble: 4 × 5
  station   total_rain total_sun total_rain_cm total_sun_days
  <chr>          <dbl>     <dbl>         <dbl>          <dbl>
1 camborne       1147.     1437.         115.            59.9
2 lerwick         644.      800.          64.4           33.3
3 oxford          298.      961.          29.8           40.0
4 sheffield       354       841           35.4           35.0

Identify the weather station for which the median number of monthly sunshine hours over the months April to September was largest.

climate2 %>% 
  filter(month %in% c(4:9)) %>% 
  group_by(station) %>% 
  summarize(sun_median = median(sun)) %>% 
  arrange(desc(sun_median)) #%>%

# A tibble: 5 × 2
  station   sun_median
  <chr>          <dbl>
1 oxford          183.
2 camborne        166 
3 sheffield       160.
4 lerwick         132.
5 armagh          124.

  # head(n = 1) # to only get the first/largest one

Like in the last exercise; imagine you need to send your code to a collaborator. Review your code to ensure it is clear and well-structured, so your collaborator can easily understand and follow your work. Render your Quarto document and look at the result.

Optional section

If you’ve gone through the exercises above and are ready for more challenges — you’re in the right place. You haven’t learned all the operations you’ll need here, so feel free to search online for help. If you’re feeling overwhelmed, no worries — take a break and come back later.

Let’s simulate some climate data for the year 2056 and merge it with the original dataset from 2016.

climate_fake <- climate

climate_fake$year <- 2056
set.seed(101)
climate_fake$af <- sample(0:11, nrow(climate_fake), replace = TRUE)
climate_fake$rain <- rnorm(nrow(climate_fake), mean = mean(climate$rain)+150, sd = sd(climate$rain)+50)
climate_fake$sun <- rnorm(nrow(climate_fake), mean = mean(climate$sun), sd = sd(climate$sun))
climate_fake$device <- paste(climate_fake$device, ', New and Improved')

climate_change <- rbind(climate, climate_fake)
head(climate_change)

# A tibble: 6 × 7
  station  year month    af  rain   sun device         
  <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <chr>          
1 armagh   2016     1     5 132.   44.5 Campbell Stokes
2 armagh   2016     2    10  62.6  71.3 Campbell Stokes
3 armagh   2016     3     4  43.8 117.  Campbell Stokes
4 armagh   2016     4     5  54   140.  Campbell Stokes
5 armagh   2016     5     0  41.4 210.  Campbell Stokes
6 armagh   2016     6     0  75.1 114.  Campbell Stokes

Modify the station names so that they start with capital letters.

climate_change <- climate_change %>% 
  mutate(station = str_to_title(station))

head(climate_change)

# A tibble: 6 × 7
  station  year month    af  rain   sun device         
  <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <chr>          
1 Armagh   2016     1     5 132.   44.5 Campbell Stokes
2 Armagh   2016     2    10  62.6  71.3 Campbell Stokes
3 Armagh   2016     3     4  43.8 117.  Campbell Stokes
4 Armagh   2016     4     5  54   140.  Campbell Stokes
5 Armagh   2016     5     0  41.4 210.  Campbell Stokes
6 Armagh   2016     6     0  75.1 114.  Campbell Stokes

Create a new column that contains the month names (e.g., ‘January’, ‘February’, etc.). Could someone before you have needed the month.name in R?

climate_change <- climate_change %>% 
  mutate(month_names = month.name[month])

head(climate_change)

# A tibble: 6 × 8
  station  year month    af  rain   sun device          month_names
  <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <chr>           <chr>      
1 Armagh   2016     1     5 132.   44.5 Campbell Stokes January    
2 Armagh   2016     2    10  62.6  71.3 Campbell Stokes February   
3 Armagh   2016     3     4  43.8 117.  Campbell Stokes March      
4 Armagh   2016     4     5  54   140.  Campbell Stokes April      
5 Armagh   2016     5     0  41.4 210.  Campbell Stokes May        
6 Armagh   2016     6     0  75.1 114.  Campbell Stokes June

Create a new column that assigns a season (Winter, Spring, Summer, Fall) based on the month. Have a look at the case_when function.

Hint

Have a look at the case_when function.

climate_change <- climate_change %>% 
  mutate(season = case_when(month %in% c(12, 1, 2) ~ 'Winter',
                            month %in% c(3, 4, 5) ~ 'Spring',
                            month %in% c(6, 7, 8) ~ 'Summer',
                            month %in% c(9, 10, 11) ~ 'Fall'
                            )
         )

head(climate_change)

# A tibble: 6 × 9
  station  year month    af  rain   sun device          month_names season
  <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <chr>           <chr>       <chr> 
1 Armagh   2016     1     5 132.   44.5 Campbell Stokes January     Winter
2 Armagh   2016     2    10  62.6  71.3 Campbell Stokes February    Winter
3 Armagh   2016     3     4  43.8 117.  Campbell Stokes March       Spring
4 Armagh   2016     4     5  54   140.  Campbell Stokes April       Spring
5 Armagh   2016     5     0  41.4 210.  Campbell Stokes May         Spring
6 Armagh   2016     6     0  75.1 114.  Campbell Stokes June        Summer

Summarize the mean rain fall, air frost, and sun for each year. Evaluate the results.

climate_change %>% 
  group_by(year) %>% 
  summarize(mean_rain = mean(rain), 
            mean_af = mean(af), 
            mean_sun = mean(sun)
            )

# A tibble: 2 × 4
   year mean_rain mean_af mean_sun
  <dbl>     <dbl>   <dbl>    <dbl>
1  2016      75.8    1.68     114.
2  2056     230.     6.17     111.

Summarize the mean rain fall, air frost, and sun for each season and year. Compare the seasons across the years. How will the seasons be different in year 2056?

climate_change %>% 
  group_by(season, year) %>% 
  summarize(mean_rain = mean(rain), 
            mean_af = mean(af), 
            mean_sun = mean(sun)
            )

`summarise()` has grouped output by 'season'. You can override using the
`.groups` argument.

# A tibble: 8 × 5
# Groups:   season [4]
  season  year mean_rain mean_af mean_sun
  <chr>  <dbl>     <dbl>   <dbl>    <dbl>
1 Fall    2016      73.5   0.933     93.7
2 Fall    2056     232.    5.6      102. 
3 Spring  2016      64.0   1.47     155. 
4 Spring  2056     249.    6.33     109. 
5 Summer  2016      66.0   0        147. 
6 Summer  2056     216.    6.67     122. 
7 Winter  2016      99.7   4.33      59.1
8 Winter  2056     223.    6.07     110.

Summarize the mean rain fall, air frost, and sun for each station, season and year. Is this a good way to get an overview of the weather changes?

climate_change %>% 
  group_by(station, season, year) %>% 
  summarize(mean_rain = mean(rain))

`summarise()` has grouped output by 'station', 'season'. You can override using
the `.groups` argument.

# A tibble: 40 × 4
# Groups:   station, season [20]
   station  season  year mean_rain
   <chr>    <chr>  <dbl>     <dbl>
 1 Armagh   Fall    2016      47.8
 2 Armagh   Fall    2056     191. 
 3 Armagh   Spring  2016      46.4
 4 Armagh   Spring  2056     271. 
 5 Armagh   Summer  2016      69.4
 6 Armagh   Summer  2056      93.6
 7 Armagh   Winter  2016      82.0
 8 Armagh   Winter  2056     235. 
 9 Camborne Fall    2016     115. 
10 Camborne Fall    2056     172. 
# ℹ 30 more rows

We can get a better overview of the weather changes by plotting the data. We are learning the ggplot method in the next presentation.

Export your data

# writexl::write_xlsx(DATASET, 'PATH/TO/DATASET')
writexl::write_xlsx(climate_change, '../../Data/climate_change.xlsx')

If you’re up for more, we can explore some additional exercises that may not perfectly align with this dataset but will introduce you to some very useful tidyverse operations.

Extract the unique stations and save them as a vector.

Hint

The unlist function will convert a one-column dataframe to a vector.

station_unique <- climate_change %>% 
  select(station) %>% 
  distinct() %>% 
  unlist()

station_unique

   station1    station2    station3    station4    station5 
   "Armagh"  "Camborne"   "Lerwick"    "Oxford" "Sheffield"

Count how many times the letter ‘A’ appears in each station.

station_unique %>% str_count('A')

[1] 1 0 0 0 0

Count the number of both uppercase ‘A’ and lowercase ‘a’ in each station.

station_unique %>% str_count('A|a')

[1] 2 1 0 0 0

In the climate change data frame, create a new column for each word in the device column. For example, you can use the str_split_i() function to extract individual words and assign them to separate columns.

climate_change <- climate_change %>% 
  mutate(device_1 = str_split_i(device, pattern = ' ', i = 1),
         device_2 = str_split_i(device, pattern = ' ', i = 2))

Modify one of the new columns to contain only lowercase letters and the other to contain only uppercase letters.

climate_change <- climate_change %>% 
  mutate(device_1 = tolower(device_1),
         device_2 = toupper(device_1))

Add a new columns that contains the first three letters of each month in uppercase.

Hint

Have a look at the str_sub function.

climate_change <- climate_change %>% 
  mutate(month_names_short = str_sub(month_names, start = 1, end = 3) %>% toupper())

Footnotes

Contains public sector information licensed under the Open Government Licence v3.0.↩︎