Exercise 2: Tidyverse

Author

HeaDS Data Science Lab, University of Copenhagen

Setting up

  1. Create new Quarto document. For working on the exercise, create a new Quarto document with a descriptive name and save it where you can find it again, i.e. in the folder where you downloaded the teaching material. You can use the commands shown in presentation2.qmd to solve this exercise. There is no shame in outright copying from the presentation2.qmd script, provided you understand what the command is doing.

  2. Load packages. You will need to load the packages tidyverse and readxl for this exercise.

Importing data and a first look at the dataset

The data set used in these exercises was compiled from data downloaded from the website of the UK’s national weather service, the Met Office. It is saved in the file climate.xlsx1 which can be found in the folder Exercises/Data/. The spreadsheet contains monthly data from five UK weather stations for the following variables:

Variable name Explanation
station Location of weather station
year Year
month Month
af Days of air frost
rain Rainfall in mm
sun Sunshine duration in hours
device Brand of sunshine recorder / sensor
  1. Load data. Start by importing the dataset using either the read_excel() function or the Import Dataset button and name it climate. If you load with Import Dataset it is a good idea to copy the command into your script so that the next time you run your script you can just execute that line instead of having to find the file again.

  2. First look at data. Write the name of the dataframe, i.e. climate, into the console and press enter to see the first rows of the dataset. You can also click on the climate object in the Environment panel.

  3. Explore your dataset and understand what data you have.

    1. How many observations, i.e. rows are there?

    2. How many data columns are there and what are their types?

    3. What is the information in each row and column?

    4. How many different stations are there?

    5. How many rows per station?

Working with the data

Before you proceed with the exercises in this document, make sure you load the tidyverse in order to use the functions from this package.

  1. Count the number of rows that did not have any days with air frost.

  2. Count the number of rows per station that did not have any days with air frost.

  3. Select from the climate dataset (remember to filter rows and select columns):

    1. all rows from the station in Oxford

    2. all rows from the station in Oxford when there were at least 100 hours of sunlight

    3. all rows from the stations in Oxford and Camborne when there were at least 100 hours of sunlight

    4. a subset that only contains the station, year and rain columns

The next few questions build on each other, each adding a piece of code:

  1. Compute the average rainfall over the full dataset by using the summarize function. You can look at the examples we did at the end of presentation 2.

  2. Now, compute the average rainfall, standard deviation of the rainfall and the total rainfall (the sum) on the full dataset. I.e. all three measures should be inside the same resulting table. Have a look at the tidyverse lecture if you have trouble with this.

  3. Now, use group_by before summarize in order to compute group summary statistics (average, standard deviation, and sum) but split up into each of the five weather stations.

  4. Include a column in the summary statistics which shows how many observations, i.e. rows, the data set contains for each station.

  5. Sort the rows in the output in descending order according to average annual rainfall.

Manipulating the data

  1. Create a new column in climate and save the new dataset in a different variable so you don’t overwrite your original climate data. The new column should count the number of days in each month without air frost, based on the existing af column. For this exercise, assume each month has 30 days. To find the number of days without air frost, subtract the value in the af column from 30.

  2. Add another column to your new dataset that says whether the weather this month was good. We consider a month to be good if it had at least 100 hours of sunshine and less than 100 mm of rain. Otherwise the weather was bad.

  3. How many months are there with good weather (use the column you made in 14) for each station? Find the station that has the most months with good weather.

Complex operations

The final questions require that you combine commands and variables of the type above.

  1. For each weather station apart from the one in Armagh, compute the total rainfall and sunshine duration for months that had no days of air frost. Present the totals in centimetres and days, respectively.

  2. Identify the weather station for which the median number of monthly sunshine hours over the months April to September was largest.

Wrapping up

  1. Like in the last exercise; imagine you need to send your code to a collaborator. Review your code to ensure it is clear and well-structured, so your collaborator can easily understand and follow your work. Render your Quarto document and look at the result.

Footnotes

  1. Contains public sector information licensed under the Open Government Licence v3.0.β†©οΈŽ