Presentation 4: Scripting in R

In this section we will learn more about flow control and how to make more complex code constructs in R.

library(tidyverse)

If-else statments

If-else statements are essential if you want your program to do different things depending on a condition. Here we see how to code them in R.

First define some variables.

num1 <- 8
num2 <- 5

Now that we have variables, we can test logical statement between them: Is num1 larger than num2? The result of a logical statement is always one of either TRUE or FALSE:

num1 > num2
[1] TRUE

Is num1 smaller than num2?

num1 < num2
[1] FALSE

We use logical statements inside an if statement to define a condition.

if (num1 > num2){
  statement <- paste(num1, 'is larger than', num2)
}

print(statement)
[1] "8 is larger than 5"

We can add an else if to test multiple conditions. else is what applies when all previous checks where FALSE.

Now we have three possible outcomes:

#try with different values for num2
num2 <- 10

if (num1 > num2){
  statement <- paste(num1, 'is larger than', num2)
} else if (num1 < num2) {
  statement <- paste(num1, 'is smaller than', num2)
} else {
  statement <- paste(num1, 'is equal to', num2)
} 

print(statement)
[1] "8 is smaller than 10"

For-loops

Defining a for loop

Many functions in R are already vectorized, i.e. 

df <- tibble(num1 = 1:10)
df
# A tibble: 10 × 1
    num1
   <int>
 1     1
 2     2
 3     3
 4     4
 5     5
 6     6
 7     7
 8     8
 9     9
10    10
df$num2 <- df$num1 * 10
df
# A tibble: 10 × 2
    num1  num2
   <int> <dbl>
 1     1    10
 2     2    20
 3     3    30
 4     4    40
 5     5    50
 6     6    60
 7     7    70
 8     8    80
 9     9    90
10    10   100

The above code applies * 10 to each element of column num1 without us having to invoke a loop.

But sometimes we want to iterate over the elements manually because the situation requires it. For that case we can use a for loop.

We first define a list containing both numeric and character elements.

list1 <- list(1, 2, 6, 3, 2, 'hello', 'world', 'yes', 7, 8, 12, 15)

To loop through list1, we define a loop variable (here called element), which takes the value of each item in the vector, one at a time.

for (element in list1) {
  print(element)
}
[1] 1
[1] 2
[1] 6
[1] 3
[1] 2
[1] "hello"
[1] "world"
[1] "yes"
[1] 7
[1] 8
[1] 12
[1] 15

The loop variable name is arbitrary - you can call it anything. For example, we can use THIS_VARIABLE and get the same result. Point is, it does not matter what you call the variable, just avoid overwriting an important variable of your script.

for (THIS_VARIABLE in list1) {
  print(THIS_VARIABLE)
}
[1] 1
[1] 2
[1] 6
[1] 3
[1] 2
[1] "hello"
[1] "world"
[1] "yes"
[1] 7
[1] 8
[1] 12
[1] 15

After you loop through a vector or a list, the value of the loop variable is always the last element of your vector. The variable is hence a global variable.

THIS_VARIABLE
[1] 15

Loop control

There are two loop control statements we can use to

  • jump to the next iteration: next
  • end the loop before finishing: break
#example for next

for (element in list1) {
  if(element == 'hello'){
    next
  }
  
  print(element)
}
[1] 1
[1] 2
[1] 6
[1] 3
[1] 2
[1] "world"
[1] "yes"
[1] 7
[1] 8
[1] 12
[1] 15
#example for break
for (element in list1) {
  if(element == 'hello'){
    break
  }
  
  print(element)
}
[1] 1
[1] 2
[1] 6
[1] 3
[1] 2

Which data constructs are iterable in R?

Vectors:

my_vector <- c(1, 2, 3, 4, 5)
for (elem in my_vector) {
  print(elem)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

Lists:

my_list <- list(a = 1, b = "Hello", c = TRUE)
for (elem in my_list) {
  print(elem)
}
[1] 1
[1] "Hello"
[1] TRUE

Dataframes and tibbles:

my_df <- data.frame(A = 1:3, B = c("X", "Y", "Z"))
my_df
  A B
1 1 X
2 2 Y
3 3 Z
#column-wise

for (col in my_df) {
  print(col)
}
[1] 1 2 3
[1] "X" "Y" "Z"

For row-wise iteration you can for example use the row index:

for (i in 1:nrow(my_df)) {
  print(i)
  #print row i
  print(my_df[i,])
}
[1] 1
  A B
1 1 X
[1] 2
  A B
2 2 Y
[1] 3
  A B
3 3 Z

If-else in loops

We can now use what we have learned to loop through our list1 and multiply all numeric values with 10:

#to remember contents:
list1
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 6

[[4]]
[1] 3

[[5]]
[1] 2

[[6]]
[1] "hello"

[[7]]
[1] "world"

[[8]]
[1] "yes"

[[9]]
[1] 7

[[10]]
[1] 8

[[11]]
[1] 12

[[12]]
[1] 15
for (element in list1) {
  if (is.numeric(element)){
    statement <- paste(element, 'times 10 is', element*10)
  } else {
    statement <- paste(element, 'is not a number!')
  }
  print(statement)
}
[1] "1 times 10 is 10"
[1] "2 times 10 is 20"
[1] "6 times 10 is 60"
[1] "3 times 10 is 30"
[1] "2 times 10 is 20"
[1] "hello is not a number!"
[1] "world is not a number!"
[1] "yes is not a number!"
[1] "7 times 10 is 70"
[1] "8 times 10 is 80"
[1] "12 times 10 is 120"
[1] "15 times 10 is 150"

Note: that this does not work with a vector, i.e. vec <- c(1,2,'hello') because vectors can only contain one data type so all elements of vec are characters.

User defined Functions

User defined functions help us to re-use and structure our code.

We will use BMI calculation as an example for this part.

#measurements of one individual

weight_kg <- 70
height_m <- 1.80

We calculate BMI with this formula:

bmi <- weight_kg/height_m^2
bmi
[1] 21.60494

If we plan to calculate BMI for multiple individuals it is convenient to write the calculation into a function.

  • Function name: calculate_bmi.

  • Function parameters: weight_kg and height_m.

  • The return value: bmi.

The return statement specifies the value that the function will return when called.

calculate_bmi <- function(weight_kg, height_m){
  
  bmi <- weight_kg/height_m^2
  
  return(bmi)
  
}

We can now call the function on our previously defined variables.

calculate_bmi(weight_kg = weight_kg, 
              height_m = height_m)
[1] 21.60494

We can also pass numbers directly to the function.

calculate_bmi(weight_kg = 100, 
              height_m = 1.90)
[1] 27.70083

Argument Order in Function Calls

If we specify the parameter names, the order can be changed.

calculate_bmi(height_m = 1.90, 
              weight_kg = 100)
[1] 27.70083

If we do not specify the parameter names, the arguments will be matched according to the position - so be careful with this.

calculate_bmi(1.90, 
              100)
[1] 0.00019

Combining function call with if-statement

We can combine user-defined functions with if-else statements, so that the if-else will decide whether we execute the function or not.

#measurements of one individual
age <- 45
weight_kg <- 85
height_m <- 1.75

Fpr some BMI should only be calculated for individuals over the age of 18.

if (age >= 18){
  calculate_bmi(weight_kg, height_m)
}
[1] 27.7551

Combining function call with for-loops

Or we can choose to execute our function once for every element of an iterable, e.g. every row in a dataframe:

df <- data.frame(row.names = 1:5, 
                 age = c(45, 16, 31, 56, 19), 
                 weight_kg = c(85, 65, 100, 45, 76), 
                 height_m = c(1.75, 1.45, 1.95, 1.51, 1.89))

df
  age weight_kg height_m
1  45        85     1.75
2  16        65     1.45
3  31       100     1.95
4  56        45     1.51
5  19        76     1.89

Print ID, weight, and height of all individuals.

for (id in rownames(df)){
  
  weight <- df[id, 'weight_kg']
  
  height <- df[id, 'height_m']
  
  print(c(id, weight, height))
  
}
[1] "1"    "85"   "1.75"
[1] "2"    "65"   "1.45"
[1] "3"    "100"  "1.95"
[1] "4"    "45"   "1.51"
[1] "5"    "76"   "1.89"

Call function to calculate BMI for all individuals.

for (id in rownames(df)) {
  
  weight <- df[id, 'weight_kg']
  
  height <- df[id, 'height_m']
  
  bmi <- calculate_bmi(weight, height)
  
  print(c(id, bmi))
  
}
[1] "1"                "27.7551020408163"
[1] "2"                "30.9155766944114"
[1] "3"                "26.2984878369494"
[1] "4"                "19.7359764922591"
[1] "5"                "21.2760001119789"

Combination of function call, if-statement and for-loops.

Print BMI for individuals that are 18 years old or older.

for (id in rownames(df)) {
  
  if (df[id, 'age'] >= 18) {
    
    weight <- df[id, 'weight_kg']
  
    height <- df[id, 'height_m']
    
    bmi <- calculate_bmi(weight, height)
    
    print(c(id, bmi))

  } else {
    
    print(paste(id, 'is under 18.'))
    
  }
  
}
[1] "1"                "27.7551020408163"
[1] "2 is under 18."
[1] "3"                "26.2984878369494"
[1] "4"                "19.7359764922591"
[1] "5"                "21.2760001119789"

Add BMI to the data frame.

for (id in rownames(df)){
  
  if (df[id, 'age'] >= 18) {
    
    weight <- df[id, 'weight_kg']
  
    height <- df[id, 'height_m']
    
    bmi <- calculate_bmi(weight, height)

  } else {
    
    bmi <- NA
    
  }
  
  df[id, 'bmi'] <- bmi
  
}

Have a look at the data frame.

df
  age weight_kg height_m      bmi
1  45        85     1.75 27.75510
2  16        65     1.45       NA
3  31       100     1.95 26.29849
4  56        45     1.51 19.73598
5  19        76     1.89 21.27600

Error handling in user-defined functions

Currently our BMI function accepts all kinds of inputs. However, what happens if we give a negative weight?

calculate_bmi(weight_kg = -50, height_m = 1.80)
[1] -15.4321

We should require that both weight and height need to be positive values:

calculate_bmi_2 <- function(weight_kg, height_m) {
  
  # Check if weight and height are numeric
  if (!is.numeric(weight_kg) | !is.numeric(height_m)) {
    stop("Both weight_kg and height_m must be numeric values.")
  }
  
  # Check if weight and height are positive
  if (weight_kg <= 0) {
    stop("Weight must be a positive value.")
  }
  if (height_m <= 0) {
    stop("Height must be a positive value.")
  }
  
  # Calculate BMI
  bmi <- weight_kg / height_m^2
  
  # Check if BMI is within a reasonable range
  if (bmi < 10 | bmi > 60) {
    warning("The calculated BMI is outside the normal range. Please check your input values.")
  }
  
  return(bmi)
  
}

When we try to run calculate_bmi_2 with a negative weight we now receive an error:

calculate_bmi_2(weight_kg = -50, height_m = 1.80)

We also added a check whether the calculated BMI is within the normal range:

calculate_bmi_2(weight_kg = 25, height_m = 1.80)
Warning in calculate_bmi_2(weight_kg = 25, height_m = 1.8): The calculated BMI
is outside the normal range. Please check your input values.
[1] 7.716049

Running calculate_bmi_2 with appropriate inputs:

calculate_bmi_2(weight_kg = 75, height_m = 1.80)
[1] 23.14815

Out-sourcing functions to an Rscript you source

It is cleaner to collect all your functions in one place, and perhaps that place should not be your analysis script. You can instead save your functions in a separate R script and source it inside your analysis script to have access to all your functions without them cluttering your workflow.

We have create a file named presentation4_functions.R and copied our two function definitions for calculate_bmi and calculate_bmi_2 into it.

Now we remove our function definitions from the global environment to demonstrate how to source them from an external file.

rm(list = "calculate_bmi", "calculate_bmi_2")

By sourcing a script, all global variables (including functions) in that script will be loaded and appear in the Global environment in the top left corner. Here we source the functions.R script. Check the environment to confirm that the two functions appeared.

source('./presentation4_functions.R')

After we sourced the functions script the calculate_bmi function can be used just like if it was defined in the main script. If you work on a larger project and write multiple functions, it is best practice to have a function script and source it in your main script.

calculate_bmi_2(weight_kg = 67, 
              height_m = 1.70)
[1] 23.18339

You can also use mapply as an alternative to calling the function in a for-loop:

mapply(FUN = calculate_bmi_2, 
       weight_kg = df$weight_kg, 
       height_m = df$height_m)
[1] 27.75510 30.91558 26.29849 19.73598 21.27600