Presentation 3: ggplot2

Want to code along?

Importing libraries and data

library(readxl)
library(writexl)
library(tidyverse)

Load data

The iris dataset is a widely-used dataset in data science, containing 150 observations of iris flowers with features like sepal length, sepal width, petal length, and petal width. It includes three species: Setosa, Versicolor, and Virginica, making it ideal for classification tasks and data visualization.

data('iris')
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Data wrangling

Making the data more fun to plot.

# Define 5 colors and their respective ratios
colors <- c(rep("Red", 40), 
            rep("Blue", 20), 
            rep("Yellow",30), 
            rep("Green", 20),
            rep("Purple",40))

set.seed(123)  # For reproducibility

# Shuffle the colors to mix them randomly
colors <- sample(colors, replace = TRUE)

# Add the 'Flower.Color' column to the iris dataset
iris$Flower.Color <- colors

Have a look at the dataset

head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Flower.Color
1          5.1         3.5          1.4         0.2  setosa          Red
2          4.9         3.0          1.4         0.2  setosa         Blue
3          4.7         3.2          1.3         0.2  setosa       Purple
4          4.6         3.1          1.5         0.2  setosa         Blue
5          5.0         3.6          1.4         0.2  setosa          Red
6          5.4         3.9          1.7         0.4  setosa       Purple
colnames(iris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
[6] "Flower.Color"

ggplot2: The basic concepts

This is the starting point of a ggplot. The dataframe and the columns we wish to plot are defined. We have not specified what type of plot we want, hence an empty plot is produced.

ggplot(iris,                    # dataframe 
       aes(x = Sepal.Length,    # x-value
           y = Petal.Length))   # y-value

                                # missing type of plot 

Scatter plot with geom_point

A scatter plot is made with the geom_point function and is used to get an overview over the relationship between two numeric variables. Here we see the relationship between sepal length and width.

ggplot(iris,                    # dataframe 
       aes(x = Sepal.Length,    # x-value
           y = Petal.Length)) + # y-value
  geom_point()                  # type of plot

Change color of entire plot by setting it outside aes().

ggplot(iris,                    
       aes(x = Sepal.Length,    
           y = Petal.Length)) + 
  geom_point(color = 'hotpink') 

Scatter plot with geom_point with color stratification

To change colors based on a feature, you need to set it inside aes(). Here we see the relationship between sepal length and width colored by species.

ggplot(iris,                   
       aes(x = Sepal.Length,   
           y = Petal.Length,     
           color = Species)) +  
  geom_point()

Boxplot with geom_boxplot

Boxplots are great to get an overview of continues variables and spot outliers. Can be shown on either axis (x and y).

ggplot(iris,                    
       aes(y = Sepal.Length)) + 
  geom_boxplot() 

Split up by categorical variable like Species:

ggplot(iris,                    
       aes(y = Sepal.Length,    
           x = Species)) +      
  geom_boxplot()              

… or color.

ggplot(iris,                    
       aes(y = Sepal.Length,    
           fill = Species)) +  
  geom_boxplot()

Violin plot with geom_violin

A violin plot shows the distribution of a continuous variable across different categories, combining the features of a box plot and a density plot. Also, the labels can be edited.

ggplot(iris,                    
       aes(y = Sepal.Length,    
           x = Species)) +      
  geom_violin() + 
  labs(y = 'Sepal Length', 
       x = 'Flower Species', 
       title = 'Violin plot of sepal length stratisfied by flower species')

Histogram with geom_histogram

Histogram shows the distribution of a continuous variable.

ggplot(iris,                    
       aes(x = Sepal.Length)) +     
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You will sometimes get a message that suggests to select another binwidth. Do what is says and you will often get nicer plot (something nothing changes).

ggplot(iris,                    
       aes(x = Sepal.Length)) +     
  geom_histogram(binwidth = 0.5)

Bar chart with geom_bar

A bar chart is made with the geom_bar function and is used to get an overview over the distribution of a single categorical variable, e.g. Flower.Color in this instance. The function treats the x-axis as categorical and calculates the bar heights based on the number of occurrences in each category. Here we see the number of flowers of each Flower.Color. Notice that the Flower.Colors are sorted alphabetically.

# Save plot in p
p <- ggplot(iris,                    
            aes(x = Flower.Color)) + 
  geom_bar()                         

# Show p
p

# Show p with new labels 
p + labs(x = 'Awesome Flower Colors', y = 'Awesome Count')

# Show p again
p

# Save p with new lables in p (overwrite / reassign)
p2 <- p + labs(x = 'Awesome Flower Colors', y = 'Awesome Count')

# Show p2
p2

Color by species. The bars are stacked by default.

ggplot(iris,                    
       aes(x = Flower.Color,   
           fill = Species)) +   
  geom_bar()                   

Add position = "dodge" for bars to be placed next to each other.

ggplot(iris,                   
      aes(x = Flower.Color,    
          fill = Species)) +   
  geom_bar(position = "dodge") 

Add position = "fill" for bars to be normalized such that heights of the bars to represent percentages rather than counts.

Additionally, themes can be added as a layer to any ggplot if you prefer a theme other than the default grey background.

ggplot(iris,                   
      aes(x = Flower.Color,   
          fill = Species)) +   
  geom_bar(position = "fill") + 
  theme_bw()

  # theme_classic()  
  # theme_minimal()
  # theme_dark()

Using facet_wrap if you want a plot to be split up according to a categorical variable.

ggplot(iris,                   
      aes(x = Flower.Color)) +
  geom_bar()  + 
  facet_wrap(vars(Species))

Ordering columns

We can order the columns such that the count goes from lowest to highest. This is actually not that easy in R.

First, we see that the class of the Flower.Color is character. Characters are always sorted alphabetically like we saw above.

class(iris$Flower.Color)
[1] "character"

Extract the number of flowers for each Flower.Color.

dl_Flower.Color <- iris %>%
  group_by(Flower.Color) %>%
  summarise(n = n()) %>% 
  arrange(desc(n))

dl_Flower.Color
# A tibble: 5 × 2
  Flower.Color     n
  <chr>        <int>
1 Red             44
2 Yellow          33
3 Purple          30
4 Green           22
5 Blue            21
dl_Flower.Color$Flower.Color
[1] "Red"    "Yellow" "Purple" "Green"  "Blue"  

Change the class of the Flower.Color feature to factor and add levels according to the number of flowers with each color.

iris$Flower.Color <- factor(iris$Flower.Color,
                            levels = dl_Flower.Color$Flower.Color)

Check class now

class(iris$Flower.Color)
[1] "factor"

Now we do the same plot as before and we see that the order has changed to range from largest to smallest Flower.Colors group. The plot is saved in the variable p.

p <- ggplot(iris,                # dataframe 
       aes(x = Flower.Color)) +  # x-value
  geom_bar()                     # type of plot 

p

We can also flip the chart. We update the plot, p, be reassignment.

p <- p + coord_flip()

p

Since we are working with colors, we can change the colors of the bars to match the groups.

R color chart here

# Define color palette
color_palette <- c("Red" = "red3", 
                   "Blue" = "cornflowerblue", 
                   "Yellow" = "lightgoldenrod1", 
                   "Green" = "darkolivegreen2", 
                   "Purple" = "darkorchid3")

p <- p + 
  aes(fill = Flower.Color) +                # add the fill ascetics 
  scale_fill_manual(values = color_palette) # set the fill color according to the color palette

print(p)

Bar chart with geom_col

Another way to make a bar chart is by using geom_col. Unlike geom_bar, which only requires an x-value and automatically counts occurrences, geom_col requires both x- and y-values. This makes geom_col ideal for cases where you already have pre-calculated values that you want to use as the bar heights.

The mean of the sepal length within each color is calcualted using the summarize function.

mean_sepal_length_pr_color <- iris %>%
  group_by(Flower.Color) %>%
  summarize(mean_Sepal.Length = mean(Sepal.Length))

head(mean_sepal_length_pr_color)
# A tibble: 5 × 2
  Flower.Color mean_Sepal.Length
  <fct>                    <dbl>
1 Red                       5.88
2 Yellow                    5.89
3 Purple                    5.91
4 Green                     5.65
5 Blue                      5.79
ggplot(mean_sepal_length_pr_color,   # dataframe 
       aes(x = Flower.Color,         # x-value
           y = mean_Sepal.Length)) + # y-value
  geom_col()                         # type of plot